# Relax Inc. Take Home Practice

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE

# Introduction

# Predicting User Adoption in an Online Platform

The goal of this project is to identify factors that predict future user adoption in an online platform. We have two main datasets:

1. **User Data**: This dataset contains information about 12,000 users who signed up for the product. It includes user details such as name, email, account creation source, creation time, last session creation time, and more.

2. **User Engagement Data**: This dataset contains a summary of user activity, including login dates.

To define an "adopted user," we consider a user who has logged into the product on three separate days within at least one seven-day period. The task is to identify which factors predict whether a user will become an adopted user.

In this analysis, we will perform data cleaning, exploratory data analysis, and build predictive models to understand the key indicators of user adoption. The insights gained will help the platform improve long-term user retention.

Let's start by loading the data and conducting initial data exploration.


In [2]:
# Load 'takehome_users.csv' into a DataFrame
users = pd.read_csv('takehome_users.csv', encoding='latin-1')

# Load 'takehome_user_engagement.csv' into a DataFrame
user_engagement = pd.read_csv('takehome_user_engagement.csv')

In [3]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
#Rename object id as user id for easier processing
users.rename(columns={'object_id': 'user_id'}, inplace=True)

In [5]:
#Count nulls in each column of users
null_counts = users.isnull().sum()
print(null_counts)

user_id                          0
creation_time                    0
name                             0
email                            0
creation_source                  0
last_session_creation_time    3177
opted_in_to_mailing_list         0
enabled_for_marketing_drip       0
org_id                           0
invited_by_user_id            5583
dtype: int64


In [6]:
unique_users = users['user_id'].nunique()
unique_users

12000

In [7]:
# Drop the email& name columns. User Id will be the unique identifier between users
users.drop(columns=['email', 'name'], inplace=True)

# Set 'user_id' as the index
users.set_index('user_id', inplace=True)

In [8]:
#Convert creation_time column to datetime type
users['creation_time'] = pd.to_datetime(users['creation_time'])

users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit='s')

## Dealing with Null Values

In the dataset, there were null values in the "invited_by_user_id" column. We decided to create a new binary column named "invited_by_user" to indicate whether a user was referred by another user. Users with a non-null "invited_by_user_id" were assigned the value 1, indicating that they were referred by another user, while users with null values in "invited_by_user_id" were assigned the value 0, indicating that they were not referred by anyone. This approach allowed us to retain the information about whether a user was referred without needing to identify the specific user who referred them.

By doing so, we handled the null values in a way that aligns with the binary nature of the information we were interested in, which is whether or not a user was referred. This approach simplifies the dataset while preserving the relevant information.


In [9]:
# Create a new column "invited_by_user" with 1 for non-null values and 0 for null values
users['invited_by_user'] = users['invited_by_user_id'].notna().astype(int)

# Now, drop the original column
users.drop('invited_by_user_id', axis=1, inplace=True)

In [10]:
users.head()

Unnamed: 0_level_0,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2014-04-22 03:53:30,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,1
2,2013-11-15 03:45:04,ORG_INVITE,2014-03-31 03:45:04,0,0,1,1
3,2013-03-19 23:14:52,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1
4,2013-05-21 08:09:28,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,1
5,2013-01-17 10:14:20,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,1


In [11]:
user_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [12]:
unique_users = user_engagement['user_id'].nunique()
unique_users

8823

In our analysis, we've addressed null values in the dataset. For the "invited_by_user_id" column, we created a new binary column, "invited_by_user," to indicate whether users were invited by another user. Null values in this column were treated as uninvited users.

Regarding the "last_session_creation_time" column, it's important to note that the 8823 non-null values correspond to users who have logged into the product at least once, while the remaining 3177 null values indicate users who have not logged in since their account creation. This column will be dropped before modeling, as it was used solely for identifying adopted users.

Our focus now is to identify adopted users based on user engagement data.


In [13]:
# Create the target variable column and set all values to 0
users['adopted_user'] = 0

users.head()

Unnamed: 0_level_0,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user,adopted_user
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2014-04-22 03:53:30,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,1,0
2,2013-11-15 03:45:04,ORG_INVITE,2014-03-31 03:45:04,0,0,1,1,0
3,2013-03-19 23:14:52,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1,0
4,2013-05-21 08:09:28,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,1,0
5,2013-01-17 10:14:20,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,1,0


### Calculating User Engagement Duration

We are calculating the duration for which users have been engaging with the product. To do this, we set the reference date as the most recent date from the user engagement data. By measuring the duration between the creation date and the reference date, we can capture how long each user has been using the product based on their last recorded activity. This provides a more current measure of user engagement, which can be valuable for predicting adopted users.


In [14]:
# Find the latest date in user_engagement
latest_date = user_engagement['time_stamp'].max()

# Convert it to a datetime object
latest_date = pd.to_datetime(latest_date)

# Calculate the duration using the latest date as the reference
users['usage_duration'] = (latest_date - users['creation_time']).dt.days

users.head()

Unnamed: 0_level_0,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user,adopted_user,usage_duration
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:30,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,1,0,45
2,2013-11-15 03:45:04,ORG_INVITE,2014-03-31 03:45:04,0,0,1,1,0,203
3,2013-03-19 23:14:52,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1,0,443
4,2013-05-21 08:09:28,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,1,0,381
5,2013-01-17 10:14:20,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,1,0,505


In [15]:
users['usage_duration'].describe()

count    12000.000000
mean       324.568000
std        216.646173
min          6.000000
25%        129.000000
50%        304.000000
75%        506.000000
max        736.000000
Name: usage_duration, dtype: float64

In [16]:
#One-hot encoding for final non-numeric column to be used in modeling
users = pd.get_dummies(users, columns=['creation_source'], drop_first=True)
users.drop('creation_time', axis=1, inplace=True)

In [17]:
visited_counts = user_engagement['visited'].value_counts()
print(visited_counts)

1    207917
Name: visited, dtype: int64


In [18]:
user_engagement['time_stamp'] = pd.to_datetime(user_engagement['time_stamp'])

# Create an empty list to store adopted user IDs
adopted_users = []

# Iterate over unique user IDs
for user_id in user_engagement['user_id'].unique():
    user_data = user_engagement[user_engagement['user_id'] == user_id]
    
    if len(user_data) >= 3:
        user_data = user_data.set_index('time_stamp')
        if user_data.resample('D').count().rolling(window=7).sum()['visited'].max() >= 3:
            adopted_users.append(user_id)

# Now, 'adopted_users' contains the IDs of adopted users

In [19]:
users['adopted'] = users.index.to_series().apply(lambda x: x in adopted_users).astype(int)

users.head()

Unnamed: 0_level_0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user,adopted_user,usage_duration,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH,adopted
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,2014-04-22 03:53:30,1,0,11,1,0,45,0,0,0,0,0
2,2014-03-31 03:45:04,0,0,1,1,0,203,1,0,0,0,1
3,2013-03-19 23:14:52,0,0,94,1,0,443,1,0,0,0,0
4,2013-05-22 08:09:28,0,0,1,1,0,381,0,0,0,0,0
5,2013-01-22 10:14:20,0,0,193,1,0,505,0,0,0,0,0


In [20]:
# Count the values of the 'adopted' column
adopted_count = users['adopted'].value_counts()

# Drop the 'last_session_creation_time' column
users.drop(['last_session_creation_time'], axis=1, inplace=True)

# Display the count of adopted and non-adopted users
print(adopted_count)

0    10403
1     1597
Name: adopted, dtype: int64


## Data Preparation and Exploration

- We loaded two datasets, "takehome_users.csv" and "takehome_user_engagement.csv."
- We cleaned and preprocessed the data, handling null values and creating a target variable for "adopted users."
- The class distribution for the target variable is imbalanced, with 10,403 non-adopted users and 1,597 adopted users.

Now on to the modeling phase.

## Modeling Phase

In this phase, we will build a binary classification model to predict user adoption. We'll follow these steps:

1. **Prepare Data:** 
   - Define the feature variables (X) and the target variable (y).
   - Split the data into training and testing sets.

2. **Model Selection:**
   - Explore three different classification algorithms: Logistic Regression, Random Forest, and XGBoost.
   - Tune the hyperparameters for each model using grid search.

3. **Model Training and Evaluation:**
   - Train the models on the training data.
   - Evaluate model performance using accuracy, precision, recall, F1-score, and confusion matrices.


In [21]:
#Split data into train and test splits
X = users.drop('adopted', axis=1)
y = users['adopted']

# Create a SMOTE instance
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Apply SMOTE to balance the dataset
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

###  Grid Search - Hyperparameter Tuning for Random Forest and XGBoost

In [22]:
param_grid = {
    'n_estimators': [50, 100, 150, 200],  
    'max_depth': [10, 20, 30, None],
}

# Create the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Create a GridSearchCV instance
grid_search_rf = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Fit the grid search to your training data
grid_search_rf.fit(X_train, y_train)

# Get the best hyperparameters
best_n_estimators_rf = grid_search_rf.best_params_['n_estimators']
best_max_depth_rf = grid_search_rf.best_params_['max_depth']

# Print the best hyperparameters
print(f"Best n_estimators for Random Forest: {best_n_estimators_rf}")
print(f"Best max_depth for Random Forest: {best_max_depth_rf}")

Best n_estimators for Random Forest: 150
Best max_depth for Random Forest: 30


In [23]:
from xgboost import XGBClassifier

# Define the hyperparameter grid for XGBoost
param_grid_xgboost = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Create the XGBoost classifier
xgboost_classifier = XGBClassifier(random_state=42)

# Create a GridSearchCV instance
grid_search_xgboost = GridSearchCV(estimator=xgboost_classifier, param_grid=param_grid_xgboost, cv=3, scoring='accuracy', n_jobs=-1)

# Fit the grid search to your training data
grid_search_xgboost.fit(X_train, y_train)

# Get the best hyperparameters
best_n_estimators_xgboost = grid_search_xgboost.best_params_['n_estimators']
best_max_depth_xgboost = grid_search_xgboost.best_params_['max_depth']
best_learning_rate_xgboost = grid_search_xgboost.best_params_['learning_rate']

# Print the best hyperparameters for XGBoost
print(f"Best n_estimators for XGBoost: {best_n_estimators_xgboost}")
print(f"Best max_depth for XGBoost: {best_max_depth_xgboost}")
print(f"Best learning_rate for XGBoost: {best_learning_rate_xgboost}")

Best n_estimators for XGBoost: 200
Best max_depth for XGBoost: 5
Best learning_rate for XGBoost: 0.2


# Logistic Regression Model

In this section, we will construct and evaluate a Logistic Regression model designed to predict user adoption. Logistic Regression is a classification algorithm suitable for binary classification tasks, making it a fitting choice for determining whether users will become adopted or not.

Our workflow for this model will entail the following key steps:

1. Data Split: We will divide the dataset into training and validation sets, allowing us to both train and evaluate the model's performance effectively.

2. Model Development: We will create a Logistic Regression model and train it using the training dataset.

3. Model Assessment: Our evaluation process will involve a comprehensive analysis of the model's performance, taking into account various metrics such as accuracy, precision, recall, and the F1-score.

4. Interpretation: We will delve into the results to gain insights into how well the model predicts user adoption and the factors influencing it.

With this approach, we aim to build and assess the Logistic Regression model's capability to predict user adoption accurately. Let's proceed with this crucial step.

In [24]:
logreg = LogisticRegression(max_iter=1000, random_state=42)

logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

In [25]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.78
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.89      0.80      2054
           1       0.86      0.67      0.75      2108

    accuracy                           0.78      4162
   macro avg       0.79      0.78      0.77      4162
weighted avg       0.79      0.78      0.77      4162

Confusion Matrix:
[[1830  224]
 [ 705 1403]]


In [26]:
# Create the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=150, max_depth=30, random_state=42)

# Train the model on the training data
rf_classifier.fit(X_train, y_train)

rf_y_pred = rf_classifier.predict(X_test)

In [27]:
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print(f"Accuracy: {rf_accuracy:.2f}")

print("Classification Report:")
print(classification_report(y_test, rf_y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, rf_y_pred))

Accuracy: 0.86
Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.87      0.86      2054
           1       0.87      0.85      0.86      2108

    accuracy                           0.86      4162
   macro avg       0.86      0.86      0.86      4162
weighted avg       0.86      0.86      0.86      4162

Confusion Matrix:
[[1785  269]
 [ 306 1802]]


In [28]:
xgb_params = {
    'n_estimators': 200,
    'max_depth': 5,
    'learning_rate': 0.2,
    'random_state': 42,
}

xgb_classifier = XGBClassifier(**xgb_params)

xgb_classifier.fit(X_train, y_train)

xgb_y_pred = xgb_classifier.predict(X_test)

In [29]:
xgb_accuracy = accuracy_score(y_test, xgb_y_pred)
print(f"Accuracy: {xgb_accuracy:.2f}")

print("Classification Report:")
print(classification_report(y_test, xgb_y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, xgb_y_pred))

Accuracy: 0.82
Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.87      0.83      2054
           1       0.86      0.77      0.81      2108

    accuracy                           0.82      4162
   macro avg       0.82      0.82      0.82      4162
weighted avg       0.82      0.82      0.82      4162

Confusion Matrix:
[[1781  273]
 [ 479 1629]]


### Model Comparison

| Model            | Accuracy | Precision | Recall  | F1-Score |
|------------------|----------|-----------|---------|----------|
| Logistic Regression | 0.78     | 0.75      | 0.78    | 0.77     |
| Random Forest    | 0.86     | 0.86      | 0.86    | 0.86     |
| XGBoost          | 0.82     | 0.82      | 0.82    | 0.82     |

- **Accuracy:** The proportion of correct predictions.
- **Precision:** The ability of the model to avoid false positives.
- **Recall:** The ability of the model to identify true positives.
- **F1-Score:** The harmonic mean of precision and recall.

These metrics are based on the adopted user prediction task after addressing class imbalance.


In [30]:
importances = rf_classifier.feature_importances_

# Get the names of the features
feature_names = X.columns

# Create a DataFrame to organize the feature names and their importance scores
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})

# Sort the features by importance in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the top N features with their importance scores
top_n = 10  # You can change this to show more or fewer features
top_features = feature_importance_df.head(top_n)
print(top_features)

                              Feature  Importance
5                      usage_duration    0.336047
2                              org_id    0.255910
7   creation_source_PERSONAL_PROJECTS    0.098205
8              creation_source_SIGNUP    0.086424
9  creation_source_SIGNUP_GOOGLE_AUTH    0.068861
3                     invited_by_user    0.063204
6          creation_source_ORG_INVITE    0.057449
0            opted_in_to_mailing_list    0.019461
1          enabled_for_marketing_drip    0.014438
4                        adopted_user    0.000000


## Recommendations to Improve User Adoption

1. **Usage Duration**: While `usage_duration` is important, it's not easy to directly influence. However, you can use other strategies to increase it, such as offering incentives, loyalty programs, or improving the overall user experience to encourage more extended usage.

2. **Organization Membership (org_id)**: Encourage users to participate in or join different organizations within the platform. Highlight the benefits of group engagement and collaboration. Consider organizing events, challenges, or forums for different organizations to boost user involvement.

3. **Personal Projects (creation_source_PERSONAL_PROJECTS)**: Promote the creation of personal projects. Provide resources and tools to facilitate personal project development, and showcase success stories of users who have created valuable content or projects.

4. **Direct Sign-up (creation_source_SIGNUP)**: Make the sign-up process seamless and user-friendly. Provide clear benefits of using your platform and make it easy for users to create an account. Consider optimizing the sign-up flow to reduce friction.

5. **Google Authentication (creation_source_SIGNUP_GOOGLE_AUTH)**: Continue to offer Google authentication as a convenient sign-up method. Users appreciate quick and secure access. Ensure the process remains reliable and safe.

6. **Invitations (invited_by_user)**: Implement a referral program where users can invite others to join the platform. Reward users for successful invitations, and provide tools to facilitate the invitation process.

7. **Organization Invitations (creation_source_ORG_INVITE)**: Strengthen the organization invitation process. Encourage users to invite colleagues and friends to join organizations. Highlight the benefits of organization membership.

8. **Mailing List (opted_in_to_mailing_list)**: Utilize email marketing effectively to keep users informed about new features, updates, and community events. Provide a clear opt-in process during registration and explain the value of subscribing.

9. **Marketing Drip (enabled_for_marketing_drip)**: Implement a regular marketing drip campaign that educates users about the platform's features and benefits. Keep users engaged and informed with personalized content.

These recommendations aim to leverage the most important features to improve user adoption. It's essential to keep monitoring user behavior and adapt your strategies as needed based on data and user feedback. Additionally, conducting A/B tests for various initiatives can help assess their impact on user adoption and retention.
