# Step 1: Problem Understanding and Data Loading

First, need to understand the problem. 

Task: to identify which factors predict user adoption, where an "adopted user" is defined as one who has logged in on three separate days in a seven-day period.

Approach:

- Load the provided CSV files.
- Explore the dataset to understand its structure and contents.
- Define the adopted user based on the usage summary.

In [6]:
import pandas as pd

# Load the data with a different encoding
users_df = pd.read_csv('takehome_users.csv', encoding='ISO-8859-1')
engagement_df = pd.read_csv('takehome_user_engagement.csv', encoding='ISO-8859-1')


In [7]:
# Display the first few rows of both datasets
users_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [9]:
engagement_df.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [10]:
print("Users Dataframe Info:")
users_df.info()

Users Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [12]:
print("User Engagement Dataframe Info:")
engagement_df.info()

User Engagement Dataframe Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


### Feature Engineering

Before we model user adoption, we need to:

- Convert the time_stamp column in the engagement_df to a date format.
- Group user interactions by a 7-day period to define "adopted users" (those who logged in on 3 or more different days within a 7-day period).
- Merge this data back with the users_df to add an "adopted_user" flag for each user.

In [14]:
# Convert 'time_stamp' in engagement_df to datetime format
engagement_df['time_stamp'] = pd.to_datetime(engagement_df['time_stamp'])

# Group by user and find the number of distinct days each user logged in
engagement_df['login_date'] = engagement_df['time_stamp'].dt.date
adopted_users = engagement_df.groupby('user_id')['login_date'].nunique().reset_index()

In [15]:
# Define adopted users as those with at least 3 unique logins in the 7-day window
adopted_users = adopted_users[adopted_users['login_date'] >= 3]

# Mark adopted users in the users_df dataframe
users_df['adopted_user'] = users_df['object_id'].isin(adopted_users['user_id']).astype(int)

# Display the updated users_df
users_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0


## Step 3: Data Preprocessing

In [16]:
# Handle missing values
users_df['last_session_creation_time'].fillna(0, inplace=True)
users_df['invited_by_user_id'].fillna(0, inplace=True)

# One-hot encode the 'creation_source' column
users_df = pd.get_dummies(users_df, columns=['creation_source'], drop_first=True)

# Drop irrelevant columns
X = users_df.drop(['name', 'email', 'object_id', 'adopted_user', 'creation_time'], axis=1)
y = users_df['adopted_user']

# Display the first few rows of the preprocessed data
X.head()

Unnamed: 0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1398139000.0,1,0,11,10803.0,False,False,False,False
1,1396238000.0,0,0,1,316.0,True,False,False,False
2,1363735000.0,0,0,94,1525.0,True,False,False,False
3,1369210000.0,0,0,1,5151.0,False,False,False,False
4,1358850000.0,0,0,193,5240.0,False,False,False,False


## Step 4: Model Building

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.95      0.93      1956
           1       0.72      0.55      0.62       444

    accuracy                           0.88      2400
   macro avg       0.81      0.75      0.77      2400
weighted avg       0.87      0.88      0.87      2400



Interpretation of the Results:


- Class 0 (Non-adopted users): The model performs quite well in identifying non-adopted users, with a high recall of 0.95, meaning it correctly identifies 95% of the non-adopted users. The precision is also good at 0.90, indicating that when the model predicts a user as non-adopted, it is correct 90% of the time.
- Class 1 (Adopted users): The model struggles more with identifying adopted users. The recall is 0.55, meaning it only correctly identifies 55% of the adopted users. The precision is 0.72, meaning when the model predicts a user as adopted, it is correct 72% of the time. The F1-score for class 1 is 0.62, indicating the model's performance isn't as strong in predicting adopted users compared to non-adopted users.