## Introduction

The goal of this analysis is to identify which user attributes are associated with long-term engagement with the product. An **adopted user** is defined as someone who logs in on three separate days within any seven-day period. Using login activity to label adoption, we analyze how signup-time characteristics—such as account creation source, invitation status, and marketing preferences—relate to the likelihood of user adoption.


### Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Load the Data

In [4]:
engagement = pd.read_csv('/content/takehome_user_engagement.csv')
engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [5]:
user = pd.read_csv('/content/takehome_users.csv', encoding='latin-1')
user.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [7]:
# Inspsect the columns

print(user.columns)
print(engagement.columns)

Index(['object_id', 'creation_time', 'name', 'email', 'creation_source',
       'last_session_creation_time', 'opted_in_to_mailing_list',
       'enabled_for_marketing_drip', 'org_id', 'invited_by_user_id'],
      dtype='object')
Index(['time_stamp', 'user_id', 'visited'], dtype='object')


### Convert time columns to Datetime
Datetime format allows us to compare dates, calculate 7-day windows, group logins by day

In [8]:
user['creation_time'] = pd.to_datetime(user['creation_time'])
engagement['time_stamp'] = pd.to_datetime(engagement['time_stamp'])

### Prepare Login Data - One row per user per day

In [9]:
# Create login_day column
engagement['login_day'] = engagement['time_stamp'].dt.normalize()
engagement.head()

Unnamed: 0,time_stamp,user_id,visited,login_day
0,2014-04-22 03:53:30,1,1,2014-04-22
1,2013-11-15 03:45:04,2,1,2013-11-15
2,2013-11-29 03:45:04,2,1,2013-11-29
3,2013-12-09 03:45:04,2,1,2013-12-09
4,2013-12-25 03:45:04,2,1,2013-12-25


### Remove duplicate


In [10]:
# Remove duplicate login on the same day
daily_logins = engagement.drop_duplicates(
    subset=['user_id','login_day']
    ).sort_values(['user_id','login_day'])

daily_logins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   time_stamp  207917 non-null  datetime64[ns]
 1   user_id     207917 non-null  int64         
 2   visited     207917 non-null  int64         
 3   login_day   207917 non-null  datetime64[ns]
dtypes: datetime64[ns](2), int64(2)
memory usage: 6.3 MB


### Check Adoption Status
Adopted user is defined as a user who has logged in the product on three seperate days in at least one seven day period

In [14]:
def is_adopted(login_days):
  """
    A user is adopted if they log in on 3 different days
    within any 7-day period.
    """
  days = login_days.dt.dayofyear.sort_values()

  for i in range(len(days)):
    # num of login within 7 days of days[i]
    num_logins = sum((days - days.iloc[i]) <=6)

    if num_logins >= 3:
      return 1

  return 0

### Create the adoption label


In [15]:
adopted_users = (
    daily_logins
    .groupby('user_id')['login_day']
    .apply(is_adopted)
    .reset_index(name='adopted')
)


### Merge adoption label into user table

In [18]:
df = user.merge(
    adopted_users,
    left_on='object_id',
    right_on='user_id',
    how='left'
)

# Users with no login history are not adopted
df['adopted'] = df['adopted'].fillna(0).astype(int)

# Sanity check
print('Overall adoption rate:', df.adopted.mean().round(3)* 100,'%')


Overall adoption rate: 18.7 %


In [21]:
df.adopted.describe()

Unnamed: 0,adopted
count,12000.0
mean,0.187333
std,0.390195
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


Using the provided definition, approximately 18.7% of users are classified as adopted, indicating that a minority of users develop sustained engagement with the product. This suggests meaningful opportunity to improve activation and early user experience.

### Adoption rate by Sign up source

In [29]:
df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id,adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,1.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2.0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,3.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,4.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,5.0,0


In [28]:
df.groupby('creation_source')['adopted'].mean().sort_values(ascending=False)

Unnamed: 0_level_0,adopted
creation_source,Unnamed: 1_level_1
GUEST_INVITE,0.232085
SIGNUP_GOOGLE_AUTH,0.226715
SIGNUP,0.200287
ORG_INVITE,0.184062
PERSONAL_PROJECTS,0.109427


Users who join the product via invitations or streamlined signup flows (Guest Invites and Google Authentication) show the highest adoption rates, suggesting that social context and low-friction onboarding may encourage early engagement.
In contrast, users invited to personal projects exhibit substantially lower adoption. This may indicate limited ongoing collaboration or weaker incentives to return regularly.
Overall, the differences across creation sources suggest that how a user is introduced to the product influences their likelihood of becoming a regular user, with socially-driven or frictionless entry points performing better than isolated or individual use cases.

### Adoption rate: Invited VS Not Invited

In [31]:
# Create was_invited column
df['was_invited'] = df['invited_by_user_id'].notna()

# Invited summary
invited_summary = (df.groupby('was_invited')
  .agg(
      users = ('adopted','count'),
      adoption_rate = ('adopted','mean')
  ))
invited_summary

Unnamed: 0_level_0,users,adoption_rate
was_invited,Unnamed: 1_level_1,Unnamed: 2_level_1
False,5583,0.172488
True,6417,0.200249



Users who were invited by an existing user show a higher adoption rate (20.0%) compared to users who were not invited (17.2%). This suggests that social introduction to the product modestly increases the likelihood of adoption, though the effect size is moderate.


### Adoption rate by Marketing Drip

In [33]:
df.groupby('enabled_for_marketing_drip')['adopted'].mean()

Unnamed: 0_level_0,adopted
enabled_for_marketing_drip,Unnamed: 1_level_1
0,0.186129
1,0.194196


  Users enrolled in the marketing drip campaign show a slightly higher adoption rate (19.4%) compared to those who are not (18.6%). The difference is small, suggesting limited impact on sustained engagement.


### Adoption rate by Mailing list

In [37]:
df.groupby('opted_in_to_mailing_list')['adopted'].mean()

Unnamed: 0_level_0,adopted
opted_in_to_mailing_list,Unnamed: 1_level_1
0,0.185432
1,0.193053


  Users who opted into the mailing list have a marginally higher adoption rate (19.3%) than those who did not (18.5%). This indicates a weak association and suggests that email opt-in alone is not a strong driver of adoption.


### **Random Forest Model to Identify Key Predictors of Adoption**
A simple Random Forest classifier to identify which signup-time features are most important for predicting user adoption.

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

In [41]:
# Target Variable
y = df['adopted']

# Featires available at sign up
# Only use pre adoption features to avoid leakage

X = df[
    ['was_invited',
     'enabled_for_marketing_drip',
     'opted_in_to_mailing_list',
     'creation_source']
    ]

# One hot coding creation criterion
X = pd.get_dummies(X, columns=['creation_source'], drop_first=True)

In [42]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.25,
    random_state=42,
    stratify=y
    )

In [43]:
# Random Forest Model
model = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    class_weight='balanced'

)

model.fit(X_train, y_train)

In [51]:
# Model predict
y_pred = model.predict(X_test)

# Model predict y_proba
y_pred_prob = model.predict_proba(X_test)

# auc
auc = roc_auc_score(y_test, y_pred_prob[:, 1])
print('AUC:', auc)

# accuracy
accuracy = accuracy_score(y_test, y_pred)
print("accuracy score:", accuracy)

AUC: 0.568606056536628
accuracy score: 0.6186666666666667


In [53]:
# Feature Importance
feature_imortance = (
    pd.Series(model.feature_importances_, index= X.columns)
    .sort_values(ascending=False)
)

# Feature Importnace Table
importance_df = feature_imortance.reset_index()
importance_df.columns=['feature','importance']
importance_df

Unnamed: 0,feature,importance
0,creation_source_PERSONAL_PROJECTS,0.468523
1,creation_source_ORG_INVITE,0.127038
2,was_invited,0.097795
3,opted_in_to_mailing_list,0.092749
4,creation_source_SIGNUP_GOOGLE_AUTH,0.082142
5,enabled_for_marketing_drip,0.068334
6,creation_source_SIGNUP,0.063419


**Model Performance:**  
A simple Random Forest model achieved modest discriminative performance (AUC ≈ 0.57). This indicates that signup-time attributes alone provide limited predictive power for adoption. The model primarily serves to confirm feature importance patterns observed in exploratory analysis rather than to optimize prediction accuracy.  
**Feature Importance:**  
The model identifies **account creation source** as the most influential predictor of adoption, with **Personal Projects** and **Organization Invites** contributing the largest importance. Invitation status also plays a meaningful role. In contrast, marketing-related features (mailing list opt-in and marketing drip enrollment) have relatively lower importance, reinforcing earlier findings that **onboarding context and social entry points matter more than email-based marketing**.

