# Relax User Adoption Study

Defining  an  *"adopted  user"*   as  a  user  who   has  logged  into  the  product  on  three  separate days  in  at  least  one  seven­day  period,  identify  which  factors  **predict  future  user adoption.**

**Data Sources:**

1. takehome_users(information about a given user and how their account was created)


2. takehome_user_engagement (row for each day a user logged into the product)


In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.dummy import DummyClassifier

%matplotlib inline

sns.set_style('darkgrid')

In [53]:
eng = pd.read_csv('takehome_user_engagement.csv')

users = pd.read_csv('takehome_users.csv', encoding='latin-1')

In [54]:
print(users.info())
users.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB
None


Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [55]:
print(eng.info())
eng.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB
None


Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


Several items of note appear after this initial look:


1.  I need to change several columns to datetime objects


2. Fill in missing values - if any.


3. I need to create some dummy variables for account creation source


4. I'll need to create several new features: 

    - Invited by a member of the same organization (y/n?)

    - Adopted user (y/n?) 

    - \# of logins prior to becoming 'adopted'

5. Examine last session time and understand how this feature translates into datetime values, if at all.

In [56]:
# Converting columns to datetime objects
users['creation_time'] = pd.to_datetime(users['creation_time'], format='%Y %m %d %H:%M:%S')
eng['time_stamp'] = pd.to_datetime(eng['time_stamp'], format='%Y %m %d %H:%M:%S')

# also converting the referral id to an integer and renaming the column to be easier to use

users.rename(columns = {'invited_by_user_id':'referral_id', 'object_id':'user_id'}, inplace=True)
users['referral_id'] = users['referral_id'].fillna(0)
users['referral_id'] = users['referral_id'].astype('int64')

In [57]:
print(users.info())
eng.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   user_id                     12000 non-null  int64         
 1   creation_time               12000 non-null  datetime64[ns]
 2   name                        12000 non-null  object        
 3   email                       12000 non-null  object        
 4   creation_source             12000 non-null  object        
 5   last_session_creation_time  8823 non-null   float64       
 6   opted_in_to_mailing_list    12000 non-null  int64         
 7   enabled_for_marketing_drip  12000 non-null  int64         
 8   org_id                      12000 non-null  int64         
 9   referral_id                 12000 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(5), object(3)
memory usage: 937.6+ KB
None
<class 'pandas.core.frame.DataFrame

In [58]:
users.describe()

Unnamed: 0,user_id,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,referral_id
count,12000.0,8823.0,12000.0,12000.0,12000.0,12000.0
mean,6000.5,1379279000.0,0.2495,0.149333,141.884583,3188.691333
std,3464.24595,19531160.0,0.432742,0.356432,124.056723,3869.027693
min,1.0,1338452000.0,0.0,0.0,0.0,0.0
25%,3000.75,1363195000.0,0.0,0.0,29.0,0.0
50%,6000.5,1382888000.0,0.0,0.0,108.0,875.0
75%,9000.25,1398443000.0,0.0,0.0,238.25,6317.0
max,12000.0,1402067000.0,1.0,1.0,416.0,11999.0


In [59]:
eng.describe()

Unnamed: 0,user_id,visited
count,207917.0,207917.0
mean,5913.314197,1.0
std,3394.941674,0.0
min,1.0,1.0
25%,3087.0,1.0
50%,5682.0,1.0
75%,8944.0,1.0
max,12000.0,1.0


From the above we can see that the engagement dataframe is clean, and in fact we can drop the visited colum if needed since it's all identical.  For now I'm going to keep it since it may be useful for windown functions and resampling the time series information. 

It seems prodent to fill in the 'invited by user_id' as zero since there are no users with that ID and it will prevent any issues with null values later on.  

However I need to examine the 'last session created time' in order to asess how to impute those missing values. 

In [60]:
users.last_session_creation_time

0        1.398139e+09
1        1.396238e+09
2        1.363735e+09
3        1.369210e+09
4        1.358850e+09
             ...     
11995    1.378448e+09
11996    1.358275e+09
11997    1.398603e+09
11998    1.338638e+09
11999    1.390727e+09
Name: last_session_creation_time, Length: 12000, dtype: float64

This is in a unix timestamp format and I'll need to convert it back to the same formate as the other datetime objects in order to simplify working with the time data.  This will also allow me to impute the last login time from the engagement dataframe. 

In [61]:
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'],unit='s')

Next I need to fill in the missing values for 'last_session_creation_time'.  However, all the values from the engagement data that could be used to fill in this missing values only exist for users where we already have the 'last_session_creation_time'. I'll need to decide on another method to fill those values if needed.  However given that I'm missing 25% of the values there I'll wait to see how and if I need them before blindly choosing a method.

### Engineering Several Features

In [62]:
# Creating binary feature that indicates if a user was invited to join by a member of the same org

for i in users.index:
    if users.loc[i, 'referral_id'] != 0:
        ref = users.loc[i, 'referral_id'] #storing the referrer ID
        new_org = users.loc[i, 'org_id'] # storing the org ID for the user
        ref_org = users['org_id'][users['user_id'] == ref].iloc[0] # selecting the first value in a pandas series as the org ID for the referrer
        if ref_org == new_org:
                users.loc[i, 'org_referral'] = 1 # If the referrer org and user org match, value is 1
        else: 
            users.loc[i, 'org_referral'] = 0
    else:
        users.loc[i, 'org_referral'] = 0  

Next I need to define the 'adopted' feauture as a label for analysis.  In this instance it is whether or not a users has logged in on 3 distinct days within a given 7 day period. In this instance it does not matter when a 'week' begins or ends, which means that were we tobe using SQL I'd do this with a window function. 

In [63]:
# Creating a feature which indicated whether or not a users is 'adopted'
ids = list(eng.user_id.unique())
index = users.index
for i in ids:
    idx = users.index[users["user_id"] == i].tolist()[0]
    practice = eng[eng['user_id'] == i]
    practice = practice[['time_stamp', 'visited']]
    practice.index = practice['time_stamp']
    practice = practice.drop('time_stamp', axis=1)
    practice = practice.rolling('7d').sum()
    
    if len(practice[practice['visited'] >2]) >= 1:
        users.loc[idx, 'adopted'] = 1
    else:
        users.loc[idx, 'adopted'] = 0

In [64]:
# Showing what percent of users are considered adopted
print(str(round(users.adopted.sum()/12000 *100, 2)), 'percent of users are considered adopted.')

13.35 percent of users are considered adopted.


### Preprocessing

Here I need to prepare the data set for use with the ML models, and manipulate the datetime objects into new columns that are numeric and can be used by ML. 

I'll be dropping the features which cannot have any predictive power:  name, email, etc.  

I'll also need to break out the dates into new features with day/month/week of the year, year, and hour/second of the day, I'll also get out the day of the week. 

Finally, I'll create some dummy variables from a few other features that are categorical. 

In [65]:
ids = users['user_id']
data = users.drop(['user_id', 'org_id', 'name', 'email', 'referral_id'], axis=1)
data.head(3)

Unnamed: 0,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_referral,adopted
0,2014-04-22 03:53:30,GUEST_INVITE,2014-04-22 03:53:30,1,0,1.0,0.0
1,2013-11-15 03:45:04,ORG_INVITE,2014-03-31 03:45:04,0,0,1.0,1.0
2,2013-03-19 23:14:52,ORG_INVITE,2013-03-19 23:14:52,0,0,1.0,0.0


In [66]:
data['timedelta'] = data['last_session_creation_time'] - data['creation_time']
data.head(3)

Unnamed: 0,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_referral,adopted,timedelta
0,2014-04-22 03:53:30,GUEST_INVITE,2014-04-22 03:53:30,1,0,1.0,0.0,0 days
1,2013-11-15 03:45:04,ORG_INVITE,2014-03-31 03:45:04,0,0,1.0,1.0,136 days
2,2013-03-19 23:14:52,ORG_INVITE,2013-03-19 23:14:52,0,0,1.0,0.0,0 days


In [67]:
#I've found that there are missing values in the adopted column, as well as the last_login date columns and will
# drop those columns here
data = data.dropna()
# Encoding the categorical feature using on-hot for logistic regression, and label encoding for ensemble models
creation_source = data['creation_source']
dummy  = pd.get_dummies(creation_source, prefix = 'source')
le = LabelEncoder()
label_coded = le.fit_transform(creation_source)
data = data.drop(['creation_source', 'creation_time', 'last_session_creation_time'], axis = 1)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8823 entries, 0 to 11999
Data columns (total 5 columns):
 #   Column                      Non-Null Count  Dtype          
---  ------                      --------------  -----          
 0   opted_in_to_mailing_list    8823 non-null   int64          
 1   enabled_for_marketing_drip  8823 non-null   int64          
 2   org_referral                8823 non-null   float64        
 3   adopted                     8823 non-null   float64        
 4   timedelta                   8823 non-null   timedelta64[ns]
dtypes: float64(2), int64(2), timedelta64[ns](1)
memory usage: 413.6 KB


In [68]:
data['timedelta'] = pd.to_numeric(data['timedelta'].dt.days, downcast='integer')
data.head(3)

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,org_referral,adopted,timedelta
0,1,0,1.0,0.0,0
1,0,0,1.0,1.0,136
2,0,0,1.0,0.0,0


At this point I've created a data set from the provided data that will work with a ML model.  The next step is to build a basic predictive model and then forward select to include more data as needed to improve performance. 

### Building a basic predictive model.  

At this point we have engineered a new feature and cleaned up the original data.  The next step is to use a classification model to identify what features contribute to the determination of whether or not a user will adopt the software. 

The first step is to create a dummy classifier and will be used as a 'baseline' from which I can forward select additional features to include in the model. 

In [70]:
# starting with a small subset of the total data
X = data.drop('adopted', axis=1)
Y = data['adopted']


x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42, stratify=Y)

In [71]:
dum = DummyClassifier()
dum.fit(x_train, y_train)
dpred = dum.predict(x_test)

print(classification_report(y_test, dpred, labels = [0,1]))

              precision    recall  f1-score   support

           0       0.82      1.00      0.90      1445
           1       0.00      0.00      0.00       320

    accuracy                           0.82      1765
   macro avg       0.41      0.50      0.45      1765
weighted avg       0.67      0.82      0.74      1765



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [72]:
logit = LogisticRegression(random_state=42, class_weight='balanced')
logit.fit(x_train, y_train)
lpred = logit.predict(x_test)

print(classification_report(y_test, lpred, labels = [0,1]))

              precision    recall  f1-score   support

           0       0.99      0.95      0.97      1445
           1       0.81      0.94      0.87       320

    accuracy                           0.95      1765
   macro avg       0.90      0.95      0.92      1765
weighted avg       0.96      0.95      0.95      1765



In [73]:
rf = RandomForestClassifier(random_state=42)
rf.fit(x_train, y_train)
rpred = rf.predict(x_test)

print(classification_report(y_test, rpred, labels = [0,1]))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1445
           1       0.87      0.87      0.87       320

    accuracy                           0.95      1765
   macro avg       0.92      0.92      0.92      1765
weighted avg       0.95      0.95      0.95      1765



In [74]:
gb = GradientBoostingClassifier(random_state=42)
gb.fit(x_train, y_train)
gpred = gb.predict(x_test)

print(classification_report(y_test, gpred, labels = [0,1]))

              precision    recall  f1-score   support

           0       0.97      0.98      0.98      1445
           1       0.92      0.88      0.90       320

    accuracy                           0.96      1765
   macro avg       0.95      0.93      0.94      1765
weighted avg       0.96      0.96      0.96      1765



It seems that the logistic regression model predicts adopted users with the 