### Relax Take Home Challenge

In [1]:
import pandas as pd
import datetime

#### Read and Explore Data

There are two csv files. One containes login timestamps and the other contains user information. First I will read and look at the user engagement file that contains login timestamps.

In [2]:
engagement = pd.read_csv("takehome_user_engagement.csv")
print(engagement.head())
print("")
print(engagement.describe())
print("")
print(engagement.info())

            time_stamp  user_id  visited
0  2014-04-22 03:53:30        1        1
1  2013-11-15 03:45:04        2        1
2  2013-11-29 03:45:04        2        1
3  2013-12-09 03:45:04        2        1
4  2013-12-25 03:45:04        2        1

             user_id   visited
count  207917.000000  207917.0
mean     5913.314197       1.0
std      3394.941674       0.0
min         1.000000       1.0
25%      3087.000000       1.0
50%      5682.000000       1.0
75%      8944.000000       1.0
max     12000.000000       1.0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB
None


In [3]:
engagement.dtypes

time_stamp    object
user_id        int64
visited        int64
dtype: object

In [4]:
engagement['time_stamp'] = pd.to_datetime(engagement['time_stamp'])

There are 207, 917 logins recorded in the file, which contains a time stamp variable and the user_id. The `visited` variable has a value of 1 in every row, and can be ignored.  There are no null values.

Next I will look at the `users` file.

In [5]:
users = pd.read_csv("takehome_users.csv", encoding = 'latin-1')
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [6]:
users.describe()

Unnamed: 0,object_id,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,12000.0,8823.0,12000.0,12000.0,12000.0,6417.0
mean,6000.5,1379279000.0,0.2495,0.149333,141.884583,5962.957145
std,3464.24595,19531160.0,0.432742,0.356432,124.056723,3383.761968
min,1.0,1338452000.0,0.0,0.0,0.0,3.0
25%,3000.75,1363195000.0,0.0,0.0,29.0,3058.0
50%,6000.5,1382888000.0,0.0,0.0,108.0,5954.0
75%,9000.25,1398443000.0,0.0,0.0,238.25,8817.0
max,12000.0,1402067000.0,1.0,1.0,416.0,11999.0


In [7]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


There are some missing values in variable `last_session_creation_time` and `invited_by_user_id`.

I suspect the missing values in `last_session_creation_time` could be because the data isn't necessarily missing, but because the users didn't actually have a session.  I can test this by seeing if the users who have a last session creation time are the same as the users appearing in the engagement file.

In [8]:
users_set = set(users.object_id[users.last_session_creation_time.notnull()])
engagement_set = set(engagement.user_id)

In [9]:
users_set - engagement_set

set()

The set difference is empty, meaning there is no difference between the users in the user file with non-null value in the `last_session_creation_time` variable and the unique users in the engagement file.  This meants that the null `last_session_creation_time` values in the users data is not because the data is missing, but because the users haven't had a session.  

Similarly, the null values in the `invited_by_user_id` variable is not because the data is missing, but rather because these users were't invited by another user.  This is confirmed by looking at the `creation_source` variable, which is only "Guest Invite" or "Org_Invite" for the users invited by another user, and "Signup", "Personal Projects", or "Signup Google Auth" when the `invited_by_user_id` is missing: 

In [10]:
# creation source when invited_by_user_id is not null
users[users.invited_by_user_id.notnull()].creation_source.value_counts()

ORG_INVITE      4254
GUEST_INVITE    2163
Name: creation_source, dtype: int64

In [11]:
# creation source when invited_by_user_id is null
users[users.invited_by_user_id.isnull()].creation_source.value_counts()

PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

The individual user that invited thae user is not really relevent, and the values contained in the `creation_source` variable is all that is needed for modeling.

I will drop the `invited_by_user_id` and convert the `creation_source` variable into a dummy variable. Name and email can also be dropped.

In [12]:
users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [13]:
drop_cols = ['invited_by_user_id', 'name', 'email', 'creation_time','last_session_creation_time', 'org_id']
users.drop(drop_cols, axis = 1, inplace = True)

In [15]:
users = pd.get_dummies(users, columns = ['creation_source'], drop_first = False)

In [16]:
users.head()

Unnamed: 0,object_id,opted_in_to_mailing_list,enabled_for_marketing_drip,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1,1,0,1,0,0,0,0
1,2,0,0,0,1,0,0,0
2,3,0,0,0,1,0,0,0
3,4,0,0,1,0,0,0,0
4,5,0,0,1,0,0,0,0


In [17]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 8 columns):
object_id                             12000 non-null int64
opted_in_to_mailing_list              12000 non-null int64
enabled_for_marketing_drip            12000 non-null int64
creation_source_GUEST_INVITE          12000 non-null uint8
creation_source_ORG_INVITE            12000 non-null uint8
creation_source_PERSONAL_PROJECTS     12000 non-null uint8
creation_source_SIGNUP                12000 non-null uint8
creation_source_SIGNUP_GOOGLE_AUTH    12000 non-null uint8
dtypes: int64(3), uint8(5)
memory usage: 339.9 KB


#### Challenge Part 1: Define "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period.

The target feature is whether the user is adopted, which is defined as a user who has logged into the product at least 3 times in any seven day period. 

To calculate this variable, I will first create a dictionary of users and a True or False value for whether or not they are adopted. I will create this by looping over the users in the engagement data, and testing to see if there are 3 time-stamps in a 1-week period for that user.

In [18]:
seven_days = datetime.timedelta(7)

#Initialize empty dictionary for adopted status
adopted_dict = {}

#Loop over unique users
for user_id in sorted(list(engagement.user_id.unique())):
    
    adopted_user = False
    user_stamps = engagement[engagement.user_id == user_id].sort_values('time_stamp')
    # Skip items with less than 3 logins
    if len(user_stamps) < 3:
        adopted_dict[user_id] = adopted_user
        continue
    #For users with 3 or more logins, change adopted_user to True if any three were within 7 days
    for row in user_stamps.itertuples():
        if adopted_user == True:
            continue 
        time_stamp = user_stamps.at[row[0], 'time_stamp']
        if len(user_stamps[(user_stamps['time_stamp'] >= time_stamp) & (user_stamps['time_stamp'] <= (time_stamp + seven_days))]) >= 3:
            adopted_user = True
    
    adopted_dict[user_id] = adopted_user

Now I can use the `adopted_dict` to create the target variable in the users data frame.

In [19]:
users['adopted'] = False

In [20]:
for index, row in users.iterrows():
    user_id = row['object_id']
    adopted_status = adopted_dict.get(user_id)
    #users.set_value(row[0],'adopted', adopted_status)
    users.at[index, 'adopted'] = adopted_status

In [21]:
users.head()

Unnamed: 0,object_id,opted_in_to_mailing_list,enabled_for_marketing_drip,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH,adopted
0,1,1,0,1,0,0,0,0,False
1,2,0,0,0,1,0,0,0,True
2,3,0,0,0,1,0,0,0,False
3,4,0,0,1,0,0,0,0,False
4,5,0,0,1,0,0,0,0,False


In [22]:
users.drop('object_id', axis = 1, inplace = True)

In [24]:
X = users.drop('adopted', axis = 1)
y = users.adopted

In [25]:
X.shape

(12000, 7)

#### Challenge Part 2: Identify which factors predict future user adoption.

I will use a random forest classification to look for important features.

In [28]:
from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [37]:
# Will use SMOTE oversampling to balance out the classes

from imblearn.over_sampling import SMOTE, ADASYN
from collections import Counter

X_resampled, y_resampled = ADASYN().fit_sample(X_train, y_train)

print(sorted(Counter(y_resampled).items()))

[(False, 7229), (True, 6703)]


In [38]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(criterion='gini', class_weight='balanced')

param_grid = {'n_estimators' : [40, 60, 80], 'min_samples_split' : [2, 3, 4], 
              'max_depth' : [4, 7, 10]}

rf_cv = GridSearchCV(rf, param_grid, cv = 5)

rf_cv.fit(X_resampled, y_resampled)

#Print out the best model
print('Best RF Params: {}'.format(rf_cv.best_params_))
print('Best RF Score : %f' % rf_cv.best_score_)

Best RF Params: {'max_depth': 4, 'min_samples_split': 3, 'n_estimators': 60}
Best RF Score : 0.549813


In [43]:
rf = RandomForestClassifier(n_jobs = -1, n_estimators = 60, max_depth = 4, min_samples_split = 3, oob_score = True)
rf.fit(X_resampled, y_resampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=60, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)

In [44]:
import numpy as np
from sklearn.metrics import mean_squared_error

rf_predictions = rf.predict(X_test)
rf_train_score = rf.score(X_train, y_train)
rf_test_score = rf.score(X_test, y_test)

In [45]:
print('Random Forest Train Score:  ', rf_train_score)
print('Random OOB Score:           ', rf.oob_score_)
print('Random Forest Test Score:   ', rf_test_score)

Random Forest Train Score:   0.6203571428571428
Random OOB Score:            0.5572782084409992
Random Forest Test Score:    0.6055555555555555


In [46]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, rf_predictions)

array([[1977, 1138],
       [ 282,  203]])

In [47]:
fi = pd.DataFrame(rf.feature_importances_, columns = ['importance'])
label = pd.DataFrame(X.columns, columns = ['feature'])
feature_imp = pd.concat([label, fi], axis = 1)

In [48]:
feature_imp.sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
4,creation_source_PERSONAL_PROJECTS,0.581204
6,creation_source_SIGNUP_GOOGLE_AUTH,0.195114
2,creation_source_GUEST_INVITE,0.084885
1,enabled_for_marketing_drip,0.041801
5,creation_source_SIGNUP,0.037103
0,opted_in_to_mailing_list,0.032718
3,creation_source_ORG_INVITE,0.027174
