<div class="alert alert-success">
<h1>SUMMARY</h1>
</div>

Overall, org_id and invited_by_user_id are the most predictive of whether or not a user will adopt the product.  In particular, the organizations with the ids 395, 392, 345, 305, and 387 have the highest mean user adoption rate.

**Summary of approach**

The factors org_id, invited_by_user_id, creation_source, opted_in_to_mailing_list, and enabled_for_marketing_drip were entered as predictor variables in a logistic regression model predicting adoption (1 or 0).  All predictor variables were scaled to be between 0 and 1.  The data were split into training and test sets (70% and 30%, respectively), and 5-fold cross-validation was performed.  The class_weight parameter was adjusted to account for the class imbalance (lower rate of 1 [adopted] vs. 0 [not adopted]).  Overall, the model achieved an AUC of 0.59 and F1 of 0.28.

(See code below for more details)

# Setup

## Import libraries

In [1]:
import pandas as pd
import chardet
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline 

## Read in files

In [2]:
engage = pd.read_csv('takehome_user_engagement.csv')
chardet.detect(open('takehome_users.csv','rb').read())
users = pd.read_csv('takehome_users.csv', encoding='ISO-8859-1', index_col='object_id')
users.index.names = ['user_id']

# Identify adopted users
(Users who have logged into product on three separate days in at least one seven-day period)

In [3]:
# convert time_stamp column to datetime object
engage['time_stamp'] = pd.to_datetime(engage['time_stamp'])

In [4]:
# create date column from time_stamp column
engage['date'] = engage['time_stamp'].dt.date

# delete time_stamp and visited columns
engage.drop(['visited', 'time_stamp'], axis=1, inplace=True)

In [5]:
def adopt_function(x):
    
    """
    Determine whether user logged in on three separate days in at least one seven day period
    If so, return 1; if not, return 0
    """
    
    from datetime import datetime, timedelta
    
    # create variable to store result (whether user is/isn't adopted)
    adopted = 0
    
    # convert inputted dates to list
    x = list(x)

    # sort user login dates
    x.sort()

    # create list to store time differences between each day and next day
    tds = []

    # calculate difference between current and next day and add to list 'tds'
    for i in range(len(x)-2):
        tds.append(x[i+1] - x[i])

    # create list to store instances where user logged in 3 or more times in 7 days
    logins = []
    
    # add up current day, day+1, and day+2
    # if any sum is less than or equal to 7 days, return 1 (else, return 0)
    for i in range(len(tds)-2):
        tds_sum = (tds[i] + tds[i+1] + tds[i+2])
        if tds_sum <= timedelta(days=7):
            adopted = 1

    return adopted

In [6]:
# for each user, determine whether they're considered "adopted"
adopted = engage.groupby('user_id').agg(adopt_function)
adopted.columns = ['adopted']

# add adopted column to user df
users = users.merge(adopted, left_index=True, right_index=True, how='left')

# Identifying predictive features

## Prepare data for modeling

In [7]:
df_mod = users.copy()

# remove rows with no entry for adopted
df_mod = df_mod[pd.notnull(df_mod['adopted'])]

# delete unnecessary columns (name, email)
df_mod.drop(['name', 'email'], axis=1, inplace=True)

# drop the 'creation_time' and 'last_session_creation_time' columns
df_mod.drop(['creation_time', 'last_session_creation_time'], axis=1, inplace=True)

# replace text columns with numeric representations
creation_dict = {'ORG_INVITE': 1,
                 'SIGNUP': 2,
                 'GUEST_INVITE': 3,
                 'SIGNUP_GOOGLE_AUTH': 4,
                 'PERSONAL_PROJECTS': 5}

df_mod['creation_source'] = df_mod['creation_source'].map(creation_dict)

# drop rows with NaNs
df_mod = df_mod.dropna() 

## Modeling setup

### Create X and y and scale values

In [8]:
# create X and y
y = df_mod['adopted']
X_cols = [x for x in df_mod.columns if (x != 'adopted')]
X = df_mod[X_cols]

# scale values in dataframe to be between 0 and 1
scaler = MinMaxScaler()
scaler.fit(X)
scaler.fit_transform(X)

array([[ 1.        ,  1.        ,  0.        ,  0.02650602,  0.9003001 ],
       [ 0.        ,  0.        ,  0.        ,  0.00240964,  0.02609203],
       [ 0.        ,  0.        ,  0.        ,  0.22650602,  0.12687563],
       ..., 
       [ 0.        ,  0.        ,  0.        ,  0.06024096,  0.32852618],
       [ 0.        ,  0.        ,  0.        ,  0.21445783,  0.68856285],
       [ 1.        ,  1.        ,  1.        ,  0.2       ,  0.6728076 ]])

### Create training and test sets

In [9]:
from sklearn.cross_validation import train_test_split

_, itest = train_test_split(range(df_mod.shape[0]), train_size = 0.7)
mask = np.zeros(df_mod.shape[0], dtype=np.bool)
mask[itest] = True

test = df_mod[mask]
train = df_mod[~mask]
Xtest = test[X_cols]
ytest = test['adopted']
Xtrain = train[X_cols]
ytrain = train['adopted']



In [10]:
# make sure proportion of classes in training and test sets is about equal
prop_adopt = sum(train['adopted'])/len(train)
print('Percent of adopted (1) in training set: {:.2f}%'.format(prop_adopt*100))

prop_adopt = sum(test['adopted'])/len(test)
print('Percent of adopted (1) in test set: {:.2f}%'.format(prop_adopt*100))

Percent of adopted (1) in training set: 15.70%
Percent of adopted (1) in test set: 16.12%


## Functions for modeling

In [11]:
"""
==================================================================
Perform cross-validation
==================================================================
"""

def cv_optimize(clf, dict_params, Xtrain, ytrain, scorer, n_folds):
    
    from sklearn.model_selection import GridSearchCV

    gs = GridSearchCV(clf,
                      param_grid = dict_params,
                      cv = n_folds,
                      scoring = scorer) 
    
    gs.fit(Xtrain, ytrain)
    
    print('Cross validation results:')
    print('Best parameters: ' + str(gs.best_params_))
    print('Best score: {:.2f}\n'.format(gs.best_score_))
    
    return gs.best_estimator_, gs.best_params_

"""
==================================================================
Evaluate model
==================================================================
"""

def model_eval(clf, Xtrain, ytrain, Xtest, ytest):
    
    # get predictions for y
    ypred_test = clf.predict(Xtest)
    ypred_train = clf.predict(Xtrain)
    ypred_proba = clf.predict_proba(Xtest)
    auc = metrics.roc_auc_score(ytest, ypred_proba[:,1])

    # create confusion matrix
    cnf_matrix = metrics.confusion_matrix(ytest, ypred_test)
    
    # print model performance summary
    print('Training accuracy: {:.2f}%'.format(100*metrics.accuracy_score(ypred_train, ytrain)),
          '\nTest accuracy: {:.2f}%'.format(100*metrics.accuracy_score(ypred_test, ytest)),
          '\nPrecision: {:.2f}'.format(metrics.precision_score(ytest, ypred_test)),
          '\nRecall: {:.2f}'.format(metrics.recall_score(ytest, ypred_test)),
          '\nF1: {:.2f}'.format(metrics.f1_score(ytest, ypred_test)),
          '\nAUC: {:.2f}'.format(auc),
          '\nTrue Positive Rate: {:.2f}'.format(float(cnf_matrix[0][0])/np.sum(cnf_matrix[0])),
          '\nTrue Negative Rate: {:.2f}'.format(float(cnf_matrix[1][1])/np.sum(cnf_matrix[1]))
         )

    print('Confusion matrix:')
    print(cnf_matrix)    

## Logistic regression model

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

clf = LogisticRegression()

dict_params = {'C': [.001, .001, .01, .1, 1],
               'class_weight': [{0:.15, 1:.85}], # to account for smaller proportion of adopted
               'fit_intercept': [False]}

scorer = metrics.make_scorer(metrics.auc, reorder=True)

# perform 5-fold cross-validation
clf_LR, best_params_LR = cv_optimize(clf, dict_params, Xtrain, ytrain, scorer, n_folds=5)

# print model evaluation metrics
model_eval(clf_LR, Xtrain, ytrain, Xtest, ytest)

# print feature importances
coefs = np.std(Xtest, 0)*clf_LR.coef_[0,0:]
coef_df = pd.DataFrame({'Feature': X_cols,
                        'Coefficient': coefs,
                        'Coeff_AV': abs(coefs)})
coef_df.sort_values(by = 'Coeff_AV', ascending = False, inplace=True)
coef_df

Cross validation results:
Best parameters: {'C': 0.001, 'class_weight': {0: 0.15, 1: 0.85}, 'fit_intercept': False}
Best score: 0.50

Training accuracy: 43.94% 
Test accuracy: 42.50% 
Precision: 0.18 
Recall: 0.71 
F1: 0.28 
AUC: 0.56 
True Positive Rate: 0.37 
True Negative Rate: 0.71
Confusion matrix:
[[446 756]
 [ 68 163]]


Unnamed: 0,Coeff_AV,Coefficient,Feature
org_id,0.137914,0.137914,org_id
invited_by_user_id,0.055387,-0.055387,invited_by_user_id
creation_source,0.011249,0.011249,creation_source
enabled_for_marketing_drip,0.001403,-0.001403,enabled_for_marketing_drip
opted_in_to_mailing_list,0.000475,0.000475,opted_in_to_mailing_list


## Examine top org_ids

In [13]:
# print top 5 organizations with the highest mean adoption rate
df_temp = df_mod[['org_id','adopted']]
df_temp = df_temp.groupby('org_id').mean()
df_temp.sort_values('adopted', ascending=False).head()

Unnamed: 0_level_0,adopted
org_id,Unnamed: 1_level_1
395,0.75
392,0.75
345,0.666667
305,0.666667
387,0.666667
