# Relax Challenge:

Defining  an  "adopted  user"   as  a  user  who   has  logged  into  the  product  on  three  separate
days  in  at  least  one  seven­day  period ,  identify  which  factors  predict  future  user
adoption .

We  suggest  spending  1 to 2  hours  on  this,  but  you're  welcome  to  spend  more  or  less.
Please  send  us  a  brief  writeup  of  your  findings  (the  more  concise,  the  better  ­­  no  more
than  one  page),  along  with  any  summary  tables,  graphs,  code,  or  queries  that  can  help
us  understand  your  approach.  Please  note  any  factors  you  considered  or  investigation
you  did,  even  if  they  did  not  pan  out.  Feel  free  to  identify  any  further  research  or  data
you  think  would  be  valuable

## Summary Report and Recommendation for further research

My EDA below shows that the largest number of adopted users came from the Organization invite channel. There were no particular power users who invited 5 other users.

I tried out 8 models and all gave very poor initial performance with very low effect of 'Recall' score aka True Positive. Some additional reseach and fine tuning of the hyper parameters can be done to increase the 'Recall' score.

Additional modelling done by upsample the imbalance from the adopted users. The overall number of 1,656 out of 8,823 users who logged in during the period. One such method is SMOTE.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df1 = pd.read_csv('takehome_user_engagement.csv',encoding = 'utf-8')
df2 = pd.read_csv('takehome_users.csv',encoding = 'latin')

### Identify Adopted User

  A  usage  summary  table  ( "takehome_user_engagement" )  that  has  a  row  for  each  day
that  a  user  logged  into  the  product

# EDA

Part 1: Inspect and create array of users who have logged in 3 or more times in a period of 7 days

Inspect the data

In [3]:
df1.head(10)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1
5,2013-12-31 03:45:04,2,1
6,2014-01-08 03:45:04,2,1
7,2014-02-03 03:45:04,2,1
8,2014-02-08 03:45:04,2,1
9,2014-02-09 03:45:04,2,1


In [4]:
df1.tail()

Unnamed: 0,time_stamp,user_id,visited
207912,2013-09-06 06:14:15,11996,1
207913,2013-01-15 18:28:37,11997,1
207914,2014-04-27 12:45:16,11998,1
207915,2012-06-02 11:55:59,11999,1
207916,2014-01-26 08:57:12,12000,1


Check for null values

In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


Confirm all the column 'visited' contains 1s and only 1s 

In [12]:
# all visited values are 1 and only 1
df1['visited'].sum() == len(df1)

True

How many unique users out of the 12,000 logged in over the period at least once

In [21]:
#out of 12000 user ID how many logged in over the period?
len(df1['user_id'].unique())

8823

Convert time_stamp field to date_time format without hours:mins:sec

In [31]:
df1['time_stamp_day']= df1['time_stamp'].values.astype('datetime64[D]')
df1.head()

Unnamed: 0,time_stamp,user_id,visited,time_stamp_day
0,2014-04-22 03:53:30,1,1,2014-04-22
1,2013-11-15 03:45:04,2,1,2013-11-15
2,2013-11-29 03:45:04,2,1,2013-11-29
3,2013-12-09 03:45:04,2,1,2013-12-09
4,2013-12-25 03:45:04,2,1,2013-12-25


Loop through the dataframe comparing entries 3 indicies apart to see if they are
- same user_ID 
- within 7 days

and add that userID to a list

In [120]:
user_logs = []

for i in range(len(df1)):
    
    if df1.iloc[i]['user_id'] == df1.iloc[i-2]['user_id']:
        #check back 3 indices if user is the same

        tdelta = df1.iloc[i]['time_stamp_day'] - df1.iloc[i-2]['time_stamp_day']
        # compare dates to 3 indices

        if tdelta.days <=7:
            #there are 7 days within in the last 3  indicis
            user_logs.append(df1.iloc[i]['user_id'])


Convert the list of User IDs to a set (of unique values) and then sort

In [126]:
adopted = sorted(list(set(user_logs)))

In [127]:
adopted[:5]

[2, 10, 20, 33, 42]

---

# EDA part 2

Explore the dataset of users

In [10]:
df2.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [24]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


Create a Target label feature "adopted_user" using the list of frequent users

In [131]:
df2['adopted_user'] = df2.apply(lambda row: 1 if row['object_id'] in adopted else 0,axis =1)

In [132]:
df2.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0


The total number of adopted users is 1,656

In [249]:
df2['adopted_user'].sum()

1656

In [183]:
source = df2.groupby('creation_source')['adopted_user'].sum()
source

creation_source
GUEST_INVITE          369
ORG_INVITE            574
PERSONAL_PROJECTS     172
SIGNUP                302
SIGNUP_GOOGLE_AUTH    239
Name: adopted_user, dtype: int64

In [184]:
source = pd.DataFrame(source)

In [195]:
source = source.reset_index()

It appears the largest source of adopted users was the Org_invite chanel

In [196]:
source.head()

Unnamed: 0,creation_source,adopted_user
0,GUEST_INVITE,369
1,ORG_INVITE,574
2,PERSONAL_PROJECTS,172
3,SIGNUP,302
4,SIGNUP_GOOGLE_AUTH,239


Was there any particular 'Power Users' who  invited more than others? It appears that only 2 'power invite' reached 4 adopted users

In [247]:
invited = df2.groupby('invited_by_user_id')['adopted_user'].sum().sort_values().tail()
invited

invited_by_user_id
7882.0     3
9726.0     3
11267.0    3
10628.0    4
2354.0     4
Name: adopted_user, dtype: int64

From the 12,000 dataframe create a dataframe with the features we will use

In [134]:
df2.columns

Index(['object_id', 'creation_time', 'name', 'email', 'creation_source',
       'last_session_creation_time', 'opted_in_to_mailing_list',
       'enabled_for_marketing_drip', 'org_id', 'invited_by_user_id',
       'adopted_user'],
      dtype='object')

In [210]:
features = df2[['creation_source','opted_in_to_mailing_list','enabled_for_marketing_drip','org_id']]
labels = df2['adopted_user']

In [211]:
features.head()

Unnamed: 0,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id
0,GUEST_INVITE,1,0,11
1,ORG_INVITE,0,0,1
2,ORG_INVITE,0,0,94
3,GUEST_INVITE,0,0,1
4,GUEST_INVITE,0,0,193


In [212]:
features = pd.get_dummies(features)

In [213]:
features.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1,0,11,1,0,0,0,0
1,0,0,1,0,1,0,0,0
2,0,0,94,0,1,0,0,0
3,0,0,1,1,0,0,0,0
4,0,0,193,1,0,0,0,0


In [214]:
#remove the excess dummy variable from the creation source spread
features.drop('creation_source_PERSONAL_PROJECTS', axis=1,inplace=True)

Import classification Algorithms and Metrics:

In [224]:
#Import the classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score,recall_score, f1_score, classification_report, confusion_matrix

from sklearn.model_selection import train_test_split

Instatiate the classifiers with hyper parameters

In [139]:
#Assign the classifiers 
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier(n_neighbors=5)
mnb = MultinomialNB(alpha=0.2)
dtc = DecisionTreeClassifier(min_samples_split=7, random_state=111)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=31, random_state=111)
abc = AdaBoostClassifier(n_estimators=62, random_state=111)
bc = BaggingClassifier(n_estimators=9, random_state=111)

Create a dictionary to map the classifiers

In [140]:
clfs = {'SVC' : svc,'KN' : knc, 'NB': mnb, 'DT': dtc, 'LR': lrc, 'RF': rfc, 'AdaBoost': abc, 'BgC': bc}

create a classifier trainer function

In [141]:
def train_classifier(clf, feature_train, labels_train):
    """create the classfier feeder to fit the features and labels"""
    clf.fit(feature_train, labels_train)

create a predictor function

In [142]:
def predict_labels(clf, features):
    """predicted values from the features matrix"""
    return (clf.predict(features))

In [216]:
#Select Training and Test data 
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.4, random_state=111)

In [231]:
pred_scores = []
for k,v in clfs.items():
    train_classifier(v, features_train, labels_train)
    pred = predict_labels(v,features_test)
    pred_scores.append((k, accuracy_score(labels_test,pred),recall_score(labels_test,pred)))

In [239]:
df_result =pd.DataFrame(pred_scores)
df_result.columns=['Classifier', 'Accuracy','Recall (True Positive)']
df_result

Unnamed: 0,Classifier,Accuracy,Recall (True Positive)
0,SVC,0.740625,0.084948
1,KN,0.846667,0.028316
2,NB,0.860208,0.0
3,DT,0.811875,0.080477
4,LR,0.860208,0.0
5,RF,0.806042,0.114754
6,AdaBoost,0.860208,0.0
7,BgC,0.808542,0.110283


In [240]:
#Print Confusion Matrix for BgC
print ('\nClassification Report: BgC\n', classification_report(labels_test,pred))
confusion_matrix_graph = confusion_matrix(labels_test,pred)


Classification Report: BgC
              precision    recall  f1-score   support

          0       0.86      0.92      0.89      4129
          1       0.19      0.11      0.14       671

avg / total       0.77      0.81      0.79      4800

