## Relax Take home Challenge

Benhur Tedros

The data is available as two attached CSV files:  takehome_user_engagement. csv  ;   takehome_users . csv

The data has the following two tables:

1] A user table ( "takehome_users" ) with data on 12,000 users who signed up for the
product in the last two years. This table includes:

● name: the user's name

● object_id: the user's id

● email: email address

● creation_source: how their account was created. This takes on one
of 5 values:

    ○ PERSONAL_PROJECTS: invited to join another user's
personal workspace

    ○ GUEST_INVITE: invited to an organization as a guest
(limited permissions)

    ○ ORG_INVITE: invited to an organization (as a full member)

    ○ SIGNUP: signed up via the website
    
    ○ SIGNUP_GOOGLE_AUTH: signed up using Google Authentication (using a Google email account for their login id)

● creation_time: when they created their account

● last_session_creation_time: unix timestamp of last login

● opted_in_to_mailing_list: whether they have opted into receiving
marketing emails

● enabled_for_marketing_drip: whether they are on the regular
marketing email drip

● org_id: the organization (group of users) they belong to

● invited_by_user_id: which user invited them to join (if applicable).


2] A usage summary table ( "takehome_user_engagement" ) that has a row for each day that a user logged into the product.

Defining an "adopted user" as a user who has logged into the product on three separate
days in at least one sevenday period , identify which factors predict future user
adoption .

In [1]:
# Importing the required libraries

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import os
from datetime import datetime
import time
import json 
from pandas.io.json import json_normalize
from ggplot import *
import pylab
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from pandas.tools.plotting import scatter_matrix
from sklearn import preprocessing
import statsmodels.api as sm
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve,accuracy_score, confusion_matrix, mean_squared_error

You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
  from pandas.lib import Timestamp
  from pandas.core import datetools


In [2]:
# Changing the PATH directory
os.chdir('F:\\BENHUR FOLDER\\Data Science Career Track\\Springboard_takehome_challenges\\relax_challenge')

In [3]:
# Loading the takehome_user_engagement csv datafile into pandas dataframes
user_eng= pd.read_csv('takehome_user_engagement.csv')
user_eng.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [4]:
# Loading the takehome_user data
tak_user= pd.read_csv('takehome_users.csv',encoding='latin-1')
print(tak_user.info())
tak_user.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB
None


Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [5]:
## Renaming the name object_id by user_id 
takehome_users= tak_user.rename(columns = {'object_id':'user_id'})


In [6]:
## Convert to datetime format
user_eng['time_stamp'] = pd.to_datetime(user_eng['time_stamp'])

In [7]:
### counting the number of times a user visited the product 
summ_vist =user_eng.groupby('user_id')['visited'].agg('sum' )


# selecting the users with morethan 3 separate days logging 
clean_summ = summ_vist [summ_vist >=3]
clean_summ.head()

user_id
2      14
10    284
20      7
33     18
42    342
Name: visited, dtype: int64

In [8]:
## Adding adopted_user to the user_eng

for i, v in clean_summ.iteritems(): 
    tak_user.loc[tak_user['object_id'] == i, 'adopted_user'] = 1     

## Filling NaN with 0

tak_user =tak_user.fillna(0)

In [9]:
## Dropping the datafields which are not needed for our prediction
final =  tak_user.loc[:,  tak_user.columns != 'creation_time']
final = final.loc[:,  final.columns != 'name']
final = final.loc[:,  final.columns != 'email']
final = final.loc[:,  final.columns != 'last_session_creation_time']
final = final.loc[:,  final.columns != 'object_id']
# rev_user = tak_user.drop(['creation_time', 'name', 'email','last_session_creation_time'], axis=1, inplace=True)

In [10]:
# To convert the categorical inputs of some data fields to numerical nature
numeric = LabelEncoder()
final['creation_source']=numeric.fit_transform(final['creation_source'].astype('str'))
final.head()

Unnamed: 0,creation_source,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,0,1,0,11,10803.0,0.0
1,1,0,0,1,316.0,1.0
2,1,0,0,94,1525.0,0.0
3,0,0,0,1,5151.0,0.0
4,0,0,0,193,5240.0,0.0


## Building Prediction Model

In [11]:
# Separating the predictors and the target variables

X=final.values[:,0:5]
y=final.values[:,5]


In [12]:
# Standarizing the dataset
X_scaled = preprocessing.scale(X)

In [13]:
# Sampling the train and test dataset

X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size = 0.3, random_state = 100)

# The number of observations in the training and testing datasets

print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(8400, 5) (8400,)
(3600, 5) (3600,)


### Logistic Regression

In [14]:
lin_model = LogisticRegression().fit(X_train, y_train)
lin_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:
## Model prediction
lin_pred = lin_model.predict(X_test)

## Model evaluation/calculating the RMSE and model accuracy 

print('The RMSE=%0.4f ' % np.sqrt(mean_squared_error(y_test,lin_pred)))

The RMSE=0.4304 


### Decision Tree

In [16]:
# Building the model and make prediction
model = DecisionTreeClassifier(max_depth = 5)
dec_mod = model.fit(X_train, y_train)
decision_pred =dec_mod.predict(X_test)

# Evaluating the model/calculating the RMSE and model accuracy
print( ' The test accuracy of the model =%0.4f\n Confusion Matrix:\n' % accuracy_score(y_test, decision_pred), confusion_matrix(y_test, decision_pred))


 The test accuracy of the model =0.8142
 Confusion Matrix:
 [[2926    7]
 [ 662    5]]


In [17]:
# Discovering importance of each feature in the Decision model
feat_imp = pd.DataFrame(dec_mod.feature_importances_).T
feat_imp.columns = final.columns[:-1]
feat_imp = feat_imp.T
feat_imp1 = feat_imp.rename(index=str, columns ={0:'Feature Importance Rate'})
feat_imp1.sort_values(by='Feature Importance Rate')

Unnamed: 0,Feature Importance Rate
opted_in_to_mailing_list,0.0
enabled_for_marketing_drip,0.0
creation_source,0.258999
invited_by_user_id,0.324486
org_id,0.416515


Logistic regression and decision tree worked well, yielding the RMSE value of 0.4304 and model accuracy of 81.42% respectively. From the above analysis, four data fields were identified as important features. Those features are org_id, invited_by_user_id, and creation_source [from most to least important].