T**able 1 : takehome_users**

A user table ( "takehome_users" ) with data on 12,000 users who signed up for the
product in the last two years. This table includes:
* name: the user's name
* object_id: the user's id
* email: email address
* creation_source: how their account was created. This takes on one
of 5 values:
  * PERSONAL_PROJECTS: invited to join another user's
  personal workspace
  * GUEST_INVITE: invited to an organization as a guest
  (limited permissions)
  * ORG_INVITE: invited to an organization (as a full member)
  * SIGNUP: signed up via the website
  * SIGNUP_GOOGLE_AUTH: signed up using Google
  Authentication (using a Google email account for their login
  id)

* creation_time: when they created their account
* last_session_creation_time: unix timestamp of last login
* opted_in_to_mailing_list: whether they have opted into receiving
marketing emails
* enabled_for_marketing_drip: whether they are on the regular
marketing email drip
* org_id: the organization (group of users) they belong to
* invited_by_user_id: which user invited them to join (if applicable).


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from google.colab import files
uploaded = files.upload()

Saving takehome_users.csv to takehome_users.csv


In [4]:
import io
# Reading the csv file with "ISO-8859-1" encoding
df2 = pd.read_csv(io.BytesIO(uploaded['takehome_users.csv']),encoding="ISO-8859-1")
# Dataset is now stored in a Pandas Dataframe

**Table 2 : takehome_user_engagement**

A usage summary table ( "takehome_user_engagement" ) that has a row for each day that a user logged into the product.The table includes

* time_stamp
* user_id
* visited

In [5]:
from google.colab import files
uploaded = files.upload()

Saving takehome_user_engagement.csv to takehome_user_engagement.csv


In [6]:
import io
df1 = pd.read_csv(io.BytesIO(uploaded['takehome_user_engagement.csv']))
# Dataset is now stored in a Pandas Dataframe

In [7]:
#converting the unix timestamps for last session logins to pandas datetime 
df2['last_session_creation_time'] = pd.to_datetime(df2['last_session_creation_time'],unit='s')

In [8]:
type(df2['creation_time'][0])

str

In [9]:
# Also converting the creation_time from string to pandas datetime object
df2['creation_time'] = pd.to_datetime(df2['creation_time'])

In [10]:
df2.head(5)

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0


In [11]:
df1.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


**Task:**

Defining an ***"adopted user"*** as a user who has logged into the product on three separate days in at least one seven­day period , identify which factors predict future user adoption .

In [13]:
#df1.drop(['count_in_last_7_days'],axis =1,inplace =True)

In [14]:
type(df1['time_stamp'][0])

str

In [15]:
# Also converting the creation_time from string to pandas datetime object
df1['time_stamp'] = pd.to_datetime(df1['time_stamp'])

In [16]:
type(df1['time_stamp'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [17]:
# Counting the number of signins/logins by the user happening in 7- day rolling window.
delta = 7
df1['count_in_last_%s_days' %(delta)] = df1.assign(count=1).groupby(['user_id'])\
                                        .apply(lambda x: x.rolling('%sD' %delta, on='time_stamp').sum())['count'].astype(int) - 1

In [18]:
df1.head()

Unnamed: 0,time_stamp,user_id,visited,count_in_last_7_days
0,2014-04-22 03:53:30,1,1,0
1,2013-11-15 03:45:04,2,1,0
2,2013-11-29 03:45:04,2,1,0
3,2013-12-09 03:45:04,2,1,0
4,2013-12-25 03:45:04,2,1,0


In [19]:
# Counts of signup to product by user in 7-day window period ! Here our requirement is for such logins/signups of 3  or more
df1[df1['count_in_last_7_days'].ge(3)]

Unnamed: 0,time_stamp,user_id,visited,count_in_last_7_days
42,2013-04-17 22:08:03,10,1,3
43,2013-04-19 22:08:03,10,1,3
47,2013-04-30 22:08:03,10,1,3
48,2013-05-01 22:08:03,10,1,3
49,2013-05-02 22:08:03,10,1,4
...,...,...,...,...
207897,2014-05-21 11:04:47,11988,1,4
207898,2014-05-23 11:04:47,11988,1,5
207899,2014-05-24 11:04:47,11988,1,5
207900,2014-05-26 11:04:47,11988,1,4


In [20]:
act_product_user = df1[df1['count_in_last_7_days'].ge(3)].user_id.unique()

In [21]:
# Finally obtaining the array of users with tag of "adopted_users" of product by our defination
act_product_user

array([   10,    42,    43, ..., 11969, 11975, 11988])

In [22]:
# Creating a column with boolean values whether a user is adopted_user or not
df2["adopted user"] = df2['object_id'].isin(act_product_user)

In [23]:
df2.head(12)

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0,False
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0,False
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0,False
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,2013-05-22 08:09:28,0,0,1,5151.0,False
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,2013-01-22 10:14:20,0,0,193,5240.0,False
5,6,2013-12-17 03:37:06,Cunha Eduardo,EduardoPereiraCunha@yahoo.com,GUEST_INVITE,2013-12-19 03:37:06,0,0,197,11241.0,False
6,7,2012-12-16 13:24:32,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,2012-12-20 13:24:32,0,1,37,,False
7,8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,NaT,1,1,74,,False
8,9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,NaT,0,0,302,,False
9,10,2013-01-16 22:08:03,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,2014-06-03 22:08:03,1,1,318,4143.0,True


#### Start feature engineering

In [24]:
# extract month feature
months = df2.creation_time.dt.month

In [25]:
#df3.sort_values(by = 'count_in_last_7_days',ascending=False, inplace=False,).head(10)

In [26]:
# first: extract the day name literal
to_one_hot = df2.creation_time.dt.day_name()# second: one hot encode to 7 columns
days = pd.get_dummies(to_one_hot)
#display data
days

Unnamed: 0,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0
2,0,0,0,0,0,1,0
3,0,0,0,0,0,1,0
4,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...
11995,1,0,0,0,0,0,0
11996,0,0,0,0,1,0,0
11997,0,0,0,1,0,0,0
11998,0,0,0,0,1,0,0


In [28]:
# daypart function
def daypart(hour):
    if hour in [2,3,4,5]:
        return "dawn"
    elif hour in [6,7,8,9]:
        return "morning"
    elif hour in [10,11,12,13]:
        return "noon"
    elif hour in [14,15,16,17]:
        return "afternoon"
    elif hour in [18,19,20,21]:
        return "evening"
    else: return "midnight"
# extract hour feature
hours = df2.creation_time.dt.hour
# utilize it along with apply method
df2_dayparts = hours.apply(daypart)
# one hot encoding
dayparts = pd.get_dummies(df2_dayparts)
# re-arrange columns for convenience
dayparts = dayparts[['dawn','morning','noon','afternoon','evening','midnight']]#display data
dayparts

Unnamed: 0,dawn,morning,noon,afternoon,evening,midnight
0,1,0,0,0,0,0
1,1,0,0,0,0,0
2,0,0,0,0,0,1
3,0,1,0,0,0,0
4,0,0,1,0,0,0
...,...,...,...,...,...,...
11995,0,1,0,0,0,0
11996,0,0,0,0,1,0
11997,0,0,1,0,0,0
11998,0,0,1,0,0,0


In [29]:
# is_weekend flag 
day_names = df2.creation_time.dt.day_name()
is_weekend = day_names.apply(lambda x : 1 if x in ['Saturday','Sunday'] else 0)

In [34]:
# one hot encoding creation_source
creation_source = pd.get_dummies(df2.creation_source)

In [36]:
creation_source.head()

Unnamed: 0,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH
0,1,0,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,1,0,0,0,0
4,1,0,0,0,0


In [56]:
# features table#first step: include features with single column nature
features = pd.DataFrame({
    'month' : months,
    'hour' : hours,
    'is_weekend' : is_weekend
})
features = pd.concat([creation_source,df2[['opted_in_to_mailing_list','enabled_for_marketing_drip', 'org_id']]\
                      ,features, days, dayparts], axis = 1)# target column
target = df2['adopted user'].astype(int)

In [57]:
target.value_counts()

0    10703
1     1297
Name: adopted user, dtype: int64

In [58]:
features.head()

Unnamed: 0,GUEST_INVITE,ORG_INVITE,PERSONAL_PROJECTS,SIGNUP,SIGNUP_GOOGLE_AUTH,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,month,hour,...,Sunday,Thursday,Tuesday,Wednesday,dawn,morning,noon,afternoon,evening,midnight
0,1,0,0,0,0,1,0,11,4,3,...,0,0,1,0,1,0,0,0,0,0
1,0,1,0,0,0,0,0,1,11,3,...,0,0,0,0,1,0,0,0,0,0
2,0,1,0,0,0,0,0,94,3,23,...,0,0,1,0,0,0,0,0,0,1
3,1,0,0,0,0,0,0,1,5,8,...,0,0,1,0,0,1,0,0,0,0
4,1,0,0,0,0,0,0,193,1,10,...,0,1,0,0,0,0,1,0,0,0


#### SMOTE for Imbalanced Classification with Python

SMOTE with random undersampling of the majority class.

In [70]:
from collections import Counter
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# define pipeline
over = SMOTE(sampling_strategy=0.2) #20% of majority class
under = RandomUnderSampler(sampling_strategy=0.5) #minority class to be 50% of the majority 
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
# transform the dataset
X, y = pipeline.fit_resample(features, target)
# summarize the new class distribution
counter = Counter(y)
print(counter)

Counter({0: 4280, 1: 2140})


In [71]:
#split data into training and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle = False)

#### Training a Model

In [76]:
from sklearn.ensemble import RandomForestClassifier# define the model parameters
params = {'n_estimators': 500,
          'max_depth': 4,
          'min_samples_split': 5}# instantiate and train the model
clf = RandomForestClassifier(**params)


In [77]:
clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=4, min_samples_split=5, n_estimators=500)

#### Evaluation of Model

In [78]:
# import r2 score
from sklearn.metrics import r2_score
# evaluate the metrics
y_true = y_test
y_pred = clf.predict(X_test)
print(f"RF model R2 is {round(r2_score(y_true, y_pred)* 100 , 2)} %")

RF model R2 is 0.0 %


**Coclusion** :

With current information on user about product logins are insufficient to classify users between future adopter of product or not.More information about user on user's usage or interaction,feedbacks are required to comment further on future user engagement with product. 