# Introduction

This project addresses the take-home challenge component of the data-science interview process for Relax, Inc. The thrust of the project is to identify factors that predict future user adoption, where user adoption is defined as logging into the product on three separate days in at least one seven-day period. 

# Data Wrangling

Here, I will read in the data, check for missing values, and perform any other cleaning that might be needed.

In [1]:
import pandas as pd
import numpy as np

engage = pd.read_csv('takehome_user_engagement.csv')
engage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [2]:
# visited column appears to have no value at this point 
print(engage.visited.unique()) 

[1]


In [3]:
# drop visited column and convert time_stamp to datetime
engage_two = engage.drop(['visited'], axis=1)
engage_two['time_stamp'] = pd.to_datetime(engage_two['time_stamp'])
engage_two.head()

Unnamed: 0,time_stamp,user_id
0,2014-04-22 03:53:30,1
1,2013-11-15 03:45:04,2
2,2013-11-29 03:45:04,2
3,2013-12-09 03:45:04,2
4,2013-12-25 03:45:04,2


In [4]:
# identify adopted users (creation of dependent variable)

def check_adopt(user):
    if len(user) > 2: #user has at least 3 records
        u_list = [i for i in user].sort() #sorted list of dates for user
        diff = [u_list[i+2] - u_list[i] for i in range(len(u_list))] 
            #find time differences b/w 1st and 3rd logins for sets of 3
        if diff < 8:
            return 1
        else:
            return 0         

In [5]:
agg = engage_two.groupby('user_id').agg(check_adopt)

TypeError: object of type 'NoneType' has no len()

In [14]:
# read in users data set

users = pd.read_csv('takehome_users.csv', encoding='latin-1') #encoding handles UTF-8 need for appropriate bytes
users.rename(columns = {'object_id':'user_id'}, inplace = True) #change object_id to user_id for merge
users.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [54]:
# merge data frames
merged = pd.merge(users, engage_two, on='user_id', how='inner')
merged.head()

Unnamed: 0,user_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,time_stamp
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,2014-04-22 03:53:30
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-11-15 03:45:04
2,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-11-29 03:45:04
3,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-12-09 03:45:04
4,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2013-12-25 03:45:04


In [55]:
# sanity checks of dimensions
print(users.shape)
print(engage_two.shape)
print(merged.shape)

print(len(merged.user_id.unique()))
print(len(engage_two.user_id.unique()))

(12000, 10)
(207917, 2)
(207917, 11)
8823
8823


In [56]:
# check for missing data -- problems in invited_by_user_id
print(merged.isnull().sum())

user_id                           0
creation_time                     0
name                              0
email                             0
creation_source                   0
last_session_creation_time        0
opted_in_to_mailing_list          0
enabled_for_marketing_drip        0
org_id                            0
invited_by_user_id            91030
time_stamp                        0
dtype: int64


Nearly half of the observations have no value for the inviting user. But we can still get some useful information from this variable, so there's no need to drop it outright.  I'm going to proceed on the assumption that this means that affected observations simply were not referred by another user in the first place. With that in said, I'll use this variable to create a new, binary column to indicate whether a user was referred or not. (The other option would be to drop the missing data, which would not be good with nearly half of the data missing a referral ID.) 

In [67]:
# create binary variable to indicate if user was referred
merged['referred'] = np.where(pd.isnull(merged['invited_by_user_id']), 0, 1)

In [74]:
# drop unneeded variables, including invited_by_user_id
merged_drop = merged.drop(['invited_by_user_id', 'name', 'email'], axis=1)
merged_drop.head()

Unnamed: 0,user_id,creation_time,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,time_stamp,referred
0,1,2014-04-22 03:53:30,GUEST_INVITE,1398139000.0,1,0,11,2014-04-22 03:53:30,1
1,2,2013-11-15 03:45:04,ORG_INVITE,1396238000.0,0,0,1,2013-11-15 03:45:04,1
2,2,2013-11-15 03:45:04,ORG_INVITE,1396238000.0,0,0,1,2013-11-29 03:45:04,1
3,2,2013-11-15 03:45:04,ORG_INVITE,1396238000.0,0,0,1,2013-12-09 03:45:04,1
4,2,2013-11-15 03:45:04,ORG_INVITE,1396238000.0,0,0,1,2013-12-25 03:45:04,1
