# Relax Take Home Challenge

## Task

Task Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven day period, identify which factors predict future user adoption .


## Data

Two tables one with user engagement and the other table with user characteristics. 

## Step 1

Identify adopted users from user engagement table.  Merge list of identified adopted users to user characteristics table. (See code for details)

In [None]:
import os
import pandas as pd
import datetime
import numpy as np
#Load Data

#Get Path
os.getcwd()
basepath = os.getcwd()
lis_dir = os.listdir()


# Create the list of file names: filenames
file = 'takehome_user_engagement.csv'
file_name_path = os.path.join(basepath,  file)
engagement= pd.read_csv(file_name_path)

#Open data
engagement.head(5)

#time stamp is a date so convert to date time
engagement['time_stamp']= pd.to_datetime(engagement['time_stamp'])
engagement['time_stamp'] = engagement['time_stamp'].dt.floor('d').astype(np.int64)
engagement.pop('visited')

#remove duplicates as it does not count as multiple visits if it happens on the same day
#sort values by users so we can what how many days per user
engagement = engagement.sort_values(['user_id', 'time_stamp']).drop_duplicates() #it doesn't count if it happens 3x in one day


#Create window of every three sequential times a user logged in
a = engagement.groupby('user_id')['time_stamp'].rolling(window=3)

#determine what was the max and min time of for every window call this value b
b = pd.to_timedelta((a.max()- a.min())).dt.days

#If any value b is less than 7  for a user then the user is an adopted user
c = b[b < 8].index.get_level_values('user_id').tolist()

#This only needs to happen once
c= np.unique(np.array(c)).tolist()
print(len(c))

## Step 2- Clean user characteristics table

The table came with 9 columns and 12000 users. Two of the columns were time variables (creation_time and last_session). There were also null values in the last_session column.   I extracted the month and year from both time stamps, creating four new categories. This allowed us to have the data of which accounts were created when and if that had any value on adopted values. But by doing this we would lose how long the user had been around. So I also created a category called user_time which was the number of days between the user’s last session time from creation time. This also allowed us to get rid of null values since if no last session time was listed I could say the user time was 0 since I would have to assume it was the same as start time. I then could drop both time variables.  

In [None]:
#get user data
new_file = 'takehome_users.csv'
file_name_path2 = os.path.join(basepath, new_file)
df = pd.read_csv(file_name_path2,encoding='ISO-8859-1')
#if object id is user's id if user id in adopted listed then they are an adopted user
df= df.drop_duplicates()
df['adopted_user']= df['object_id'].isin(c)

#convert dates
df['last_session_creation_time']= pd.to_datetime(df['last_session_creation_time'], unit='s')
df['creation_time']= pd.to_datetime(df['creation_time'])

#extract creation &  last session month and year
df['creation_month']= pd.DatetimeIndex(df['creation_time']).month
df['creation_year']= pd.DatetimeIndex(df['creation_time']).year
df['last_session_month']= pd.DatetimeIndex(df['last_session_creation_time']).month
df['last_session_year']= pd.DatetimeIndex(df['last_session_creation_time']).year

df['user_time']= (df['last_session_creation_time']-df['creation_time']).dt.days

#if no last session creation time we have to assume that the last session was the first (At least with regards to use date)

print(df.info())

## Step 3- Feature Manipulation

Of the nine original features was email which had over 1100  unique variables. However emails come with three separate parts account name, subdomain and domain. For example for user relax89@yahoo.com : account name (ex:relaxrocks89) ,  subdomain (yahoo) and domain (.com). Subdomains/domains can both divide the data and could tell us a lot about the user (.de domain for example suggests the user is in Germany). There were hundreds of sub domains so I limited it to domains that had over 10 users in for it. This  process was repeated for org_id, and invited user_id.

In [None]:
print(df.nunique())
#back of users
df['email_loc'] = df['email'].str.split(pat="@").str[-1]
df['email_sub_domain'] = df['email_loc'].str.split(pat=".").str[0]
#df['.de_email'] = np.where(df['email_domain']=='de', True, False)
#all domain de is custav sub domain so this was note useful

#too many email_locs with values count less than 3 so
sub_domain = df.email_sub_domain.value_counts().to_frame()
sub_domain= sub_domain[sub_domain>10].dropna()
sub_domain_list = sub_domain.index.tolist()
df['email_sub_domain'] = np.where(df.email_sub_domain.isin(sub_domain_list), df.email_sub_domain, "NA")
print(df.nunique())

#repeat with org_id
#convert to string so later categorical
df['org_id'] = df['org_id'].astype(str)

#pick top 30 organizations
x = df['org_id'].value_counts().to_frame()
x= x[x>50].dropna()
x_list = x.index.tolist()
df['org_id'] = np.where(df.org_id.isin(x_list), df.org_id, "NA")


#pick top 30 user id
x = df['invited_by_user_id'].value_counts().to_frame()
x= x[x>10].dropna()
x_list = x.index.tolist()
df['invited_by_user_id'] = np.where(df.invited_by_user_id.isin(x_list), df.invited_by_user_id, "NA")


## Step 4- Machine Learning

Having created these features I split my data into 60% train and 40% test, encoded my categorical features and ran it through a random forest classifier. My random forest came with 85% recall score and .97% accuracy score on the test data. When looking at feature importance user_time turns out to have the highest value

 <img src="Confusion Matrix.png"> 

## Step 5- Visualize Feature Importance

There is one variable that is by far the most important- and that is how long they have been using the account. 4 out 5 adopted users have a use date of over 100 days. No adopted user has a zero day value (which makes sense) but over 50% of the non adopted users do. On the flipside a 200 day value has 50% of the adopted users while no non adopted users do. 

 <img src="Feature Importance Violin2.png"> 

## Step 6- Other Traits

Less significant traits are email sub domain, creation_source and when the user last played. However these are not as statistically significant. 

 <img src="Feature Importance Small.png"> 