# Relax Take-home Challenge
I am given users and users' activity/logged in information. The purpose of this project is to **identify which factors predict future user adoption**. User adoption (adopted user) is defined as logging in 3 separate times in at least one seven-day period.

# 1. Import Libraries and Dataset

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore third-party warnings
import warnings
warnings.filterwarnings('ignore')

In [48]:
# Import datasets
user_df = pd.read_csv('takehome_users.csv', parse_dates=['creation_time'], encoding='ISO-8859-1')
engagement_df = pd.read_csv("takehome_user_engagement.csv", parse_dates=['time_stamp'])

## 1.1 User Dataset
Contains 12,000 users' information who signed up for the product in the last two years. It contains how and when their account was created, last login, etc. 

In [49]:
user_df.head(1)

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0


In [50]:
# Convert last_session_creation_time column into datetime
user_df['last_session_creation_time'] = pd.to_datetime(user_df['last_session_creation_time'], unit='s')

In [51]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   object_id                   12000 non-null  int64         
 1   creation_time               12000 non-null  datetime64[ns]
 2   name                        12000 non-null  object        
 3   email                       12000 non-null  object        
 4   creation_source             12000 non-null  object        
 5   last_session_creation_time  8823 non-null   datetime64[ns]
 6   opted_in_to_mailing_list    12000 non-null  int64         
 7   enabled_for_marketing_drip  12000 non-null  int64         
 8   org_id                      12000 non-null  int64         
 9   invited_by_user_id          6417 non-null   float64       
dtypes: datetime64[ns](2), float64(1), int64(4), object(3)
memory usage: 937.6+ KB


In [52]:
user_df.head(1)

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0


User dataframe has missing data in the **last_Session_creation_time** and **invited_by_user_id** dataframe

## 1.2. Engagement Dataset
Engagement dataset contains summary of how often users logged in for each day.

In [53]:
engagement_df.head(1)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1


In [54]:
engagement_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   time_stamp  207917 non-null  datetime64[ns]
 1   user_id     207917 non-null  int64         
 2   visited     207917 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.8 MB


## 2. Data Wrangling
Defining whether user is an adopted user or not by utilizing both user and engagement dataframe. Adopted user is defined as a user who has logged into the product on three separate days in at least one seven-day period.

### User

In [55]:
# Create new column called 'adopted_user'
user_df['adopted_user'] = 0 # Set 0 by default

# Rename 'object_id' column name to 'user_id' to be same as engagement's 'user_id' column
user_df.rename(columns={'object_id': 'user_id'}, inplace=True)

# Set user_id column as index 
user_df.set_index('user_id', inplace=True)

In [56]:
user_df.head(3)

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,2014-04-22 03:53:30,1,0,11,10803.0,0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,2014-03-31 03:45:04,0,0,1,316.0,0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,2013-03-19 23:14:52,0,0,94,1525.0,0


In [57]:
# One hot encode 'creation_source'
user_df = pd.get_dummies(user_df, columns=['creation_source'], prefix='s_')

# Lowercase all column names
user_df.columns = user_df.columns.str.lower()
user_df.head(3)

Unnamed: 0_level_0,creation_time,name,email,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user,s__guest_invite,s__org_invite,s__personal_projects,s__signup,s__signup_google_auth
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,2014-04-22 03:53:30,1,0,11,10803.0,0,1,0,0,0,0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,2014-03-31 03:45:04,0,0,1,316.0,0,0,1,0,0,0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,2013-03-19 23:14:52,0,0,94,1525.0,0,0,1,0,0,0


In [58]:
# Fill in missing data in 'invited_by_user_id' column by creating 'invited' column and imputing binary values
user_df['invited'] = user_df['invited_by_user_id'].apply(lambda val: 0 if np.isnan(val) else 1)

# Remove 'invited_by_user_id' column as it is not needed to predict whether user will be adopted user or not
user_df.drop(['invited_by_user_id'], axis=1, inplace=True)

In [59]:
user_df['invited'].value_counts()

1    6417
0    5583
Name: invited, dtype: int64

### Engagement
We now need to look at user engagement dataframe to fill in values onto user dataframe's 'adopted_user' column

In [60]:
engagement_df.head(3)

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1


In [61]:
# Set 'time_stamp' column as datetime index
engagement_df = engagement_df.set_index(pd.DatetimeIndex(engagement_df['time_stamp']))

# Group by 'user_id' column, resample by 1 week, and total up the frequency of visits per week
engagement_df = engagement_df.groupby('user_id').resample('1W').sum()

In [62]:
engagement_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,visited
user_id,time_stamp,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2014-04-27,1,1
2,2013-11-17,2,1
2,2013-11-24,0,0
2,2013-12-01,2,1
2,2013-12-08,0,0
2,2013-12-15,2,1
2,2013-12-22,0,0
2,2013-12-29,2,1
2,2014-01-05,2,1
2,2014-01-12,2,1


In [63]:
# Identify users that had 3 or more visits per wee
engagement_df = engagement_df['visited'] >= 3
engagement_df = engagement_df.groupby(level=0).apply(np.sum)

# Remove users that does not have any weeks with 3 or more visits
engagement_df = engagement_df[engagement_df != 0]

In [64]:
# Create new engagement dataframe
adopted_users_df = pd.DataFrame(engagement_df,index=engagement_df.index)

# Rename 'visited' to 'active_weeks'
adopted_users_df.rename(columns={'visited': 'active_weeks'}, inplace=True)

In [65]:
# Get min and max values
min_val = adopted_users_df['active_weeks'].min()
max_val = adopted_users_df['active_weeks'].max()

In [66]:
# Scaling values between 0 to 1 (MinMaxScaler)
def min_max_scaler(val, min_val, max_val):
    '''
    Normalizes values between 0 to 1.
    
    Parameter
    ---------
    val: int
    df: pandas.DataFrame
    Returns
    -------
    num: float64
        Returns value between 0 to 1 
        
    '''
    # Scale
    return (val - min_val) / (max_val - min_val)

adopted_users_df['active_weeks'] = adopted_users_df['active_weeks'].apply(lambda val: min_max_scaler(val, min_val, max_val))

In [67]:
adopted_users_df

Unnamed: 0_level_0,active_weeks
user_id,Unnamed: 1_level_1
2,0.000000
10,0.548387
20,0.000000
33,0.000000
42,0.720430
...,...
11965,0.000000
11967,0.075269
11969,0.225806
11975,0.462366


In [68]:
# Setting as 1 to 'adopted_user' column (identifying adopted users)
user_df.loc[adopted_users_df.index, 'adopted_user'] = 1

In [69]:
# Add 'active_weeks' column 
user_df = user_df.merge(adopted_users_df,left_index=True,on='user_id',how='outer')

# Fill NaN with 0 in 'active_weeks' columns
user_df['active_weeks'] = user_df['active_weeks'].fillna(value=0)
user_df.head()

Unnamed: 0_level_0,creation_time,name,email,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted_user,s__guest_invite,s__org_invite,s__personal_projects,s__signup,s__signup_google_auth,invited,active_weeks
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,2014-04-22 03:53:30,1,0,11,0,1,0,0,0,0,1,0.0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,2014-03-31 03:45:04,0,0,1,1,0,1,0,0,0,1,0.0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,2013-03-19 23:14:52,0,0,94,0,0,1,0,0,0,1,0.0
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,2013-05-22 08:09:28,0,0,1,0,1,0,0,0,0,1,0.0
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,2013-01-22 10:14:20,0,0,193,0,1,0,0,0,0,1,0.0


In [172]:
# Remove columns that are not needed for model
user_df.drop(['creation_time','name','email','last_session_creation_time','org_id'],axis=1, inplace=True)

In [185]:
user_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   opted_in_to_mailing_list    12000 non-null  int64  
 1   enabled_for_marketing_drip  12000 non-null  int64  
 2   adopted_user                12000 non-null  int64  
 3   s__guest_invite             12000 non-null  uint8  
 4   s__org_invite               12000 non-null  uint8  
 5   s__personal_projects        12000 non-null  uint8  
 6   s__signup                   12000 non-null  uint8  
 7   s__signup_google_auth       12000 non-null  uint8  
 8   invited                     12000 non-null  int64  
 9   active_weeks                12000 non-null  float64
dtypes: float64(1), int64(4), uint8(5)
memory usage: 621.1 KB
