## Identify which factors predict future user adoption.
- an **adopted user** = a user who has logged into the product on **three separate days** in at least one **seven-day period**.

In [1]:
#Import necessary modules
import pandas as pd
import numpy as np

### Load and Clean Datasets
- user_login_df
- user_engage_df

**user_login_df**

In [2]:
# Load cvs 1 dataset
users_login_df = pd.read_csv('takehome_user_engagement.csv', encoding='utf-8')
users_login_df.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [3]:
print('The length of user login dataframe is: ', len(users_login_df))
print('   ')
users_login_df.info()

The length of user login dataframe is:  207917
   
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
time_stamp    207917 non-null object
user_id       207917 non-null int64
visited       207917 non-null int64
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


*Observations*: Data appears to be clean with no noticeable null datapoints due to equal sample lengths of non-null values in each column, which matches the total number of rows in the dataframe.

**user_engage_df**

In [4]:
# Load cvs 2 dataset
users_engage_df = pd.read_csv('takehome_users.csv')
users_engage_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [5]:
print('The length of user engage dataframe is: ', len(users_engage_df))
print('   ')
users_engage_df.info()

The length of user engage dataframe is:  12000
   
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [6]:
# Show sample of columns with null values
users_engage_df[['last_session_creation_time','invited_by_user_id']].head()

Unnamed: 0,last_session_creation_time,invited_by_user_id
0,1398139000.0,10803.0
1,1396238000.0,316.0
2,1363735000.0,1525.0
3,1369210000.0,5151.0
4,1358850000.0,5240.0


*Observations*: Columns *last_session_creation_time* and *invited_by_user_id* appear to have several null values. Both columns are numerical (float64 and int 64, respectively), so general values used can be used to fill null values. 

- The general value of *last_session_creation_time* should be the *mean* since each numerical value is most-likely random, so the mean serves as a generalization of the random login times.

- The general value of *invited_by_user_id* should be the *mode* since each numerical value pertains to one user. Therefore, it would make more since to allocate invites to the user most commonly inviting other users rather than the mean value of the user id; user ids are randomly assigned and do not pertain to the pattern of user login. 

In [7]:
# mean for last_session
users_engage_df['last_session_creation_time'].fillna(
    (users_engage_df['last_session_creation_time'].mean()), inplace=True)

In [8]:
# mode for user_id
users_engage_df['invited_by_user_id'].fillna(
    users_engage_df['invited_by_user_id'].mode()[0], inplace=True)

# Test output
users_engage_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    12000 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            12000 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


### Adjusting Data for Analysis
- Convert user_login_df['time_stamp'] to datetime objects
- Create a week column in user_login_df
- Create a frequency column that sorts user_login each week
- Group user_login_df by ['user_id'] and week -> update user_login_df['visited'] to save number of logins by user per week
- Create list adopted_users dateframe that is all users with more than 3 logins within a 7 day period

In [9]:
# Import necessary modules
from datetime import datetime

In [10]:
# Convert column to datetime
users_login_df['time_stamp'] = pd.to_datetime(
    users_login_df['time_stamp'])

# Test output
users_login_df['time_stamp'].head()

0   2014-04-22 03:53:30
1   2013-11-15 03:45:04
2   2013-11-29 03:45:04
3   2013-12-09 03:45:04
4   2013-12-25 03:45:04
Name: time_stamp, dtype: datetime64[ns]

In [11]:
# Group timestamp by week and create new week column
users_login_df['week'] = users_login_df['time_stamp'].dt.week
users_login_df.head()

Unnamed: 0,time_stamp,user_id,visited,week
0,2014-04-22 03:53:30,1,1,17
1,2013-11-15 03:45:04,2,1,46
2,2013-11-29 03:45:04,2,1,48
3,2013-12-09 03:45:04,2,1,50
4,2013-12-25 03:45:04,2,1,52


In [12]:
# Group user_login_df by ['user_id'] and ['week']
users_login_df = users_login_df.groupby(
    ['week','user_id'])['time_stamp'].count().reset_index(name="visited")
users_login_df.head()

Unnamed: 0,week,user_id,visited
0,1,2,1
1,1,10,5
2,1,42,4
3,1,43,1
4,1,46,1


In [13]:
# Create list of adopted_users
adopted_users = users_login_df[users_login_df.visited >= 3]
adopted_users = adopted_users['user_id'].tolist()

### Selecting Data for Model
- Create a binary column for adopted users:
    - TRUE: user_id is within adopted_users list
    - FALSE: user_id is not within adopted_users list

In [14]:
users_engage_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [15]:
user_adopted = []

for user in users_engage_df['object_id']:
    
    if user in adopted_users:
        user_adopted.append(True)
    else:
        user_adopted.append(False)

In [16]:
# Add adopted user information to users_engage_df
users_engage_df['user_adopted'] = pd.Series(user_adopted)
users_engage_df.tail()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_adopted
11995,11996,2013-09-06 06:14:15,Meier Sophia,SophiaMeier@gustr.com,ORG_INVITE,1378448000.0,0,0,89,8263.0,False
11996,11997,2013-01-10 18:28:37,Fisher Amelie,AmelieFisher@gmail.com,SIGNUP_GOOGLE_AUTH,1358275000.0,0,0,200,10741.0,False
11997,11998,2014-04-27 12:45:16,Haynes Jake,JakeHaynes@cuvox.de,GUEST_INVITE,1398603000.0,1,1,83,8074.0,False
11998,11999,2012-05-31 11:55:59,Faber Annett,mhaerzxp@iuxiw.com,PERSONAL_PROJECTS,1338638000.0,0,0,6,10741.0,False
11999,12000,2014-01-26 08:57:12,Lima Thaís,ThaisMeloLima@hotmail.com,SIGNUP,1390727000.0,0,1,0,10741.0,False


In [17]:
# Drop unnecessary columns
users_engage_df = users_engage_df.drop(columns=['name','email'])

In [18]:
%store users_engage_df

Stored 'users_engage_df' (DataFrame)
