# Relax Challenge Notebook
### By [Anthony Medina](https://www.linkedin.com/in/anthony-medina-math/)

1. Notebook Objectives/Prompt
2. Strategy
3. Imports
4. Initial Data Loading
5. Initial Exploration
6. Creating a Feature Column
7. Features
8. Modeling
9. Conclusions

### 1. Notebook Objectives
The data is available as two attached CSV files:

takehome_user_engagement.csv - Table of dates of loggins

takehome_users.csv - Table of people who have signed up in the last 2 years

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period, identify which factors predict future user adoption.

### 2. Strategy
* Create a function that tells if a user is an **adopted user** or not. 
* Target Variable is the **adopted_user** True False column.
* Join the target column on the first data set.
* Clean up the features we have.
* Encode the categories
* Run the model (Random Forest sounds good for this)
* Look at the list of variables that contributed to the target column.

### 3. Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### 4. Initial Data Loading
There are two csv files.
* loggin_data that will be converted to a feature column.
* user_data that has the rest of the features of the data set.

In [2]:
loggin_data = pd.read_csv('takehome_user_engagement.csv')
loggin_data.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [3]:
user_data = pd.read_csv('takehome_users.csv')
user_data.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


### 5. Initial Exploration

In [4]:
loggin_data.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [5]:
loggin_data.groupby('user_id')['visited'].sum() >= 3

user_id
1        False
2         True
3        False
4        False
5        False
         ...  
11996    False
11997    False
11998    False
11999    False
12000    False
Name: visited, Length: 8823, dtype: bool

In [6]:
len(loggin_data)

207917

In [7]:
loggin_data.loc[0:100]

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1
...,...,...,...
96,2013-08-02 22:08:03,10,1
97,2013-08-03 22:08:03,10,1
98,2013-08-04 22:08:03,10,1
99,2013-08-06 22:08:03,10,1


In [8]:
# Can you visit the website more that once at a time?
loggin_data.groupby('visited')['visited'].count()

visited
1    207917
Name: visited, dtype: int64

In [9]:
loggin_data.isnull().sum()

time_stamp    0
user_id       0
visited       0
dtype: int64

### 6. Creating the feature column

In [10]:
user_id_list = []
adopted_user_list = []

for  user_id in range(1,12001): # id's will go from 1 - 12000
    # print("user_id", user_id)
    user_id_list.append(user_id)
    adopted_user_value = False
    my_list = loggin_data[loggin_data['user_id'] == user_id]['time_stamp']
    length = len(my_list)
    
    if length < 3:
        adopted_user_list.append(adopted_user_value)
        # print('False list less than 3')
    else:        
        for i in range(length-2):
            # print("i:",i)
            date1 = datetime.strptime(my_list.iloc[i], '%Y-%m-%d %H:%M:%S')
            date2 = datetime.strptime(my_list.iloc[i + 2], '%Y-%m-%d %H:%M:%S')
            date_difference = abs((date2 - date1).days)
            if date_difference <= 7:
                adopted_user_value = True
                adopted_user_list.append(adopted_user_value)
                # print('True')
                break
            if i == length-3:
                adopted_user_list.append(adopted_user_value)
                # print('False finished the loop')
# print(user_id_list, adopted_user_list)

In [11]:
# Making a new data frame of the user_id and the feature.
data = {
    'user_id': user_id_list,
    'adopted_user': adopted_user_list
}

predicted_df = pd.DataFrame(data)

In [12]:
predicted_df.head()

Unnamed: 0,user_id,adopted_user
0,1,False
1,2,True
2,3,False
3,4,False
4,5,False


In [13]:
# These two need to have the same length in order to join them together. 
print(predicted_df.shape)
print(user_data.shape)

(12000, 2)
(12000, 10)


In [14]:
# Let's make sure that they have the same number of IDs as well.
print(user_data['object_id'].max(), loggin_data['user_id'].max())

12000 12000


In [15]:
# Adding on the feature column to the data frame
user_data['adopted_user'] = predicted_df['adopted_user']

In [16]:
user_data.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,False
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,True
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,False
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,False
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,False


### 7. Features

In [17]:
# We will need to go through and deal with missing values, and zeros
# We will need to encode creation source
# We will need to drop the object id, creation time, name, email
user_data.columns

Index(['object_id', 'creation_time', 'name', 'email', 'creation_source',
       'last_session_creation_time', 'opted_in_to_mailing_list',
       'enabled_for_marketing_drip', 'org_id', 'invited_by_user_id',
       'adopted_user'],
      dtype='object')

In [18]:
# Fixing Creation source
user_data.groupby('creation_source')['creation_source'].count()

creation_source
GUEST_INVITE          2163
ORG_INVITE            4254
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

In [19]:
user_data[user_data['last_session_creation_time'] == 0]['last_session_creation_time'].count()

0

In [20]:
user_data['last_session_creation_time'] = user_data['last_session_creation_time'].fillna(0.0)

In [21]:
user_data.groupby('opted_in_to_mailing_list')['opted_in_to_mailing_list'].count()

opted_in_to_mailing_list
0    9006
1    2994
Name: opted_in_to_mailing_list, dtype: int64

In [22]:
user_data.groupby('enabled_for_marketing_drip')['enabled_for_marketing_drip'].count()

enabled_for_marketing_drip
0    10208
1     1792
Name: enabled_for_marketing_drip, dtype: int64

In [23]:
user_data.groupby('org_id')['org_id'].count()

org_id
0      319
1      233
2      201
3      168
4      159
      ... 
412     17
413     16
414     20
415     16
416      2
Name: org_id, Length: 417, dtype: int64

In [24]:
user_data.groupby('invited_by_user_id')['invited_by_user_id'].count()

invited_by_user_id
3.0        1
7.0        5
10.0       1
21.0       1
23.0       3
          ..
11981.0    1
11986.0    1
11994.0    7
11997.0    1
11999.0    7
Name: invited_by_user_id, Length: 2564, dtype: int64

In [25]:
null_count = user_data['invited_by_user_id'].isnull().sum()
print(null_count)

5583


In [26]:
user_data['invited_by_user_id'] = user_data['invited_by_user_id'].fillna(0.0)

In [27]:
user_data.groupby('adopted_user')['adopted_user'].count()

adopted_user
False    10344
True      1656
Name: adopted_user, dtype: int64

In [28]:
# We will need to drop the object id, creation time, name, email
columns_to_drop = ['object_id', 'creation_time', 'name', 'email']
df = user_data.drop(columns_to_drop, axis=1)

In [29]:
# We will need to encode creation source

# Assuming df is your DataFrame
categorical_column = 'Categorical_Column'

# Perform one-hot encoding
encoded_df = pd.get_dummies(df, columns=['creation_source'], prefix=['creation_source'])

encoded_df.head()

Unnamed: 0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted_user,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH
0,1398139000.0,1,0,11,10803.0,False,1,0,0,0,0
1,1396238000.0,0,0,1,316.0,True,0,1,0,0,0
2,1363735000.0,0,0,94,1525.0,False,0,1,0,0,0
3,1369210000.0,0,0,1,5151.0,False,1,0,0,0,0
4,1358850000.0,0,0,193,5240.0,False,1,0,0,0,0


In [30]:
encoded_df[encoded_df.isnull()].count()

last_session_creation_time            0
opted_in_to_mailing_list              0
enabled_for_marketing_drip            0
org_id                                0
invited_by_user_id                    0
adopted_user                          0
creation_source_GUEST_INVITE          0
creation_source_ORG_INVITE            0
creation_source_PERSONAL_PROJECTS     0
creation_source_SIGNUP                0
creation_source_SIGNUP_GOOGLE_AUTH    0
dtype: int64

### 8. Modeling

In [31]:
# Since the prompt is asking for a list of contributing features, this really screams Random Forest.

In [32]:
X = encoded_df.drop('adopted_user', axis=1)
y = encoded_df['adopted_user']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
feature_names = [i for i in range(10)]

In [33]:
# Assuming df is your DataFrame
null_values = X.isnull()

# Count the number of null values in each column
null_count_per_column = null_values.sum()

print(null_count_per_column)

last_session_creation_time            0
opted_in_to_mailing_list              0
enabled_for_marketing_drip            0
org_id                                0
invited_by_user_id                    0
creation_source_GUEST_INVITE          0
creation_source_ORG_INVITE            0
creation_source_PERSONAL_PROJECTS     0
creation_source_SIGNUP                0
creation_source_SIGNUP_GOOGLE_AUTH    0
dtype: int64


In [34]:
# Create a Random Forest classifier
random_forest = RandomForestClassifier(random_state=42)

# Define the parameter grid for grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV with accuracy as the scoring metric
grid_search = GridSearchCV(random_forest, param_grid, cv=5, scoring='accuracy')

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best estimator from grid search
best_params = grid_search.best_params_
best_rf = grid_search.best_estimator_

# Evaluate the best model on your test data
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Test Accuracy:", accuracy)


Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 50}
Test Accuracy: 0.91375


In [35]:
# Create a Random Forest classifier
random_forest = RandomForestClassifier(n_estimators=50, max_depth=10, min_samples_leaf= 4, min_samples_split=10, random_state=42)

# Train the classifier on the training data
random_forest.fit(X_train, y_train)

# Get feature importances from the trained model
feature_importances = random_forest.feature_importances_

# Sort the features based on their importances
sorted_indices = np.argsort(feature_importances)[::-1]

# Print the importance scores and feature names
print("Feature Importances:")
for idx in sorted_indices:
    print(f"{feature_names[idx]}: {feature_importances[idx]}")

Feature Importances:
0: 0.8763631607939225
3: 0.06404044073512331
4: 0.033774291637366345
7: 0.006117350363084708
1: 0.004703758048080987
2: 0.004377198570480068
5: 0.0033161522558693844
6: 0.002563267712756153
9: 0.0023996232039213907
8: 0.002344756679395164


In [36]:
X.columns[0], X.columns[3], X.columns[4], X.columns[7], X.columns[1]

('last_session_creation_time',
 'org_id',
 'invited_by_user_id',
 'creation_source_PERSONAL_PROJECTS',
 'opted_in_to_mailing_list')

### 9. Conclusions

The most imporant feature is if the person has a session creation time, and then if they belong to an organization, if they were invited by another user, and if they came from the Personal Projects creation point.