# Classification Notebook

Classification modle with all data to determine buy or no buy. Balance the data, reduring the people that do not buy to match a similar value to that that have bought. (This will be the model that determines for the decision making of buy or no buy).

#### Importing all the libraries needed for the workbook

In [1]:
# Main Libraries
import pandas as pd  
import numpy as np
import statsmodels.api as sm
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.layers import Dense
#from tensorflow.keras.utils import to_categorical



#### Providing a table of contents for the definitions of each feature in the dataset

In [2]:
# Define the data for the DataFrame
data = {
    "Variable": [
        "user_pseudo_id", "country", "city", "device_category", "operating_system",
        "sourcemedium", "new_user", "first_event", "last_event", "event_count",
        "items_viewed", "add_to_carts", "checkouts", "purchases", "total_revenue",
        "quantity", "sessions", "coupons", "click_jetzt_einkaufen", "Click_gewinne",
        "click_primary_header_wrapper", "click_noch_primary", "click_link_noch",
        "click_dreh", "click_jetzt_spielen", "click_gluecksrad_drehe", "click_code_anzeigen"
    ],
    "Type": [
        "Float64", "Object", "Object", "Object", "Object", "Object", "Int64", 
        "datetime64[ns]", "datetime64[ns]", "Int64", "Int64", "Int64", "Int64", 
        "Int64", "Float64", "Int64", "Int64", "Object", "Int64", "Int64", 
        "Int64", "Int64", "Int64", "Int64", "Int64", "Int64", "Int64"
    ],
    "Definition": [
        "A numerical (float) identifier for users that has been anonymized.",
        "Categorical variable indicating the user's country.",
        "Categorical variable indicating the user's city.",
        "Categorical variable indicating the type of device (e.g., mobile, desktop).",
        "Categorical variable indicating the operating system of the user's device.",
        "Categorical variable indicating the source/medium through which the user accessed the site/app.",
        "Binary indicator whether the user is new (1) or returning (0).",
        "Datetime variable indicating the time of the first event recorded.",
        "Datetime variable indicating the time of the last event recorded.",
        "Numerical variable indicating the total number of events recorded.",
        "Numerical variable indicating the number of items viewed.",
        "Numerical variable indicating how many times items were added to the cart.",
        "Numerical variable indicating how many times a checkout process was initiated.",
        "Numerical variable indicating the number of purchases made.",
        "Numerical (float) indicating the total revenue generated from the user.",
        "Numerical variable indicating the quantity of items involved in transactions.",
        "Numerical variable indicating the number of sessions.",
        "Categorical variable indicating the use of coupons.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website.",
        "Numerical variable attributed to a specific click event recorded on the website."
    ]
}

# Create the DataFrame
variable_info_df = pd.DataFrame(data)

# Set display options
pd.set_option('display.max_colwidth', None)  # or use a large number like 1000 instead of None for older Pandas versions
pd.set_option('display.max_columns', None)  # Ensures all columns are displayed

# Display the DataFrame to verify
variable_info_df

Unnamed: 0,Variable,Type,Definition
0,user_pseudo_id,Float64,A numerical (float) identifier for users that has been anonymized.
1,country,Object,Categorical variable indicating the user's country.
2,city,Object,Categorical variable indicating the user's city.
3,device_category,Object,"Categorical variable indicating the type of device (e.g., mobile, desktop)."
4,operating_system,Object,Categorical variable indicating the operating system of the user's device.
5,sourcemedium,Object,Categorical variable indicating the source/medium through which the user accessed the site/app.
6,new_user,Int64,Binary indicator whether the user is new (1) or returning (0).
7,first_event,datetime64[ns],Datetime variable indicating the time of the first event recorded.
8,last_event,datetime64[ns],Datetime variable indicating the time of the last event recorded.
9,event_count,Int64,Numerical variable indicating the total number of events recorded.


#### Manipulating and cleaning the dataset to be used for classification modeling. 

Here we will import the 2 CSV files that contain the data. I create new columns for the 2 types of gamification options that were run, 'gluecksrad_engagement' & 'wbyo_engagement', and then merget the datasets where then add the overall engagement of gamification 'gamification_engagement'.

In [3]:
#importing data wheel
file_path1 = 'nov24.csv'
wheel = pd.read_csv(file_path1)
#importing data wheel
file_path2 = 'nov17.csv'
wbyo = pd.read_csv(file_path2)

In [4]:
#adding the date each promotion was run.
wheel['date'] = pd.to_datetime('2023-11-24')
wbyo['date'] = pd.to_datetime('2023-11-17')

# Specify the columns to check for the condition on the wheel
columns_to_check1 = ['click_noch_primary', 'click_dreh', 'click_jetzt_spielen', 'click_gluecksrad_drehe', 'click_code_anzeigen']

# Use np.where to create the new column based on the condition on the wheel
wheel['gluecksrad_engagement'] = np.where(wheel[columns_to_check1].gt(0).any(axis=1), 1, 0)

# Specify the columns to check for the condition on the wbyo
columns_to_check2 = ['click_noch_primary', 'click_jetzt_einkaufen', 'click_gewinne', 'click_link_noch']

# Use np.where to create the new column based on the condition on the wbyo
wbyo['wbyo_engagement'] = np.where(wbyo[columns_to_check2].gt(0).any(axis=1), 1, 0)

#adding 0 to each dataframe for the new engagement columns so that the 2 dataframes can be merged.
wheel['wbyo_engagement'] = 0
wbyo['gluecksrad_engagement'] = 0

# Concatenate the DataFrames by adding the rows of 'wheel' to 'wbyo'
df = pd.concat([wheel, wbyo], ignore_index=True)

# Specify the columns to check for the condition on the wbyo
columns_to_check3 = ['wbyo_engagement', 'gluecksrad_engagement']

# Use np.where to create the new column based on the condition on the wbyo
df['gamification_engagement'] = np.where(df[columns_to_check3].gt(0).any(axis=1), 1, 0)

df

Unnamed: 0,user_pseudo_id,country,city,device_category,operating_system,sourcemedium,new_user,first_event,last_event,event_count,items_viewed,add_to_carts,checkouts,purchases,total_revenue,quantity,sessions,coupons,click_jetzt_einkaufen,click_gewinne,click_primary_header_wrapper,click_noch_primary,click_link_noch,click_dreh,click_jetzt_spielen,click_gluecksrad_drehe,click_code_anzeigen,date,gluecksrad_engagement,wbyo_engagement,gamification_engagement
0,1.653411e+09,Germany,Berlin,mobile,iOS,TBD,1,1700842391176249,1700842456749569,16,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0
1,1.147704e+09,Germany,Mindelheim,mobile,iOS,TBD,1,1700848653518514,1700848692399166,19,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0
2,1.582437e+09,Switzerland,Bulle,mobile,iOS,TBD,1,1700852395847646,1700852628033297,16,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0
3,1.800091e+09,Germany,Frankfurt,mobile,iOS,TBD,1,1700828032485232,1700828308513799,65,1,1,1,1,46.56,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0
4,3.656791e+08,Germany,,mobile,iOS,TBD,1,1700823107022159,1700823183682287,9,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195586,2.082985e+09,Germany,Cuxhaven,mobile,Android,TBD,1,1700220385242808,1700220398841543,6,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0
195587,2.129320e+09,Germany,Bielefeld,mobile,iOS,TBD,1,1700222757561626,1700222798355484,13,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0
195588,2.130216e+09,Germany,Hildesheim,mobile,iOS,TBD,1,1700220383090692,1700220414150397,12,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0
195589,2.131727e+09,Germany,Siegen,mobile,Android,TBD,1,1700220929845757,1700220929845757,3,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0


In [5]:
# Count the number of rows where 'wbyo_engagement' is 1
num_rows_value_1_wbyo_df = (df['wbyo_engagement'] == 1).sum()

print(f"Number of rows where 'wbyo_engagement' is 1 in df: {num_rows_value_1_wbyo_df}")

# Count the number of rows where 'gluecksrad_engagement' is 1
num_rows_value_1_wheel_df = (df['gluecksrad_engagement'] == 1).sum()

print(f"Number of rows where 'gluecksrad_engagement' is 1 in df: {num_rows_value_1_wheel_df}")

# Count the number of rows where 'gamification_engagement' is 1
num_rows_value_1_gamification_df = (df['gamification_engagement'] == 1).sum()

print(f"Number of rows where 'gamification_engagement' is 1 in df: {num_rows_value_1_gamification_df}")

Number of rows where 'wbyo_engagement' is 1 in df: 1826
Number of rows where 'gluecksrad_engagement' is 1 in df: 22056
Number of rows where 'gamification_engagement' is 1 in df: 23882


In [6]:
#rounding off decimal place to 2 decimals for code like "describe", ect
pd.set_option('display.float_format', lambda x: '%.2f' % x,)

Below I take the first and last event to be able to find the amount of time spent on the website.

In [7]:
# Convert 'first_event' and 'last_event' from Unix timestamps in microseconds to datetime
df['first_event'] = pd.to_datetime(df['first_event'], unit='us')
df['last_event'] = pd.to_datetime(df['last_event'], unit='us')

# Subtract 'first_event' from 'last_event' to get the time difference
df['time_diff'] = (df['last_event'] - df['first_event'])

# Convert the time difference to total seconds, creating a numerical column
df['time_spent_seconds'] = df['time_diff'].dt.total_seconds()

df

Unnamed: 0,user_pseudo_id,country,city,device_category,operating_system,sourcemedium,new_user,first_event,last_event,event_count,items_viewed,add_to_carts,checkouts,purchases,total_revenue,quantity,sessions,coupons,click_jetzt_einkaufen,click_gewinne,click_primary_header_wrapper,click_noch_primary,click_link_noch,click_dreh,click_jetzt_spielen,click_gluecksrad_drehe,click_code_anzeigen,date,gluecksrad_engagement,wbyo_engagement,gamification_engagement,time_diff,time_spent_seconds
0,1653410570.17,Germany,Berlin,mobile,iOS,TBD,1,2023-11-24 16:13:11.176249,2023-11-24 16:14:16.749569,16,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0,0 days 00:01:05.573320,65.57
1,1147704469.17,Germany,Mindelheim,mobile,iOS,TBD,1,2023-11-24 17:57:33.518514,2023-11-24 17:58:12.399166,19,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0,0 days 00:00:38.880652,38.88
2,1582437117.17,Switzerland,Bulle,mobile,iOS,TBD,1,2023-11-24 18:59:55.847646,2023-11-24 19:03:48.033297,16,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0,0 days 00:03:52.185651,232.19
3,1800090715.17,Germany,Frankfurt,mobile,iOS,TBD,1,2023-11-24 12:13:52.485232,2023-11-24 12:18:28.513799,65,1,1,1,1,46.56,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0,0 days 00:04:36.028567,276.03
4,365679149.17,Germany,,mobile,iOS,TBD,1,2023-11-24 10:51:47.022159,2023-11-24 10:53:03.682287,9,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-24,0,0,0,0 days 00:01:16.660128,76.66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195586,2082984818.17,Germany,Cuxhaven,mobile,Android,TBD,1,2023-11-17 11:26:25.242808,2023-11-17 11:26:38.841543,6,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0,0 days 00:00:13.598735,13.60
195587,2129320152.17,Germany,Bielefeld,mobile,iOS,TBD,1,2023-11-17 12:05:57.561626,2023-11-17 12:06:38.355484,13,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0,0 days 00:00:40.793858,40.79
195588,2130215726.17,Germany,Hildesheim,mobile,iOS,TBD,1,2023-11-17 11:26:23.090692,2023-11-17 11:26:54.150397,12,1,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0,0 days 00:00:31.059705,31.06
195589,2131727157.17,Germany,Siegen,mobile,Android,TBD,1,2023-11-17 11:35:29.845757,2023-11-17 11:35:29.845757,3,0,0,0,0,0.00,TBD,1,TBD,0,0,0,0,0,0,0,0,0,2023-11-17,0,0,0,0 days 00:00:00,0.00


To follow, I then drop the columns that either are no longer applicable as new columns have been created off the raw data, or the features contain no data points and cannot be used in the analysis.

In [8]:
# List of specific columns to drop
columns_to_drop = ['time_diff', 'coupons', 'quantity', 'sourcemedium', 'first_event', 'last_event', 'click_jetzt_einkaufen', 'click_gewinne', 'click_primary_header_wrapper', 'click_noch_primary', 'click_link_noch', 'click_dreh', 'click_jetzt_spielen', 'click_gluecksrad_drehe', 'click_code_anzeigen']

# Dropping the columns from the DataFrame
df = df.drop(columns=columns_to_drop, errors='ignore')  # errors='ignore' to avoid errors if a column is missing

# Display the DataFrame to verify the columns have been dropped
df.info(verbose=True)

df1 = df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195591 entries, 0 to 195590
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   user_pseudo_id           195591 non-null  float64       
 1   country                  195588 non-null  object        
 2   city                     176991 non-null  object        
 3   device_category          195591 non-null  object        
 4   operating_system         195590 non-null  object        
 5   new_user                 195591 non-null  int64         
 6   event_count              195591 non-null  int64         
 7   items_viewed             195591 non-null  int64         
 8   add_to_carts             195591 non-null  int64         
 9   checkouts                195591 non-null  int64         
 10  purchases                195591 non-null  int64         
 11  total_revenue            195591 non-null  float64       
 12  sessions        

This code is to create dummies for the different type of device categories that can then be used for the analysis. Then the original feature is dropped.

In [9]:
# Create dummy variables for 'device_category'
device_category_dummies = pd.get_dummies(df['device_category'], prefix='device_category')

# Drop the original 'device_category' column from 'df'
df = df.drop('device_category', axis=1)

# Concatenate the dummy variables DataFrame with the original 'df' DataFrame
df = pd.concat([df, device_category_dummies], axis=1)

Here we filter out the data to only have data on users that were in Germany since we do not ship product outside of DE. There might be the small number of users that might be outside of DE that are trying to purchase whilst on holiday or traveling, but since the company has multiple domains, there are a lot of users who accendentally arrive on the DE store when they meant to be on the Austrian or Swiss website, and thus eliminating all other countries helps to have more clean data.

Additionally I also created dummy features for the type of operating systems, like done for devices, and then dropped the original feature. 

Furthermore I dropped the Country, city and user ID since these features would not be able to be used for the models.

Finally I created a converted the Date Column as there were two types of gamification options, each run on a single day (different dates), thus making it a binary variable and allowing it to be used for the models.

In [10]:
# Filter 'df' to keep only rows where the 'country' column is 'Germany'
#This is becuase purchases can only be made in germany for this specific store.
df = df[df['country'] == 'Germany']

# Create dummy variables for 'operating_system'
operating_system_dummies = pd.get_dummies(df['operating_system'], prefix='operating_system')

# Drop the original 'operating_system' column from 'df'
df = df.drop('operating_system', axis=1)

# Concatenate the dummy variables DataFrame with the original 'df' DataFrame
df = pd.concat([df, operating_system_dummies], axis=1)

#Dropping the Country and City due to country only being germany and city scope being to wide
#user_pseudo_id is removed as it is not needed either for the OLS

# List of specific columns to drop
columns_to_drop = ['country', 'city', 'user_pseudo_id']

# Dropping the columns from the DataFrame
df = df.drop(columns=columns_to_drop, errors='ignore')  # errors='ignore' to avoid errors if a column is missing

# Convert 'date' column to dummy variable: 0 for '2023-11-17' and 1 for '2023-11-24'
df['date'] = df['date'].apply(lambda x: 1 if x == pd.Timestamp('2023-11-24') else 0)

# Display the DataFrame to verify the columns have been dropped
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 187930 entries, 0 to 195590
Data columns (total 23 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   new_user                     187930 non-null  int64  
 1   event_count                  187930 non-null  int64  
 2   items_viewed                 187930 non-null  int64  
 3   add_to_carts                 187930 non-null  int64  
 4   checkouts                    187930 non-null  int64  
 5   purchases                    187930 non-null  int64  
 6   total_revenue                187930 non-null  float64
 7   sessions                     187930 non-null  int64  
 8   date                         187930 non-null  int64  
 9   gluecksrad_engagement        187930 non-null  int64  
 10  wbyo_engagement              187930 non-null  int64  
 11  gamification_engagement      187930 non-null  int64  
 12  time_spent_seconds           187930 non-null  float64
 13 

In [11]:
df2 = df

In [12]:
#Using DF1 as dataframe for Classification models as it contains objects and dummy variables.
cdf = df1


In [13]:
# Display the DataFrame to verify the columns have been dropped
df1.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195591 entries, 0 to 195590
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   user_pseudo_id           195591 non-null  float64       
 1   country                  195588 non-null  object        
 2   city                     176991 non-null  object        
 3   device_category          195591 non-null  object        
 4   operating_system         195590 non-null  object        
 5   new_user                 195591 non-null  int64         
 6   event_count              195591 non-null  int64         
 7   items_viewed             195591 non-null  int64         
 8   add_to_carts             195591 non-null  int64         
 9   checkouts                195591 non-null  int64         
 10  purchases                195591 non-null  int64         
 11  total_revenue            195591 non-null  float64       
 12  sessions        

#### Classification models suitable for predicting customer purchase behavior

1. Logistic Regression: A fundamental classification algorithm that models the probability of the default class (e.g., purchase). It's particularly useful for binary classification problems.
2. Decision Trees: A non-linear model that splits the data into branches to form a tree structure based on decision points. It's easy to interpret and can handle both numerical and categorical data.
3. Random Forest: An ensemble method that uses multiple decision trees to improve classification accuracy. It reduces overfitting and is robust against noise in the data.
4. Gradient Boosting Machines (GBM): Another ensemble technique that builds trees sequentially, with each tree trying to correct errors made by the previous ones. It's known for high accuracy and can handle various types of data.
5. Support Vector Machines (SVM): A powerful classifier that finds the optimal hyperplane for separating different classes in the feature space. It works well for high-dimensional data.
6. K-Nearest Neighbors (KNN): A simple, instance-based learning algorithm where the class of a sample is determined by the majority class among its k nearest neighbors. It's effective for datasets where similar instances lead to similar outcomes.
7. Naive Bayes: A set of algorithms based on Bayes' theorem, assuming independence between predictors. It's fast and works well with high-dimensional data.
8. Neural Networks: Deep learning models that can capture complex relationships in the data. For binary classification, a single output layer with a sigmoid activation function can be used to predict the probability of a purchase.

#### Before training these models, it's crucial to preprocess your data:

1. Encode categorical variables.
2. Handle missing values appropriately.
3. Normalize or standardize numerical features if necessary.
4. Split the data into training and testing sets to evaluate model performance.

In [14]:
# Assuming df is your original DataFrame
engaged_df = df[df['gamification_engagement'] == 1]

# Display the first few rows of the filtered DataFrame to verify
print(engaged_df.head())


    new_user  event_count  items_viewed  add_to_carts  checkouts  purchases  \
20         1           35             1             1          0          0   
21         1            6             0             0          0          0   
28         1           18             0             0          0          0   
30         1            8             0             0          0          0   
31         1           19             0             0          0          0   

    total_revenue  sessions  date  gluecksrad_engagement  wbyo_engagement  \
20           0.00         1     1                      1                0   
21           0.00         1     1                      1                0   
28           0.00         1     1                      1                0   
30           0.00         1     1                      1                0   
31           0.00         1     1                      1                0   

    gamification_engagement  time_spent_seconds  device_catego

In [15]:
# Assuming df is your original DataFrame
unengaged_df = df[df['gamification_engagement'] == 0]

# Now, sample down to 23,882 rows
unengaged_df_sampled = unengaged_df.sample(n=23882, random_state=42)  # Use a fixed random_state for reproducibility

# Display the shape of the sampled DataFrame to verify
print(unengaged_df_sampled.shape)


(23882, 23)


In [16]:
# Display the DataFrame to verify the columns have been dropped
engaged_df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22546 entries, 20 to 195573
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   new_user                     22546 non-null  int64  
 1   event_count                  22546 non-null  int64  
 2   items_viewed                 22546 non-null  int64  
 3   add_to_carts                 22546 non-null  int64  
 4   checkouts                    22546 non-null  int64  
 5   purchases                    22546 non-null  int64  
 6   total_revenue                22546 non-null  float64
 7   sessions                     22546 non-null  int64  
 8   date                         22546 non-null  int64  
 9   gluecksrad_engagement        22546 non-null  int64  
 10  wbyo_engagement              22546 non-null  int64  
 11  gamification_engagement      22546 non-null  int64  
 12  time_spent_seconds           22546 non-null  float64
 13  device_categor

In [17]:
# Display the DataFrame to verify the columns have been dropped
unengaged_df_sampled.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23882 entries, 49163 to 39580
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   new_user                     23882 non-null  int64  
 1   event_count                  23882 non-null  int64  
 2   items_viewed                 23882 non-null  int64  
 3   add_to_carts                 23882 non-null  int64  
 4   checkouts                    23882 non-null  int64  
 5   purchases                    23882 non-null  int64  
 6   total_revenue                23882 non-null  float64
 7   sessions                     23882 non-null  int64  
 8   date                         23882 non-null  int64  
 9   gluecksrad_engagement        23882 non-null  int64  
 10  wbyo_engagement              23882 non-null  int64  
 11  gamification_engagement      23882 non-null  int64  
 12  time_spent_seconds           23882 non-null  float64
 13  device_categ

In [18]:
# Concatenate the DataFrames by adding the rows of 'wheel' to 'wbyo'
class_df = pd.concat([engaged_df, unengaged_df_sampled], ignore_index=True)

In [19]:
# Display the DataFrame to verify the columns have been dropped
class_df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46428 entries, 0 to 46427
Data columns (total 23 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   new_user                     46428 non-null  int64  
 1   event_count                  46428 non-null  int64  
 2   items_viewed                 46428 non-null  int64  
 3   add_to_carts                 46428 non-null  int64  
 4   checkouts                    46428 non-null  int64  
 5   purchases                    46428 non-null  int64  
 6   total_revenue                46428 non-null  float64
 7   sessions                     46428 non-null  int64  
 8   date                         46428 non-null  int64  
 9   gluecksrad_engagement        46428 non-null  int64  
 10  wbyo_engagement              46428 non-null  int64  
 11  gamification_engagement      46428 non-null  int64  
 12  time_spent_seconds           46428 non-null  float64
 13  device_category_

In [20]:
# Assuming class_df is already defined and includes the 'checkouts' column
unique_purchases = class_df['purchases'].unique()

# Printing the unique values found in the 'checkouts' column
print("Unique values in 'purchases':", unique_purchases)

Unique values in 'purchases': [0 1 2]


In [21]:
# Assuming class_df is already defined
# Function to convert non-zero values to 1
def convert_to_binary(x):
    return 1 if x != 0 else 0

# Apply this function to every item in the 'purchases' column
class_df['purchases'] = class_df['purchases'].apply(convert_to_binary)

# Print the modified unique values to verify the changes
print("Unique values in 'purchases' after conversion:", class_df['purchases'].unique())

Unique values in 'purchases' after conversion: [0 1]


In [22]:
# a: Total number of rows in the DataFrame
total_rows = len(class_df)

# b: Number of times 1 is used as a value in "purchases"
count_ones = class_df['purchases'].sum()  # Since 'checkouts' is binary, summing will count all '1's

# c: Number of times 0 is used as a value in "purchases"
count_zeros = total_rows - count_ones  # Subtract the count of '1's from the total to get count of '0's

print("Total number of rows in the DataFrame:", total_rows)
print("Number of times 1 is used as a value in 'purchases':", count_ones)
print("Number of times 0 is used as a value in 'purchases':", count_zeros)

Total number of rows in the DataFrame: 46428
Number of times 1 is used as a value in 'purchases': 4065
Number of times 0 is used as a value in 'purchases': 42363


In [23]:
test = engaged_df['purchases'].sum()
print("Total number of rows in the DataFrame:", test)

test2 = df['gamification_engagement'].sum()
print("Total number of rows in the DataFrame:", test2)

test3 = df['purchases'].sum()
print("Total number of rows in the DataFrame:", test3)

Total number of rows in the DataFrame: 2899
Total number of rows in the DataFrame: 22546
Total number of rows in the DataFrame: 11418


### Running the classification code for merged df of 23882 data points of engaged users and 23882 unengaged users

In [24]:
# Prepare the Data
# Separate features and target
X = class_df.drop(['checkouts', 'total_revenue', 'purchases', 'date'], axis=1)
y = class_df['purchases']

# Standardize the numerical features
numeric_features = ['event_count', 'items_viewed', 'add_to_carts', 'sessions', 'time_spent_seconds']
scaler = StandardScaler()
X[numeric_features] = scaler.fit_transform(X[numeric_features])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # 70% training and 30% testing

In [25]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Trees': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'Gradient Boosting Machines': GradientBoostingClassifier(),
    'Support Vector Machine': SVC(),
    #'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'MLP Classifier': MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, activation='relu', solver='adam', random_state=42)
}


In [None]:
# Train and Evaluate Models

for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    print(f"Model: {name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("----------\n")


Model: Logistic Regression
Accuracy: 0.9358173594658626
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.97     12684
           1       0.71      0.48      0.57      1245

    accuracy                           0.94     13929
   macro avg       0.83      0.73      0.77     13929
weighted avg       0.93      0.94      0.93     13929

----------

Model: Decision Trees
Accuracy: 0.9401967118960443
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97     12684
           1       0.67      0.65      0.66      1245

    accuracy                           0.94     13929
   macro avg       0.82      0.81      0.81     13929
weighted avg       0.94      0.94      0.94     13929

----------

Model: Random Forest
Accuracy: 0.9522578792447411
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.

### Running the classification code for df of 23882 data points of engaged users only

In [27]:
# Assuming engaged_df is already defined
# Function to convert non-zero values to 1
def convert_to_binary(x):
    return 1 if x != 0 else 0

# Apply this function to every item in the 'purchases' column
engaged_df['purchases'] = engaged_df['purchases'].apply(convert_to_binary)

# Print the modified unique values to verify the changes
print("Unique values in 'purchases' after conversion:", engaged_df['purchases'].unique())

Unique values in 'purchases' after conversion: [0 1]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  engaged_df['purchases'] = engaged_df['purchases'].apply(convert_to_binary)


In [28]:
# a: Total number of rows in the DataFrame
total_rows = len(engaged_df)

# b: Number of times 1 is used as a value in "purchases"
count_ones = engaged_df['purchases'].sum()  # Since 'purchases' is binary, summing will count all '1's

# c: Number of times 0 is used as a value in "purchases"
count_zeros = total_rows - count_ones  # Subtract the count of '1's from the total to get count of '0's

print("Total number of rows in the DataFrame:", total_rows)
print("Number of times 1 is used as a value in 'purchases':", count_ones)
print("Number of times 0 is used as a value in 'purchases':", count_zeros)

Total number of rows in the DataFrame: 22546
Number of times 1 is used as a value in 'purchases': 2860
Number of times 0 is used as a value in 'purchases': 19686


In [29]:
# Prepare the Data
# Separate features and target
X = engaged_df.drop(['checkouts', 'total_revenue', 'purchases', 'date'], axis=1)
y = engaged_df['purchases']

# Standardize the numerical features
numeric_features = ['event_count', 'items_viewed', 'add_to_carts', 'sessions', 'time_spent_seconds']
scaler = StandardScaler()
X[numeric_features] = scaler.fit_transform(X[numeric_features])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # 70% training and 30% testing

In [30]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Trees': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'Gradient Boosting Machines': GradientBoostingClassifier(),
    'Support Vector Machine': SVC(),
    #'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'MLP Classifier': MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, activation='relu', solver='adam', random_state=42)
}


In [31]:
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    print(f"Model: {name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("----------\n")

Model: Logistic Regression
Accuracy: 0.9151389710230633
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.97      0.95      5929
           1       0.71      0.52      0.60       835

    accuracy                           0.92      6764
   macro avg       0.82      0.75      0.78      6764
weighted avg       0.91      0.92      0.91      6764

----------

Model: Decision Trees
Accuracy: 0.9098166765227675
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      5929
           1       0.63      0.64      0.64       835

    accuracy                           0.91      6764
   macro avg       0.79      0.79      0.79      6764
weighted avg       0.91      0.91      0.91      6764

----------

Model: Random Forest
Accuracy: 0.929183914843288
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.9



### Running the classification code for df of 23882 data points of unengaged users only

In [32]:
# Assuming unengaged_df_sampled is already defined
# Function to convert non-zero values to 1
def convert_to_binary(x):
    return 1 if x != 0 else 0

# Apply this function to every item in the 'purchases' column
unengaged_df_sampled['purchases'] = unengaged_df_sampled['purchases'].apply(convert_to_binary)

# Print the modified unique values to verify the changes
print("Unique values in 'purchases' after conversion:", unengaged_df_sampled['purchases'].unique())

Unique values in 'purchases' after conversion: [0 1]


In [33]:
# a: Total number of rows in the DataFrame
total_rows = len(unengaged_df_sampled)

# b: Number of times 1 is used as a value in "purchases"
count_ones = unengaged_df_sampled['purchases'].sum()  # Since 'purchases' is binary, summing will count all '1's

# c: Number of times 0 is used as a value in "purchases"
count_zeros = total_rows - count_ones  # Subtract the count of '1's from the total to get count of '0's

print("Total number of rows in the DataFrame:", total_rows)
print("Number of times 1 is used as a value in 'purchases':", count_ones)
print("Number of times 0 is used as a value in 'purchases':", count_zeros)

Total number of rows in the DataFrame: 23882
Number of times 1 is used as a value in 'purchases': 1205
Number of times 0 is used as a value in 'purchases': 22677


In [34]:
# Prepare the Data
# Separate features and target
X = unengaged_df_sampled.drop(['checkouts', 'total_revenue', 'purchases', 'date'], axis=1)
y = unengaged_df_sampled['purchases']

# Standardize the numerical features
numeric_features = ['event_count', 'items_viewed', 'add_to_carts', 'sessions', 'time_spent_seconds']
scaler = StandardScaler()
X[numeric_features] = scaler.fit_transform(X[numeric_features])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # 70% training and 30% testing

In [35]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Trees': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'Gradient Boosting Machines': GradientBoostingClassifier(),
    'Support Vector Machine': SVC(),
    #'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'MLP Classifier': MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, activation='relu', solver='adam', random_state=42)
}


In [36]:
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    print(f"Model: {name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("----------\n")

Model: Logistic Regression
Accuracy: 0.9666434054431263
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98      6836
           1       0.68      0.52      0.59       329

    accuracy                           0.97      7165
   macro avg       0.83      0.75      0.78      7165
weighted avg       0.96      0.97      0.96      7165

----------

Model: Decision Trees
Accuracy: 0.9697138869504536
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      6836
           1       0.67      0.67      0.67       329

    accuracy                           0.97      7165
   macro avg       0.83      0.83      0.83      7165
weighted avg       0.97      0.97      0.97      7165

----------

Model: Random Forest
Accuracy: 0.973761339846476
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.9

### Testing on a equal gamification and purchases 

The following dataset is going to have to have a data set where we first have a equal amount of gamification engagement (1) vs non-gamification engagement (0) like previously done, and then also even out the data of the purchases (1) vs the non-purchases (0).

To prepare your dataset for running classification models with a balanced subset, you need to balance it across two variables: gamification_engagement and purchases. This involves creating a subset where both of these variables are equally represented. This is a typical case where you might use undersampling or oversampling techniques to balance the classes, but for simplicity and data integrity, let's use undersampling to create balanced classes.

Here’s a step-by-step approach using Pandas:

1. Split the Data: 
First, we'll separate the dataset into four groups based on the combinations of gamification_engagement and purchases:

- Group 1: Gamification engagement = 1 and Purchases = 1
- Group 2: Gamification engagement = 1 and Purchases = 0
- Group 3: Gamification engagement = 0 and Purchases = 1
- Group 4: Gamification engagement = 0 and Purchases = 0

2. Find the Minimum Group Size: 
To balance the dataset, find the size of the smallest group among these, as you'll need to undersample the other groups to this size.

3. Undersample Each Group: 
Randomly sample from each of the larger groups to match the size of the smallest group.

4. Combine the Groups: 
Concatenate these undersampled groups back into a single DataFrame.

In [37]:
# Splitting the data into four groups based on 'gamification_engagement' and 'purchases'
group_1 = df[(df['gamification_engagement'] == 1) & (df['purchases'] == 1)]
group_2 = df[(df['gamification_engagement'] == 1) & (df['purchases'] == 0)]
group_3 = df[(df['gamification_engagement'] == 0) & (df['purchases'] == 1)]
group_4 = df[(df['gamification_engagement'] == 0) & (df['purchases'] == 0)]

# Finding the minimum size among these groups
min_size = min(len(group_1), len(group_2), len(group_3), len(group_4))

# Sampling each group to the size of the smallest group
group_1_sample = group_1.sample(n=min_size, random_state=42)
group_2_sample = group_2.sample(n=min_size, random_state=42)
group_3_sample = group_3.sample(n=min_size, random_state=42)
group_4_sample = group_4.sample(n=min_size, random_state=42)

# Combining the samples into one balanced DataFrame
balanced_df = pd.concat([group_1_sample, group_2_sample, group_3_sample, group_4_sample])

# Shuffle the DataFrame (optional, for training purposes)
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [38]:
# Print the modified unique values to verify the changes
print("Unique values in 'checkouts' after conversion:", balanced_df['purchases'].unique())

Unique values in 'checkouts' after conversion: [0 1]


In [39]:
# a: Total number of rows in the DataFrame
total_rows = len(balanced_df)

# b: Number of times 1 is used as a value in "purchases"
count_ones = balanced_df['purchases'].sum()  # Since 'purchases' is binary, summing will count all '1's

# c: Number of times 0 is used as a value in "purchases"
count_zeros = total_rows - count_ones  # Subtract the count of '1's from the total to get count of '0's

print("Total number of rows in the DataFrame:", total_rows)
print("Number of times 1 is used as a value in 'purchases':", count_ones)
print("Number of times 0 is used as a value in 'purchases':", count_zeros)

Total number of rows in the DataFrame: 11284
Number of times 1 is used as a value in 'purchases': 5642
Number of times 0 is used as a value in 'purchases': 5642


In [40]:
# Print the modified unique values to verify the changes
print("Unique values in 'checkouts' after conversion:", balanced_df['gamification_engagement'].unique())

Unique values in 'checkouts' after conversion: [0 1]


In [41]:
# a: Total number of rows in the DataFrame
total_rows = len(balanced_df)

# b: Number of times 1 is used as a value in "gamification_engagement"
count_ones = balanced_df['gamification_engagement'].sum()  # Since 'gamification_engagement' is binary, summing will count all '1's

# c: Number of times 0 is used as a value in "gamification_engagement"
count_zeros = total_rows - count_ones  # Subtract the count of '1's from the total to get count of '0's

print("Total number of rows in the DataFrame:", total_rows)
print("Number of times 1 is used as a value in 'gamification_engagement':", count_ones)
print("Number of times 0 is used as a value in 'gamification_engagement':", count_zeros)

Total number of rows in the DataFrame: 11284
Number of times 1 is used as a value in 'gamification_engagement': 5642
Number of times 0 is used as a value in 'gamification_engagement': 5642


In [42]:
# Prepare the Data
# Separate features and target
X = balanced_df.drop(['checkouts', 'total_revenue', 'purchases', 'date'], axis=1)
y = balanced_df['purchases']

# Standardize the numerical features
numeric_features = ['event_count', 'items_viewed', 'add_to_carts', 'sessions', 'time_spent_seconds']
scaler = StandardScaler()
X[numeric_features] = scaler.fit_transform(X[numeric_features])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)  # 70% training and 30% testing

In [43]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Trees': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'Gradient Boosting Machines': GradientBoostingClassifier(),
    'Support Vector Machine': SVC(),
    #'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'MLP Classifier': MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, activation='relu', solver='adam', random_state=42)
}


In [44]:
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    print(f"Model: {name}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("----------\n")

Model: Logistic Regression
Accuracy: 0.9409332545776727
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.92      0.94      1672
           1       0.93      0.96      0.94      1714

    accuracy                           0.94      3386
   macro avg       0.94      0.94      0.94      3386
weighted avg       0.94      0.94      0.94      3386

----------

Model: Decision Trees
Accuracy: 0.9267572356763142
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.92      0.93      1672
           1       0.92      0.93      0.93      1714

    accuracy                           0.93      3386
   macro avg       0.93      0.93      0.93      3386
weighted avg       0.93      0.93      0.93      3386

----------

Model: Random Forest
Accuracy: 0.9456585942114589
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.92      0.

