<a href="https://colab.research.google.com/github/Jlok17/Python-Projects/blob/main/Customer_Churn_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Data Source](https://www.kaggle.com/datasets/ermismbatuhan/digital-marketing-ecommerce-customer-behavior?select=data1.csv)

In [None]:
# All the packages and functions that are needed:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [None]:
# So we don't need to get spammed with consistent Warning Messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing CSV File from Github
df = pd.read_csv("https://raw.githubusercontent.com/Jlok17/Data-Science-Projects/main/data1.csv", sep = ';')

In [None]:
# Previewing the .CSV file to see if it imported Correctly
df.head(5)

Unnamed: 0,account length,location code,user id,credit card info save,push status,add to wishlist,desktop sessions,app sessions,desktop transactions,total product detail views,session duration,promotion clicks,avg order value,sale product views,discount rate per visited products,product detail view per app session,app transactions,add to cart per session,customer service calls,churn
0,128,415,3824657,no,yes,25,265,45,17,110,197,87,2447,91,1101,10,3,27,1,0
1,107,415,3717191,no,yes,26,162,27,17,123,196,103,2544,103,1145,137,3,37,1,0
2,137,415,3581921,no,no,0,243,41,10,114,121,110,1626,104,732,122,5,329,0,0
3,84,408,3759999,yes,no,0,299,51,5,71,62,88,1969,89,886,66,7,178,2,0
4,75,415,3306626,yes,no,0,167,28,13,113,148,122,1869,121,841,101,3,273,3,0


In [None]:
# Checking out the Dataframe Shape
df.shape

(3333, 20)

In [None]:
# Checking the different Columns to see which one's have String Value
## Cont. To change values into Numeric
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   account length                       3333 non-null   int64 
 1   location code                        3333 non-null   int64 
 2   user id                              3333 non-null   int64 
 3   credit card info save                3333 non-null   object
 4   push status                          3333 non-null   object
 5   add to wishlist                      3333 non-null   int64 
 6   desktop sessions                     3333 non-null   int64 
 7   app sessions                         3333 non-null   int64 
 8   desktop transactions                 3333 non-null   int64 
 9   total product detail views           3333 non-null   int64 
 10  session duration                     3333 non-null   int64 
 11  promotion clicks                     3333 n

In [None]:
# Checking the values of a Column as this one doesn't have any value
## can be translated to a unique ladder value
df["location code"].unique()

array([415, 408, 510])

In [None]:
# Replacing Location Code as a Numeric value into a String
df["location code"] = df["location code"].astype(str)

In [None]:
# Checking the Unique Values of Column "Credit card... save"
df["credit card info save"].unique()

array(['no', 'yes'], dtype=object)

In [None]:
# For Yes and No values we are replacing with Boolean values AKA 1/0
df["credit card info save"] = df["credit card info save"].replace({"yes": 1, "no": 0})
df["push status"] = df["push status"].replace({"yes": 1, "no": 0})

In [None]:
# Dataset seems to have numeric values in European with the , needs to be replaced with the .
df["avg order value"] = df["avg order value"].str.replace(',','.').astype(float)
df["discount rate per visited products"] = df["discount rate per visited products"].str.replace(',','.').astype(float)
df["product detail view per app session"] = df["product detail view per app session"].str.replace(',','.').astype(float)
df["add to cart per session"] = df["add to cart per session"].str.replace(',','.').astype(float)

In [None]:
# Creating Boolean Logic for Each Separate Location Code Value
df = pd.get_dummies(df, columns = ["location code"])

In [None]:
# Dropping Unneccessary Columns that are not needed
df = df.drop("user id", axis = 1)

In [None]:
# Checking Dataframe to see if it is transformed properly
df.head(5)

Unnamed: 0,account length,credit card info save,push status,add to wishlist,desktop sessions,app sessions,desktop transactions,total product detail views,session duration,promotion clicks,...,sale product views,discount rate per visited products,product detail view per app session,app transactions,add to cart per session,customer service calls,churn,location code_408,location code_415,location code_510
0,128,0,1,25,265,45,17,110,197,87,...,91,11.01,10.0,3,2.7,1,0,0,1,0
1,107,0,1,26,162,27,17,123,196,103,...,103,11.45,13.7,3,3.7,1,0,0,1,0
2,137,0,0,0,243,41,10,114,121,110,...,104,7.32,12.2,5,3.29,0,0,0,1,0
3,84,1,0,0,299,51,5,71,62,88,...,89,8.86,6.6,7,1.78,2,0,1,0,0
4,75,1,0,0,167,28,13,113,148,122,...,121,8.41,10.1,3,2.73,3,0,0,1,0


In [None]:
# Getting all the Column Names
df.columns

Index(['account length', 'credit card info save', 'push status',
       'add to wishlist', 'desktop sessions', 'app sessions',
       'desktop transactions', 'total product detail views',
       'session duration', 'promotion clicks', 'avg order value',
       'sale product views', 'discount rate per visited products',
       'product detail view per app session', 'app transactions',
       'add to cart per session', 'customer service calls', 'churn',
       'location code_408', 'location code_415', 'location code_510'],
      dtype='object')

In [None]:
# Scaling non Boolean Value Columns so its easier to Train Model
cols_to_scale = ['account length',
       'add to wishlist', 'desktop sessions', 'app sessions',
       'desktop transactions', 'total product detail views',
       'session duration', 'promotion clicks', 'avg order value',
       'sale product views', 'discount rate per visited products',
       'product detail view per app session', 'app transactions',
       'add to cart per session', 'customer service calls']
scaler = Normalizer()
scaled_data =  scaler.fit_transform(df[cols_to_scale])
scaled_df = pd.DataFrame(scaled_data, index = df.index, columns = cols_to_scale)

In [None]:
scaled_df.head(5)

Unnamed: 0,account length,add to wishlist,desktop sessions,app sessions,desktop transactions,total product detail views,session duration,promotion clicks,avg order value,sale product views,discount rate per visited products,product detail view per app session,app transactions,add to cart per session,customer service calls
0,0.275142,0.053739,0.569631,0.09673,0.036542,0.236451,0.423461,0.187011,0.525995,0.195609,0.023667,0.021496,0.006449,0.005804,0.00215
1,0.252755,0.061417,0.382676,0.063779,0.040157,0.290551,0.462991,0.243307,0.600944,0.243307,0.027047,0.032362,0.007087,0.00874,0.002362
2,0.345945,0.0,0.613611,0.103531,0.025251,0.287867,0.305543,0.277766,0.410589,0.262615,0.018484,0.030807,0.012626,0.008308,0.0
3,0.208327,0.0,0.741543,0.126484,0.0124,0.176086,0.153765,0.218247,0.488327,0.220727,0.021973,0.016369,0.017361,0.004415,0.00496
4,0.205041,0.0,0.456559,0.076549,0.035541,0.308929,0.404615,0.333534,0.510963,0.3308,0.022992,0.027612,0.008202,0.007464,0.008202


In [None]:
# Creating Dataframe with Desired Scaled Columns as replacement
## This is to get rid of any repeating data that will affect the training model
df = df.drop(cols_to_scale, axis = 1)
df = pd.merge(df, scaled_df, left_index = True, right_index = True)

In [None]:
df.head(5)

Unnamed: 0,credit card info save,push status,churn,location code_408,location code_415,location code_510,account length,add to wishlist,desktop sessions,app sessions,...,total product detail views,session duration,promotion clicks,avg order value,sale product views,discount rate per visited products,product detail view per app session,app transactions,add to cart per session,customer service calls
0,0,1,0,0,1,0,0.275142,0.053739,0.569631,0.09673,...,0.236451,0.423461,0.187011,0.525995,0.195609,0.023667,0.021496,0.006449,0.005804,0.00215
1,0,1,0,0,1,0,0.252755,0.061417,0.382676,0.063779,...,0.290551,0.462991,0.243307,0.600944,0.243307,0.027047,0.032362,0.007087,0.00874,0.002362
2,0,0,0,0,1,0,0.345945,0.0,0.613611,0.103531,...,0.287867,0.305543,0.277766,0.410589,0.262615,0.018484,0.030807,0.012626,0.008308,0.0
3,1,0,0,1,0,0,0.208327,0.0,0.741543,0.126484,...,0.176086,0.153765,0.218247,0.488327,0.220727,0.021973,0.016369,0.017361,0.004415,0.00496
4,1,0,0,0,1,0,0.205041,0.0,0.456559,0.076549,...,0.308929,0.404615,0.333534,0.510963,0.3308,0.022992,0.027612,0.008202,0.007464,0.008202


In [None]:
# Train Test split
X = df.drop("churn", axis = 1)
y = df["churn"]
# I like using a test/train size of 25/75 but numbers can be flexible
## Typical split ratios seem to be 30/70 - 20/80
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42 )

In [None]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(2499, 20) (834, 20) (2499,) (834,)


In [None]:
# Build Model
xgb_cl = xgb.XGBClassifier()
xgb_cl.fit(x_train, y_train)

In [None]:
preds = xgb_cl.predict(x_test)

In [None]:
acc = accuracy_score(y_test, preds)
print("Model Accuracy for Test Dataset:", acc)

Model Accuracy for Test Dataset: 0.9136690647482014


### Hyper Parameters Explained:

<br />

1. max_depth:

max_depth controls the maximum depth of each tree in the ensemble. A larger value allows the model to learn more complex interactions in the data, potentially leading to overfitting.
To figure out a suitable value, you can start with a small value (e.g., 3) and gradually increase it while monitoring the model's performance on a validation dataset. Stop increasing when you notice diminishing returns or overfitting signs (validation performance starts to decrease while training performance improves).

2. learning_rate:

The learning rate determines the step size taken towards the optimal solution in each iteration. Smaller values make convergence slower but might lead to better solutions.
You can perform a grid search or random search over a range of learning rates and evaluate their impact on both training and validation performance. Plotting learning curves can help identify the point where the learning rate leads to optimal convergence without overfitting.

3. gamma:

The gamma parameter introduces regularization by penalizing the cost function based on the complexity of the tree. Larger values lead to more pruning.
Try different values of gamma and observe how it affects the model's performance on validation data. A higher gamma might help prevent overfitting.

4. scale_pos_weight:

This parameter is particularly useful in imbalanced classification tasks, where the positive class is underrepresented. It balances the contribution of the positive and negative classes.
You can start with the ratio of positive to negative class samples in your dataset and adjust it from there. Perform experiments by incrementally changing the value and observing how it affects precision, recall, and F1-score on validation data.

5. subsample and colsample_bytree:

These parameters control the randomness in constructing trees. subsample specifies the fraction of data used in each iteration, while colsample_bytree specifies the fraction of features used.
Experiment with different values of these parameters to find a balance between overfitting and model diversity. Values slightly less than 1 (e.g., 0.8) can prevent overfitting by introducing randomness.

In [None]:
# hyperparameter tuning

### Providing Options for GridSearchCV to find Best Parameters
param_grid = {
    "max_depth": [5],
    "learning_rate": [0, 0.01, 0.05, 0.1],
    "gamma": [1, 5, 10],
    "scale_pos_weight": [2, 5, 10, 20],
    "subsample": [1],
    "colsample_bytree": [1]
}

xgb_cl2 = xgb.XGBClassifier(objective = "binary:logistic")
grid_cv = GridSearchCV(xgb_cl2, param_grid, n_jobs = -1, cv = 3, scoring = "roc_auc")
_ = grid_cv.fit(x_train, y_train)
print("The Best Score:", grid_cv.best_score_)
print("The Best Params:", grid_cv.best_params_)

The Best Score: 0.8865203664723595
The Best Params: {'colsample_bytree': 1, 'gamma': 1, 'learning_rate': 0.1, 'max_depth': 5, 'scale_pos_weight': 2, 'subsample': 1}


### Different Classifier Options:

The objective parameter in XGBoost allows you to specify the optimization objective for the algorithm. Different objectives are suitable for different types of tasks, such as classification, regression, ranking, and more. Here are some commonly used values for the objective parameter in XGBoost:

1. Binary Classification:

"binary:logistic": Logistic regression for binary classification (default for binary classification tasks).
"binary:logitraw": Predict raw logits instead of probabilities.

2. Multiclass Classification:

"multi:softmax": Multiclass classification using the softmax objective.
"multi:softprob": Same as softmax but outputs class probabilities.
"multi:logistic": Deprecated; use "multi:softmax" instead.

3. Regression:

"reg:squarederror": Linear regression (default for regression tasks).
"reg:squaredlogerror": Regression with squared log error loss.
"reg:logistic": Logistic regression for regression problems (outputs probabilities).

4. Ranking:

"rank:pairwise": Pairwise ranking objective for ranking tasks.
"rank:ndcg": NDCG (Normalized Discounted Cumulative Gain) ranking objective.
"rank:map": Mean Average Precision ranking objective.

5. Poisson Regression:

"count:poisson": Poisson regression for count data.

6. Gamma Regression:

"reg:gamma": Gamma regression for modeling gamma-distributed data.

7. Tweedie Regression:

"reg:tweedie": Tweedie regression for modeling compound Poisson-Gamma distributed data.

8. Custom Objectives:

You can define custom objective functions using the objective parameter by providing a callable Python function. This allows you to define your own loss functions tailored to your problem.

In [None]:
final_cl = xgb.XGBClassifier(
    **grid_cv.best_params_, objective = "binary:logistic"
)
grid_final = final_cl.fit(x_train, y_train)
preds = grid_final.predict(x_test)
acc = accuracy_score( y_test, preds)
print("Accuracy of the Final Model:", acc)

Accuracy of the Final Model: 0.9244604316546763


In [None]:
y = input()
x = input()
print(x+y)

2
4
42
