# Expresso Customer Churn Prediction

## 1.0 Introduction

### 1.1 Business Understanding / Project Objective

Per [Paddle](https://www.paddle.com/resources/customer-attrition#:~:text=Customer%20attrition%20is%20defined%20as,of%20business%20health%20over%20time.), customer churn may be defined as the loss of customers by a business. Despite being a normal part of the customer cycle, it is viewed as a key indicator of business health over time and must be managed to ensure some stability in the business' survival, (retention) strategy development, and/or growth. 

It is also known as customer attrition or customer turnover, and is calculated as the percentage of customers that stopped using a company's product or service within a specified timeframe. To better manage customer churn, companies should be able to predict it with reasonable accuracy, and that is where machine learning comes in.

This project is focused on Vodafone - a telecommunications company - and  aims to predict the likelihood that a customer will churn by identifying and modelling based on the key indicators of churn. Possible strategies that may be explored and implemented to improve retention (or reduce churn) may be recommended in this project.

### 1.2 Data Understanding

The dataset contains information about the location of clients, the services that they use, the regularity of service use, and their churn status. The columns in the dataset are described below:

- **user_id**: user ID
- **REGION**: the location of each client
- **TENURE**: duration in the network
- **MONTANT**: top-up amount
- **FREQUENCE_RECH**: number of times the client recharged
- **REVENUE**: monthly income of each client
- **ARPU_SEGMENT**: income over 90 days / 3
- **FREQUENCE**: number of times the client has made an income
- **DATA_VOLUME**: number of connections
- **ON_NET**: inter expresso call
- **ORANGE**: calls to orange
- **TIGO**: calls to Tigo
- **ZONE1**: calls to zones1
- **ZONE2**: calls to zones2
- **MRG**: a client who is going
- **REGULARITY**: number of times the client is active for 90 days
- **TOP_PACK**: the most active packs
- **FREQ_TOP_PACK**: number of times the client has activated the top pack packages
- **CHURN**: variable to predict - Target


## 2.0 Hypotheses and Questions

1. Customers with partners & dependents churn less
2. What is the distribution of customers by senior citizenship and how do they churn?
3. In terms of tenure, which range of users have churned least?
4. At what tenure levels do we lose most customers?
5. Customers who exceed the average tenure are less likely to churn
6. Users who don't use phone service churn more than phone service users
7. Does the use of multiple lines lead to reduced churn?
8. DSL users churn more than fibre-optic users
9. Users who stream both TV & movies churn less than those who stream only one
10. Customers with tech support churn less

## 3.0 Toolbox Loading

In [None]:
# Data Manipulation
import numpy as np
import pandas as pd
import re
import pickle

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Warnings
import warnings
warnings.filterwarnings("ignore")  # Hiding the warnings

# Feature Engineering
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import *
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
import xgboost as xgb
from xgboost import *

# Model evaluation
from sklearn import metrics
from sklearn.metrics import *

print("Loading complete.", "Warnings hidden.")

In [None]:
# Removing the restriction on columns to display
pd.set_option("display.max_columns", None)

## 4.0 Data Exploration

In [None]:
# Loading the data
dataset = pd.read_csv("Train.csv")
dataset

In [None]:
# Looking at information about the columns
dataset.info()

In [None]:
# Cast all column names to lowercase
dataset.columns = dataset.columns.str.lower()

In [None]:
# Checking for duplicates
dataset[dataset.duplicated()]

From the dataset preview and the info above, we make the following observations:
- Out of the 18 columns, only 5 have no missing values. The columns therefore have to be assessed, and necessary action taken on the columns to deal with the missing values.
- There are no duplicates in the dataset.

In [None]:
# # Performing initial cleaning on the dataset
# dataset["TotalCharges"] = dataset["TotalCharges"].replace(" ", np.nan)  # replacing the empty spaces in the column with nulls
# dataset["TotalCharges"] = pd.to_numeric(dataset["TotalCharges"])  # converting the column to a float

# # converting the values to Yes or No
# dataset["SeniorCitizen"] = np.where(dataset["SeniorCitizen"] == 0, "No", "Yes")

# # dropping the null values in the dataset
# dataset.dropna(inplace= True)
# dataset.reset_index(drop = True, inplace = True)

# # Dropping the customer ID column
# dataset.drop(columns=["customerID"], inplace=True)
# dataset.info()

### 4.1 Exploration of Numeric Columns

*What is the distribution of the columns with numeric values? Are there any outliers?*

In [None]:
# Looking at the descriptive statistics of the columns with numeric values
numerics = [column for column in dataset.columns if (dataset[column].dtype != "O") & (len(dataset[column].unique()) > 2)]
print("Summary table of the Descriptive Statistics of Columns with Numeric Values")
dataset[numerics].describe()

In [None]:
# Visualizing the distributions of the columns with numeric values
for column in dataset[numerics].columns:
    if len(dataset[column].unique()) > 2:

        # Visualizing the distribution of categories inside the column
        fig = px.box(dataset[numerics], y=column, labels={"color": "Churned",
                                                          "tenure": "Tenure (months)",
                                                          },
                     title=f"A visual representation of values in the {column} column"
                     )
        fig.show()

        # Visualizing the proportion of churn for each category inside the column
        fig = px.box(dataset[numerics], y=column, color=dataset["churn"], labels={"color": "Churned",
                                                                                  "tenure": "Tenure (months)",
                                                                                  },
                     title=f"A visual representation of values in the {column} column split by churn levels"
                     )
        fig.show()

### 4.2 Exploration of Categorical Columns

In [None]:
# Visualizing the distribution of the columns with categorical values and their churn levels
categoricals = [column for column in dataset.columns if (dataset[column].dtype == "O")]

for column in dataset[categoricals].columns:
    # Visualizing the distribution of the categories in the column
    fig = px.histogram(dataset, x=dataset[column], text_auto=True,
                       title=f"Distribution of values in the {column} column")
    fig.show()

    # Visualizing the churn proportions of the categories in the column
    fig = px.histogram(dataset, x=dataset[column], color="churn", barnorm="percent", text_auto=".2f",
                       title=f"Churn proportions of users in {column} column")
    fig.show()

### 4.3 Answering the other questions

## 5.0 Feature Engineering
### 5.1 Feature Encoding

In [None]:
# Looking at the unique values in each column
dataset.nunique()

**Preview**

Here, the columns with two unique values (yes or no) will be encoded using label encoding, where 1 is assigned to yes and 0 assigned to no.

For gender, since there is no ordinal arrangement between the two categories (male and female), it will be encoded using one-hot encoding.

The rest of the categorical columns will also be encoded using one-hot encoding.

The first columns will be dropped for all the columns that are encoded using one-hot encoding.

The numeric columns (tenure, monthly charges, and total charges) will be scaled using the MinMaxScaler to ensure that their structure/distributions are preserved.

In [None]:
gradio_set = dataset.drop(columns= ["Churn"])
gradio_set

In [None]:
# Exporting the dataset for use in the Gradio app
gradio_set.to_csv("churn_prediction_dataset.csv")

In [None]:
# Encoding the churn column
dataset["Churn"].replace({"Yes":1, "No":0}, inplace= True)

In [None]:
# Creating a list of categoricals
categoricals.remove("Churn")
categoricals

In [None]:
# Encoding the categorical variables
oh_encoder = OneHotEncoder(drop = "first", sparse = False)
oh_encoder.fit(dataset[categoricals])
encoded_categoricals = oh_encoder.transform(dataset[categoricals])
encoded_categoricals = pd.DataFrame(encoded_categoricals, columns = oh_encoder.get_feature_names_out().tolist())
encoded_categoricals

In [None]:
# Adding the encoded categoricals to the DataFrame and dropping the original columns
complete_set = dataset.join(encoded_categoricals)
complete_set.drop(columns= categoricals, inplace= True)
complete_set.rename(columns= lambda x: re.sub("[^A-Za-z0-9_]+", "", x), inplace= True)
complete_set

In [None]:
complete_set.info()

In [None]:
# Profiling the final dataframe with SweetViz
final_df_report = sv.analyze(complete_set)
final_df_report.show_html(filepath="final_df_report.html")

### 5.2 Feature Selection

#### 5.2.1 Correlation Matrix

In [None]:
# Looking at the correlation between the variables in the merged dataframe
correlation = pd.DataFrame(complete_set.corr())

# Defining a colourscale for the correlation plot
colorscale = [[0.0, "rgb(255,255,255)"], [0.2, "rgb(255, 255, 153)"],
              [0.4, "rgb(153, 255, 204)"], [0.6, "rgb(179, 217, 255)"],
              [0.8, "rgb(240, 179, 255)"], [1.0, "rgb(255, 77, 148)"]
              ]

# Plotting the Correlation Matrix
fig = px.imshow(correlation,
                text_auto=".3f",
                aspect="auto",
                labels={"color": "Correlation Coefficient"},
                contrast_rescaling="minmax",
                color_continuous_scale=colorscale
                )
fig.update_xaxes(side="top")
fig.show()

The correlation matrix presents a more comprehensive view on the nature of the relationships between the various variables  in the dataset, but it is not so clear due to the number of features. As such, other methods will be used to explore the features and their potential importances for the modelling process.

#### 5.2.2 Feature Selection using SelectKBest

In [None]:
# Defining the target & predictor variables
X = complete_set.drop(columns=["Churn"])
y = complete_set["Churn"]

# Splitting the dataframe into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 24, stratify= y)

In [None]:
# Fitting the variables to the function
best_features = SelectKBest(score_func=chi2, k="all")
fit = best_features.fit(X_train, y_train)

# Looking at the features & their importances
feature_scores = pd.DataFrame(fit.scores_)
selected_columns = pd.DataFrame(X_train.columns)
columns_x_scores = pd.concat([selected_columns, feature_scores], axis=1)
columns_x_scores.columns = ["Feature", "Score"]

# print 10 largest scores & features
print(columns_x_scores.nlargest(10, "Score"))

In [None]:
# Visualizing the top 10 most important features
fig = px.bar(columns_x_scores.nlargest(10, "Score"), x="Feature", y="Score")
fig.show()

From the plot and table above, we note that the top 5 most important features for churn prediction are total charges (by a mile), tenure, monthly charges, 2-year contracts, and electronic check payment method.

#### 5.2.3 Feature Importance using Extra Trees Classifier

In [None]:
# Instantiating the Classifier and fitting it to the training data
etc_model = ExtraTreesClassifier()
etc_model.fit(X_train, y_train)
print(etc_model.feature_importances_)

In [None]:
# Creating a dataframe of the features and their importances for plotting
feature_importance_lvls = pd.DataFrame(etc_model.feature_importances_, index= X_train.columns).reset_index()
feature_importance_lvls.rename(columns= {"index": "Feature", 0: "Importance"},inplace= True)
feature_importance_lvls.sort_values(by= "Importance", ascending= False, inplace= True)
feature_importance_lvls.head(10)

In [None]:
# Visualizing the top 20 most important features
fig = px.bar(feature_importance_lvls[:20], x="Feature", y="Importance")
fig.show()

Going by the results of the ExtraTreesClassifier (above), the top 5 most important features for churn prediction are tenure, total charges, monthly charges, electronic check payment method, and fibre optic internet service. This matches, to a large extent, the results from the SelectKBest model which also suggested 4 of the top 5 features here.

For now, no features will be removed before the modelling process.

## 6.0 Modelling*
**With regard to the imbalance in the dataset*

**Preview**

1. Train_test_split
    
For the modelling, the already defined train data (from Section 5.2.2) will be split again into training and testing so that the models that will be built will be cross-validated and evaluated based on them.

The test data (from Section 5.2.2) above will be the holdout sample. It is on this data that the best model(s) will make their predictions and have their final evaluation.

To do this, the X_train and y_train will be put together as "train_data" before being split into the train and test samples.
The X_test and y_test will also be put together as "test_data" but will not be split until it is time for prediction and final evaluation of the best model(s).

    
2. Dataset Balancing

Given that the dataset is imbalanced, it would have to be balanced before modelling to reduce the error in prediction since our target is the minority class. The 3 most common methods for balancing are oversampling, undersampling, and SMOTE. Since this project doubles as a study opportunity, I will apply each method to (copies of) the training data and build models under each. The models will then be evaluated before selecting the best one(s) for optimization and application on the final test set.

In [None]:
# Putting the training dataset together for modelling
train_data = X_train.join(y_train, on=X_train.index)
train_data.head()

In [None]:
train_data.shape

In [None]:
# Putting the training dataset together for modelling
test_data = X_test.join(y_test, on=X_test.index)
test_data.head()

In [None]:
test_data.shape

### 6.1 Oversampling

In [None]:
# Making a copy of the training data and checking the shape
oversampling_data = train_data.copy()
count_not_churned, count_churned = oversampling_data["Churn"].value_counts()
count_not_churned, count_churned

In [None]:
# Filtering the dataframe for observations for the various classes
not_churned = oversampling_data[oversampling_data["Churn"] == 0]
churned = oversampling_data[oversampling_data["Churn"] == 1]

In [None]:
# Oversampling the churned class and combining the "balanced" DataFrame
churn_oversampled = churned.sample(count_not_churned, replace=True)
df_oversampled = pd.concat([not_churned, churn_oversampled])

print("Random over-sampling:")
print(df_oversampled["Churn"].value_counts())

In [None]:
# Defining the target & predictor variables
os_X = df_oversampled.drop(columns=["Churn"])
os_y = df_oversampled["Churn"]

# Splitting the dataframe
os_X_train, os_X_test, os_y_train, os_y_test = train_test_split(os_X, os_y, test_size=0.25, random_state=24)

In [None]:
# Scaling the numeric columns
columns_to_scale = ["tenure", "MonthlyCharges", "TotalCharges"]

os_scaler = MinMaxScaler()

os_X_train[columns_to_scale] = os_scaler.fit_transform(os_X_train[columns_to_scale])
os_X_test[columns_to_scale] = os_scaler.transform(os_X_test[columns_to_scale])

#### 6.1.1 Logistic Regression

In [None]:
# Logistic Regression
log_reg = LogisticRegression(random_state=24)
os_log_reg_model = log_reg.fit(os_X_train, os_y_train)

# Feature Importance of the Random Forest Model
log_reg_importance = os_log_reg_model.coef_[0]
log_reg_importance = pd.DataFrame(log_reg_importance, index=os_X.columns)
log_reg_importance.reset_index(inplace=True)
log_reg_importance.rename(columns={
    "index": "Feature",
    0: "Score"
}, inplace=True)
log_reg_importance.sort_values(by="Score", ascending=False, inplace=True)
log_reg_importance

# Visualizing the feature importances
fig = px.bar(log_reg_importance, x="Feature", y="Score")
fig.show()

In [None]:
# Making predictions
oversampled_log_reg_pred = os_log_reg_model.predict(os_X_test)

# Evaluating the model
oversampled_log_reg_report = classification_report(os_y_test, oversampled_log_reg_pred, target_names=["Stayed", "Churned"])
print(oversampled_log_reg_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(os_y_test, oversampled_log_reg_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
oslr_conf_mat = confusion_matrix(os_y_test, oversampled_log_reg_pred)
oslr_conf_mat = pd.DataFrame(oslr_conf_mat).reset_index(drop=True)
oslr_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(oslr_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.1.2 Decision Tree

In [None]:
# Initializing the model
dt_clf = DecisionTreeClassifier(random_state=24)
os_dt_model = dt_clf.fit(os_X_train, os_y_train)

# Feature importances
dt_importance = os_dt_model.feature_importances_
dt_importance = pd.DataFrame(dt_importance, columns=["score"]).reset_index()
dt_importance["Feature"] = list(os_X.columns)
dt_importance.drop(columns=["index"], inplace=True)

dt_importance.sort_values(by="score",
                          ascending=False,
                          ignore_index=True,
                          inplace=True)

# Plotting the feature importances
fig = px.bar(dt_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making predictions
oversampled_dt_pred = os_dt_model.predict(os_X_test)

# Evaluating the model
oversampled_dt_report = classification_report(os_y_test, oversampled_dt_pred)
print(oversampled_dt_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(os_y_test, oversampled_dt_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
os_dt_conf_mat = confusion_matrix(os_y_test, oversampled_dt_pred)
os_dt_conf_mat = pd.DataFrame(os_dt_conf_mat).reset_index(drop=True)
os_dt_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(os_dt_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.1.3 Random Forest

In [None]:
# Random Forests
rf_clf = RandomForestClassifier(random_state=24)
os_rf_model = rf_clf.fit(os_X_train, os_y_train)

# Feature Importance of the Random Forest Model
rf_importance = os_rf_model.feature_importances_
rf_importance = pd.DataFrame(rf_importance, columns=["score"]).reset_index()
rf_importance["Feature"] = list(os_X.columns)
rf_importance.drop(columns=["index"], inplace=True)
rf_importance.sort_values(by="score",
                          ascending=False,
                          ignore_index=True,
                          inplace=True)

# Visualizing the feature importances
fig = px.bar(rf_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making predictions
oversampled_rf_pred = os_rf_model.predict(os_X_test)

# Evaluating the model
oversampled_rf_report = classification_report(os_y_test, oversampled_rf_pred)
print(oversampled_rf_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(os_y_test, oversampled_rf_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
os_rf_conf_mat = confusion_matrix(os_y_test, oversampled_rf_pred)
os_rf_conf_mat = pd.DataFrame(os_rf_conf_mat).reset_index(drop=True)
os_rf_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(os_rf_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.1.4 XGBoost

In [None]:
# Fitting model to the training data
xgb_clf = XGBClassifier(random_state=24)
os_xgb_model = xgb_clf.fit(os_X_train, os_y_train)

# Feature Importance of the Random Forest Model
xgb_importance = os_xgb_model.feature_importances_
xgb_importance = pd.DataFrame(xgb_importance, columns=["score"]).reset_index()
xgb_importance["Feature"] = list(os_X.columns)
xgb_importance.drop(columns=["index"], inplace=True)
xgb_importance.sort_values(by="score",
                           ascending=False,
                           ignore_index=True,
                           inplace=True)

# Visualizing the feature importances
fig = px.bar(xgb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making predictions
oversampled_xgb_pred = os_xgb_model.predict(os_X_test)

# Evaluating the model
oversampled_xgb_report = classification_report(os_y_test, oversampled_xgb_pred)
print(oversampled_xgb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(os_y_test, oversampled_xgb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
os_xgb_conf_mat = confusion_matrix(os_y_test, oversampled_xgb_pred)
os_xgb_conf_mat = pd.DataFrame(os_xgb_conf_mat).reset_index(drop=True)
os_xgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(os_xgb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.1.5 CatBoost

In [None]:
# Initializing CatBoostClassifier
oversampled_catb_clf = CatBoostClassifier(metric_period=100, random_state=24)

# Fitting it to the training data
oversampled_catb_model = oversampled_catb_clf.fit(os_X_train, os_y_train)

# Feature Importance of the Model
oversampled_catb_importance = oversampled_catb_model.feature_importances_

oversampled_catb_importance = pd.DataFrame(oversampled_catb_importance,
                                           columns=["score"]).reset_index()

oversampled_catb_importance["Feature"] = list(os_X.columns)

oversampled_catb_importance.drop(columns=["index"], inplace=True)

oversampled_catb_importance.sort_values(by="score",
                                        ascending=False,
                                        ignore_index=True,
                                        inplace=True)

fig = px.bar(oversampled_catb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
oversampled_catb_pred = oversampled_catb_model.predict(os_X_test)

# Evaluating the model
oversampled_catb_report = classification_report(os_y_test,
                                                oversampled_catb_pred,
                                                target_names=["Stayed", "Churned"])
print(oversampled_catb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(os_y_test, oversampled_catb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
os_catb_conf_mat = confusion_matrix(os_y_test, oversampled_catb_pred)
os_catb_conf_mat = pd.DataFrame(os_catb_conf_mat).reset_index(drop=True)
os_catb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(os_catb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax
            )
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.1.6 LightGBM

In [None]:
# Initializing LightGBM Classifier
oversampled_lgb_clf = lgb.LGBMClassifier(random_state=24)

# Fitting it to the training data
oversampled_lgb_model = oversampled_lgb_clf.fit(os_X_train, os_y_train)

# Feature Importance of the Model
oversampled_lgb_importance = oversampled_lgb_model.feature_importances_

oversampled_lgb_importance = pd.DataFrame(oversampled_lgb_importance,
                                          columns=["score"]).reset_index()

oversampled_lgb_importance["Feature"] = list(os_X.columns)
oversampled_lgb_importance.drop(columns=["index"], inplace=True)
oversampled_lgb_importance.sort_values(by="score",
                                       ascending=False,
                                       ignore_index=True,
                                       inplace=True
                                       )

fig = px.bar(oversampled_lgb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
oversampled_lgb_pred = oversampled_lgb_model.predict(os_X_test)

# Evaluating the model
oversampled_lgb_report = classification_report(os_y_test, oversampled_lgb_pred, target_names=["Stayed", "Churned"])
print(oversampled_lgb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(os_y_test, oversampled_lgb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
os_lgb_conf_mat = confusion_matrix(os_y_test, oversampled_lgb_pred)
os_lgb_conf_mat = pd.DataFrame(os_lgb_conf_mat).reset_index(drop=True)
os_lgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(os_lgb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

In [None]:
oversampling_models = {
    "os_lr": os_log_reg_model,
    "os_dt": os_dt_model,
    "os_rf": os_rf_model,
    "os_xgb": os_xgb_model,
    "os_catb": oversampled_catb_model,
    "os_lgb": oversampled_lgb_model
}

#### 6.1.7 Summarizing the Performance of the Models

In [None]:
# Defining a helper function to evaluate the models at a go
def evaluation(fit_models, X_test, y_test):
    lst = []
    for name, model in fit_models.items():
        pred = model.predict(X_test)

        f2_score = fbeta_score(y_test, pred, beta=0.5)
        f2_score = "{:.5f}".format(f2_score)

        lst.append([
            name,
            precision_score(y_test, pred, average="weighted"),
            recall_score(y_test, pred, average="weighted"),
            f1_score(y_test, pred, average="weighted"),
            accuracy_score(y_test, pred),
            f2_score
        ])

    eval_df = pd.DataFrame(lst, columns=["model", "precision", "recall", "f1_weighted", "accuracy", "f2_score"])
    eval_df.set_index("model", inplace=True)
    return eval_df

In [None]:
oversampled_models_eval = evaluation(oversampling_models, os_X_test, os_y_test)
oversampled_models_eval

### 6.2 Undersampling

In [None]:
# One more look at the training data
train_data.head()

In [None]:
# Undersampling the churned class and combining the "balanced" DataFrame
churn_undersampled = not_churned.sample(count_churned)
df_undersampled = pd.concat([churn_undersampled, churned])

print('Random under-sampling:')
print(df_undersampled["Churn"].value_counts())

In [None]:
# Defining the target & predictor variables
X = df_undersampled.drop(columns=["Churn"])
y = df_undersampled["Churn"]

# Splitting the dataframe
us_X_train, us_X_test, us_y_train, us_y_test = train_test_split(X, y, test_size=0.25, random_state=24)

In [None]:
# Scaling the numeric columns
us_scaler = MinMaxScaler()
us_X_train[columns_to_scale] = us_scaler.fit_transform(us_X_train[columns_to_scale])
us_X_test[columns_to_scale] = us_scaler.transform(us_X_test[columns_to_scale])

#### 6.2.1 Logistic Regression

In [None]:
# Logistic Regression
log_reg = LogisticRegression()
us_log_reg_model = log_reg.fit(us_X_train, us_y_train)

# Feature Importance of the Random Forest Model
us_log_reg_importance = us_log_reg_model.coef_[0]
us_log_reg_importance = pd.DataFrame(us_log_reg_importance, index=X.columns)
us_log_reg_importance.reset_index(inplace=True)
us_log_reg_importance.rename(columns={
    "index": "Feature",
    0: "Score"
},
    inplace=True)
us_log_reg_importance.sort_values(by="Score", ascending=False, inplace=True)
us_log_reg_importance

fig = px.bar(us_log_reg_importance, x="Feature", y="Score")
fig.show()

In [None]:
# Making predictions
us_log_reg_pred = us_log_reg_model.predict(us_X_test)

# Evaluating the model
us_log_reg_report = classification_report(us_y_test, us_log_reg_pred)
print(us_log_reg_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(us_y_test, us_log_reg_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
us_lr_conf_mat = confusion_matrix(us_y_test, us_log_reg_pred)
us_lr_conf_mat = pd.DataFrame(us_lr_conf_mat).reset_index(drop=True)
us_lr_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(us_lr_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.2.2 Decision Tree

In [None]:
# Decision Tree
dt_clf = DecisionTreeClassifier(random_state=24)
us_dt_model = dt_clf.fit(us_X_train, us_y_train)

# Feature importances
us_dt_importance = us_dt_model.feature_importances_
us_dt_importance = pd.DataFrame(us_dt_importance,
                                columns=["score"]).reset_index()
us_dt_importance["Feature"] = list(X.columns)
us_dt_importance.drop(columns=["index"], inplace=True)
us_dt_importance.sort_values(by="score",
                             ascending=False,
                             ignore_index=True,
                             inplace=True)

# Plotting the feature importances
fig = px.bar(us_dt_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making predictions
us_dt_pred = us_dt_model.predict(us_X_test)

# Evaluating the model
us_dt_report = classification_report(us_y_test, us_dt_pred)
print(us_dt_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(us_y_test, us_dt_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
us_dt_conf_mat = confusion_matrix(us_y_test, us_dt_pred)
us_dt_conf_mat = pd.DataFrame(us_dt_conf_mat).reset_index(drop=True)
us_dt_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(us_dt_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.2.3 Random Forest

In [None]:
# Random Forests
rf_clf = RandomForestClassifier(random_state=24)
us_rf_model = rf_clf.fit(us_X_train, us_y_train)

# Feature Importance of the Random Forest Model
us_rf_importance = us_rf_model.feature_importances_
us_rf_importance = pd.DataFrame(us_rf_importance,
                                columns=["score"]).reset_index()

us_rf_importance["Feature"] = list(X.columns)
us_rf_importance.drop(columns=["index"], inplace=True)

us_rf_importance.sort_values(by="score",
                             ascending=False,
                             ignore_index=True,
                             inplace=True)

fig = px.bar(us_rf_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making predictions
us_rf_pred = us_rf_model.predict(us_X_test)

# Evaluating the model
us_rf_report = classification_report(us_y_test, us_rf_pred)
print(us_rf_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(us_y_test, us_rf_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
us_rf_conf_mat = confusion_matrix(us_y_test, us_rf_pred)
us_rf_conf_mat = pd.DataFrame(us_rf_conf_mat).reset_index(drop=True)
us_rf_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(us_rf_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.2.4 XGBoost

In [None]:
# fitting model to the training data
xgb_clf = XGBClassifier(random_state=24)
us_xgb_model = xgb_clf.fit(us_X_train, us_y_train)

# Feature Importance of the Random Forest Model
us_xgb_importance = us_xgb_model.feature_importances_
us_xgb_importance = pd.DataFrame(us_xgb_importance,
                                 columns=["score"]).reset_index()

us_xgb_importance["Feature"] = list(X.columns)

us_xgb_importance.drop(columns=["index"], inplace=True)

us_xgb_importance.sort_values(by="score",
                              ascending=False,
                              ignore_index=True,
                              inplace=True)

fig = px.bar(us_xgb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
us_xgb_pred = us_xgb_model.predict(us_X_test)

# Evaluating the model
us_xgb_report = classification_report(us_y_test, us_xgb_pred)
print(us_xgb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(us_y_test, us_xgb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
us_xgb_conf_mat = confusion_matrix(us_y_test, us_xgb_pred)
us_xgb_conf_mat = pd.DataFrame(us_xgb_conf_mat).reset_index(drop=True)
us_xgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(us_xgb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.2.5 CatBoost

In [None]:
# Initializing CatBoostClassifier
us_catb_clf = CatBoostClassifier(metric_period=100, random_state=24)

# Fitting it to the training data
us_catb_model = us_catb_clf.fit(us_X_train, us_y_train)

# Feature Importance of the Model
us_catb_importance = us_catb_model.feature_importances_
us_catb_importance = pd.DataFrame(us_catb_importance,
                                  columns=["score"]).reset_index()
us_catb_importance["Feature"] = list(X.columns)
us_catb_importance.drop(columns=["index"], inplace=True)
us_catb_importance.sort_values(by="score",
                               ascending=False,
                               ignore_index=True,
                               inplace=True)

fig = px.bar(us_catb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
us_catb_pred = us_catb_model.predict(us_X_test)

# Evaluating the model
us_catb_report = classification_report(
    us_y_test, us_catb_pred, target_names=["Stayed", "Churned"])
print(us_catb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(us_y_test, us_catb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
us_catb_conf_mat = confusion_matrix(us_y_test, us_catb_pred)
us_catb_conf_mat = pd.DataFrame(us_catb_conf_mat).reset_index(drop=True)
us_catb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(us_catb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.2.6 LightGBM

In [None]:
# Initializing LightGBM Classifier
us_lgb_clf = lgb.LGBMClassifier(random_state=24)

# Fitting it to the training data
us_lgb_model = us_lgb_clf.fit(us_X_train, us_y_train)

# Feature Importance of the Model
us_lgb_importance = us_lgb_model.feature_importances_
us_lgb_importance = pd.DataFrame(us_lgb_importance,
                                 columns=["score"]).reset_index()
us_lgb_importance["Feature"] = list(X.columns)
us_lgb_importance.drop(columns=["index"], inplace=True)
us_lgb_importance.sort_values(by="score",
                              ascending=False,
                              ignore_index=True,
                              inplace=True)

fig = px.bar(us_lgb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
us_lgb_pred = us_lgb_model.predict(us_X_test)

# Evaluating the model
us_lgb_report = classification_report(
    us_y_test, us_lgb_pred, target_names=["Stayed", "Churned"])
print(us_lgb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(us_y_test, us_lgb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
us_lgb_conf_mat = confusion_matrix(us_y_test, us_lgb_pred)
us_lgb_conf_mat = pd.DataFrame(us_lgb_conf_mat).reset_index(drop=True)
us_lgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(us_lgb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.1.7 Summarizing the Performance of the Models

In [None]:
undersampling_models = {
    "us_lr": us_log_reg_model,
    "us_dt": us_dt_model,
    "us_rf": us_rf_model,
    "us_xgb": us_xgb_model,
    "us_catb": us_catb_model,
    "us_lgb": us_lgb_model
}

In [None]:
undersampled_models_eval = evaluation(
    undersampling_models, us_X_test, us_y_test)
undersampled_models_eval

### 6.3 SMOTE

In [None]:
# Creating a copy of the the training dataframe for the SMOTE
smote_data = train_data.copy()
X = smote_data.drop(columns=["Churn"])
y = smote_data["Churn"]

In [None]:
# Resampling the dataframe using SMOTE
smote = SMOTE(sampling_strategy="auto", random_state = 24)
X_sm, y_sm = smote.fit_resample(X, y)
y_sm.value_counts()

In [None]:
# Splitting the dataframe
sm_X_train, sm_X_test, sm_y_train, sm_y_test = train_test_split(X_sm, y_sm, test_size=0.25, random_state=24, stratify=y_sm)
sm_y_train.value_counts()

In [None]:
# Scaling the numeric columns
sm_scaler = MinMaxScaler()

sm_X_train[columns_to_scale] = sm_scaler.fit_transform(sm_X_train[columns_to_scale])
sm_X_test[columns_to_scale] = sm_scaler.transform(sm_X_test[columns_to_scale])

#### 6.3.1 Logistic Regression

In [None]:
# Logistic Regression
log_reg = LogisticRegression(random_state=24)
sm_log_reg_model = log_reg.fit(sm_X_train, sm_y_train)

# Feature Importance of the Random Forest Model
sm_log_reg_importance = sm_log_reg_model.coef_[0]
sm_log_reg_importance = pd.DataFrame(sm_log_reg_importance, index=X.columns)
sm_log_reg_importance.reset_index(inplace=True)
sm_log_reg_importance.rename(columns={
    "index": "Feature",
    0: "Score"
},
    inplace=True)
sm_log_reg_importance.sort_values(by="Score", ascending=False, inplace=True)
sm_log_reg_importance

fig = px.bar(sm_log_reg_importance, x="Feature", y="Score")
fig.show()

In [None]:
# Making predictions
sm_log_reg_pred = sm_log_reg_model.predict(sm_X_test)

# Evaluating the predictions
sm_log_reg_report = classification_report(sm_y_test, sm_log_reg_pred)
print(sm_log_reg_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(sm_y_test, sm_log_reg_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
sm_log_reg_conf_mat = confusion_matrix(sm_y_test, sm_log_reg_pred)
sm_log_reg_conf_mat = pd.DataFrame(sm_log_reg_conf_mat).reset_index(drop=True)
sm_log_reg_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(sm_log_reg_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.3.2 Decision Tree

In [None]:
# Decision Tree
dt_clf = DecisionTreeClassifier(random_state=24)
sm_dt_model = dt_clf.fit(sm_X_train, sm_y_train)

# Feature importances
sm_dt_importance = sm_dt_model.feature_importances_
sm_dt_importance = pd.DataFrame(sm_dt_importance,
                                columns=["score"]).reset_index()
sm_dt_importance["Feature"] = list(X.columns)
sm_dt_importance.drop(columns=["index"], inplace=True)
sm_dt_importance.sort_values(by="score",
                             ascending=False,
                             ignore_index=True,
                             inplace=True)

# Plotting the feature importances
fig = px.bar(sm_dt_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making predictions
sm_dt_pred = sm_dt_model.predict(sm_X_test)

# Evaluating the predictions
sm_dt_report = classification_report(sm_y_test, sm_dt_pred)
print(sm_dt_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(sm_y_test, sm_dt_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
sm_dt_conf_mat = confusion_matrix(sm_y_test, sm_dt_pred)
sm_dt_conf_mat = pd.DataFrame(sm_dt_conf_mat).reset_index(drop=True)
sm_dt_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(sm_dt_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.3.3 Random Forest

In [None]:
# Random Forests
rf_clf = RandomForestClassifier(random_state=24)
sm_rf_model = rf_clf.fit(sm_X_train, sm_y_train)

# Feature Importance of the Random Forest Model
sm_rf_importance = sm_rf_model.feature_importances_
sm_rf_importance = pd.DataFrame(sm_rf_importance,
                                columns=["score"]).reset_index()
sm_rf_importance["Feature"] = list(X.columns)
sm_rf_importance.drop(columns=["index"], inplace=True)
sm_rf_importance.sort_values(by="score",
                             ascending=False,
                             ignore_index=True,
                             inplace=True)

fig = px.bar(sm_rf_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making predictions
sm_rf_pred = sm_rf_model.predict(sm_X_test)

sm_rf_report = classification_report(sm_y_test, sm_rf_pred)
print(sm_rf_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(sm_y_test, sm_rf_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
sm_rf_conf_mat = confusion_matrix(sm_y_test, sm_rf_pred)
sm_rf_conf_mat = pd.DataFrame(sm_rf_conf_mat).reset_index(drop=True)
sm_rf_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(sm_rf_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.3.4 XGBoost

In [None]:
# fitting model to the training data
xgb_clf = XGBClassifier(random_state=24)
sm_xgb_model = xgb_clf.fit(sm_X_train, sm_y_train)

# Feature Importance of the Random Forest Model
sm_xgb_importance = sm_xgb_model.feature_importances_
sm_xgb_importance = pd.DataFrame(sm_xgb_importance,
                                 columns=["score"]).reset_index()
sm_xgb_importance["Feature"] = list(X.columns)
sm_xgb_importance.drop(columns=["index"], inplace=True)
sm_xgb_importance.sort_values(by="score",
                              ascending=False,
                              ignore_index=True,
                              inplace=True)

fig = px.bar(sm_xgb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
sm_xgb_pred = sm_xgb_model.predict(sm_X_test)

# Evaluating the model
sm_xgb_report = classification_report(sm_y_test, sm_xgb_pred)
print(sm_xgb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(sm_y_test, sm_xgb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
sm_xgb_conf_mat = confusion_matrix(sm_y_test, sm_xgb_pred)
sm_xgb_conf_mat = pd.DataFrame(sm_xgb_conf_mat).reset_index(drop=True)
sm_xgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(sm_xgb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.3.5 CatBoost

In [None]:
# Initializing CatBoostClassifier
catb_clf = CatBoostClassifier(metric_period=100, random_state=24)

# Fitting it to the training data
sm_catb_model = catb_clf.fit(sm_X_train, sm_y_train)

# Feature Importance of the Model
sm_catb_importance = sm_catb_model.feature_importances_
sm_catb_importance = pd.DataFrame(sm_catb_importance,
                                  columns=["score"]).reset_index()

sm_catb_importance["Feature"] = list(X.columns)

sm_catb_importance.drop(columns=["index"], inplace=True)

sm_catb_importance.sort_values(by="score",
                               ascending=False,
                               ignore_index=True,
                               inplace=True)

fig = px.bar(sm_catb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
sm_catb_pred = sm_catb_model.predict(sm_X_test)

# Evaluating the model
sm_catb_report = classification_report(sm_y_test,
                                       sm_catb_pred,
                                       target_names=["Stayed", "Churned"])
print(sm_catb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(sm_y_test, sm_catb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
sm_catb_conf_mat = confusion_matrix(sm_y_test, sm_catb_pred)
sm_catb_conf_mat = pd.DataFrame(sm_catb_conf_mat).reset_index(drop=True)
sm_catb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(sm_catb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.3.6 LightGBM

In [None]:
# Initializing LightGBM Classifier
sm_lgb_clf = lgb.LGBMClassifier(random_state=24)

# Fitting it to the training data
sm_lgb_model = sm_lgb_clf.fit(sm_X_train, sm_y_train)

# Feature Importance of the Model
sm_lgb_importance = sm_lgb_model.feature_importances_
sm_lgb_importance = pd.DataFrame(
    sm_lgb_importance, columns=["score"]).reset_index()
sm_lgb_importance["Feature"] = list(X.columns)
sm_lgb_importance.drop(columns=["index"], inplace=True)

sm_lgb_importance.sort_values(by="score",
                              ascending=False,
                              ignore_index=True,
                              inplace=True)

fig = px.bar(sm_lgb_importance, x="Feature", y="score")
fig.show()

In [None]:
# Making the predictions
sm_lgb_pred = sm_lgb_model.predict(sm_X_test)

# Evaluating the model
sm_lgb_report = classification_report(
    sm_y_test, sm_lgb_pred, target_names=["Stayed", "Churned"])
print(sm_lgb_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(sm_y_test, sm_lgb_pred, beta=0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
sm_lgb_conf_mat = confusion_matrix(sm_y_test, sm_lgb_pred)
sm_lgb_conf_mat = pd.DataFrame(sm_lgb_conf_mat).reset_index(drop=True)
sm_lgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(sm_lgb_conf_mat,
            annot=True,
            linewidth=1.0,
            fmt=".0f",
            cmap="RdPu",
            ax=ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.3.7 Summarizing the Performance of the Models

In [None]:
smote_models = {
    "sm_lr": sm_log_reg_model,
    "sm_dt": sm_dt_model,
    "sm_rf": sm_rf_model,
    "sm_xgb": sm_xgb_model,
    "sm_catb": sm_catb_model,
    "sm_lgb": sm_lgb_model
}

In [None]:
smote_models_eval = evaluation(smote_models, sm_X_test, sm_y_test)
smote_models_eval

### 6.4 Standalone tree-based models

In [None]:
# Defining the target & predictor variables
st_X = train_data.drop(columns=["Churn"])
st_y = train_data["Churn"]

# Splitting the dataframe into train and test
st_X_train, st_X_test, st_y_train, st_y_test = train_test_split(st_X, st_y, test_size= 0.25, random_state= 24, stratify= st_y)

In [None]:
# Scaling the numeric columns
st_scaler = MinMaxScaler()
st_X_train[columns_to_scale] = st_scaler.fit_transform(st_X_train[columns_to_scale])
st_X_test[columns_to_scale] = st_scaler.transform(st_X_test[columns_to_scale])

#### 6.4.1 Decision Tree

In [None]:
# Decision Tree
dt_clf = DecisionTreeClassifier(random_state= 24)
dt_model = dt_clf.fit(st_X_train, st_y_train)

# Feature importances
dt_importance = dt_model.feature_importances_
dt_importance = pd.DataFrame(dt_importance, columns= ["score"]).reset_index()
dt_importance["Feature"] = list(st_X.columns)
dt_importance.drop(columns= ["index"], inplace=True)

dt_importance.sort_values(by= "score",
                          ascending= False,
                          ignore_index= True,
                          inplace= True)

# Plotting the feature importances
fig = px.bar(dt_importance, x= "Feature", y= "score")
fig.show()

In [None]:
# Making predictions
dt_pred = dt_model.predict(st_X_test)

# Evaluating the model
dt_report = classification_report(st_y_test, dt_pred, target_names= ["Stayed", "Churned"])
print(dt_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(st_y_test, dt_pred, beta= 0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Defining the Confusion Matrix
dt_conf_mat = confusion_matrix(st_y_test, dt_pred)
dt_conf_mat = pd.DataFrame(dt_conf_mat).reset_index(drop= True)
dt_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(dt_conf_mat,
            annot= True,
            linewidth= 1.0,
            fmt= ".0f",
            cmap= "RdPu",
            ax= ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.4.2 Random Forest

In [None]:
# Random Forests
rf_clf = RandomForestClassifier(random_state= 24)
rf_model = rf_clf.fit(st_X_train, st_y_train)

# Feature Importance of the Random Forest Model
rf_importance = rf_model.feature_importances_
rf_importance = pd.DataFrame(rf_importance, columns= ["score"]).reset_index()
rf_importance["Feature"] = list(st_X.columns)
rf_importance.drop(columns= ["index"], inplace= True)

rf_importance.sort_values(by= "score",
                          ascending= False,
                          ignore_index= True,
                          inplace= True)

# Visualizing the feature importances
fig = px.bar(rf_importance, x= "Feature", y= "score")
fig.show()

In [None]:
# Making predictions
rf_pred = rf_model.predict(st_X_test)

# Evaluating the model
rf_report = classification_report(st_y_test, rf_pred, target_names= ["Stayed", "Churned"])
print(rf_report)

In [None]:
# Calculating the F2 Score
f2_score = fbeta_score(st_y_test, rf_pred, beta= 0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Confusion Matrix
rf_conf_mat = confusion_matrix(st_y_test, rf_pred)
rf_conf_mat = (pd.DataFrame(rf_conf_mat).reset_index(drop=True)).rename(columns={0: "Stayed", 1: "Churned"})
rf_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(rf_conf_mat,
            annot= True,
            linewidth= 1.0,
            fmt= ".0f",
            cmap= "RdPu",
            ax= ax)
plt.xlabel = ("y_pred")
plt.ylabel = ("y_true")
plt.show()

#### 6.4.3 Summarizing the results from the standalone tree-based models

In [None]:
# Defining the dictionary for the results of the standalone tree-based models
standalone_tree_models = {"dt": dt_model,
                          "rf": rf_model
                         }

In [None]:
# Putting the results of the standalone trees together
standalone_tree_models_eval = evaluation(standalone_tree_models, st_X_test, st_y_test)
standalone_tree_models_eval

### 6.5 Model Selection

In [None]:
# Putting all the model summaries together for ease of selection
all_models = pd.concat([oversampled_models_eval, undersampled_models_eval,
                        smote_models_eval, standalone_tree_models_eval])

# Sorting models by F2 score, F1 score and accuracy
all_models = all_models.sort_values(by= ["f2_score", "f1_weighted", "accuracy"], ascending= False)
all_models

**Notes on Features**

From all the models, we note some consistency in the feature importances; we note that the top 5 most important features are:
- Total Charges: increasing total charges lead to increased churn
- Two-year contracts: 2-year contracts have a negative effect on churn; that is, those with 2-year contracts churned less
- Monthly charges: increases in monthly charges lead to increased churn
- Tenure: increasing tenure lead to lower churn.
- One-year contracts: 1-year contracts have a negative effect on churn; that is, customers with 1-year contracts churned less

## 7.0 Model Optimization
### Cross-Validation and Hyperparameter tuning

From Section 6.5 above, we note that the models from oversampling and SMOTE have the highest F2 and F1 scores, with the Oversampling Random Forest model ranking highest across all 17 models followed by 3 of 6 SMOTE models in the top 5. 

The standalone tree-based models are the worst performers, leaving the models based on the undersampling method in between.

Based on the foregoing, the Random Forest will be chosen as the optimal model as it has the highest F2 score. Its F1 score is also high, implying that regardless of the weight of the precision and recall, it performs well. It will therefore be cross-validated and the hyperparameters tuned to optimize it and improve its performance.

To ensure that no room is left to chance, the SMOTE XGBoost (2nd best model) will also be optimized to serve as a backup.

### 7.1 Oversampling Random Forest

For the best model, I will try K-fold cross-validation with different folds and different estimators (trees) to find the best number of estimators to use in getting the best version of the model.

A range of 3 different folds and 10 estimators will be used for this assessment.

**Pasting the model here for ease of access**

*Oversampled Random Forest*
```python
rf_clf = RandomForestClassifier(random_state= 24)
os_rf_model = rf_clf.fit(os_X_train, os_y_train)
```

#### 7.1.1 K-Fold Cross-Validation

In [None]:
# # Defining the number of folds for cross-validation and the range of estimators
# cv = list(range(10, 21, 5))

# # Using a loop to cross-validate with each number in the range of estimators
# for c in cv:
#     score = cross_val_score(estimator= os_rf_model,
#                             X= os_X_train,
#                             y= os_y_train,
#                             cv= c
#                            ).mean()
#     print(f"The average score after cross-validation for the model at {c} folds is:", "{0:.5}".format(score))

From the above, we note that generally the model performs well with an increasing number of estimators even though there is no clear pattern as results vary by the number of k-folds used for cross-validation.

As such the number of estimators will be left open tuned along with other hyperparameters to find the best version of the model. For this, the RandomizedSearchCV and GridSearchCV will be used.

#### 7.1.2 RandomizedSearch Cross-Validation

Here, the random grid is defined by specifying some options for some hyperparameters for the model.

In [None]:
# # Defining the values and instantiating the grid to be used in the RandomizedSearch
# n_estimators = list(range(10, 1001, 50))
# random_grid = {"n_estimators": n_estimators,
#                "max_depth": [1, 5, 10, 20, 50, 75, 100, 150, 200, 300],
#                "bootstrap": [True, False],
#                "criterion": ["gini", "entropy"],
#                "max_features": ["sqrt", "log2", None],
#                "random_state": [24]
#               }

In [None]:
# # Running the RandomizedSearch Cross-Validation with the grid
# rf_rscv = RandomizedSearchCV(estimator= os_rf_model,
#                              param_distributions= random_grid,
#                              n_iter= 30,
#                              cv= 10,
#                              random_state= 24,
#                              n_jobs= -1)

# # Fitting the model to the training data
# rf_rscv.fit(os_X_train, os_y_train)

In [None]:
# # Looking at the best combination of hyperparameters for the model
# best_params = rf_rscv.best_params_

# print("The best combination of hyperparameters for the model will be:")

# for param_name in sorted(best_params.keys()):
#     print(f"{param_name}: {best_params[param_name]}")

In [None]:
# # Looking at the best score for the model during cross-validation
# print("The mean cross-validated score of the model's best combination of hyperparameters is:",
#       "{0:.5}".format(rf_rscv.best_score_))

With the RandomizedSearchCV, we note a significant improvement in the score of the model. As such, we will build an "optimized" version of the model using the recommended parameters from above and assess it.

In [None]:
# Defining an optimized version of the model with the best parameters
best_rf_model_rscv = RandomForestClassifier(bootstrap= False,
                                            criterion= "gini",
                                            max_depth= 20,
                                            max_features= "log2",
                                            n_estimators= 510,
                                            random_state= 24
                                            )

In [None]:
# Fitting the model to the training data
best_rf_model_rscv = best_rf_model_rscv.fit(os_X_train, os_y_train)

# Predicting the test data
best_rf_pred = best_rf_model_rscv.predict(os_X_test)

In [None]:
# Evaluating the model
best_rf_report = classification_report(os_y_test, best_rf_pred, target_names= ["Stayed", "Churned"])
print(best_rf_report)

# Calculating the accuracy score
accuracy = accuracy_score(os_y_test, best_rf_pred)
accuracy = "{:.5f}".format(accuracy)
print("Accuracy:", accuracy)

# Calculating the F2 Score
f2_score = fbeta_score(os_y_test, best_rf_pred, beta= 0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Confusion Matrix
best_rf_conf_mat = confusion_matrix(os_y_test, best_rf_pred)
best_rf_conf_mat = (pd.DataFrame(best_rf_conf_mat).reset_index(drop= True)).rename(columns= {0: "Stayed", 1: "Churned"})
best_rf_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(best_rf_conf_mat,
            annot= True,
            linewidth= 1.0,
            fmt= ".0f",
            cmap= "RdPu",
            ax= ax)
plt.show()

From the scores and confusion matrix above, we see that this version of the model is the best version we can have for now. With an average (cross-validated) score of about 89% and great F1-(about 90%) and F2-(about 88%) scores, we are almost certain that this model will accurately predict which customers are likely to churn and enable Vodafone take active steps to improve customer retention.

#### 7.1.3 GridSearch Cross-Validation

In [None]:
# # Defining the parameter grid for the GridsearchCV (chosen with reference to the best estimators from the RandomizedSearchCV)
# gscv_param_grid = {"n_estimators": [100, 200, 300, 400, 500, 600],
#                    "max_features": ["sqrt", "log2"],
#                    "max_depth": [10, 20, 40, 80, 160],
#                    "criterion": ["gini"], 
#                    "random_state": [24], 
#                    "bootstrap": [False]
#                   }

In [None]:
# # Executing GridSearchCV
# rf_gscv = GridSearchCV(estimator= os_rf_model,
#                        param_grid= gscv_param_grid,
#                        n_jobs= -1,
#                        cv= 10)

# # Fitting the model to the training data
# rf_gscv.fit(os_X_train, os_y_train)

In [None]:
# # Printing the best combination of hyperparameters for the model
# best_params = rf_gscv.best_params_
# print("The best combination of hyperparameters for the model will be:")

# for param_name in sorted(best_params.keys()):
#     print(f"{param_name}: {best_params[param_name]}")

In [None]:
# # Looking at the best score for the model during cross-validation
# print("The mean cross-validated score of the model's best combination of hyperparameters is:",
#       "{0:.5}".format(rf_gscv.best_score_))

In [None]:
# # Building a new model with the best parameters
# gscv_best_rf = RandomForestClassifier(random_state= 24,
#                                       bootstrap= False,
#                                       max_features= "sqrt",
#                                       n_estimators= 500,
#                                       max_depth= 20,
#                                       criterion= "gini"
#                                       )

In [None]:
# # Fitting the optimized model to the training data
# gscv_best_rf.fit(os_X_train, os_y_train)

# # Predicting the test data
# gscv_rf_pred = gscv_best_rf.predict(os_X_test)

In [None]:
# # Evaluating the model
# gscv_rf_report = classification_report(os_y_test, gscv_rf_pred, target_names= ["Stayed", "Churned"])
# print(gscv_rf_report)

# # Calculating the accuracy score
# accuracy = accuracy_score(os_y_test, gscv_rf_pred)
# accuracy = "{:.5f}".format(accuracy)
# print("Accuracy:", accuracy)

# # Calculating the F2 Score
# f2_score = fbeta_score(os_y_test, gscv_rf_pred, beta= 0.5)
# f2_score = "{:.5f}".format(f2_score)
# print("F2 Score:", f2_score)

In [None]:
# # Confusion Matrix
# gscv_rf_conf_mat = confusion_matrix(os_y_test, gscv_rf_pred)
# gscv_rf_conf_mat = (pd.DataFrame(gscv_rf_conf_mat).reset_index(drop= True)).rename(columns= {0: "Stayed", 1: "Churned"})
# gscv_rf_conf_mat

In [None]:
# # Visualizing the Confusion Matrix
# f, ax = plt.subplots()
# sns.heatmap(gscv_rf_conf_mat,
#             annot= True,
#             linewidth= 1.0,
#             fmt= ".0f",
#             cmap= "RdPu",
#             ax= ax
#            )
# plt.show()

### 7.2 SMOTE XGBoost

As indicated earlier, the second best performing model will also be optimized to ensure that at the end of the day, there are at least two optimized models from which to choose. The XGBoost models will therefore be optimized with K-FOld Cross-Validation and/or RandomizedSearchCV

*Pasting the model here for ease of access*

**SMOTE XGBoost Classifier**
```python
xgb_clf = XGBClassifier(random_state= 24)
sm_xgb_model = xgb_clf.fit(sm_X_train, sm_y_train)
```

#### 7.2.1 K-Fold Cross-Validation

As was done with the Random Forest model, the XGBoost Classifier is is cross-validated using K-Fold Cross-Validation with 3 different k-values and 10 different estimators.

In [None]:
# # Defining the number of folds for cross-validation and the range of estimators
# cv = list(range(10, 21, 5))
# n_estimators = list(range(10, 101, 10))

# # Defining a loop to cross-validate with each number in the range of estimators
# for c in cv:
#     print(f"The model's average score after cross-validation at {c} folds is:")
#     score = cross_val_score(estimator= sm_xgb_model,
#                             X= sm_X_train, y= sm_y_train,
#                             cv= c
#                             ).mean()
#     print("score_" + str(c) + "_folds:", "{0:.5}".format(score))

From the results above, we note that the best performance on each iteration is best at 100 estimators, with mean score on cross-validation generally increasing with the number of estimators. 

Since it is the only hyperparameter we tuned here, we may use these findings to inform further tuning and model optimization.

#### 7.2.2 RandomizedSearch Cross-Validation

In [None]:
# # Defining the values for the RandomizedSearchCV
# random_grid = {"colsample_bytree": [0.1, 0.3, 0.5, 0.7],
#                "learning_rate": [0.1, 0.3, 0.5, 0.7, 1.0],
#                "max_depth": [5, 10, 15, 20, 25, 30, 35],
#                "booster": ["gbtree", "gblinear", "dart"],
#                "n_estimators": [5, 10, 20, 50, 80, 100]
#                }

In [None]:
# # Running the RandomizedSearch Cross-Validation with the above set of Parameters
# xgb_rs_cv_model = RandomizedSearchCV(estimator= sm_xgb_model,
#                                      param_distributions= random_grid,
#                                      n_iter= 30,
#                                      cv= 10,
#                                      random_state= 24,
#                                      n_jobs= -1)

# # Fitting the model to the training data
# xgb_rs_cv_model.fit(sm_X_train, sm_y_train)

In [None]:
# # Looking at the best combination of hyperparameters for the model
# best_params = xgb_rs_cv_model.best_params_
# print("The best combination of hyperparameters for the model will be:")
# for param_name in sorted(best_params.keys()):
#     print(f"{param_name} : {best_params[param_name]}")

In [None]:
# # Looking at the best score for the model during cross-validation
# print("The model's cross-validated score with the best combination of hyperparameters is:", 
#       "{0:.5}".format(xgb_rs_cv_model.best_score_))

In [None]:
# Defining the best version of the model with the best parameters
best_xgb_model = XGBClassifier(random_state= 24,
                               booster= "dart",
                               colsample_bytree= 0.5,
                               learning_rate= 0.5,
                               max_depth= 20,
                               n_estimators= 80
                               )

In [None]:
# Fitting the model to the training data
best_xgb_model = best_xgb_model.fit(sm_X_train, sm_y_train)

# Predicting the test data
best_xgb_pred = best_xgb_model.predict(sm_X_test)

In [None]:
# Evaluating the model
best_xgb_report = classification_report(sm_y_test, best_xgb_pred, target_names= ["Stayed", "Churned"])
print(best_xgb_report)

# Calculating the accuracy score
accuracy = accuracy_score(sm_y_test, best_xgb_pred)
accuracy = "{:.5f}".format(accuracy)
print("Accuracy:", accuracy)

# Calculating the F2 Score
f2_score = fbeta_score(sm_y_test, best_xgb_pred, beta= 0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Confusion Matrix
best_xgb_conf_mat = confusion_matrix(sm_y_test, best_xgb_pred)
best_xgb_conf_mat = (pd.DataFrame(best_xgb_conf_mat).reset_index(drop= True)).rename(columns={0: "Stayed", 1: "Churned"})
best_xgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(best_xgb_conf_mat,
            annot= True,
            linewidth= 1.0,
            fmt= ".0f",
            cmap= "RdPu",
            ax= ax)
plt.show()

When the confusion matrix after optimization is compared to the original confusion matrix from the original model in Section 6.3.4, we note that there is no difference between the performance of the model, hence we go with the best model(from 7.1 above).

## 8.0 Future Prediction

### 8.1 Random Forest

As concluded in Section 7 above, the selected model to be used for the predictions will be the Random Forest from the Oversampling method. As such, the data on which predictions are to be made will be oversampled

In [None]:
# Initializing the model (for shege reasons)
best_rf_model_rscv

In [None]:
# Oversampling the test data and confirming the shape
final_test_data = test_data.copy()
count_not_churned, count_churned = final_test_data["Churn"].value_counts()
count_not_churned, count_churned

In [None]:
# Filtering the observations for the 2 classes
test_not_churned = final_test_data[final_test_data["Churn"] == 0]
test_churned = final_test_data[final_test_data["Churn"] == 1]

In [None]:
# Oversampling the churned class and combining with the original dataframe
test_churn_oversampled = test_churned.sample(count_not_churned, replace= True)
test_oversampled = pd.concat([test_not_churned, test_churn_oversampled])

print("Random over-sampling:")
print(test_oversampled["Churn"].value_counts())

In [None]:
# Defining the target & predictor variables
test_X = test_oversampled.drop(columns= ["Churn"])
test_y = test_oversampled["Churn"]

In [None]:
# Scaling the test data columns
test_X[columns_to_scale] = os_scaler.transform(test_X[columns_to_scale])
test_X

In [None]:
# Predicting the test data
best_rf_pred = best_rf_model_rscv.predict(test_X)

# Evaluating the model
best_rf_report = classification_report(test_y, best_rf_pred, target_names= ["Stayed", "Churned"])
print(best_rf_report)

# Calculating the F2 Score
f2_score = fbeta_score(test_y, best_rf_pred, beta= 0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Confusion Matrix
best_rf_conf_mat = confusion_matrix(test_y, best_rf_pred)
best_rf_conf_mat = (pd.DataFrame(best_rf_conf_mat).reset_index(drop= True)).rename(columns= {0: "Stayed", 1: "Churned"})
best_rf_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(best_rf_conf_mat,
            annot= True,
            linewidth= 1.0,
            fmt= ".0f",
            cmap= "RdPu",
            ax= ax
           )
plt.show()

### 8.2 XGBoost

In [None]:
# A look at the parameters of the final XGB model
best_xgb_model

In [None]:
# Creating a copy of the the training dataframe for the SMOTE
smote_test = test_data.copy()

sm_X = smote_test.drop(columns= ["Churn"])
sm_y = smote_test["Churn"]

# Resampling the dataframe using SMOTE
smote = SMOTE(sampling_strategy= "minority")
smote_X, smote_y = smote.fit_resample(sm_X, sm_y)
smote_y.value_counts()

In [None]:
# Scaling the numeric columns
smote_X[columns_to_scale] = sm_scaler.transform(smote_X[columns_to_scale])

In [None]:
# Predicting the test data
final_xgb_pred = best_xgb_model.predict(smote_X)

# Evaluating the model
final_xgb_report = classification_report(smote_y, final_xgb_pred, target_names= ["Stayed", "Churned"])
print(final_xgb_report)

# Calculating the F2 Score
f2_score = fbeta_score(smote_y, final_xgb_pred, beta= 0.5)
f2_score = "{:.5f}".format(f2_score)
print("F2 Score:", f2_score)

In [None]:
# Confusion Matrix
final_xgb_conf_mat = confusion_matrix(smote_y, final_xgb_pred)
final_xgb_conf_mat = (pd.DataFrame(final_xgb_conf_mat).reset_index(drop= True)).rename(columns= {0: "Stayed", 1: "Churned"})
final_xgb_conf_mat

In [None]:
# Visualizing the Confusion Matrix
f, ax = plt.subplots()
sns.heatmap(final_xgb_conf_mat,
            annot= True,
            linewidth= 1.0,
            fmt= ".0f",
            cmap= "RdPu",
            ax= ax
           )
plt.show()

## 9.0 Conclusion

### 9.1 Summary of Key Insights and Recommendations

1. Most of the churned customers had tenures within 1 - 29 months, and customers who passed the 29-month mark generally churned less.
2. Customers generally churned more when they crossed USD 56.15 monthly charge mark, with majority falling between USD 56.15 (lower quartile) and USD 79.65 (median monthly charge of churned customers)
3. Churn levels spiked most when monthly charges paid USD 64.45
4. Most of the customers who churned fell within USD 134.46 and USD 2,332.30. This is surprising and may be investigated further as it gives an indication that total charges may not be the sole reason for churn
5. Males are almost just as likely to churn as females. Hence, gender - like total charges - may not be a sole determinant for assessing the likelihood of churn
6. Customers without partners were about 67% more likely to churn that those with partners.
7. Fibre-optic service users were over twice as likely to churn as compared to DSL users (41.89% vs. 19.00%). Given that, ideally, fibre-optic is supposed to be faster (and better) than DSL, the churn proportion there was particularly worrying. Vodafone may want to investigate the reasons for high churn among customers. It may also review it's fibre-optic internet services and reach out to customers for more information.
8. Over 63% of internet service users did not use online security services. Since customers in this group were about 42% likely to churn, Vodafone may consider increased promotion for their online security services, as user of online security services did not churn much.
9. Less than 29% of customers used the Tech Support services and were 41.65% likely to churn. Given that tech support is critical to tech service delivery, Vodafone may want to bundle tech support offerings with other services to ensure that customers receive support as and when needed, and churn is reduced.
10. Internet service users who streamed (either TV or movies) were just as likely to churn as those who did not. This is concerning and raises questions about the streaming services offered by Vodafone. The company may want to evaluate their streaming services and ensure engagement with customers to improve the streaming service delivery and reduce churn among streamers. Improving this will mean that any customer who signs unto streaming services will be unlikely to churn.
11. The churn proportion for electronic checks (45.29%) is concerning, and should be investigated and improved to ensure convenience and ease of use for customers.

### 9.2 Conclusion

Per their confusion matrices and F2 scores, the XGBoost model (0.79 F2-score) generalizes and performs better on unseen data than the Random Forest model (0.69 F2-score). The XGBoost is therefore recommended for further optimization and deployment.

## 10 Exporting

In [None]:
# Exporting the requirements
requirements = "\n".join(f"{m.__name__}=={m.__version__}" for m in globals().values() if getattr(m, "__version__", None))

with open("requirements.txt", "w") as f:
    f.write(requirements)

In [None]:
# Creating a dictionary of objects to export
exports = {"encoder": oh_encoder,
           "scaler": sm_scaler,
           "model": best_xgb_model}

In [None]:
# Exporting the dictionary with Pickle
with open("Gradio_App_toolkit", "wb") as file:
    pickle.dump(exports, file)

In [None]:
# Exporting the model
best_xgb_model.save_model("xgb_model.json")