# Importing the required Libraries

In [1]:
# Importing libraries used for Data manipulation and analysis
import pandas as pd
import numpy as np

# Import necessary modules from the scikit-learn library for machine learning tasks.
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import precision_score, confusion_matrix, f1_score
from xgboost import XGBClassifier, XGBRFClassifier
from imblearn.over_sampling import SMOTE

# Import necessary module for saving model
import joblib

# Importing the dataset into a dataframe

In [2]:
Churn = pd.read_csv("./data/Cleaned_Churn_Modelling.csv")

# Making a copy of the dataset
churn_copy = Churn.copy()

# Data Preprocessing
Now that we have imported our cleaned dataset, let's take a brief look at it.

In [3]:
churn_copy.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,High_Payer
0,619,France,Female,42,2,0.0,1,1,0,101348.88,1,0
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,1


Now that everything looks nice, let's proceed to the data preprocessing phase. During this stage, we will transform the data into a format that can be fully utilized by our model.

**Step 1: Handling Categorical Columns**<br>
We'll begin by addressing our categorical columns and converting them into dummy variables.

In [4]:
categorical_columns = Churn.select_dtypes("object").columns
for column in categorical_columns:
  churn_copy = churn_copy.join(pd.get_dummies(churn_copy[column])).drop(columns=column)

Let's now examine our dataset to verify whether all the changes have been implemented correctly.

In [5]:
churn_copy.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,High_Payer,France,Germany,Spain,Female,Male
0,619,42,2,0.0,1,1,0,101348.88,1,0,1,0,0,1,0
1,608,41,1,83807.86,1,0,1,112542.58,0,1,0,0,1,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,1,1,0,0,1,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,1,0,0,1,0
4,850,43,2,125510.82,1,1,1,79084.1,0,1,0,0,1,1,0


Upon further examination, I have observed that there are certain categorical columns, such as `NumOfProducts` and `Tenure.` Although these columns are represented as numerical data types, they are categorical in nature. Therefore, it is necessary to convert them into dummy variables for our analysis.

In [6]:
churn_copy = churn_copy.join(pd.get_dummies(churn_copy['NumOfProducts'], prefix="Products_Bought"))\
.drop(columns="NumOfProducts")

churn_copy = churn_copy.join(pd.get_dummies(churn_copy['Tenure'], prefix="Tenure")).drop(columns="Tenure")

Let's examine our dataset again to verify whether all the changes have been implemented correctly.

In [7]:
churn_copy.head()

Unnamed: 0,CreditScore,Age,Balance,HasCrCard,IsActiveMember,EstimatedSalary,Exited,High_Payer,France,Germany,...,Tenure_1,Tenure_2,Tenure_3,Tenure_4,Tenure_5,Tenure_6,Tenure_7,Tenure_8,Tenure_9,Tenure_10
0,619,42,0.0,1,0,101348.88,1,0,1,0,...,0,1,0,0,0,0,0,0,0,0
1,608,41,83807.86,0,1,112542.58,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,502,42,159660.8,1,0,113931.57,1,1,1,0,...,0,0,0,0,0,0,0,1,0,0
3,699,39,0.0,0,0,93826.63,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,850,43,125510.82,1,1,79084.1,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0


Now that all our changes have been implemented, let's move on to the next section.

**Step 2: Scaling of Numerical Columns**<br>
Scaling is a crucial data transformation technique used to ensure that numerical features are on the same scale, preventing some features from dominating others during machine learning model training.<br>
We can going utilizing the Standardization (Z-score normalization) scaling technique and it's particularly suitable for our model.

In [8]:
SS_Norm = preprocessing.StandardScaler()

numerics = ['Balance', 'EstimatedSalary', "CreditScore"]
for i in numerics:
    val = np.array(Churn[i].values).reshape(-1, 1)
    churn_copy[i] = SS_Norm.fit_transform(val)

Let's check if the changes have been implemented.

In [9]:
churn_copy.head()

Unnamed: 0,CreditScore,Age,Balance,HasCrCard,IsActiveMember,EstimatedSalary,Exited,High_Payer,France,Germany,...,Tenure_1,Tenure_2,Tenure_3,Tenure_4,Tenure_5,Tenure_6,Tenure_7,Tenure_8,Tenure_9,Tenure_10
0,-0.326221,42,-1.225848,1,0,0.021886,1,0,1,0,...,0,1,0,0,0,0,0,0,0,0
1,-0.440036,41,0.11735,0,1,0.216534,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,-1.536794,42,1.333053,1,0,0.240687,1,1,1,0,...,0,0,0,0,0,0,0,1,0,0
3,0.501521,39,-1.225848,0,0,-0.108918,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,2.063884,43,0.785728,1,1,-0.365276,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0


**Step 3: Data Splitting**<br>
Great! The changes have been successfully implemented, and now it's time to split the dataset into two distinct sets: the training dataset and the testing dataset. Data splitting is a crucial step in machine learning as it allows us to assess the model's performance on unseen data.<br>
In the next steps, we'll split the dataset into these two subsets, ensuring that both the training and testing data are representative and properly randomized to avoid bias in our model evaluation.


In [10]:
churn_copy["young_person"] = (churn_copy.Age < 50).apply(lambda x: 0 if x else 1)
X =  churn_copy.drop(columns=["Exited"])
y = churn_copy.Exited.values
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
                                                                    y,
                                                                    test_size=0.2,
                                                                    random_state=42
                                                                   )

**Step 4: Model Selection and Training**<br>
XGBoost was selected due to its ability to handle class imbalance naturally. It incorporates techniques like weighted loss functions and subsampling of the majority class during tree construction, making it less prone to bias toward the majority class.
Model Training is an iterative process.

Let's first create a function to make the model build processing less messy.

In [11]:
def model_builder(model, X_train, X_test, y_train, y_test):
    mod = model
    mod.fit(X_train, y_train)
    pred = mod.predict(X_test)
    print("f1", f1_score(y_test, pred), "precision", precision_score(y_test, pred))
    print(confusion_matrix(y_test, pred))
    return mod

Now that that is out of the way, Let's go straight to model building.

In [12]:
model_1 = model_builder(XGBRFClassifier(), X_train, X_test, y_train, y_test)

f1 0.7080394922425952 precision 0.7943037974683544
[[1542   65]
 [ 142  251]]


In [13]:
model_2 = model_builder(XGBClassifier(n_estimators=1000, max_depth=6, learning_rate=0.01),
              X_train, X_test, y_train, y_test)

f1 0.7302904564315352 precision 0.8
[[1541   66]
 [ 129  264]]


The performance of the two models can be summarized as follows:

**Model 1:**
- F1 Score: 0.7080
- Precision: 0.7943
- Confusion Matrix:
```
[[1542   65]
 [ 142  251]]
 ```

**Model 2:**
- F1 Score: 0.7303
- Precision: 0.8000
- Confusion Matrix:
```
[[1541   66]
 [ 129  264]]
```

In the context of binary classification, these metrics provide insights into how well each model is performing:

- **F1 Score**: It is a balanced metric that considers both precision and recall. A higher F1 score indicates a better balance between correctly identifying positive instances (sensitivity) and minimizing false positives (precision).

- **Precision**: Precision measures the accuracy of positive predictions made by the model. A higher precision score means that the model is better at making positive predictions, with fewer false positives.

- **Confusion Matrix**: The confusion matrix provides a detailed breakdown of the model's predictions. It includes the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). This breakdown helps in understanding the model's performance in different aspects.

Based on the provided metrics and confusion matrices:

- Model 2 has a slightly higher F1 score and precision compared to Model 1.
- Model 2 has a higher number of true positives (TP) and true negatives (TN) but also a slightly higher number of false positives (FP) compared to Model 1.

In conclusion, Model 2 appears to perform slightly better overall, achieving a higher F1 score and precision, which indicates a better balance between correctly classifying positive instances and minimizing false positives. However, both models still have areas that can be improved.

**Step 5: Addressing the data imbalance**

SMOTE helps address class imbalance by creating additional examples of the minority class, which allows the classifier to learn more effectively from the minority class's limited data. This can lead to better model performance and a reduction in the bias that might result from imbalanced datasets.

In [14]:
# Create an instance of the SMOTE class
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Apply SMOTE to the dataset to create synthetic samples
X_resampled, y_resampled = smote.fit_resample(X, y)

X_train_resamp, X_test_resamp, y_train, y_test = model_selection.train_test_split(X_resampled, y_resampled,
                                                                    test_size=0.2,
                                                                    random_state=42
                                                                   )


model_1 = model_builder(XGBRFClassifier(), X_train_resamp, X_test_resamp, y_train, y_test)

model_2 = model_builder(XGBClassifier(n_estimators=1000, max_depth=6, learning_rate=0.01),
                  X_train_resamp, X_test_resamp, y_train, y_test)


f1 0.8853483286472977 precision 0.8598300970873787
[[1402  231]
 [ 136 1417]]
f1 0.9205444761000317 precision 0.9053549190535491
[[1481  152]
 [  99 1454]]


In [15]:
feature_importance_df = pd.DataFrame({'Feature': churn_copy.drop(columns='Exited').columns,
                                      'Importance': model_2.feature_importances_})

# Sort the DataFrame by importance scores in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the feature importance table
print(feature_importance_df)

              Feature  Importance
4      IsActiveMember    0.571659
13  Products_Bought_2    0.122893
12  Products_Bought_1    0.033437
19           Tenure_3    0.020303
1                 Age    0.018375
24           Tenure_8    0.016165
21           Tenure_5    0.016141
7              France    0.015978
10             Female    0.015847
22           Tenure_6    0.015822
9               Spain    0.014049
23           Tenure_7    0.014005
8             Germany    0.013451
25           Tenure_9    0.013049
17           Tenure_1    0.012710
11               Male    0.012625
20           Tenure_4    0.012280
6          High_Payer    0.012228
18           Tenure_2    0.011136
26          Tenure_10    0.010675
16           Tenure_0    0.009058
2             Balance    0.006665
3           HasCrCard    0.003553
0         CreditScore    0.002826
5     EstimatedSalary    0.002794
14  Products_Bought_3    0.002277
15  Products_Bought_4    0.000000
27       young_person    0.000000


From the table above, we can observe the following insights:

1. The number of products bought emerges as a significant factor for predicting whether a customer will churn or not. Additionally, the `IsActiveMember` column also holds substantial importance in this prediction.

2. The columns at the bottom of the list hold minimal to no significance for our model. To ensure optimal model performance, we should consider either removing these columns entirely or engaging in feature engineering to ensure they do not hinder our model's accuracy and efficiency.

Let's get to removing the columns with minimal to no significance for our model.

In [16]:
X_mod = X_resampled[feature_importance_df.query("Importance>0.005").Feature]
X_train_resamp, X_test_resamp, y_train, y_test = model_selection.train_test_split(X_mod, y_resampled,
                                                                    test_size=0.2,
                                                                    random_state=42
                                                                   )


model_1 = model_builder(XGBRFClassifier(), X_train_resamp, X_test_resamp, y_train, y_test)

model_2 = model_builder(XGBClassifier(n_estimators=1000, max_depth=6, learning_rate=0.01),
                  X_train_resamp, X_test_resamp, y_train, y_test)

f1 0.8871571072319202 precision 0.8598187311178248
[[1401  232]
 [ 130 1423]]
f1 0.922102596580114 precision 0.907165109034268
[[1484  149]
 [  97 1456]]


# Conclusion
In conclusion, our classification model has demonstrated outstanding performance with an `F1-score of 0.922` and a `precision of 0.907`. The confusion matrix shows a relatively low number of false positives and false negatives, indicating that the model effectively discriminates between positive and negative instances.

However, while the model has achieved impressive results, there is always room for improvement. To further enhance the model's accuracy and robustness, one key avenue is to collect more data. Additional data can provide the model with a broader and more representative sample of instances, especially if the dataset is limited. This larger and more diverse dataset can help the model generalize better to unseen examples and further increase its predictive power.

In summary, while our model has achieved impressive results, the pursuit of excellence continues. Collecting more data is just one step in our journey to create a highly accurate and reliable classification model. By continually refining our approach and incorporating best practices, we aim to build a model that excels in its predictive capabilities.

# Saving of Model

In [17]:
joblib.dump(model_2, './model/Bank_Customer_Churn_Prediction_model.joblib')

['Bank_Customer_Churn_Prediction_model.joblib']