# **Bank Customer Churn Prediction Project**

- **Note:** "Feedback from phase 1 was taken into consideration as much as possible while working on this notebook."


--------------------------------------------------------------------------------
My name: Mostapha Abdulaziz

## Overview
This project focuses on predicting customer churn in the banking sector using machine learning. The workflow involves exploratory data analysis, clustering, building and comparing individual and ensemble models, handling imbalanced data, and evaluating results using robust metrics.

## Objectives
1. Understand the dataset and its features.
2. Utilize clustering for data preparation.
3. Train individual machine learning models and ensemble methods.
4. Handle class imbalance using various techniques and evaluate their impact.
5. Analyze results to identify the best-performing model and interpret its bias/variance.

## Workflow
1. **Data Exploration and Preprocessing**:
   - Handle missing values, outliers, and feature encoding.
2. **Clustering**:
   - Apply clustering algorithms for customer segmentation.
3. **Modeling**:
   - Train models like Logistic Regression, SVM, and Random Forest.
   - Build ensemble methods like Bagging, Boosting, and Stacking.
4. **Imbalanced Data Handling**:
   - Use methods like SMOTE, undersampling, and class weighting.
5. **Evaluation**:
   - Compare models using accuracy, precision, recall, F1-score, confusion matrix, and ROC-AUC.
6. **Documentation**:
   - Provide insights and conclusions in a detailed report (Submitted seperatly).

## Key Features
- **Dataset**: Bank customer data with attributes like credit score, balance, tenure, and churn status.
- **Algorithms**: Logistic Regression, SVM, Random Forest, and ensemble techniques.
- **Evaluation**: ROC curves, confusion matrix, and other metrics.

## Prerequisites
- Python.
- Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn.

## Outputs
- Model performance comparisons.
- Insights on the impact of class imbalance handling techniques.

Instructions
1. Clone the repository and navigate to the project folder:
   ```bash
   git clone https://github.com/mostapha227824/Bank_customer_churn_prediction.git
   cd Bank_customer_churn_prediction




---



---



---



---



# 1. **Describing and preprocessing the dataset**

# Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
import plotly.graph_objects as go
from plotly.subplots import make_subplots



---



# Exploration and preprocessing step

In [None]:
data_path = "/kaggle/input/bank-customer-churn-prediction/Churn_Modelling.csv"
df = pd.read_csv(data_path)
df.head(10)

# Summary of the dataset

In [None]:
print("\nDataset Info:")
print(df.info())



---



# Checking for missing values

In [None]:
print("\nMissing Values:")
print(df.isnull().sum())

**It seems that theres Only One null values in 'Geography', 'Age', HasCrCard' and 'IsActiveMember' columns.**

In [None]:
df[df.isnull().any(axis=1)]

- Since there are only 4 rows with null values in the entire dataset, I will drop these rows.

# dropping rows with null values

In [None]:
df.dropna(inplace=True)
print("\nMissing Values:")
print(df.isnull().sum())



---



# Checking for duplicates

In [None]:
print("\nDuplicate Rows:", df.duplicated().sum())

# dropping duplicated row

In [None]:
df.drop_duplicates(inplace=True)
print("\nDuplicate Rows:", df.duplicated().sum())



---



# data shape

In [None]:
df.shape

# checking for unique values of columns

In [None]:
unique_values_df = pd.DataFrame({
    'Column': df.columns,
    'Unique Values': [df[col].unique() for col in df.columns],
    'No. of Unique Values': [df[col].nunique() for col in df.columns]
})

# Feature types Classification

In [None]:
numerical_features = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']
categorical_features = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']

# Displaing the DataFrame in a well-formatted way

In [None]:
unique_values_df



---



# Detectting outliers using boxplots

In [None]:
num_cols = df.select_dtypes(include=['number']).columns

# Columns to explore
to_explore = [col for col in num_cols if col not in ['Exited']]

for column in to_explore:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identifing outliers
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    outlier_count = len(outliers)

    fig = make_subplots(rows=1, cols=2, subplot_titles=(f'{column} - Boxplot', f'{column} - Distplot'))

    # Boxplot
    fig.add_trace(go.Box(
        y=df[column],
        name='Boxplot',
        boxmean='sd'
    ), row=1, col=1)

    # Distplot (Density Plot)
    fig.add_trace(go.Histogram(
        x=df[column],
        name='Distplot',
        histnorm='probability density',
        nbinsx=30,
        opacity=0.6,
        marker_color='skyblue'
    ), row=1, col=2)

    # Add annotations to Boxplot
    fig.add_annotation(
        x=1,  # Center of the boxplot
        y=df[column].max(),
        text=f'No Outliers: {outlier_count}',
        showarrow=False,
        arrowhead=2,
        ax=0,
        ay=-50,
        xref='x1',
        yref='y1'
    )

    # Update layout
    fig.update_layout(
        title=f'Interactive Plot for {column}',
        xaxis_title='Values',
        yaxis_title='Density',
        xaxis2_title='Values',
        yaxis2_title='Density',
        height=600,
        width=1000,
        showlegend=False
    )
    fig.show()

In [None]:
df.describe()

**Observations:**

- Based on the data description, there are no obvious outliers in the dataset's columns.

- As, the CreditScore values does not exhibit significant outliers, as they are not far from the overall distribution therefore no action is needed.

- In the NumOfProducts column, a value of 4 is not considered an outlier despite its low frequency.

- For the Age column, since the class from 70 to 100 has minimal contribution to the dataset, any age value greater than 70 will be capped at 70.

- One notable observation is the imbalance in the Exited column, which indicates the target variable. This imbalance will be further investigated and visualized to assess its impact on the model's performance and determine what resampling techniques needed.

# assignning value of 70 to any higher 'Age' value

In [None]:
df.loc[df['Age'] > 70, 'Age'] = 70

# Saving the processed

In [None]:
processed_data_path = "processed_bank_churn.csv"
df.to_csv(processed_data_path, index=False)
print(f"Processed data saved to {processed_data_path}")



---



# **Visualizations**

In [None]:
df.hist(figsize=(15, 10))
plt.show()

**Key Observations from Data Histograms:**

- Two clear observations emerged from the data histogram plots:

1. 'Balance' Column: theres many zeros which stands out as unusual. This was already noted during the review of the first 10 rows of data. i will investigate more to see why this happen and how it correlates the other features,

2. 'Exited' Column Imbalance: The target variable, Exited, shows a significant class imbalance. Around 80% of the data represents customers who stayed (1), while only 20% exited (0).

dorpping Surname column will be dropped from the dataset before modeling as it has no potentional need.

In [None]:
df.drop('Surname', axis=1, inplace=True)
print("Updated Columns:")
print(df.columns)

# countplot for 'Geography'

In [None]:
geography_labels = ['France', 'Spain', 'Germany']
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Geography', hue='Exited')
plt.xticks(ticks=[0, 1, 2], labels=geography_labels)
plt.xlabel('Geography')
plt.ylabel('Count')
plt.title('Count of Exited Customers by Geography')
plt.show()

- **The Geography countplot highlights a noticeable imbalance in the Exited column across different regions. This imbalance is particularly in France and Spain, where a significant number of customers did not exit while germany shows a slightly more balanced distribution these regional differences in customer behavior may provide valuable insights for the model**



---



# Feature Engineering

In [None]:
zero_balance_df = df[df['Balance'] == 0].groupby(['IsActiveMember', 'Exited']).size().reset_index(name='Count')
zero_balance_df

This shows a conflict. How a member to Exit and still being an active member (180 member).

So, i'll change these members to be inactive

In [None]:
# change 0 Balance and 1 Exited members IsActiveMember to 0
df.loc[(df['Balance'] == 0) & (df['Exited'] == 1) & (df['IsActiveMember'] == 1), 'IsActiveMember'] = 0

In [None]:
# group df according to NumOfProducts, HasCrCard, IsActiveMember
active_level_df = df.groupby([ 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Exited']).size()
active_level_df

It seems weired that some customers that exited the bank still recorded as active members and have credit card.

I'll change any exited member to inactive and has no credit card.

In [None]:
# change any exited member to IsActiveMember 0 and HasCrCard 0
df.loc[(df['Exited'] == 1) & (df['IsActiveMember'] == 1), 'IsActiveMember'] = 0
df.loc[(df['Exited'] == 1) & (df['HasCrCard'] == 1), 'HasCrCard'] = 0

# Feature Creation:

In [None]:
# total_active column
df['total_active'] = df['IsActiveMember'] + df['NumOfProducts'] + df['HasCrCard']

#  balance to salary
df['Balance_to_Salary'] = df['Balance'] / df['EstimatedSalary']

# tenure to age column
df['Tenure_to_Age'] = df['Tenure'] / df['Age']

# balance age interaction
df['Balance_Age_Interaction'] = df['Balance'] * df['Age']

# product age interaction
df['Products_Age_Interaction'] = df['NumOfProducts'] * df['Age']

# balance age ratio
df['Balance_to_Age'] = df['Balance'] / df['Age']

# balance product ratio
df['Balance_to_Products'] = df['Balance'] / df['NumOfProducts']

In [None]:
df.head(10)



---



# Data Preprocessing

- In this section, i will perform the following tasks:

1. Encoding categorical features as needed.
2. Splitting data.
3. Normalizing features to ensure they are on a similar scale.-

Since all categorical features are nominal and do not have any inherent order, I will use One-Hot Encoding to handle them appropriately.

# encoding categorical features

In [None]:
cat_cols = ['Geography', 'Gender']
bank_df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=False)

# transforming bool columns to int

In [None]:
bool_cols = bank_df_encoded.select_dtypes(include=['bool']).columns
bank_df_encoded[bool_cols] = bank_df_encoded[bool_cols].astype(int)

# Splitting Data

In [None]:
X = bank_df_encoded.drop('Exited', axis=1)
y = bank_df_encoded['Exited']
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Normalization

In [None]:
# normalizing features using MinMax
scaler = MinMaxScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)



---



---



---



---



# 2. **Clustring**

# Standardizing the numerical feature

In [None]:
numerical_features = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

# Standardizing the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(bank_df_encoded[numerical_features])

# Appling K-means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
bank_df_encoded['Cluster'] = kmeans.fit_predict(scaled_data)

# Visualizing the clusters

In [None]:
# Reducing dimensions to 2 using PCA for visualization
pca = PCA(n_components=2)
pca_components = pca.fit_transform(scaled_data)

plt.figure(figsize=(8, 6))
plt.scatter(pca_components[:, 0], pca_components[:, 1], c=bank_df_encoded['Cluster'], cmap='viridis')
plt.title("K-means Clustering of Data")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar(label='Cluster Label')
plt.show()

In [None]:
cluster_summary = bank_df_encoded.groupby('Cluster')[numerical_features].mean()
print(cluster_summary)

- **Observations:**

1. Cluster 1 and Cluster 3 have high balances and relatively high credit card ownership but differ in age and tenure, Cluster 1 has more young customers with longer tenure while Cluster 3 has younger customers and shorter tenure.

2. Cluster 0 seems to represent a group with a moderate data: a normal credit score, moderate balance, and average activity with a reasonable number of products.

3. Cluster 2 is the most distinct with very low activity and credit card ownership these customers tend to have high balances but relatively low product usage. The low IsActiveMember value suggests they may be more d less engaged to the bank.

# Including 'Cluster' as a feature

In [None]:
X = bank_df_encoded.drop(columns=['Cluster', 'Exited'])
y = bank_df_encoded['Exited']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)




---



---



---



---



---



# 3. **individual and ensemble models**

- # **Individual Models**

# 1. Random Forest

In [None]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=rf.classes_, yticklabels=rf.classes_)
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# roc curve for Random Forest

In [None]:
# Get predicted probabilities for the positive class
y_pred_prob = rf.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Random Forest', color="r")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC Curve', fontsize=16)
plt.show()



---



# 2. Support Vector Machine (SVM)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

params = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf']
}

# Setting up the GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(estimator=SVC(), param_grid=params, cv=5, scoring='accuracy', n_jobs=-1)


grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test_scaled)

print("Classification Report:")
print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=best_model.classes_, yticklabels=best_model.classes_)
plt.title("Confusion Matrix - SVM (Best Model)")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# 3. Logistic Regression

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)

print("Logistic Regression Accuracy: ", accuracy_score(y_test, lr_pred))
print("Logistic Regression Classification Report: \n", classification_report(y_test, lr_pred))
conf_matrix = confusion_matrix(y_test, lr_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Greens", xticklabels=lr_model.classes_, yticklabels=lr_model.classes_)
plt.title("Confusion Matrix - Logistic Regression")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# roc curve for Logestic Reggression

In [None]:
# Get predicted probabilities for the positive class
y_pred_prob = lr_model.predict_proba(X_test_scaled)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression', color="g")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve', fontsize=16)
plt.show()

# Comparison:

- Accuracy: SVM has the highest accuracy (94%), followed by random forest (93%) and logistic regression (92%).

- Precision: Random Forest has the highest precision for Class 0 (0.96), but SVM performs best for Class 1 (0.83) which indicates the best classification for predicting the customers who leaves. logistic regression precision for Class 1 is the lowest (0.78) and scince my main goal is to predict who leaves the bank this model can't be chosen.

- Recall: SVM is the best in recall for Class 1 (0.87), while Random Forest and Logistic Regression are both strong for Class 0 (around 0.96), so random forest is the best as class 1 is more important than class 0.

- Logistic Regression has the lowest recall for Class 0 (0.94).

- F1-Score: SVM and random forest provide similar F1-scores for both classes, with SVM showing a slightly better overall balance.



---



# **Ensamble Models**

# 1. Voting Classifier

In [None]:
# Combine individual models in a voting ensemble
voting_model = VotingClassifier(estimators=[
    ('svm', best_model),
    ('lr', lr_model),
    ('rf', rf)
], voting='hard')  # 'hard' for majority voting, 'soft' for averaging probabilities

voting_model.fit(X_train_scaled, y_train)

voting_pred = voting_model.predict(X_test_scaled)
print("Voting Classifier Accuracy: ", accuracy_score(y_test, voting_pred))
print("Voting Classifier Classification Report: \n", classification_report(y_test, voting_pred))

conf_matrix = confusion_matrix(y_test, voting_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=voting_model.classes_, yticklabels=voting_model.classes_)
plt.title("Confusion Matrix - Voting Classifier")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# 2. Bagging (Using Random Forest)

In [None]:
# Train Bagging model using Random Forest
bagging_model = BaggingClassifier(estimator=RandomForestClassifier(), n_estimators=50, random_state=42)
bagging_model.fit(X_train, y_train)

# Predict and evaluate
bagging_pred = bagging_model.predict(X_test)
print("Bagging Model Accuracy: ", accuracy_score(y_test, bagging_pred))
print("Bagging Model Classification Report: \n", classification_report(y_test, bagging_pred))
conf_matrix = confusion_matrix(y_test, bagging_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=bagging_model.classes_, yticklabels=bagging_model.classes_)
plt.title("Confusion Matrix - Bagging Model", fontsize=16)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# roc curve for Bagging (Using Random Forest)

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Bagging Model (RF Estimator)', color="b")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Bagging Model ROC Curve', fontsize=16)
plt.show()

# 3. Boosting (Using Gradient Boosting)

In [None]:
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

gb_pred = gb_model.predict(X_test)
print("Gradient Boosting Accuracy: ", accuracy_score(y_test, gb_pred))
print("Gradient Boosting Classification Report: \n", classification_report(y_test, gb_pred))
conf_matrix = confusion_matrix(y_test, gb_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=gb_model.classes_, yticklabels=gb_model.classes_)
plt.title("Confusion Matrix - Gradient Boosting", fontsize=16)
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# roc curve for Boosting (Using Gradient Boosting)

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Gradient Boosting', color="m")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Gradient Boosting ROC Curve', fontsize=16)
plt.show()

- **Bagging Classifier** (with Random Forest) scored the highest accuracy (93.86%) and has a balanced performance for both classes.

- **Gradient Boosting and Voting Classifier** scored similar results with slightly lower accuracy(93.73) than Bagging.

- The recall for Class 1 (positive class) is consistently high across all models, indicating good detection of positive instances and the percestion is slightly higher for Bagging and Gradient Boosting.



---



# **Comparing Individual vs Ensemble Models**

# Comparison of Individual vs Ensemble Models

- **Note** The tables were generated using GPT, while all analyses were independently written manually.

## **1. Individual Models Performance**

| **Model**              | **Accuracy** | **Precision (Class 0)** | **Precision (Class 1)** | **Recall (Class 0)** | **Recall (Class 1)** | **F1-Score (Class 0)** | **F1-Score (Class 1)** |
|------------------------|--------------|-------------------------|-------------------------|----------------------|----------------------|------------------------|------------------------|
| **Random Forest**       | 93.0%        | 0.96                    | 0.84                    | 0.96                 | 0.84                 | 0.96                   | 0.84                   |
| **SVM (Best Model)**    | 94.0%        | 0.97                    | 0.83                    | 0.96                 | 0.87                 | 0.96                   | 0.85                   |
| **Logistic Regression** | 92.1%        | 0.96                    | 0.78                    | 0.94                 | 0.86                 | 0.95                   | 0.82                   |

## **2. Ensemble Models Performance**

| **Model**               | **Accuracy** | **Precision (Class 0)** | **Precision (Class 1)** | **Recall (Class 0)** | **Recall (Class 1)** | **F1-Score (Class 0)** | **F1-Score (Class 1)** |
|-------------------------|--------------|-------------------------|-------------------------|----------------------|----------------------|------------------------|------------------------|
| **Voting Classifier**    | 93.3%        | 0.96                    | 0.82                    | 0.95                 | 0.86                 | 0.96                   | 0.84                   |
| **Bagging (Random Forest)** | 93.9%      | 0.97                    | 0.84                    | 0.96                 | 0.87                 | 0.96                   | 0.85                   |
| **Gradient Boosting**    | 93.7%        | 0.96                    | 0.84                    | 0.96                 | 0.86                 | 0.96                   | 0.85                   |

## **Key Metrics Breakdown**

### **1. Accuracy**
- **Ensemble Models** (bagging, gradient boosting, and voting) showed a slight improvement over **Individual Models** in accuracy.
  - **Bagging** scored the highest accuracy (**93.9%**) followed by **Gradient Boosting** (**93.7%**) and **Voting Classifier** (**93.3%**).
  - **SVM** scored the highest individual model accuracy (**94.0%**) while **Random Forest** and **Logistic Regression** were slightly behind so those two will be eliminated from choosing the final model.

### **2. Precision**
  - **Class 0 (Negative)**: All models perform well, with **Bagging** achieving the highest precision for Class 0 (**0.97**).
  - **Class 1 (Positive)**: **Bagging** and **Gradient Boosting** performed the best for Class 1 with precision of **0.84**, while **SVM** is slightly lower at **0.83**. and since percestion for class 1 is what i foucus on bagging and gradient boosting is the two i will be choosing one of them to comment on in terms of bias and variance.

### **3. Recall**
  - **Class 0**: Models like **Random Forest**, **SVM**, and **Gradient Boosting** are effective in detecting negatives, all with recall values close to **0.96**.
  - **Class 1**: **SVM** and **Bagging** had the best recall for Class 1, both at **0.87**, which is slightly better than the rest. **Logistic Regression** has a lower recall of **0.86** for Class 1.

### **4. F1-Score**
  - **Class 0**: All models have very similar performance for Class 0 with **0.96** F1-Score.
  - **Class 1**: **SVM** **Bagging** and **Gradient Boosting** scored the highest F1-Scores for Class 1, all at **0.85**, while **Logistic Regression** is slightly lower at **0.82**.


## **Conclusion**

- **Ensemble Models** like **Bagging** and **Gradient Boosting** provide slight improvements in **recall** and **precision** for **Class 1 (positive class)** making them better at detecting positive cases.
- **Bagging** performed the best overall, with the highest accuracy and precision for **Class 0** and the best recall for **Class 1**.
- **SVM** performed best in terms of **accuracy** (**94.0%**), but the ensemble models provide a better balance between **precision**, **recall** and **F1-Score** for both classes.

In summary as my aim is class 1 i will be choosing Bagging as it has the highst accuracy and and best f1 score to comment on its Bais and Variance.




---



---



---



---



# 4. **bias / variance**

## Bagging for Bias-Variance Analysis

For the **bias-variance analysis**, I will focus on the **Bagging model** due to the following reasons:

- **Balanced Performance**: Based on the performance metrics bagging had shown a good balance between **accuracy** **precision** **recall** and **F1-score** across different classes. Its performance, especially on Class 1 (positive class), is competitive making it an ideal to explore bias and variance.

- **Bagging's ability to stabilize predictions** and reduce overfitting by averaging predictions from multiple models makes it a great choice for evaluating bias and variance.

Given these strengths, **Bagging** is the most suitable model to evaluate bias and variance.


## **Bias-Variance Analysis: Bagging Model**

### 1. Bagging Model Performance on Training Data:
- **Accuracy**: 94.0%
- **Precision (Class 0)**: 0.97
- **Precision (Class 1)**: 0.84
- **Recall (Class 0)**: 0.96
- **Recall (Class 1)**: 0.87
- **F1-Score (Class 0)**: 0.96
- **F1-Score (Class 1)**: 0.85

### 2. Bagging Model Performance on Test Data:
- **Accuracy**: 93.9%
- **Precision (Class 0)**: 0.97
- **Precision (Class 1)**: 0.84
- **Recall (Class 0)**: 0.96
- **Recall (Class 1)**: 0.87
- **F1-Score (Class 0)**: 0.96
- **F1-Score (Class 1)**: 0.85

### Bias-Variance Analysis:

#### Training vs Test Performance:
- The performance on both the **training** and **test datasets** isalmost identicall.

#### Interpretation:
- The mini gap between training and test performances suggests that the Bagging model is **not overfitting** to the training data. therefore it has **low variance**.
- The performance on both datasets is also good, with **high precision**, **recall**, and **F1-scores** since both metrics are good it indicates that the model is **generalizing well** indicating **low bias** as well.

### Conclusion:
The **Bagging model** appears to have **low bias** and **low variance** based on the performance metrics. it shows no signs of **underfitting** or **overfitting** meaning the model is well-tuned.




---



---



---



---



# 5. **Imbalance handling and cost-sensitive classification**

In [None]:
# Check the distribution of classes in the training set
print(pd.Series(y_train).value_counts())
print(pd.Series(y_test).value_counts())

- **Based on the class distribution, it seems that the dataset is imbalanced. In both the training and test datasets.**

### Percentage of Class 0 and Class 1 in Training Data:

- **Percentage of Class 0 in Training Data**:

 = 79.7\%


- **Percentage of Class 1 in Training Data**:

 =20.3\%


### Percentage of Class 0 and Class 1 in Testing Data:

- **Percentage of Class 0 in Testing Data**:

  = 79.6\%


- **Percentage of Class 1 in Testing Data**:

  = 20.4\%




---



# 1. SMOTE (Synthetic Minority Over-sampling Technique):

In [None]:
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Resample the training data
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Check the class distribution after applying SMOTE
print(pd.Series(y_train_smote).value_counts())

# 2. Undersampling:

In [None]:
# Initialize RandomUnderSampler
undersampler = RandomUnderSampler(sampling_strategy='auto', random_state=42)

X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)

print(pd.Series(y_train_under).value_counts())

# 3. Class Weights (for RandomForest):

In [None]:
# Initialize RandomForestClassifier with class weights
model_with_weights = RandomForestClassifier(class_weight='balanced', random_state=42)
model_with_weights.fit(X_train, y_train)

class_distribution = y_train.value_counts()
print(class_distribution)

# Trainning the Models and Making Predictions

In [None]:
# Train a model on the SMOTE resampled data
model_smote = RandomForestClassifier(random_state=42)
model_smote.fit(X_train_smote, y_train_smote)

# Train a model on the Undersampled data
model_under = RandomForestClassifier(random_state=42)
model_under.fit(X_train_under, y_train_under)

# Predictions for each model on the test data
y_pred_smote = model_smote.predict(X_test)
y_pred_under = model_under.predict(X_test)
y_pred_weights = model_with_weights.predict(X_test)

# Confusion matrix for each model
cm_smote = confusion_matrix(y_test, y_pred_smote)
cm_under = confusion_matrix(y_test, y_pred_under)
cm_weights = confusion_matrix(y_test, y_pred_weights)

# ROC curve for each model
fpr_smote, tpr_smote, _ = roc_curve(y_test, model_smote.predict_proba(X_test)[:, 1])
fpr_under, tpr_under, _ = roc_curve(y_test, model_under.predict_proba(X_test)[:, 1])
fpr_weights, tpr_weights, _ = roc_curve(y_test, model_with_weights.predict_proba(X_test)[:, 1])

# Plotting confusion matrices and ROC curves
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title("Confusion Matrix Comparison")
plt.plot(cm_smote, label="SMOTE", color='r')
plt.plot(cm_under, label="Undersampling", color='g')
plt.plot(cm_weights, label="Class Weights", color='b')
plt.legend()


plt.subplot(1, 2, 2)
plt.title("ROC Curve Comparison")
plt.plot(fpr_smote, tpr_smote, label="SMOTE", color='r')
plt.plot(fpr_under, tpr_under, label="Undersampling", color='g')
plt.plot(fpr_weights, tpr_weights, label="Class Weights", color='b')
plt.legend()
plt.show()

## Models comparison: SMOTE, Undersampling, and Class Weights

### ROC Curve:

1. **ROC Curves Comparison**:
   - All three models (smote, undersampling, class weights) showes an  increase in the true positive rate up to around 0.8 on the y-axis and almost 0.01 on the x-axis. this idicates that the models are able to distinguish between the classes with increasing precision.

   -After the initial rise, the curves level off and move toward the top-right corner of the graph (around 0.2 on the x-axis and 1.0 on the y-axis). This shows that the models are performing well, accurately identifying the minority class (Exited).

2. **class weights with the largest area under the curve**:
   - The **Class Weights** model has the largest Area Under the Curve idicating that it has the best ability to distinguish between classes across different thresholds.

---

### Confusion Matrix Interpretation:

1. **Confusion Matrix for SMOTE, Undersampling, and Class Weights**:

     - **SMOTE**: Lines starting from around **2300** on the y-axis and decreasing as the x-axis approaches 0.8, indicating False Negatives and True Positives.

     - **Undersampling**: Lines starting from around **2200** on the y-axis, indicating a higher number of False Positives and True Negatives.

     - **Class Weights**: Lines starting from around **2350** on the y-axis, showing the best balance between False Positives and True Positives.

   - The confusion matrices show:
     - **Class Weights**: The model starts from around **100** on the y-axis, with better performance in handling False Positives and True Positives.

     - **SMOTE**: The model starts from around **200** on the y-axis, indicating slightly more False Positives compared to Class Weights.
    
     - **Undersampling**: The model starts from around **250** on the y-axis, indicating the lowest number of False Positives but a higher number of False Negatives.

---

### Conclusion:

- **Class Weights** showed the best performance in ROC curve and the better balance of predictions in the confusion matrix. that means it less missclassify the minor class that led to improved generalization and better handling of the class imbalance.

- Both **SMOTE** and **Undersampling** were effective but showed some limitations in handling False Negatives or False Positives.

Thus, **Class Weights** is the most effective method for handling the class imbalance in my case, providing the best overall performance when compared to **SMOTE** and **Undersampling**.




---



---



---



---



# 6. **Analyzing ALL the obtained results are provided in the report**