<div style=" text-align: center; justify-content: center; align-items: center; background-color: #FAF884
; padding: 10px;  border-radius: 10px;">
    <h1 style="color: black;  margin-top: 9px; margin-bottom: 9px; "><center>
        Credit Card Customers Churn Prediction💡
</center></h1>
</div>

# Introduction 📖

The bank's manager is concerned about customers leaving their credit card services. **The goal** of this analysis is to create a model that predicts whether a customer is likely to leave or not. This prediction will help the bank proactively engage with at-risk customers, provide better services, and ultimately reduce customer churn.

----------------------------
# Table of Contents 📑

> - [1 - Import Libraries 📚](#1)
> - [2 - Data Exploration 🔎](#2)
> - [3 - Exploratory Data Analysis 📊](#3)
>    - [3.1- Univariate Analysis](#3.1)
>    - [3.2- Bivariate Analysis](#3.2)
> - [4 - Data Preprocessing ⚒️](#4)
>    - [4.1- Handling Missing Data](#4.1)
>    - [4.2- Handling Outliers](#4.2)
>    - [4.3- Handling Categorical Data](#4.3)
>    - [4.4- Data Split to Train and Test Sets](#4.4)
>    - [4.5- Handling Imbalanced Data](#4.5)
>    - [4.6- Feature Scaling](#4.6)
> - [5- Models Training and Evaluation ⚙️](#5)
>    - [5.1- K-fold Cross-Validation Evaluation](#5.1)
>    - [5.2- Training the Chosen Model (XGBoost Classifier)](#5.2)
> - [6- Hyperparameter Tuning 🛠️](#6)
>    - [6.1- ROC Curve for Final Model (XGBoost Classifier)](#6.1)

---------------------------------------
<a class="anchor"  id="1"></a>
# 1- Import Libraries 📚

In [None]:
# EDA Libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.subplots as sp
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import pycountry
sns.set_style("whitegrid")

# Data Preprocessing Libraries
from datasist.structdata import detect_outliers
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from category_encoders import BinaryEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

# Machine Learing (classification models) Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.feature_selection import SequentialFeatureSelector, SelectKBest, f_regression, RFE, SelectFromModel
from imblearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report, roc_curve, roc_auc_score 
import lightgbm as lgb
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

-------------------
<a class="anchor"  id="2"></a>
# 2- Data Exploration🔎

In [None]:
df = pd.read_csv('BankChurners.csv')
df.sample(10)

In [None]:
# Showing all column names
df.columns

In [None]:
# Dropping columns with unusual names
df = df.drop(['Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
             'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis=1)

### Feature Exploration

> - **`CLIENTNUM`** : Unique client identifier.
> - **`Attrition_Flag`** : Indicates whether the customer's account is active or has churned.
> - **`Customer_Age`** : Age of the customer.
> - **`Gender`** : Gender of the customer.
> - **`Dependent_count`** : Number of dependents of the customer.
> - **`Education_Level`** : Educational level of the customer.
> - **`Marital_Status`** : Marital status of the customer.
> - **`Income_Category`** : Income category of the customer.
> - **`Card_Category`** : Category of the credit card held by the customer.
> - **`Months_on_book`** : Number of months the customer has been a bank client.
> - **`Total_Relationship_Count`** : Total number of bank products held by the customer.
> - **`Months_Inactive_12_mon`** : Number of months with inactivity in the last 12 months.
> - **`Contacts_Count_12_mon`** : Number of contacts with the bank in the last 12 months.
> - **`Credit_Limit`** : Credit limit on the credit card.
> - **`Total_Revolving_Bal`** : Total revolving balance on the credit card.
> - **`Avg_Open_To_Buy`** : Average open to buy credit line on the credit card.
> - **`Total_Amt_Chng_Q4_Q1`** : Change in transaction amount over the last four quarters.
> - **`Total_Trans_Amt`** : Total transaction amount in the last 12 months.
> - **`Total_Trans_Ct`** : Total transaction count in the last 12 months.
> - **`Total_Ct_Chng_Q4_Q1`** : Change in transaction count over the last four quarters.
> - **`Avg_Utilization_Ratio`** : Average utilization ratio of the credit card.

In [None]:
# check the dataset shape
df.shape

In [None]:
# data information
df.info()

In [None]:
# Dropping CLIENTNUM column as it's a unique identifier and not useful for predictions.
df = df.drop( 'CLIENTNUM', axis=1)

In [None]:
df.shape

In [None]:
# checking for duplicated values
df.duplicated().sum()

- Data doesn't contain any duplicated values

In [None]:
# checking count the number of unique values in each column of the data
df.nunique()

In [None]:
# Descriptive analysis for numerical data
df.describe()

In [None]:
# Descriptive analysis for categorical data
df.describe(include='O')

-----------------------------------------
<a class="anchor"  id="3"></a>
# 3- Exploratory Data Analysis 📊

In [None]:
# Splitting columns into Categorical and Numerical Features
categorical_features = [
    'Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status',
    'Income_Category', 'Card_Category'
]

numerical_features = [
    'Customer_Age', 'Dependent_count', 'Months_on_book', 
    'Total_Relationship_Count', 'Months_Inactive_12_mon',
    'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
    'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
    'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'
]

<a class="anchor"  id="3.1"></a>
## 3.1- Univariate Analysis

### Exploration: Categorical Features

In [None]:
fig = px.pie(df, names='Attrition_Flag', 
             title='Attrition Flag Distribution',
             color_discrete_sequence=['#ff7f0e', '#3498db'],# Setting custom color
            )

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))


# Show the pie chart
fig.show()

In [None]:
fig = px.pie(df, names='Gender', 
             title='Gender Distribution',
             color_discrete_sequence=['#ff7f0e', '#3498db'],# Setting custom color
            )

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))


# Show the pie chart
fig.show()

In [None]:
fig = px.histogram(df, x='Education_Level', 
                   title='Education Level Distribution',
                   color_discrete_sequence=['#3498db'],  # Setting custom color
                  )

fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.show()

In [None]:
fig = px.histogram(df, x='Marital_Status', 
                   title='Marital Status Distribution',
                   color_discrete_sequence=['#3498db'],  # Setting custom color
                  )

fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.show()

In [None]:
fig = px.histogram(df, x='Income_Category', 
                   title='Income Category Distribution',
                   color_discrete_sequence=['#3498db'],  # Setting custom color
                  )

fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.show()

In [None]:
fig = px.histogram(df, x='Card_Category', 
                   title='Card Category Distribution',
                   color_discrete_sequence=['#3498db'],  # Setting custom color
                  )

fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.show()

### Exploration: Numerical Features

In [None]:
# Create subplots with specified dimensions
fig, axes = plt.subplots(7, 2, figsize=(14, 20))  # Adjust the figsize according to your preference

# Plot distribution plots for each skewed column
for i, column in enumerate(numerical_features):
    row = i // 2  # Calculate the row for the subplot
    col = i % 2   # Calculate the column for the subplot
    sns.boxplot(data=df, x=column, ax=axes[row, col], palette='Set2', orient='h', color='skyblue')
    axes[row, col].set_title(f'Distribution of {column}', fontsize=14)

# Remove any empty subplots if there are fewer numerical features than expected
for i in range(len(numerical_features), 7 * 2):
    row = i // 2
    col = i % 2
    fig.delaxes(axes[row, col])

plt.tight_layout()
plt.show()

### Skewed Continuous Features Exploration

In [None]:
# Create subplots with specified dimensions
fig, axes = plt.subplots(7, 2, figsize=(14, 20))  # Adjust the figsize according to your preference

# Plot distribution plots for each skewed column
for i, column in enumerate(numerical_features):
    row = i // 2  # Calculate the row for the subplot
    col = i % 2   # Calculate the column for the subplot
    sns.histplot(data=df, x=column, kde=True, ax=axes[row, col], color='skyblue')
    axes[row, col].set_title(f'Distribution of {column}', fontsize=14)

# Remove any empty subplots if there are fewer numerical features than expected
for i in range(len(numerical_features), 7 * 2):
    row = i // 2
    col = i % 2
    fig.delaxes(axes[row, col])

plt.tight_layout()
plt.show()

---------------------------------------------
<a class="anchor"  id="3.2"></a>
## 3.2- Bivariate Analysis

### What is the relationship between churn status and other categorical variables like Gender, Education Level, Marital Status, and Income Category?

In [None]:
fig = px.histogram(df, x='Gender', color='Attrition_Flag',
             title='Churn Rates by Gender',
             labels={'country': 'Country', 'state': 'Project State'},
             template='plotly_white', barmode='group',
             color_discrete_sequence=['#ff7f0e', '#3498db']
            )

# Customizing marker appearance
fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

# Show the plot
fig.show()

In [None]:
fig = px.histogram(df, x='Education_Level', color='Attrition_Flag',
             title='Churn Rates by Gender',
             labels={'country': 'Country', 'state': 'Project State'},
             template='plotly_white', barmode='group',
             color_discrete_sequence=['#ff7f0e', '#3498db']
            )

# Customizing marker appearance
fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

# Show the plot
fig.show()

In [None]:
fig = px.histogram(df, x='Marital_Status', color='Attrition_Flag',
             title='Churn Rates by Gender',
             labels={'country': 'Country', 'state': 'Project State'},
             template='plotly_white', barmode='group',
             color_discrete_sequence=['#ff7f0e', '#3498db']
            )

# Customizing marker appearance
fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

# Show the plot
fig.show()

In [None]:
fig = px.histogram(df, x='Income_Category', color='Attrition_Flag',
             title='Churn Rates by Gender',
             labels={'country': 'Country', 'state': 'Project State'},
             template='plotly_white', barmode='group',
             color_discrete_sequence=['#ff7f0e', '#3498db']
            )

# Customizing marker appearance
fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

# Show the plot
fig.show()

### How does customer age correlate with churn status ? Are younger or older customers more likely to churn?

In [None]:
# Creating a histogram using Plotly Express to visualize the relationship between age and the risk of heart attack
fig = px.histogram(df, x='Customer_Age', color='Attrition_Flag', title='The Effect of Age on Risk of Heart Attack (Output)',
                   labels={'age': 'Age', 'output': 'Output'}, 
                   marginal='box', barmode='group',
                   color_discrete_sequence=['#ff7f0e', '#3498db']
                 )

# Customizing marker appearance
fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.show()

### Is there a relationship between credit card utilization and churn status ? Do customers with higher card utilization have a higher likelihood of churn?

In [None]:
fig = px.box(df, y='Avg_Utilization_Ratio', x='Attrition_Flag', 
                color='Attrition_Flag', 
                title='Relationship Between Credit Card Utilization and Churn Status',
                color_discrete_sequence=['#ff7f0e', '#3498db'],
                labels={'Avg_Utilization_Ratio': 'Credit Card Utilization Ratio', 'Attrition_Flag': 'Churn Status'})

# Format the layout
fig.update_layout(
    xaxis=dict(showgrid=False, zeroline=False),
    yaxis=dict(title='Credit Card Utilization Ratio', zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.show()

### Is there a correlation between credit limit and churn status ? Do customers with higher credit limits tend to stay with the bank?

In [None]:
fig = px.box(df, y='Credit_Limit', x='Attrition_Flag', 
                 color='Attrition_Flag', 
                 title='Correlation Between Credit Limit and Churn Status',
                 color_discrete_sequence=['#ff7f0e', '#3498db'],
                 labels={'Credit_Limit': 'Credit Limit', 'Attrition_Flag': 'Churn Status'},)

# Format the layout
fig.update_layout(
    yaxis=dict(showgrid=False, zeroline=False),
    xaxis=dict(title='Credit Limit', zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233,233,233)',
    plot_bgcolor='rgb(233,233,233)',
)

fig.show()

### Does the number of contacts with the bank in the last 12 months influence churn? Are customers who are contacted more frequently less likely to churn?

In [None]:
fig = px.box(df, y='Contacts_Count_12_mon', x='Attrition_Flag', 
                color='Attrition_Flag', 
                title='Influence of Contacts Count on Churn Status',
                color_discrete_sequence=['#ff7f0e', '#3498db'],
                labels={'Contacts_Count_12_mon': 'Number of Contacts in Last 12 Months', 'Attrition_Flag': 'Churn Status'},
                category_orders={'Attrition_Flag': ['Existing Customer', 'Attrited Customer']})  

# Customizing marker appearance
fig.update_traces(marker=dict(line=dict(width=2, color='DarkSlateGrey')))

# Format the layout
fig.update_layout(
    yaxis=dict(title='Number of Contacts in Last 12 Months', showgrid=False, zeroline=False),
    xaxis=dict(title='Churn Status', zeroline=False, gridcolor='white'),
    paper_bgcolor='rgb(233, 233, 233)',
    plot_bgcolor='rgb(233, 233, 233)',
)

fig.show()

------------------
<a class="anchor"  id="4"></a>
# 4- Data Preprocessing ⚒️

<a class="anchor"  id="4.1"></a>
## 4.1- Handling Missing Data

In [None]:
# checking for missing values in data
df.isna().sum()

- Data does not contain any missing values

<a class="anchor"  id="4.2"></a>
## 4.2- Handling Outliers

In [None]:
# Detect outliers in numerical features
outliers_indices = detect_outliers(df, features=numerical_features, n=1.5)
number_of_outliers = len(outliers_indices)

# Print the number of outliers
print(f'Number of outliers: {number_of_outliers}')

In [None]:
# Removing all Outliers
df = df.drop(outliers_indices)

In [None]:
print(f"Dataset Shape After Removing Outliers {df.shape}")

<a class="anchor"  id="4.3"></a>
## 4.3- Handling Categorical Data

- **Nominal**: Categories without a meaningful order or ranking like (**Attrition Flag, Gender, Marital Status**).
- **Ordinal**: Categories with a meaningful order or ranking like (**Education Level, Income Category, Card Category**).

In [None]:
# Working with Ordinal Features with pandas `map` method.

attrition_flag_dic = {
    'Existing Customer' : 0,
    'Attrited Customer' : 1
}

edu_level_dic = {  
    'Unknown': 0, 
    'Uneducated': 1, 
    'High School': 2, 
    'College': 3,
    'Post-Graduate': 4, 
    'Graduate': 5, 
    'Doctorate': 6
} 

income_cat_dic = {
    'Unknown': 0,
    'Less than $40K': 1,
    '$40K - $60K': 2,
    '$60K - $80K': 3,
    '$80K - $120K': 4,
    '$120K +': 5
}

card_cat_dic = {
    'Blue': 0,
    'Silver': 1,
    'Gold': 2,
    'Platinum': 3
}

df['Attrition_Flag'] = df['Attrition_Flag'].map(attrition_flag_dic)

df['Education_Level'] = df['Education_Level'].map(edu_level_dic)

df['Income_Category'] = df['Income_Category'].map(income_cat_dic)

df['Card_Category'] = df['Card_Category'].map(card_cat_dic)

df.head()

In [None]:
# Working with Nominal Features with pandas `get_dummies` function.
df = pd.get_dummies(df, columns=['Gender', 'Marital_Status'])

encoded = list(df.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))

In [None]:
df.head()

<a class="anchor"  id="4.4"></a>
## 4.4- Data Split to Train and Test Sets

In [None]:
# First we extract the x Featues and y Label
x = df.drop(['Attrition_Flag'], axis=1)
y = df['Attrition_Flag']

In [None]:
x.shape, y.shape

In [None]:
# Then we Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(x, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 42)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

<a class="anchor"  id="4.5"></a>
## 4.5- Handling Imbalanced Data

In [None]:
y_train.value_counts()

> - Data is imbalanced so we're using SMOTE to balance the data because under-sampling can cause data loss and affect prediction quality when the initial data is imbalanced.

In [None]:
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [None]:
y_train.value_counts()

> - Now the Data is balanced

<a class="anchor"  id="4.6"></a>
## 4.6- Feature Scaling

### Standardizing Continuous Features with StandardScaler

In [None]:
# Creating a StandardScaler instance
scaler = StandardScaler()

# Fitting the StandardScaler on the training data
scaler.fit(X_train[numerical_features])

# Transforming (standardize) the continuous features in the training and testing data
X_train_cont_scaled = scaler.transform(X_train[numerical_features])
X_test_cont_scaled = scaler.transform(X_test[numerical_features])

# Replacing the scaled continuous features in the original data
X_train[numerical_features] = X_train_cont_scaled
X_test[numerical_features] = X_test_cont_scaled

X_train

--------------------------------------------------------
<a class="anchor"  id="5"></a>
# 5- Models Training and Evaluation ⚙️

In [None]:
# List of classifiers to evaluate
classifiers = [
    ("Logistic Regression", LogisticRegression(random_state=42, max_iter= 1500, n_jobs=-1)),
    ("Decision Tree", DecisionTreeClassifier(random_state=42)),
    ("Random Forest", RandomForestClassifier(random_state=42, n_jobs =-1)),
    ("AdaBoost", AdaBoostClassifier(random_state=42)),
    ("Gradient Boosting", GradientBoostingClassifier(random_state=42)),
    ("LightGBM", lgb.LGBMClassifier(random_state=42, verbose=-1)),
    ("XGBoost", xgb.XGBClassifier(random_state=42, n_jobs =-1))
]

<a class="anchor"  id="5.1"></a>
## 5.1- K-fold Cross-Validation Evaluation and Feature Selection

> Applying cross-validation through pipelines helps us thoroughly test machine learning models. It checks their performance across various data sets, ensuring a strong evaluation. By integrating feature selection within this process through pipelines, we carefully choose the best features. This method involves testing these features on different data parts, guaranteeing they work well across different situations. This meticulous approach ensures our selected features are reliable and effective, leading to a robust and widely applicable model.

In [None]:
# Initialize RFE feature selector
RFE_selector = RFE(lgb.LGBMClassifier(random_state=42, verbose=-1), n_features_to_select=12)


# Creating lists for classifier names, mean_test_accuracy_scores, and results.
results = []
mean_test_accuracy_scores = []
classifier_names = []

for model_name, model in classifiers:
    # Print model name
    print(f"For {model_name}:")
    
    # Steps Creation
    steps = list()
    
    steps.append(('feature_selector', RFE_selector))  # RFE feature selection
    
    steps.append((model_name, model))

    # Create the pipeline
    pipeline = Pipeline(steps=steps)
                        
    # 5-fold Stratified Cross-Validation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    # Perform cross-validation with train scores
    cv_results = cross_validate(pipeline, X_train, y_train, cv=cv, scoring='accuracy',n_jobs=-1, return_train_score=True)
    
    print(f"Cross-validation completed successfully for {model_name}")
    print('*' * 50)

    # Append results to the list
    results.append({
        "Model Name": model_name,
        "Mean Train Accuracy": np.mean(cv_results['train_score']),
        "Mean Test Accuracy": np.mean(cv_results['test_score'])
    })
    
    mean_test_accuracy_scores.append(np.mean(cv_results['test_score']))
    classifier_names.append(model_name)

# Create a DataFrame from the results list
results_df = pd.DataFrame(results)

# Display the DataFrame
display(results_df)

### Mean Test Accuracy Scores by Classifiers

In [None]:
# Creating a DataFrame from the data
data = pd.DataFrame({'Classifier': classifier_names, 'Test Accuracy': mean_test_accuracy_scores})

# Creating Plotly bar chart
fig = px.bar(data, x='Test Accuracy', y='Classifier', orientation='h', color='Test Accuracy',
             title='Mean Test Accuracy Scores by Classifiers', text='Test Accuracy', color_continuous_scale='viridis')

# Customizing the layout
fig.update_layout(
    xaxis_title='Test Accuracy',
    yaxis_title='Classifier',
    xaxis=dict(range=[0, 1]),
    yaxis=dict(categoryorder='total ascending'),
    showlegend=False,
    height=500,
    width=900
)

fig.show()

> Among the various models evaluated during cross-validation, XGBoost Classifier emerged as the top performer. It exhibited exceptional performance with a Excellent Mean Train Accuracy score and an outstanding Mean Test Accuracy score Notably, the model demonstrated no signs of overfitting, making it our chosen model for further analysis.

## Selected Features 

In [None]:
# Initialize RFE feature selector
RFE_selector = RFE(lgb.LGBMClassifier(random_state=42, verbose=-1), n_features_to_select=12)

# Fit RFE selector to the training data
RFE_selector.fit(X_train, y_train)

# Get the indices of the selected features
selected_feature_indices = np.where(RFE_selector.support_)[0]

# Get the names of the selected features based on their indices
selected_feature_names = X_train.columns[selected_feature_indices]

# Print the names of the selected features
print("Selected Features:")
print(selected_feature_names)

<a class="anchor"  id="5.2"></a>
## 5.2- Training the Chosen Model (XGBoost Classifier)

In [None]:
# Define the pipeline with the feature selector
pipeline = Pipeline(steps=[
    ('feature_selector', RFE_selector),
    ("XGBoost", xgb.XGBClassifier(random_state=42, n_jobs =-1))
])

pipeline.fit(X_train, y_train)

# Predictions on test data
y_pred = pipeline.predict(X_test)

# Calculate F1-score
f1 = f1_score(y_test, y_pred, average='weighted')

# Printing model details
print(f'Model: XGBoost')
print(f'Training Accuracy: {accuracy_score(y_train, pipeline.predict(X_train))}')
print(f'Testing Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'F1-score: {f1}')
print('------------------------------------------------------------------')
print(f'Testing Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}')
print('------------------------------------------------------------------')
print(f'Testing Classification report: \n{classification_report(y_test, y_pred)}')
print('------------------------------------------------------------------')

## Models Predictions Conclusion

> - Great, XGBoost demonstrates outstanding training and testing performance, showing no signs of overfitting and achieving an impressive F1-score of 96%.
> - Next, we'll work on improving the XGBoost model to see if we can make it more accurate.

--------------------------------------------------------
<a class="anchor"  id="6"></a>
# 6- Hyperparameter Tuning  🛠️

> Hyperparameter tuning with GridSearch is crucial for optimizing model accuracy, preventing overfitting, and ensuring stable, robust predictions. It saves time, enhances computational efficiency, and leads to better-informed decisions, making it indispensable in machine learning model development.

### Hyperparameter Tuning for XGBoost Classifier

In [None]:
param_grid = {
    'XGBoost__learning_rate': [0.01, 0.1, 0.2],  
    'XGBoost__n_estimators': [100, 200, 300],  
    'XGBoost__max_depth': [3, 4, 5],  
}

In [None]:
steps=[]
steps.append(('feature_selector', RFE_selector))
steps.append(("XGBoost", xgb.XGBClassifier(random_state=42, n_jobs =-1)))
pipeline=Pipeline(steps=steps)

In [None]:
# Create GridSearchCV instance
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1,
                           return_train_score=True)

# Fit the pipeline with GridSearch to the data
grid_search.fit(x, y)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

In [None]:
# Get the mean test score and mean train score for the best estimator
mean_test_score = grid_search.cv_results_['mean_test_score'][grid_search.best_index_]
mean_train_score = grid_search.cv_results_['mean_train_score'][grid_search.best_index_]

print("Mean Test Score:", mean_test_score)
print("Mean Train Score:", mean_train_score)

> The initial settings for XGBoost worked well, and even after trying different configurations, we didn't see much improvement in accuracy.

In [None]:
final_model=grid_search.best_estimator_

In [None]:
final_model

<a class="anchor"  id="6.1"></a>
## 6.1- ROC Curve for Final Model (XGBoost Classifier)

In [None]:
# Predict probabilities for the positive class using the final model
y_probabilities = final_model.predict_proba(x)[:, 1]

# Calculate the ROC curve and AUC score
fpr, tpr, thresholds = roc_curve(y, y_probabilities)
auc = roc_auc_score(y, y_probabilities)

# Plotting the ROC curve
sns.set(style='whitegrid')
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for XGBoost Classifier')
plt.legend(loc='lower right')
plt.show()

> An ROC curve with AUC = 1.00 means a perfect classifier. For the XGBoost Classifier, it signifies the model makes no classification errors, distinguishing all positive and negative cases perfectly.