<a href="https://www.kaggle.com/code/divyam6969/bank-churn-binary-classification?scriptVersionId=159353958" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Bank Churn Binary Classification
Done by
Divyam 
[Github](https://github.com/divyam6969)

#### This project involves building a predictive model using XGBoost to identify potential customer churn in a financial dataset. The process includes preprocessing steps like handling missing values and outliers, as well as one-hot encoding categorical variables. The trained model is evaluated on a test set, and predictions are saved to a CSV file. The project output provides a detailed breakdown of churn predictions by customer segments, offering insights into the model's performance.

#### For users interested in replicating or extending the project, the README file serves as a concise guide, detailing project structure, dependencies, and usage instructions. Overall, this work contributes to customer relationship management by employing machine learning to proactively detect and address potential churn in a financial context.

## Reading the data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
pip install pandas

In [None]:
import pandas as pd

In [None]:
train_data = pd.read_csv('/kaggle/input/playground-series-s4e1/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s4e1/test.csv')


In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
train_data.describe()

So now we have read the data now we will explore it

## Exploring / Pre Processing Data

In [None]:
train_data.info()

#### removing negative values (in Age and Balance now)

In [None]:
# Assuming 'Balance' and 'Age' are the columns where negative values should be removed
train_data = train_data[(train_data['Balance'] >= 0) & (train_data['Age'] >= 0)]


In [None]:
train_data.describe()

#### removing the outliers

In [None]:
# Calculate IQR excluding the last column ("Exited")
Q1 = train_data.iloc[:, :-1].select_dtypes(include=['float64', 'int64']).quantile(0.25)
Q3 = train_data.iloc[:, :-1].select_dtypes(include=['float64', 'int64']).quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
outliers = (
    (train_data.iloc[:, :-1].select_dtypes(include=['float64', 'int64']) < lower_fence) | 
    (train_data.iloc[:, :-1].select_dtypes(include=['float64', 'int64']) > upper_fence)
).any(axis=1)

# Remove outliers
train_data = train_data[~outliers]


In [None]:
train_data.describe()

#### now i would replace the null values with the mean

In [None]:
# Extract numeric columns excluding the last column ("Exited")
numeric_columns = train_data.iloc[:, :-1].select_dtypes(include=['float64', 'int64']).columns

# Calculate column-wise means excluding the last column ("Exited")
column_means = train_data.iloc[:, :-1][numeric_columns].mean()

# Create a copy of the selected columns and fill null values with the corresponding mean
train_data[numeric_columns] = train_data[numeric_columns].copy().fillna(column_means)


#### Now removing the null values for non numeric columns

In [None]:
train_data.dropna(inplace=True)

In [None]:
train_data.describe()

In [None]:
train_data.info()

Now we have removed the faaltu ki values and our data is preprocessed now, now we would train the data and test it on test_csv file

#### Plotting the data to see if there's any outlier

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import math

# Assuming train_data is your cleaned DataFrame
# ...

# Assuming continuous_features and categorical_features are lists of column names
continuous_features = train_data.select_dtypes(include=['float64', 'int64']).columns
categorical_features = ['Geography', 'Gender']  # Update with your actual categorical columns

# Set up the subplots
num_cols_continuous = len(continuous_features)
num_cols_categorical = len(categorical_features)

num_rows_continuous = math.ceil(num_cols_continuous / 2)
num_rows_categorical = math.ceil(num_cols_categorical / 2)

# Set up the subplots for continuous features
fig, ax = plt.subplots(num_rows_continuous, 2, figsize=(16, 5 * num_rows_continuous))

# Choose a suitable color palette
box_color_palette = sns.color_palette("pastel")

for i, col in enumerate(continuous_features):
    # Create a boxplot using sns.boxplot with styling
    sns.boxplot(data=train_data, x=col, color=box_color_palette[i % len(box_color_palette)], ax=ax.flatten()[i])
    
    ax.flatten()[i].set_title(f'Boxplot of {str(col).upper()}', fontsize=14)
    ax.flatten()[i].set_xlabel(col, fontsize=12)
    ax.flatten()[i].set_ylabel('Value', fontsize=12)
    
    # Add grid for better readability
    ax.flatten()[i].grid(axis='y', linestyle='--', alpha=0.7)

# Hide any empty subplots
for j in range(i + 1, num_rows_continuous * 2):
    fig.delaxes(ax.flatten()[j])

# Set up the subplots for categorical features
fig, ax = plt.subplots(num_rows_categorical, 2, figsize=(16, 5 * num_rows_categorical))

# Choose a suitable color palette
countplot_color_palette = sns.color_palette("muted")

for i, col in enumerate(categorical_features):
    # Create a countplot using sns.countplot with styling
    sns.countplot(x=col, data=train_data, palette=countplot_color_palette, ax=ax.flatten()[i])
    
    ax.flatten()[i].set_title(f'Countplot of {str(col).upper()}', fontsize=14)
    ax.flatten()[i].set_xlabel(col, fontsize=12)
    ax.flatten()[i].set_ylabel('Count', fontsize=12)
    
    # Add grid for better readability
    ax.flatten()[i].grid(axis='y', linestyle='--', alpha=0.7)

# Hide any empty subplots
for j in range(i + 1, num_rows_categorical * 2):
    fig.delaxes(ax.flatten()[j])

# Adjust layout and show the plot
plt.tight_layout()
plt.show()


### Now we would train the data using different models and choose the model whose accuracy is the best and test the data on train_csv

## Data Modelling

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
print(train_data.describe())

In [None]:
print(test_data.describe())

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Separate features (X) and target variable (y)
X = train_data.drop(columns=['Exited'])  # Exclude the target variable
y = train_data['Exited']


#### Training the model using XGBClassfier Model

In [None]:

# Define categorical and numeric features
categorical_features = ['Surname', 'Geography', 'Gender']
numeric_features = [col for col in X.columns if col not in categorical_features]


In [None]:

# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])



In [None]:

# Combine the preprocessor with the classifier in a pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('classifier', XGBClassifier())])

# Split the data into training and testing sets
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Assuming 'y' is a continuous variable
threshold = 0.5  # set your desired threshold

# Ensure the indices are aligned
y_binary = (y_train > threshold).astype(int)

# Train the model
pipeline.fit(X_train, y_binary)



In [None]:
test_predictions = pipeline.predict(test_data)
train_predictions = pipeline.predict(X_train)  # Assuming X_train is the feature matrix used for training

# Assuming 'y_train' is the target variable used for training
train_accuracy = accuracy_score(y_train, train_predictions)
print(f'Training Accuracy: {train_accuracy}')
# Save the predictions to a CSV file
predictions_df = pd.DataFrame({'id': test_data['id'], 'Exited': test_predictions})
predictions_df.to_csv('DivyamKaAnswer.csv', index=False)


In [None]:
sub = pd.DataFrame()
sub["id"] = test_data['id']
sub["Exited"] = pipeline.predict_proba(test_data)[:,1]

sub.to_csv("submission.csv",header=True,index=False)
sub