**YBI FOUNDATION PROJECT-1**
**TITLE OF THE PROJECT**

*Customer Clustering and Churn Prediction in a Bank*

In [None]:
#The objective of this project is to cluster bank customers and predict customer churn using machine learning techniques.

In [None]:
#DATA SOURCE
#The dataset for this project is available on Kaggle: Bank Customer Churn Modeling.
# https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling

In [None]:
#IMPORT LIBRARIES

import numpy as np
import scipy as sc
import sklearn as sk
import pandas as pd
import seaborn as sb
import random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.patches as mpatches
import seaborn as sns
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from matplotlib.ticker import MaxNLocator
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization

pd.options.mode.chained_assignment = None


In [None]:
#IMPORT DATA

# Loading the .csv
# Loading the .csv
bank_data = pd.read_csv("/content/churn_dataset_Bank.csv")



In [None]:
#DATA PREPROCESSING
# Eliminating unnecessary attributes
bank_data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

# Renaming and encoding the 'Gender' column
bank_data.rename(columns={'Gender': 'IsMale'}, inplace=True)
bank_data['IsMale'] = bank_data['IsMale'].apply(lambda x: 1 if x == 'Male' else 0)

# Separating numerical and categorical variables
num_subset = bank_data.select_dtypes('number')
cat_subset = bank_data.select_dtypes('object')

# One-hot encoding categorical variables
cat_subset = pd.get_dummies(cat_subset)

# Saving a denormalized but organized version of the dataset
denorm_bank_data = pd.concat([cat_subset, num_subset], axis=1)

# Normalizing numerical variables
maxvals = num_subset.astype(float).max()
numericalColumns = {'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary'}
for col in numericalColumns:
    num_subset[col] = num_subset[col] / maxvals[col]
bank_data = pd.concat([cat_subset, num_subset], axis=1)

# Printing dataset types and displaying dataset head
print(bank_data.dtypes)
display(bank_data.head())


In [None]:
#Define Target Variable (Y) and Feature Variables (X):

# Defining X and Y
bankX = bank_data.iloc[:, :12]
bankY = bank_data.iloc[:, 12:13]

X = bankX.values  # numpy array (10000, 12)
Y = bankY.values  # numpy array (10000, 1)


In [None]:
#Train Test Split

origDataModel = []  # will contain the #iterations models trained with the original (imbalanced) data

for i in range(iterations):

  print("Iteration Nº", i, ": \n")

  X_train, X_test, Y_train, Y_test = train_test_split(X, Y.ravel(), test_size=0.3)

  model = tf.keras.Sequential()

  model.add(Dense(256, activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(0.3))

  model.add(Dense(128,  activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(0.3))

  model.add(Dense(64, activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(0.3))

  model.add(Dense(32, activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(0.3))

  model.add(Dense(8,  activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(0.3))

  model.add(Dense(1, activation='sigmoid'))

  model.compile(optimizer=tf.train.AdamOptimizer(),
                loss='binary_crossentropy',
                metrics=['acc'])

  model_logs = model.fit(X_train,
          Y_train,
          batch_size=32,
          epochs=120,
          verbose=0,  # silent mode
          validation_data=(X_test, Y_test))  # Only to check that the model is not "overfitting" the training data

  origDataModel.append(model)

    # Pseudo-validate with common validation data

  score = model.evaluate(X_val, Y_val, verbose=0)  # tensorflow default threshold = 0.5
  print("Accuracy training with imbalanced data (default threshold=0.5): {:-5f} %".format(score[1] * 100))

  Y_pred = model.predict(X_val) > 0.5  # manual threshold
  matConf = confusion_matrix(Y_val, Y_pred)
  valsize = Y_val.shape[0]

  plt.figure(figsize=(6, 5))  # Establishing the heatmap size before plotting
  ax = sb.heatmap(matConf, annot=True, fmt=".0f")
  ax.set_ylabel('Original', fontsize=15)
  ax.set_xlabel('Predicted', fontsize=15)
  plt.title("Imbalanced data")
  plt.show()

In [None]:
#Data Visualization:

list_binary = [0, 1, 2, 4, 9, 10]
list_normal = [3, 5, 6, 7, 8, 11]
order = [0, 1, 2, 4, 9, 10, 8, 6, 5, 3, 7, 11]

fig = plt.figure(figsize=(20, 8))
for i in range(len(order)):
    xi = denorm_bank_data.values[:, order[i]]
    # Use integer division // to ensure an integer number of columns
    ax1 = fig.add_subplot(2, len(order) // 2, i + 1)
    plt.title(list(bank_data)[order[i]], fontsize=16)
    if i < 6:
        # Convert boolean values to numerical (0 and 1) for histogram
        plt.hist(xi.astype(int), 2)
        plt.xticks([0.25, 0.75], np.arange(0, 2, 1))
    else:
        plt.hist(xi, 100)
plt.suptitle('Distributions of the Variables', fontsize=30)
plt.show()

In [None]:
#Train-Test Split and Model Training:

iterations = 10
origDataModel = []

for i in range(iterations):
    print(f"Iteration Nº {i}: \n")

    # Convert X and Y to float32 before splitting
    X_train, X_test, Y_train, Y_test = train_test_split(X.astype(np.float32),
                                                        Y.ravel().astype(np.float32),
                                                        test_size=0.3)

    model = tf.keras.Sequential()

    model.add(Dense(256, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))

    model.add(Dense(128, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))

    model.add(Dense(64, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))

    model.add(Dense(32, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))

    model.add(Dense(8, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))

    model.add(Dense(1, activation='sigmoid'))

    model.compile(optimizer=tf.keras.optimizers.Adam(),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    model_logs = model.fit(X_train, Y_train, batch_size=32, epochs=120, verbose=0,
                           validation_data=(X_test, Y_test))

    origDataModel.append(model)

    # Pseudo-validate with common validation data
    score = model.evaluate(X_test, Y_test, verbose=0)
    print(f"Accuracy training with imbalanced data (default threshold=0.5): {score[1] * 100:.2f} %")

    Y_pred = model.predict(X_test) > 0.5
    matConf = confusion_matrix(Y_test, Y_pred)

    plt.figure(figsize=(6, 5))
    ax = sb.heatmap(matConf, annot=True, fmt=".0f")
    ax.set_ylabel('Original', fontsize=15)
    ax.set_xlabel('Predicted', fontsize=15)
    plt.title("Imbalanced data")
    plt.show()

In [None]:
# **What is the proportion of customers for each nationality?**

fig=plt.figure(figsize=(5,5))
# First, we will find how many customers each country has
Customers_France = bankX.Geography_France.sum()
Customers_Germany = bankX.Geography_Germany.sum()
Customers_Spain = bankX.Geography_Spain.sum()

# We label, color and plot our data
labels = ['Germany','Spain','France']
sizes = [Customers_Germany, Customers_Spain,Customers_France]
colors = ['lightcoral','gold', 'cadetblue']
plt.title('Nationality - Proportion', fontsize=20)
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140,textprops={'fontsize': 14})
plt.axis('equal')
plt.show()

In [None]:
# What is the GENDER Ratio:

fig=plt.figure(figsize=(5,5))
# Data to plot
Men = bankX.loc[bankX['IsMale']==1, 'IsMale'].count()
Women = bankX.loc[bankX['IsMale']==0, 'IsMale'].count()
labels = ['Men', 'Women']
sizes = [Men,Women]
colors = ['#5539cc','#cb416b']
plt.title('Gender - Proportion', fontsize=20)
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90,textprops={'fontsize': 14})
plt.axis('equal')
plt.show()

In [None]:
# Correlation between the variable

fig=plt.figure(figsize=(10,10))
CX=bank_data.corr()
# Use bool instead of np.bool
mask = np.zeros_like(CX, dtype=bool)
mask[np.triu_indices_from(mask)] = True
heat=sb.heatmap(CX,mask=mask,annot=True, vmin=-1, vmax=1, fmt='.2f',cmap='RdBu_r')
fig.add_subplot(heat)
plt.show()

### **Project Summary: Customer Clustering and Churn Prediction in a Bank**

#### **Objective:**
The main goal of this project is to segment bank customers and predict customer churn using machine learning techniques. By identifying patterns and predicting churn, banks can devise strategies to retain customers and improve their services.

#### **Data Source:**
The dataset used for this project is available on Kaggle: [Bank Customer Churn Modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling).

#### **Libraries and Tools:**
- **Data Manipulation and Analysis:** NumPy, Pandas, SciPy
- **Visualization:** Matplotlib, Seaborn
- **Machine Learning:** Scikit-learn, TensorFlow, Keras
- **Miscellaneous:** Random, mpl_toolkits.mplot3d, Matplotlib patches

#### **Data Preprocessing:**
1. **Loading the Dataset:**
   - The dataset is loaded from a CSV file.
2. **Dropping Unnecessary Columns:**
   - Columns like `RowNumber`, `CustomerId`, and `Surname` are dropped.
3. **Renaming and Encoding:**
   - The `Gender` column is renamed to `IsMale` and encoded to binary (1 for Male, 0 for Female).
4. **Separating Numerical and Categorical Variables:**
   - Numerical variables and categorical variables are separated for further processing.
5. **One-Hot Encoding:**
   - Categorical variables are one-hot encoded.
6. **Normalization:**
   - Numerical variables are normalized to a range between 0 and 1.

#### **Data Visualization:**
- **Distribution of Variables:** Histograms are used to visualize the distribution of various numerical and categorical variables.
- **Nationality Proportion:** A pie chart is used to show the proportion of customers from Germany, Spain, and France.
- **Gender Ratio:** A pie chart displays the gender distribution among customers.
- **Correlation Matrix:** A heatmap is used to visualize the correlation between different variables in the dataset.

#### **Model Training:**
1. **Defining Features and Target Variable:**
   - The features (`X`) and the target variable (`Y`) are defined from the processed dataset.
2. **Train-Test Split:**
   - The dataset is split into training and testing sets.
3. **Model Architecture:**
   - A neural network model is built using TensorFlow and Keras with multiple layers (Dense, BatchNormalization, Dropout).
4. **Model Compilation and Training:**
   - The model is compiled with the Adam optimizer and trained using binary cross-entropy loss.
5. **Model Evaluation:**
   - The model's performance is evaluated using accuracy, and a confusion matrix is plotted for further insights.

#### **Key Insights:**
1. **Nationality Distribution:**
   - The majority of customers are from France, followed by Germany and Spain.
2. **Gender Distribution:**
   - The dataset contains more male customers compared to female customers.
3. **Correlation Analysis:**
   - The heatmap reveals correlations between different features, helping to understand which features are strongly related.

#### **Model Performance:**
- **Accuracy:** The model's accuracy is evaluated after each iteration of training, with results indicating the model's capability to predict churn accurately.
- **Confusion Matrix:** The confusion matrix helps in understanding the model's performance in terms of true positives, true negatives, false positives, and false negatives.

#### **Conclusion:**
This project demonstrates how to preprocess a dataset, visualize key insights, and build a machine learning model to predict customer churn. By understanding customer segments and predicting churn, banks can implement targeted strategies to retain customers and enhance their overall experience. Further model evaluation metrics like precision, recall, F1 score, and ROC-AUC curves can be used to gain deeper insights into the model's performance.