# **Bank Customer Exit Predictor (CI PP-5)** 

# **Exited Customer Data Analysis**

## Objectives

* Answer business requirement 1:
  * The bank is interested in identyfying from the available data most relevant customer attributes which are correlated to customer exit.

## Inputs

* outputs/datasets/collection/BankCustomerData.csv

## Outputs

* Create code to answer business requirement 1 and help in building the Streamlit App


---

# Change working directory

* Notebooks are being stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from its current folder to parent folder


1. We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

2. We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You have set a new current directory")

3. Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

* Loading dataset from outputs folder, however we are not including variables: CustomerID, Surname and RowNumber as they are just identifiers and dont impact the exit study.

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/BankCustomerData.csv")
    .drop(['CustomerId','Surname','RowNumber'], axis=1)
    )
df.head(3)


---

# Data Exploration

We will create a profile report of the dataset to examine and analyse variable type and distribution, also to understand missing levels. We will try to understand the relevance of these varibles in a business context.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

# Correlation Study

* We are using OneHotEncoder to convert categorical variables 'Gender' and 'Geography'. 

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=False)
df_enc = encoder.fit_transform(df)
print(df_enc.shape)
df_enc.head(5)

We use the .corr() from pandas library and use 'spearman' and 'pearson' methods to identify top correlations.
* As the first item is Exited and Exited we excude it using [1:]
* We sort values basis the absolute value, this is done by setting key=abs

In [None]:
corr_spearman = df_enc.corr(method='spearman')['Exited'].sort_values(key=abs, ascending=False)[1:]
corr_spearman

* Similarly for Pearson

In [None]:
corr_pearson = df_enc.corr(method='pearson')['Exited'].sort_values(key=abs, ascending=False)[1:]
corr_pearson

Basis Spearman and Pearson methods we notice there is weak levels of correlation between Exited and other variables.
* Ideally a strong correlation level should be considered, However this is not always possible.

We now filter variables in ascending order of correlation levels.

In [None]:
set(corr_pearson.index.to_list() + corr_spearman.index.to_list())

We now consider top 7 variables with highest level of correlation.

In [None]:
imp_vars = ['Age','Balance','CreditScore','EstimatedSalary','Gender','Geography','NumOfProducts']
imp_vars 

# Exploratory Data Analysis (EDA) Of Chosen Variables

In [None]:
df_imp = df.filter(imp_vars + ['Exited'])
df_imp.head(3)

# Variables Distribution by Exited
  We plot distribution of selected variables basis exited using custom plots 
  (These custom plots were obtained from Code Institute's Walkthrough Project )

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(8, 3))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(6, 3))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'Exited'
for col in imp_vars:
    if df_imp[col].dtype == 'object':
        plot_categorical(df_imp, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_imp, col, target_var)
        print("\n\n")

# Conclusion and Next Steps

## Conclusions:

We can notice from the above plots that.
* The average age of customers who is exited is 45 years and who didn't exit is 35 years.
* Customers having more than one product tend to exit less.
* Customers belonging to Germany tend to exit more than France and Spain.
* Customers who exited usually have credit scores in the range of 600 to 675. However, customers who dont exit tend to have credit scores in the range of 625 to 700.
* Customers who exited didn't belong to any specific salary range.
* Customers with lower account balances tend to exit less compared to customers with higher account balances.

## Next Steps:

* Data Cleaning