## Bank Churn Prediction: Practical Insights into Preprocessing, Model Building, and Predictive Modeling.
In the world of banking, one of the biggest challenges is keeping customers happy and staying with the bank. It's way cheaper to retain existing customers than to acquire new ones. But sometimes, customers decide to leave, and banks want to know why. That's where this project comes in. We're using smart computer methods to figure out when customers might leave.

We start by looking at all the data the bank has collected over time. Then, we clean it up and identify the most important parts. We use data cleaning to ensure the information is accurate and ready for analysis. After that, we employ clever techniques like feature engineering to extract the most useful bits of data. This helps us uncover patterns and trends that might predict if a customer will leave or not.

Once we've got everything prepared, we're building machine learning models using simple but effective methods to predict which customers might leave the bank. But we're not stopping there. We're also testing these models with separate data to ensure their efficiency. By doing all this, we hope to provide banks with the tools they need to keep their customers happy and loyal for the long haul.

### Reading and Analyzing Bank Customer Churn Data.

In [16]:
import pandas as pd

df = pd.read_csv('churn.csv')
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,7496,15589541,Sutherland,557,France,Female,27,2,0.00,2,0,1,4497.55,0
7496,7497,15608804,Allan,824,Germany,Male,49,8,133231.48,1,1,1,67885.37,0
7497,7498,15645820,Folliero,698,France,Male,27,7,0.00,2,1,0,111471.55,0
7498,7499,15659031,Mordvinova,630,France,Female,36,8,126598.99,2,1,1,134407.93,0


### Counting Null Values in the Data.

In [17]:
null_values = df.isnull().sum()
null_values

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

### Counting Duplicates.
Wow!!! Our dataset does not have any null values. Now, we need to ensure that the data does not contain any duplicates. Checking for duplicate rows is crucial for maintaining data accuracy. Let's go ahead and check for them.

In [18]:
duplicates = duplicates = df.duplicated().sum()
duplicates

0

### Exited Customer Distribution Analysis.
we need to check the distribution of 'Exited' customers in our dataset.

In [19]:
values = df['Exited'].value_counts()
values

0    5954
1    1546
Name: Exited, dtype: int64

### Removing Unnecessary Columns
After found out the distribution. Now, we need to drop irrelevant columns. This action will enhance the clarity

In [20]:
df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
df

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
7495,557,France,Female,27,2,0.00,2,0,1,4497.55,0
7496,824,Germany,Male,49,8,133231.48,1,1,1,67885.37,0
7497,698,France,Male,27,7,0.00,2,1,0,111471.55,0
7498,630,France,Female,36,8,126598.99,2,1,1,134407.93,0


### Categorizing Dataset Columns.
Now, we're categorizing the columns into numerical and nominal types. This categorization facilitates targeted exploration and manipulation of data based on their respective natures. 

In [21]:
numbCol = ['EstimatedSalary', 'Balance', 'CreditScore', 'Age']
nomCol = ['HasCrCard', 'IsActiveMember', 'Geography', 'Gender', 'NumOfProducts', 'Tenure']

print(numbCol)
print(nomCol)

['EstimatedSalary', 'Balance', 'CreditScore', 'Age']
['HasCrCard', 'IsActiveMember', 'Geography', 'Gender', 'NumOfProducts', 'Tenure']


### Outlier Detection Analysis
We have categorized the columns. Now, we aim to identify outliers within our dataset. By doing this, we can facilitate robust data cleansing and analysis strategies. 

In [22]:
from scipy import stats

# Define the threshold for outlier detection
threshold = 3

# Create a dictionary to store the count of outliers for each column
outlier_counts = {}

# Iterate over each column in the list numbCol containing the names of numerical columns
for column in numbCol:
    # Calculate the Z-scores for the values in the current column
    z_scores = stats.zscore(df[column])
    
    # Identify outliers by comparing the absolute Z-scores against the defined threshold
    outliers = (abs(z_scores) > threshold)
    
    # Count the number of outliers for the current column
    outlier_count = sum(outliers)
    
    # Store the count of outliers for the current column in the outlier_counts dictionary
    outlier_counts[column] = outlier_count


outlier_counts

{'EstimatedSalary': 0, 'Balance': 0, 'CreditScore': 5, 'Age': 99}

### Age Data Integrity Check
We've found outliers in two columns. Before removing them, we need to ensure that it's okay to drop them. Because age can be any number, although we set one limit and check for real outliers. This examination is crucial for maintaining data integrity and reliability in subsequent analyses. 

In [23]:
# Filter the DataFrame for values in the "Age" column greater than 100 and less than 1
ages_check = len(df[(df['Age'] > 100) | (df['Age'] < 1)])

ages_check

0

### Credit Score Data Integrity Check
We have no real outliers in the Age column. Now, we need to check the credit score. Here, our industry knowledge comes into play because the credit score typically ranges from 300 to 900. 

In [24]:
credit_check = len(df[(df['CreditScore'] > 900) | (df['CreditScore'] < 300)])
credit_check

0

### Converting Estimated Salary and Balance to Integer Data Type
There are no real outliers in the credit score column either. Now, we need to change the columns from float to integer values

In [25]:
df['EstimatedSalary'] = df['EstimatedSalary'].astype(int)

df['Balance'] = df['Balance'].astype(int)

df

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0,1,1,1,101348,1
1,608,Spain,Female,41,1,83807,1,0,1,112542,0
2,502,France,Female,42,8,159660,3,1,0,113931,1
3,699,France,Female,39,1,0,2,0,0,93826,0
4,850,Spain,Female,43,2,125510,1,1,1,79084,0
...,...,...,...,...,...,...,...,...,...,...,...
7495,557,France,Female,27,2,0,2,0,1,4497,0
7496,824,Germany,Male,49,8,133231,1,1,1,67885,0
7497,698,France,Male,27,7,0,2,1,0,111471,0
7498,630,France,Female,36,8,126598,2,1,1,134407,0


### Label Encoding for Categorical Data
Now, we need to change the words of the categories into numbers.

In [26]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

lstforle = ['Geography', 'Gender']
for feature in lstforle:
    df[feature] = le.fit_transform(df[feature])

df

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,0,0,42,2,0,1,1,1,101348,1
1,608,2,0,41,1,83807,1,0,1,112542,0
2,502,0,0,42,8,159660,3,1,0,113931,1
3,699,0,0,39,1,0,2,0,0,93826,0
4,850,2,0,43,2,125510,1,1,1,79084,0
...,...,...,...,...,...,...,...,...,...,...,...
7495,557,0,0,27,2,0,2,0,1,4497,0
7496,824,1,1,49,8,133231,1,1,1,67885,0
7497,698,0,1,27,7,0,2,1,0,111471,0
7498,630,0,0,36,8,126598,2,1,1,134407,0


### Percentage of Churned Customers based on Credit Card Ownership
we need to check how the column relates to the 'Exited' column. This analysis helps us understand how credit card services affect customer retention strategies.

In [27]:
credit_churn_percentage = df.groupby('HasCrCard')['Exited'].mean() * 100
credit_churn_percentage

HasCrCard
0    21.246078
1    20.345417
Name: Exited, dtype: float64

### Removing Credit Card Ownership Column
We found out that credit card services have no impact on customer retention. Let's go ahead and drop the column.

In [28]:
df.drop(columns=['HasCrCard'], inplace=True)
df

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,IsActiveMember,EstimatedSalary,Exited
0,619,0,0,42,2,0,1,1,101348,1
1,608,2,0,41,1,83807,1,1,112542,0
2,502,0,0,42,8,159660,3,0,113931,1
3,699,0,0,39,1,0,2,0,93826,0
4,850,2,0,43,2,125510,1,1,79084,0
...,...,...,...,...,...,...,...,...,...,...
7495,557,0,0,27,2,0,2,1,4497,0
7496,824,1,1,49,8,133231,1,1,67885,0
7497,698,0,1,27,7,0,2,0,111471,0
7498,630,0,0,36,8,126598,2,1,134407,0


### Standardizing Numerical Data with StandardScaler

In [29]:
from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler
sc = StandardScaler()

# Fit StandardScaler to all numerical columns in the training DataFrame
sc.fit(df[numbCol])

# Apply the transformation to all numerical columns in the training DataFrame
df[numbCol] = sc.transform(df[numbCol])
df

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,IsActiveMember,EstimatedSalary,Exited
0,-0.319190,0,0,0.295670,2,-1.230265,1,1,0.018331,1
1,-0.432858,2,0,0.200198,1,0.111903,1,1,0.213069,0
2,-1.528202,0,0,0.295670,8,1.326687,3,0,0.237233,1
3,0.507485,0,0,0.009254,1,-1.230265,2,0,-0.112526,0
4,2.067833,2,0,0.391141,2,0.779776,1,1,-0.368987,0
...,...,...,...,...,...,...,...,...,...,...
7495,-0.959863,0,0,-1.136407,2,-1.230265,2,1,-1.666549,0
7496,1.799164,1,1,0.963972,8,0.903427,1,1,-0.563812,0
7497,0.497151,0,1,-1.136407,7,-1.230265,2,0,0.194437,0
7498,-0.205522,0,0,-0.277161,8,0.797200,2,1,0.593446,0


### Splitting Data into Features and Target Variable

In [30]:
x1 = df.drop(columns=['Exited'])
y1 = df['Exited']

print("Shape of x1 (features):", x1.shape)
print("Shape of y1 (target variable):", y1.shape)

Shape of x1 (features): (7500, 9)
Shape of y1 (target variable): (7500,)


### Oversampling Minority Class with SMOTE

In [34]:
from imblearn.over_sampling import SMOTE

# Create an instance of SMOTE
over = SMOTE(sampling_strategy=1)

# Apply SMOTE to oversample the minority class
x1_resampled, y1_resampled = over.fit_resample(x1, y1)
x1_resampled = x1_resampled.values
y1_resampled = y1_resampled.values

print("Shape of x1_resampled (resampled features):", x1_resampled.shape)
print("Shape of y1_resampled (resampled target variable):", y1_resampled.shape)

Shape of x1_resampled (resampled features): (11908, 9)
Shape of y1_resampled (resampled target variable): (11908,)


### Splitting Oversampled Data for Training and Testing

In [35]:
from sklearn.model_selection import train_test_split

# Assuming x1_resampled and y1_resampled are your features and target after oversampling
x_train, x_test, y_train, y_test = train_test_split(x1_resampled, y1_resampled, test_size=0.2, random_state=42)

print("Shape of x_train (training features):", x_train.shape)
print("Shape of x_test (testing features):", x_test.shape)
print("Shape of y_train (training target):", y_train.shape)
print("Shape of y_test (testing target):", y_test.shape)

Shape of x_train (training features): (9526, 9)
Shape of x_test (testing features): (2382, 9)
Shape of y_train (training target): (9526,)
Shape of y_test (testing target): (2382,)
