# Business problem: 
 
‘ABC’ bank operates in several countries.The bank is facing revenue loss because of customer churn. Hence they intend to design and launch a special incentives and rewards program to help retain customers of the country with highest churn, hence reducing overall revenue losses.

The bank also wants to be able to predict which customers are going to churn so that some standard set of actions can be implemented to retain customers.

    

# Project objectives:

1. To determine country with the highest churn 
2. To identify the characteristics of churned customers in this country by employing several visualizations 
3. To build an ML model which predicts a customer’s propensity to churn based on various features


In [5]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns

import os


# reading the Bank.csv dataset
f1 = pd.read_csv("Bank.csv")
f1.head(10)
print(f1.info())
#the 'Bank.csv'dataset contains 10,000 rows and 12 columns 


In [6]:
f1.describe(include ='object')

NameError: name 'f1' is not defined

# Data Cleaning and Preparation

In [None]:
#checking number of missing values in each column of the dataset
f1.isnull().sum() 

#The dataset appears to have 84 null values in age column  


In [None]:
#checking the mean value of age
age_mean = f1.age.mean()

#replacing null values in 'age' with mean values in age: 
f1.age.fillna(age_mean, inplace = True)

In [None]:
#rechecking if dataset has null values after null value treatment
f1.isnull().sum()
print(f1)

#dropping columns which are unnecessary for exploration
f11 = f1.drop('customer_id', axis=1)

#converting float type numerical variables to int type
f2 = f11.copy()
f2['balance'] = f2['balance'].astype(int)
f2['estimated_salary'] = f2['estimated_salary'].astype(int)
f2.head()


In [None]:
#creating categorical variables, one each to for numeric variables- age, tenure and balance, estimated salary and credit score.
#created these variables for making easy to understand visualisations 
age_bins = [0, 20, 30, 40, 50, 60, 70, 80, 100]
tenure_bins = [0, 2, 4, 6, 8, 10]
balance_bins = [0, 50000, 100000, 150000, 200000, 250000, 300000]
salary_bins = [0, 50000, 100000, 150000, 200000]
age_labels = ['<20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80+']
tenure_labels = ['0-2', '2-4', '4-6', '6-8', '8-10']
balance_labels = ['<$50k', '$50k-100k', '$100k-150k', '$150k-200k', '$200k-250k', '>$250K']
salary_labels = ['<$50K', '$50K-$100K', '$100K-$150K', '>$150K']
credit_bins = [300, 579, 669, 739, 799, 850]
credit_labels = ['Poor', 'Fair', 'Good', 'Very Good', 'Excellent']

f2['credit_score_range'] = pd.cut(f2['credit_score'], bins= credit_bins, labels= credit_labels, include_lowest= True)
f2['age_range'] = pd.cut(f2['age'], bins= age_bins, labels= age_labels)
f2['tenure_range'] = pd.cut(f2['tenure'], bins= tenure_bins, labels= tenure_labels, include_lowest= True)
f2['balance_range'] = pd.cut(f2['balance'], bins= balance_bins, labels= balance_labels, include_lowest= True)
f2['salary_range'] = pd.cut(f2['estimated_salary'], bins= salary_bins, labels= salary_labels)
f2.head()

# Basic Exploratory Data Analysis to understand distribution of features 

In [None]:
##Distribution of numeric features

import matplotlib.pyplot as plt
f2.hist(figsize=(14,14))

plt.show() 


In [None]:
#Distribution of categorical features 

# Bar plot for "Gender"
plt.figure(figsize=(4,4))
f2['gender'].value_counts().plot.bar(color=['b', 'r'])
plt.ylabel('Count')
plt.xlabel('Gender')
plt.xticks(rotation=0)
plt.show()


In [None]:
# Bar plot for "country"
plt.figure(figsize=(6,4))
f2['country'].value_counts().plot.bar(color=['b', 'g', 'r'])
plt.ylabel('Count')
plt.xlabel('Geography')
plt.xticks(rotation=0)
plt.show()


## Insights from basic distribution analysis 

### There are high number of customers with :

1. credit score between 650 to 700
2. age between 30-40
3. tenure of 9-10 years
4. have a balance of O
5. have purchased only 1 product from the bank
6. have a credit card from the bank
7. have an active account i.e they have made atleast one transactions in the last month

- Approximately 20% of total customers have churned 

- France has the higest number of customers

- The customer base has more females than males 


## EDA to understand distrirbution of churned customers 

- 20% of all the bank's customers have exited. We look at what is the distribution of these customers for variables categorical variables 'country' and 'gender' 

### 1. Which country has most number of customers who have churned? 
- Germany

In [None]:
#dropping all rows containing retained customer's info
f8 = f2.copy()
f8.drop(f8[f8['churn'] == 0].index, inplace = True)
s3 = px.pie(f8, values='churn', names='country', title='Churn by Country')
s3.update_traces(textposition='inside', textinfo='percent+label')
s3.show()

#Hence, The number of customers who exited the bank was highest in Germany


### Who churn more? Men or women?
- More Female customers have churned as compared to male customers in all three countries 

  Additional information from graph:
  - Though Germany has highest churn% it has fewer number of churned female customers than France.
  - Germany has the highest number of churned male customers overall 


In [None]:
aa3=f8.copy()
aa3 = aa3[aa3['churn'] == 1]

# group the data by gender and country
gender_country = aa3.groupby(['gender', 'country']).size().reset_index(name='Count')
male_counts = gender_country[gender_country['gender'] == 'Male']['Count'].tolist()
female_counts = gender_country[gender_country['gender'] == 'Female']['Count'].tolist()
countries = sorted(set(gender_country['country']))

# create the bar chart
bar_width = 0.35
x = np.arange(len(countries))
fig, ax = plt.subplots()
male_bars = ax.bar(x - bar_width/2, male_counts, bar_width, label='Male')
female_bars = ax.bar(x + bar_width/2, female_counts, bar_width, label='Female')

# add labels and legend
ax.set_xlabel('Country')
ax.set_ylabel('Count')
ax.set_title('Churned count by Gender and Country')
ax.set_xticks(x)
ax.set_xticklabels(countries)
ax.legend()

# add data labels
for bar in male_bars + female_bars:
    height = bar.get_height()
    ax.annotate('{}'.format(height), xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')
                
plt.show()





### German customers in which age range and FICO credit score category churned the most?

- Age 40-50
- Credit score: Fair

In [None]:
import plotly.graph_objs as go
import plotly.io as pio
f9 = f8.copy()#dataframe used for Question 3 
# Filtering the data to only include customers who churned in Germany
f9.drop(f9[f9['country'] != 'Germany'].index, inplace = True)

#Grouping the data by a categorical variable
grouped = f9.groupby(['age_range', 'credit_score_range']).size().reset_index(name='count')

#Pivoting the data to create a matrix of counts
pivoted = grouped.pivot(index='age_range', columns='credit_score_range', values='count')

#Creating a stacked bar chart
data = []
for col in pivoted.columns:
    trace = go.Bar(x=pivoted.index, y=pivoted[col], name=col)
    data.append(trace)

layout = go.Layout(title='Customer Churn in Germany by age and FICO credit score ',
                   xaxis=dict(title='Age Group'),
                   yaxis=dict(title='Count'),
                   barmode='stack',
                   legend=dict(title='FICO credit score'))

fig = go.Figure(data=data, layout=layout)
pio.show(fig)

#Most customers in Germany who exited the bank belonged to the 40-50 age group and they had a 'Fair' FICO credit score 



### What is most common salary range of Churned customers in germany?
- most customers within this goup has a salary <$50K

In [None]:
aa2 =f9.copy() 

# replace 0 with 'Retained' and 1 with 'Churned'
aa2['churn'] = aa2['churn'].replace({0: 'Retained', 1: 'Churned'})

# counting occurrences of churn
##dropping rows of retained customers
aa2.drop(aa2[aa2['churn'] ==0].index, inplace = True)
salary_counts = aa2['salary_range'].value_counts()
plt.pie(salary_counts, labels= salary_counts.index, autopct='%1.1f%%')
plt.legend(title="Salary_Range", loc="center left", bbox_to_anchor=(1, 0.5))
plt.title('Churn Proportion by Salary Range')
plt.show()

#interpretation: less than 50k are most likely to churn 

### What is the balance range of German churned customers

-   most churned german customers have an account balance of $ 100k-150k in their accounts 


In [None]:
s10 =f9.copy() 
balance_count1 = s10.groupby(['balance_range']).size().reset_index(name='count')
print(balance_count1)
#there are no chured customers with account balance <$50k, $200-250k, $>250k


In [None]:
#creating a bar representation to visualze the above result 
excluded_ranges = ['<$50k', '$200k-250k', '>$250K']
balance_count1= balance_count1[~balance_count1['balance_range'].isin(excluded_ranges)]

# create an interactive bar chart using plotly.express
fig18 = px.bar(balance_count1, x='balance_range', y='count', 
               title='Count of Churned German Customers in each Account Balance Range',
               labels={'balance_range': 'Balance Range', 'count': 'Count'},
               color_discrete_sequence=['#636EFA'])

# remove the legend
fig18.update_layout(showlegend=False)

fig18.show()



### What is the tenure and activity status of most German customers churned ?

Most customers in this group had a tenure between 0-2 years and were inactive members 


In [None]:
aa4 = f9.copy()

tenure_distribution = aa4.groupby(['tenure_range','active_member']).size()
# Unstack the series by active_member level
pivoted = tenure_distribution.unstack(level=-1)

# Creating a stacked bar chart
data1 = []
for col in pivoted.columns:
    trace1 = go.Bar(x=pivoted.index, y=pivoted[col], name=col)
    data1.append(trace1)

layout1 = go.Layout(title='Customer Churn by tenure and activity status in Germany',
                   xaxis=dict(title='Tenure Age'),
                   yaxis=dict(title='Count'),
                   barmode='stack',
                   legend=dict(title='Active Member'))

fig20 = go.Figure(data=data1, layout=layout1)
pio.show(fig20)

# Conclusion of EDA:

After analyzing the churn rate of customers from different countries, we found that Germany has the highest overall churn rate. Further analysis revealed that female customers in Germany have a higher churn rate compared to male customers.

Moreover , we observed that customers who earn less than $50k in Germany have the highest proportion of churned customers. 

Additionally, most of the churned customers in Germany are between the age group of 40-50 years with a Fair FICO credit score.

Most churned German customers had an account balance in between $100K-150K

Most churned German customers had been with the bank for <=2 years and were inactive 


Therefore , our analysis suggests that the bank should keep in mind these characteristics of churned customers in germany to create an incentive/ rewards program which will help gain thier loyalty, thereby reducing churn.



In [None]:
f11.columns

## Data Preparation  for Modeling 

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, f1_score, recall_score, confusion_matrix, classification_report

# assume X_train and X_test are the training and testing feature dataframes
# and y_train and y_test are the corresponding target series
# assume num_cols and cat_cols are the lists of column names of numerical and categorical features, respectively

num_cols1= ['age', 'credit_score','tenure', 'balance','products_number', 'credit_card', 'active_member', 'estimated_salary']   # features / independent variables 
cat_cols1 = ['country', 'gender']
target = ['churn']     # target / dependent variable 

num_transformer = StandardScaler()
cat_transformer = OneHotEncoder()

# combine the transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols1),
        ('cat', cat_transformer, cat_cols1)])

# apply the preprocessor to the dataframe
X = preprocessor.fit_transform(f11)
Y = f11['churn']


In [None]:
#from sklearn.preprocessing import StandardScaler
#X = StandardScaler().fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

## Making Predictions

### KNN

In [None]:
from sklearn.model_selection import cross_validate     # import cross_validation function from sklearn
from sklearn.neighbors import KNeighborsClassifier    # import KNN classifer 

knn = KNeighborsClassifier()

## fit with 5 folder cross validation 

scores = cross_validate(estimator=knn, X=X, y=Y, cv=5, scoring="accuracy")

print("Average Fitting Time:", scores['fit_time'].mean())
print("Average Accuracy:", scores['test_score'].mean())

### Logistic Regression 

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

scores = cross_validate(estimator=lr, X=X, y=Y, cv=5, scoring="accuracy")

print("Average Fitting Time:", scores['fit_time'].mean())
print("Average Accuracy:", scores['test_score'].mean())

### SVM 

In [None]:
from sklearn.svm import SVC

svm = SVC()

scores = cross_validate(estimator=svm, X=X, y=Y, cv=5, scoring="accuracy")

print("Average Fitting Time:", scores['fit_time'].mean())
print("Average Accuracy:", scores['test_score'].mean())

### Naive Bayes 

In [None]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

scores = cross_validate(estimator=nb, X=X, y=Y, cv=5, scoring="accuracy")

print("Average Fitting Time:", scores['fit_time'].mean())
print("Average Accuracy:", scores['test_score'].mean())

### Decision Tree 

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()

scores = cross_validate(estimator=tree, X=X, y=Y, cv=5, scoring="accuracy")

print("Average Fitting Time:", scores['fit_time'].mean())
print("Average Accuracy:", scores['test_score'].mean())

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

scores = cross_validate(estimator=rf, X=X, y=Y, cv=5, scoring="accuracy")

print("Average Fitting Time:", scores['fit_time'].mean())
print("Average Accuracy:", scores['test_score'].mean())

### Gradient Boost Classifier 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier()

scores = cross_validate(estimator=gb, X=X, y=Y, cv=5, scoring="accuracy")

print("Average Fitting Time:", scores['fit_time'].mean())
print("Average Accuracy:", scores['test_score'].mean())

## Model Selection 

### We write a loop to search for the best model for the problem. 

In [None]:
## candidate models 
knn = KNeighborsClassifier()
lr = LogisticRegression()
svm = SVC()
nb = GaussianNB()
tree = DecisionTreeClassifier()
rf = RandomForestClassifier()


models = {"KNN": knn, "LogisticRegression": lr, "SVM": svm, 
          "Naive Bayes": nb, "DecisionTree": tree, "RandomForest": rf}
results = []

for model_name, model in models.items():
    default = {"Model":model_name, "Fitting Time": np.nan, "Accuracy": np.nan}
    scores = cross_validate(estimator=model, X=X, y=Y, cv=5, scoring="accuracy")
    default['Fitting Time'] = scores['fit_time'].mean()
    default['Accuracy'] = scores['test_score'].mean()
    results.append(default)

In [None]:
results = pd.DataFrame(results)
results.sort_values("Accuracy", ascending=False)

In [None]:
### visualize the prediction 

ax = sns.scatterplot(x="Fitting Time", y="Accuracy", data=results)
for x, y, model_name in results[['Fitting Time', 'Accuracy', 'Model']].values:
    ax.text(x+.005, y, model_name)

As we can see, the best model is the Random Forest model. We can either choose 
1. fine tune the best model to get the best prediction result. 
2. use a set of best model to generate an ensemble model for prediction. 

We have chosen to fine tune the best model i.e Random Forest 

### Next Step 1: Fine Tune the Best Model 

In [None]:
from sklearn.model_selection import GridSearchCV    ## use cross validation to search for best model hyper-parameters 

rf = RandomForestClassifier()

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 15],
    'criterion': ['gini', 'entropy'],
    'class_weight':[None, "balanced_subsample"]
}

In [None]:
# Perform grid search
grid_search = GridSearchCV(rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid_search.fit(X, Y)

# Print best parameters and score
print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

# Precision, recall and F1 scores: 

In [None]:
rf.fit(X_train, Y_train)

In [None]:
#Testing metrics
roc_auc_score(Y_test, rf.predict(X_test))
recall_score(Y_test, rf.predict(X_test))
confusion_matrix(Y_test, rf.predict(X_test))
print(classification_report(Y_test, rf.predict(X_test)))

- As we can see, the fine tuned RF model is 0.004 higher in accuracy than the RF model with default parameters. 
- The model is able to predict the customers that will churn with ~86% accuracy 