<a href="https://colab.research.google.com/github/SurabhiInamdar/Finance_ML_Projects/blob/master/credit_card_customer_churn_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msn
from collections import Counter

## 2. Loading the Dataset

### Load data into a Pandas DataFrame

In [2]:
df=pd.read_csv('./BankChurners.csv')

### Print the Datatypes of the dataset

In [None]:
df.dtypes

In [None]:
df.info()

* This dataset has 10127 rows and 23 columns

## 3. Data Cleaning

### Drop duplicates if any


In [None]:
df.duplicated().sum()
df.drop_duplicates(inplace=True)
df.shape

* As you can see, there are no duplicates

### Check for the null values in each column

In [None]:
df.isnull().any()

- As you can see, this dataset doesn't have null vaules! 

## 4. Exploratory Data Analysis and Data Visualization

### Customer age distribution

In [None]:
sns.distplot(df['Customer_Age'])
plt.title('Credit Card Customer Age Distribution')

* Customer age is normally distributed. 
* Most of the customer age are clustered around the mean value (between 40 to 60).

### Proportion of customer gender count

In [None]:
#count the number of gender
df['Gender'].value_counts()

In [None]:
# visualize gender count
sns.countplot(data=df, x='Gender')

In [None]:
plt.pie(df['Gender'].value_counts(), labels = ['Female', 'Male'], autopct='%1.1f%%', shadow = True, startangle = 90)
plt.title('Proportion of Gender count', fontsize = 16)
plt.show()

* Proportion of gender is almost equally distributed

### Proportion of existing and attrited customers count

In [None]:
plt.pie(df['Attrition_Flag'].value_counts(), labels = ['Existing Customer', 'Attrited Customer'], 
        autopct='%1.1f%%', startangle = 90)
plt.title('Proportion of Existing and Attrited Customer count', fontsize = 16)
plt.show()

* As you can see, proportion of customer is highly imbalanced compare to proportion of gender count

* So I'd like to see the proportion of existing and attrited customers by Gender! (see below)

### Proportion of existing and attrited customer by gender

In [None]:
#visualize to see the number of existing and attrited customers by gender
plt.figure(figsize=(10,6))
sns.countplot(x='Gender', hue='Attrition_Flag', data=df)
plt.title('Existing and Attrted Customers by Gender', fontsize=20)

In [None]:
# visualize to see the proportion of existing and attrited customers by gender

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,15))

attrited_gender = df.loc[df["Attrition_Flag"] == "Attrited Customer", ["Gender"]].value_counts().tolist()
ax1.pie(x=attrited_gender, labels=["Male", "Female"], autopct='%1.1f%%', startangle=90)
ax1.set_title('Attrited Customer vs Gender', fontsize=16)

existing_gender=df.loc[df["Attrition_Flag"] == "Existing Customer", ["Gender"]].value_counts().tolist()
ax2.pie(x=existing_gender,labels=["Male","Female"],autopct='%1.1f%%', startangle=90)
ax2.set_title('Existing Customer vs Gender', fontsize=16)

* The proportion of customers by gender in both pie charts is almost compareable. As you can see, there are 14.4% more attrited male than attrited female.  

### Proportion of entire education levels

In [None]:
edu = df['Education_Level'].value_counts().to_frame('Counts') 
plt.figure(figsize = (8,8))
plt.pie(edu['Counts'], labels = edu.index, autopct = '%1.1f%%')
plt.title('Proportion of Education Levels', fontsize = 18)
plt.show()

### Proportion of education level by existing and attrited customer

In [None]:
# Proportion of education level by gender

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,15))

attrited_edu = df.loc[df["Attrition_Flag"] == "Attrited Customer", ["Education_Level"]].value_counts().tolist()
ax1.pie(x=attrited_edu, labels=['Graduate', 'Post-Graduate', 'College', 'Unknown', 'Uneducated',
                                     'Doctorate', 'High School'], autopct='%1.1f%%', startangle=90)
ax1.set_title('Attrited Customer vs Education Level', fontsize=16)

existing_edu = df.loc[df["Attrition_Flag"] == "Existing Customer", ["Education_Level"]].value_counts().tolist()
ax2.pie(x=existing_edu, labels=['Graduate', 'Post-Graduate', 'College', 'Unknown', 'Uneducated',
                                     'Doctorate', 'High School'], autopct='%1.1f%%', startangle=90)
ax2.set_title('Existing Customer vs Education Level', fontsize=16)


### Proportion of education level by gender

In [None]:
# By pieplot

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,15))

attrited_eduprop = df.loc[df["Gender"] == "F", ["Education_Level"]].value_counts().tolist()
ax1.pie(x=attrited_eduprop, labels=['Graduate', 'Post-Graduate', 'College', 'Unknown', 'Uneducated',
                                     'Doctorate', 'High School'], autopct='%1.1f%%', startangle=90)
ax1.set_title('Female vs Education Level', fontsize=16)

existing_eduprop = df.loc[df["Gender"] == "M", ["Education_Level"]].value_counts().tolist()
ax2.pie(x=existing_eduprop, labels=['Graduate', 'Post-Graduate', 'College', 'Unknown', 'Uneducated',
                                     'Doctorate', 'High School'], autopct='%1.1f%%', startangle=90)
ax2.set_title('Male vs Education Level', fontsize=16)



In [None]:
# By countplot
plt.figure(figsize=(10,6))
sns.countplot(x='Gender', hue='Education_Level', data=df)
plt.title('Education Level by gender', fontsize=20)

- Proportion of education level of **both the customers and gender are concentrated on Graduate level, followed by Post-Graduate level

### Proportion of marital status by attrited and existing customers 

In [None]:
df['Marital_Status'].value_counts()

In [None]:
# Proportion of marital status by customer

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,15))

attrited_mar = df.loc[df["Attrition_Flag"] == "Attrited Customer", ["Marital_Status"]].value_counts().tolist()
ax1.pie(x=attrited_mar, labels=['Married', 'Single', 'Unknown', 'Divorced'], autopct='%1.1f%%', startangle=90)
ax1.set_title('Attrited Customer vs Marital_Status', fontsize=16)

existing_mar = df.loc[df["Attrition_Flag"] == "Existing Customer", ["Marital_Status"]].value_counts().tolist()
ax2.pie(x=existing_mar, labels=['Married', 'Single', 'Unknown', 'Divorced'], autopct='%1.1f%%', startangle=90)
ax2.set_title('Existing Customer vs Marital_Status', fontsize=16)


In [None]:
# By countplot
plt.figure(figsize=(10,6))
sns.countplot(x='Attrition_Flag', hue='Marital_Status', data=df)
plt.title('Attrited and Existing Customers by Marital Status', fontsize=20)

- A high proportion of marital status in attrited customer is Married status (43.6%), followed by Single (41.1%)

### Proportion of income category by customer

In [None]:
# Proportion of income category by customer

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,15))
count = Counter(df['Income_Category'])

attrited_inc = df.loc[df["Attrition_Flag"] == "Attrited Customer", ["Income_Category"]].value_counts().tolist()
ax1.pie(x=attrited_inc, labels=count, autopct='%1.1f%%', startangle=90)
ax1.set_title('Attrited Customer vs Income_Category', fontsize=16)

existing_inc = df.loc[df["Attrition_Flag"] == "Existing Customer", ["Income_Category"]].value_counts().tolist()
ax2.pie(x=existing_inc, labels=count, autopct='%1.1f%%', startangle=90)
ax2.set_title('Existing Customer vs Income_Category', fontsize=16)


- The proportion of income category of both attrited and existing customers shows us that it is highy concentrated around 60K-80K income.

## Correlation using heatmap 

In [None]:
f, ax = plt.subplots(figsize=(12, 8)) 
sns.heatmap(df.corr(), annot=True, cmap="Blues") 
plt.show()

### Customer age count by customer

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='Customer_Age', data=df, hue='Attrition_Flag')

# 5. Customer Churn Prediciton

Since we require **numerical values** for the predictive model ,the categorical columns need to be transformed. Hence **label encoding** is done.

### Preprocessing to transform categorial to numerical to data pridiction

In [None]:
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score, average_precision_score, recall_score
from sklearn.model_selection import RandomizedSearchCV

In [None]:
df_categorical = df[['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']]
df_categorical.head()

In [None]:
df_numerical = df[['Customer_Age', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
                      'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct',
                      'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']]
df_numerical.head()

In [None]:
enc = OneHotEncoder()
df_categorical_enc = pd.DataFrame(enc.fit_transform(df_categorical).toarray())
df_categorical_enc.head()

### Merge categorical and numerical dataframe

In [None]:
df_all = pd.concat([df_categorical_enc, df_numerical], axis=1)
df_all.head()

In [None]:
X = df_all

In [None]:
y = df['Attrition_Flag']

In [None]:
le = LabelEncoder()
y = le.fit_transform(y)

### Test Train Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [None]:
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
target_names = ['Attrited Customer', 'Existing Customer']

In [None]:
parameters_randomforest = {'n_estimators':range(10,400,5), 'max_depth':range(2,8,2)}

### RandomForestClassifier

In [None]:
randomforest = RandomForestClassifier(class_weight = 'balanced')
clf_randomforest = RandomizedSearchCV(randomforest, parameters_randomforest, random_state=0)
clf_randomforest.fit(X_train, y_train)

In [None]:
y_pred_randomforest = clf_randomforest.predict(X_test)

In [None]:
average_precision_score(y_test, y_pred_randomforest), roc_auc_score(y_test, y_pred_randomforest)

In [None]:
print(classification_report(y_test, y_pred_randomforest, target_names=target_names))

In [None]:
parameters_gb = {'learning_rate':(0.1,0.01), 'n_estimators':range(10,400,5),
                'max_depth':range(2,8,2)
              }

In [None]:
gb = GradientBoostingClassifier()

clf_gb = RandomizedSearchCV(gb, parameters_gb, random_state=0)

clf_gb.fit(X_train, y_train)

In [None]:
y_pred_gb = clf_gb.predict(X_test)

In [None]:
average_precision_score(y_test, y_pred_gb), roc_auc_score(y_test, y_pred_gb)

In [None]:
sns.heatmap(confusion_matrix(y_test, y_pred_gb), annot=True)

In [None]:
print(classification_report(y_test, y_pred_gb, target_names=target_names))

In [None]:
clf_lg = LogisticRegression(C=0.5, penalty='l2',n_jobs=6, random_state=0)

clf_lg.fit(X_train, y_train)

In [None]:
y_pred_lg = clf_lg.predict(X_test)

In [None]:
average_precision_score(y_test, y_pred_lg), roc_auc_score(y_test, y_pred_lg)

In [None]:
sns.heatmap(confusion_matrix(y_test, y_pred_lg), annot=True)

In [None]:
print(classification_report(y_test, y_pred_lg, target_names=target_names))

In [None]:
y_pred_all = (0.5*y_pred_gb) + (y_pred_randomforest*0.3) + (y_pred_lg*0.2)

In [None]:
average_precision_score(y_test, y_pred_all), roc_auc_score(y_test, y_pred_all)

## 6. Conclusion

### How can the bank stop the credit card customers who have churned?
   * There are 16.07% of customers who have churned.
   * The proportion of gender count is almost equally distributed (52.9% male and 47.1%) compare to proportion of existing and attributed customer count (83.9% and 16.1%) which is highly imbalanced.
   * The proportion of attrited customers by gender **there are 14.4% more male than female who have churned** 
   * **Customers who have churned are highly educated** - A high proportion of education level of attrited customer is Graduate level (29.9%), followed by Post-Graduate level (18.8%)** 
   * A high proportion of marital status of customers who have churned is Married (43.6%), followed by Single (41.1%) compared to Divorced (7.4%) and Unknown (7.9%) status  - **marital status of the attributed customers are highly clustered in Married status and Single** 
   * As you can see from the proportion of income category of attrited customer, it is highly concentrated around 60K - 80K income (37.6%), followed by Less than 40K income (16.7%) compare to attrited customers with higher annual income of 80K-120K(14.9%) and over 120K+ (11.5%). **I assume that customers with higher income doesn't likely leave their credit card services than meddle-income customers** 