# PROBLEM STATEMENT#

Given the attributes of various customers of a bank, we need to build a model that can help predict if a customer's average balance will fall below minimum balance in the upcomimg quarter, based on current activity;

**FEATURES IN DATASET**

1. **Customer_ID** - ID of customer          
2. **Vintage** -  vintage of the customer with the bank, in number of days                            
3. **Age**  -  age of customer                            
4. **Gender** - gender of customer                            
5. **Dependents** - number of dependents
6. **Occupation** - occupation of customer
7. **City** - city of customer (integer coded) 
8. **customer_nw_category** -  net worth of customer (integer coded)
9. **branch code** - branch code for customer account
10. **days_since_last_transaction** - no of days since last credit in last 1 year
11. **current_balance** - balance as of today
12. **previous_month_end_balance** - end of balance previous month
13. **average_month_end_balance_prevQ** - average monthly balance in previous quarter
14. **average_month_end_balance_prevQ2** - average monthly balance in previous to previous quarter
15. **current_month_credit** - total credit amount current month
16. **previous_month_credit** - total credit amount previous month
17. **current_month_debit** - total debit amount current month
18. **previous_month_debit** - total debut amount previous month
19. **current_month_balance** - average balance of current month
20. **previous_month_balance** - average balance of previous month
21. **churn** - average balance of customer falls below minimum balance in next quarter (1 : yes, 0 : no)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import display_html

from scipy import stats

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score


from sklearn import metrics

from xgboost import XGBRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

In [2]:
customer_data = pd.read_csv(r"C:\Users\Sasidharan\Desktop\Data Science Projects\Customer Churn Prediction\churn_prediction.csv")

**Exploratory Data Analysis (EDA) **

In [3]:
customer_data.shape

(28382, 21)

In [4]:
customer_data.head()

Unnamed: 0,customer_id,vintage,age,gender,dependents,occupation,city,customer_nw_category,branch_code,days_since_last_transaction,...,previous_month_end_balance,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,churn
0,1,3135,66,Male,0.0,self_employed,187.0,2,755,224.0,...,1458.71,1458.71,1449.07,0.2,0.2,0.2,0.2,1458.71,1458.71,0
1,2,310,35,Male,0.0,self_employed,,2,3214,60.0,...,8704.66,7799.26,12419.41,0.56,0.56,5486.27,100.56,6496.78,8787.61,0
2,4,2356,31,Male,0.0,salaried,146.0,2,41,,...,5815.29,4910.17,2815.94,0.61,0.61,6046.73,259.23,5006.28,5070.14,0
3,5,478,90,,,self_employed,1020.0,2,582,147.0,...,2291.91,2084.54,1006.54,0.47,0.47,0.47,2143.33,2291.91,1669.79,1
4,6,2531,42,Male,2.0,self_employed,1494.0,3,388,58.0,...,1401.72,1643.31,1871.12,0.33,714.61,588.62,1538.06,1157.15,1677.16,1


In [5]:
customer_data.dtypes

customer_id                         int64
vintage                             int64
age                                 int64
gender                             object
dependents                        float64
occupation                         object
city                              float64
customer_nw_category                int64
branch_code                         int64
days_since_last_transaction       float64
current_balance                   float64
previous_month_end_balance        float64
average_monthly_balance_prevQ     float64
average_monthly_balance_prevQ2    float64
current_month_credit              float64
previous_month_credit             float64
current_month_debit               float64
previous_month_debit              float64
current_month_balance             float64
previous_month_balance            float64
churn                               int64
dtype: object

In [6]:
customer_data.describe()

Unnamed: 0,customer_id,vintage,age,dependents,city,customer_nw_category,branch_code,days_since_last_transaction,current_balance,previous_month_end_balance,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,churn
count,28382.0,28382.0,28382.0,25919.0,27579.0,28382.0,28382.0,25159.0,28382.0,28382.0,28382.0,28382.0,28382.0,28382.0,28382.0,28382.0,28382.0,28382.0,28382.0
mean,15143.508667,2364.336446,48.208336,0.347236,796.109576,2.22553,925.975019,69.997814,7380.552,7495.771,7496.78,7124.209,3433.252,3261.694,3658.745,3339.761,7451.133,7495.177,0.185329
std,8746.454456,1610.124506,17.807163,0.997661,432.872102,0.660443,937.799129,86.341098,42598.71,42529.35,41726.22,44575.81,77071.45,29688.89,51985.42,24301.11,42033.94,42431.98,0.388571
min,1.0,180.0,1.0,0.0,0.0,1.0,1.0,0.0,-5503.96,-3149.57,1428.69,-16506.1,0.01,0.01,0.01,0.01,-3374.18,-5171.92,0.0
25%,7557.25,1121.0,36.0,0.0,409.0,2.0,176.0,11.0,1784.47,1906.0,2180.945,1832.507,0.31,0.33,0.41,0.41,1996.765,2074.408,0.0
50%,15150.5,2018.0,46.0,0.0,834.0,2.0,572.0,30.0,3281.255,3379.915,3542.865,3359.6,0.61,0.63,91.93,109.96,3447.995,3465.235,0.0
75%,22706.75,3176.0,60.0,0.0,1096.0,3.0,1440.0,95.0,6635.82,6656.535,6666.887,6517.96,707.2725,749.235,1360.435,1357.553,6667.958,6654.693,0.0
max,30301.0,12899.0,90.0,52.0,1649.0,3.0,4782.0,365.0,5905904.0,5740439.0,5700290.0,5010170.0,12269850.0,2361808.0,7637857.0,1414168.0,5778185.0,5720144.0,1.0


In [7]:
customer_data.isnull().sum()

customer_id                          0
vintage                              0
age                                  0
gender                             525
dependents                        2463
occupation                          80
city                               803
customer_nw_category                 0
branch_code                          0
days_since_last_transaction       3223
current_balance                      0
previous_month_end_balance           0
average_monthly_balance_prevQ        0
average_monthly_balance_prevQ2       0
current_month_credit                 0
previous_month_credit                0
current_month_debit                  0
previous_month_debit                 0
current_month_balance                0
previous_month_balance               0
churn                                0
dtype: int64

**Gender**

In [8]:
customer_data["gender"].value_counts()

Male      16548
Female    11309
Name: gender, dtype: int64

In [9]:
# Creating a third category;
customer_data["gender"].fillna("other",inplace=True)

**Dependents,Occupation,City**

In [10]:
# All these are categorical variables, we can use mode imputation;
customer_data["dependents"].value_counts()

0.0     21435
2.0      2150
1.0      1395
3.0       701
4.0       179
5.0        41
6.0         8
7.0         3
36.0        1
52.0        1
25.0        1
9.0         1
50.0        1
32.0        1
8.0         1
Name: dependents, dtype: int64

In [11]:
customer_data["city"].value_counts()

1020.0    3479
1096.0    2016
409.0     1334
146.0     1291
834.0     1138
          ... 
629.0        1
527.0        1
1212.0       1
530.0        1
70.0         1
Name: city, Length: 1604, dtype: int64

In [12]:
customer_data["occupation"].value_counts()

self_employed    17476
salaried          6704
student           2058
retired           2024
company             40
Name: occupation, dtype: int64

In [13]:
customer_data["dependents"].fillna(0,inplace=True)
customer_data["occupation"].fillna('self_employed',inplace=True)
customer_data["city"].fillna(1020,inplace=True)

**Days Since Last Transaction**

In [14]:
customer_data["days_since_last_transaction"].describe()

count    25159.000000
mean        69.997814
std         86.341098
min          0.000000
25%         11.000000
50%         30.000000
75%         95.000000
max        365.000000
Name: days_since_last_transaction, dtype: float64

In [15]:
customer_data["days_since_last_transaction"].fillna(999,inplace=True)

**Encoding Categorical Variables**

In [16]:
customer_data = pd.get_dummies(customer_data)

In [17]:
customer_data["city"] = customer_data["city"].astype('object')
customer_data["branch_code"] = customer_data["branch_code"].astype('object')
customer_data["churn"] = customer_data["churn"].astype('category')

In [18]:
customer_data["dependents"].value_counts()

0.0     23898
2.0      2150
1.0      1395
3.0       701
4.0       179
5.0        41
6.0         8
7.0         3
36.0        1
52.0        1
25.0        1
9.0         1
50.0        1
32.0        1
8.0         1
Name: dependents, dtype: int64

In [19]:
customer_data["dependents"] = customer_data["dependents"].apply(lambda x: 8 if x > 8 else x)

In [20]:
num_cols = ['current_balance',
            'previous_month_end_balance', 'average_monthly_balance_prevQ2', 'average_monthly_balance_prevQ',
            'current_month_credit','previous_month_credit', 'current_month_debit', 
            'previous_month_debit','current_month_balance', 'previous_month_balance']
for i in num_cols:
    customer_data[i] = np.log1p(customer_data[i] + 17000)

std = StandardScaler()
scaled = std.fit_transform(customer_data[num_cols])
scaled = pd.DataFrame(scaled,columns=num_cols)

In [21]:
customer_data = customer_data.drop(columns = num_cols,axis = 1)
customer_data = customer_data.join(scaled) 

In [22]:
y = customer_data["churn"]
x = customer_data.drop(["customer_id","churn"],axis=1)

**Cross Validation**

In [23]:
# We can use class distribution in dataset to set threshold for predictions;

customer_data["churn"].value_counts() / customer_data.shape[0]

0    0.814671
1    0.185329
Name: churn, dtype: float64

In [24]:
from statistics import mean

def scorer(Model,threshold):
    kf = StratifiedKFold(n_splits=5,random_state=42,shuffle=True)
    roc_scores = [] 
    recall_scores = [] 
    precision_scores = []  
    
    for train_index,val_index in kf.split(x,y):
        xtr,xvl = x.loc[train_index],x.loc[val_index]
        ytr,yvl = y.loc[train_index],y.loc[val_index]
    
        model = Model
        model.fit(xtr,ytr)
        pred_prob = model.predict_proba(xvl)
        pred_class = []
    
        for j in pred_prob[:,1]:
            if j > threshold:
                pred_class.append(1)
            else:
                pred_class.append(0)
        
        pred_val = pred_class
        roc_score = metrics.roc_auc_score(yvl,pred_prob[:,1])
        recall = metrics.recall_score(yvl,pred_val)
        precision = metrics.precision_score(yvl,pred_val)
    
        roc_scores.append(roc_score)
        recall_scores.append(recall)
        precision_scores.append(precision)
    
    return mean(roc_scores), mean(recall_scores), mean(precision_scores)

In [25]:
scores_rf = scorer(RandomForestClassifier(),0.185)
scores_lr = scorer(LogisticRegression(),0.185)

In [26]:
print("Random Forest Model Cross Validataion Metrics :",)
print("Auc-Roc Score : " + str(scores_rf[0]), "Recall Score :" + str(scores_rf[1]),'\n')

print("Logistic Regression Model Cross Validation Metrics:")
print("Auc-Roc Score : " + str(scores_lr[0]), "Recall Score :" + str(scores_lr[1]),'\n')

Random Forest Model Cross Validataion Metrics :
Auc-Roc Score : 0.8346849794906871 Recall Score :0.7623574144486692 

Logistic Regression Model Cross Validation Metrics:
Auc-Roc Score : 0.7491168971456754 Recall Score :0.7129277566539924 



** Recall is a relevant metric to use in this case as the impact of false regatives is higher (if a customer is incorrectly identified to not churn, the bank might end up losing a customer even though it had predicted otherwise). So since we want to minimize number of false negatives, recall is a good metric to measure accuracy of our model. As per our best model, out of all customers who churn we would be able to predict about 75% of them accurately;**

**Model perfomance on unseen data**

In [27]:
xtrain, xtest, ytrain, ytest = train_test_split(x,y,test_size=0.30, random_state = 42)

model = RandomForestClassifier()
model.fit(xtrain,ytrain)
pp = model.predict_proba(xtest)[:,1]

# Using same threshold as validataion 
ypt = []

for p in pp:
    if p > 0.185:
        ypt.append(1)
    else:
        ypt.append(0)
    

roc_score = metrics.roc_auc_score(ytest,pp)
recall_score = metrics.recall_score(ytest,ypt)
precision_score = metrics.precision_score(ytest,ypt)

In [29]:
 # Performance on unseen data
    
print("Test Roc Score : " + str(roc_score), "Test Recall Score :" + str(recall_score))

Test Roc Score : 0.8353773919055094 Test Recall Score :0.7648578811369509


** The current model has good roc score with a decent recall score. This model can be used as a baseline and further optimized (hyperparameter tuning, top feature selction etc) to imporve overall metrics.**

*****************************