# Problem Statement and Data Description

**Customer Churn Prediction (Regression)**

A Bank wants to take care of customer retention for its product: savings accounts. The bank wants you to identify customers likely to churn balances below the minimum balance. You have the customers information such as age, gender, demographics along with their transactions with the bank.

Your task as a data scientist would be to predict the propensity to churn for each customer.


### Data discription


## I. Demographic information about customers
-	customer_id - Customer id 
-	vintage - Vintage of the customer with the bank in a number of days 
-	age - Age of customer 
-	gender - Gender of customer 
-	dependents - Number of dependents 
-	occupation - Occupation of the customer 
-	city - City of the customer (anonymized) 


## II. Customer Bank Relationship
-	customer_nw_category - Net worth of customer (3: Low 2: Medium 1: High) 
-	branch_code - Branch Code for a customer account 
-	days_since_last_transaction - No of Days Since Last Credit in Last 1 year 


## III. Transactional Information
-	current_balance - Balance as of today 
-	previous_month_end_balance - End of Month Balance of previous month 
-	average_monthly_balance_prevQ - Average monthly balances (AMB) in Previous Quarter 
-	average_monthly_balance_prevQ2 - Average monthly balances (AMB) in previous to the previous quarter 
-	current_month_credit - Total Credit Amount current month 
-	previous_month_credit - Total Credit Amount previous month 
-	current_month_debit - Total Debit Amount current month 
-	previous_month_debit - Total Debit Amount previous month 
-	current_month_balance - Average Balance of current month 
-	previous_month_balance - Average Balance of previous month 
-	churn - Average balance of customer falls below minimum balance in the next quarter (1/0) 


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
df = pd.read_csv("C:\\Users\\hp pc\\Desktop\\kuchbhi\\Project\\Project\\churn_prediction.csv")
df.shape

(28382, 21)

In [2]:
df.head()

Unnamed: 0,customer_id,vintage,age,gender,dependents,occupation,city,customer_nw_category,branch_code,days_since_last_transaction,...,previous_month_end_balance,average_monthly_balance_prevQ,average_monthly_balance_prevQ2,current_month_credit,previous_month_credit,current_month_debit,previous_month_debit,current_month_balance,previous_month_balance,churn
0,1,3135,66,Male,0.0,self_employed,187.0,2,755,224.0,...,1458.71,1458.71,1449.07,0.2,0.2,0.2,0.2,1458.71,1458.71,0
1,2,310,35,Male,0.0,self_employed,,2,3214,60.0,...,8704.66,7799.26,12419.41,0.56,0.56,5486.27,100.56,6496.78,8787.61,0
2,4,2356,31,Male,0.0,salaried,146.0,2,41,,...,5815.29,4910.17,2815.94,0.61,0.61,6046.73,259.23,5006.28,5070.14,0
3,5,478,90,,,self_employed,1020.0,2,582,147.0,...,2291.91,2084.54,1006.54,0.47,0.47,0.47,2143.33,2291.91,1669.79,1
4,6,2531,42,Male,2.0,self_employed,1494.0,3,388,58.0,...,1401.72,1643.31,1871.12,0.33,714.61,588.62,1538.06,1157.15,1677.16,1


### Missing value and outlier treatment

In [3]:
df=df[df['dependents']<7]
#plt.scatter(data['age'],data['dependents'])
df['dependents'].fillna(0,inplace=True);
df['gender'].fillna('Male',inplace=True);
df['occupation'].fillna('self_employed',inplace=True); 
df['days_since_last_transaction'].fillna(70,inplace=True); 
df=df.dropna()
# df = df.drop("customer_id",axis=1)
# df = df.drop("branch_code",axis=1)

### Variable Trasformation

### Model making

In [4]:
df = pd.get_dummies(df)

In [5]:
x = df.drop(['churn'], axis=1)
y = df['churn']
x.shape, y.shape

((25175, 25), (25175,))

In [6]:
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 56,stratify=y)
scaler = MinMaxScaler()
cols = train_x.columns
# cols
train_x_scaled = scaler.fit_transform(train_x)
train_x = pd.DataFrame(train_x_scaled, columns=cols)
test_x_scaled = scaler.transform(test_x)
test_x = pd.DataFrame(test_x_scaled, columns=cols)
# train_x

### Random forest

In [7]:
clf = RandomForestClassifier(random_state=50, max_depth=6, n_estimators=76)
clf.fit(train_x,train_y)
temp=pd.DataFrame(clf.predict(train_x))
trainscore = f1_score(temp,train_y)
pred1=clf.predict(test_x)
testscore = f1_score(pred1,test_y)
print(trainscore,testscore)
print(clf.score(test_x,test_y))

0.4876182806523051 0.49112426035502954
0.8633619319987289


### Decision Tree

In [8]:
clf = DecisionTreeClassifier(random_state=96,max_depth=4,min_samples_leaf=240)
clf.fit(train_x,train_y)
clf.score(train_x, train_y),clf.score(test_x, test_y)
temp=pd.DataFrame(clf.predict(train_x))
trainscore = f1_score(temp,train_y)
pred2=clf.predict(test_x)
testscore = f1_score(pred2,test_y)
print(trainscore,testscore)
print(clf.score(test_x,test_y))

0.521919944550338 0.5360928823826351
0.8539879250079441


### Logistic Regression

In [9]:
logreg = LogReg(solver='saga')
logreg.fit(train_x, train_y)
temp=pd.DataFrame(logreg.predict(train_x))
trainscore = f1_score(temp,train_y)
pred3=logreg.predict(test_x)
temp=pd.DataFrame(logreg.predict(test_x))
testscore = f1_score(temp,test_y)
print(trainscore,testscore)
print(logreg.score(test_x,test_y))

0.002320185614849188 0.003481288076588338
0.8180807117890054


### K_NN

In [10]:
k_nn=KNN(n_neighbors=5)
k_nn.fit(train_x,train_y)
temp=pd.DataFrame(k_nn.predict(train_x))
trainscore = f1_score(temp,train_y)
pred4=k_nn.predict(test_x)
temp=pd.DataFrame(k_nn.predict(test_x))
testscore = f1_score(temp,test_y)
print(trainscore,testscore)
print(k_nn.score(test_x,test_y))

0.2701465201465202 0.08531468531468532
0.792183031458532


## Ensemble

In [19]:
from statistics import mode
final_pred = np.array([])
for i in range(0,len(test_x)):
    final_pred = np.append(final_pred, mode([pred1[i], pred2[i], pred3[i],pred4[i]]))

In [22]:
from sklearn.metrics import accuracy_score
print("Accuracy_score=",accuracy_score(test_y, final_pred))
print("F1_score=",f1_score(final_pred,test_y))

Accuracy_score= 0.8603431839847474
F1_score= 0.46823956442831216
