# Approaching (Almost) Any Churn Predication Problem for Classification on Kaggle

![](https://i0.wp.com/www.everythingai.co.in/wp-content/uploads/2018/01/Churn.png?resize=900%2C450&ssl=1)

In this post, I'll talk about approaching churn predication problems on Kaggle. As an example, we will use the data from this competition. I have create a very basic all classification model first and then improve algorithm parameter.

## Cover all Classification Algorithm
* LogisticRegression
* XGBClassifier
* MultinomialNB
* AdaBoostClassifier
* KNeighborsClassifier
* GradientBoostingClassifier
* ExtraTreesClassifier
* DecisionTreeClassifier

## Road Map 
* Library for Preprocessing and Cleaning
* Load all Classification Packages and Accuracy Packages
* Load Data Set
* Analyse the Data 
* LabelEncoder
* Split the Data Train and Validation
* Train Model and Check Validation Data Accuracy

### Library for Preprocessing and Cleaning

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import seaborn as sns

# Any results you write to the current directory are saved as output.

### Load all Classification Packages and Accuracy Packages

In [None]:
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier,RadiusNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC,SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier,ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score,roc_auc_score

### The Importance of Predicting Customer Churn

The ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it, represents a huge additional potential revenue source for every online business. Besides the direct loss of revenue that results from a customer abandoning the business, the costs of initially acquiring that customer may not have already been covered by the customer’s spending to date. (In other words, acquiring that customer may have actually been a losing investment.) Furthermore, it is always more difficult and expensive to acquire a new customer than it is to retain a current paying customer.

Reference : [Link](https://www.optimove.com/learning-center/customer-churn-prediction-and-prevention)

### Load Data set

In [None]:
df = pd.read_csv('../input/bigml_59c28831336c6604c800002a.csv')
df.head(5)

###  Remove Column, Shape, Null Value and Data Type :---- Over_View

In [None]:
df = df.drop(['phone number'],axis=1)
df.shape

In [None]:
df.isnull().sum()

In [None]:
print("------  Data Types  ----- \n",df.dtypes)
print("------  Data type Count  ----- \n",df.dtypes.value_counts())

### Label Encoding for Catergorical Variable 

In [None]:
cate = [key for key in dict(df.dtypes) if dict(df.dtypes)[key] in ['bool', 'object']]

In [None]:
le = preprocessing.LabelEncoder()
for i in cate:
    le.fit(df[i])
    df[i] = le.transform(df[i])
    

### Correlation Plot

In [None]:
corrmat = df.corr(method='pearson')
f, ax = plt.subplots(figsize=(8, 8))

# Draw the heatmap using seaborn
sns.heatmap(corrmat, vmax=1., square=True)
plt.title("Important variables correlation map", fontsize=15)
plt.show()

In [None]:
y = df['churn']
df = df = df.drop(['churn'],axis=1)

### Feature Important by XGB

using XGBClassifier i have achive great accurcy so i have take insight of which one feature  

In [None]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(df, y)
# plot the important features #
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(clf, max_num_features=50, height=0.8, ax=ax)
plt.show()

### Split Train and Validation Dataset

In [None]:
xtrain, xvalid, ytrain, yvalid = train_test_split(df, y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)

In [None]:
print(xtrain.shape, xvalid.shape, ytrain.shape, yvalid.shape)

### Cover all Classification Algorithm
* LogisticRegression
* XGBClassifier
* MultinomialNB
* AdaBoostClassifier
* KNeighborsClassifier
* GradientBoostingClassifier
* ExtraTreesClassifier
* DecisionTreeClassifier 

### LogisticRegression

In [None]:
clf = LogisticRegression(C=1.0)
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
lr = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = pd.DataFrame([lr])

### XGBClassifier

In [None]:
clf = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
xg = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = algo.append([xg])

### MultinomialNB

In [None]:
clf = MultinomialNB()
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
mnb = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = algo.append([mnb])

### AdaBoostClassifier

In [None]:
clf = AdaBoostClassifier()
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
abc = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = algo.append([abc])

### KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier()
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
knc = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = algo.append([knc])

### GradientBoostingClassifier

In [None]:
clf = GradientBoostingClassifier()
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
gbc = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = algo.append([gbc])

### ExtraTreesClassifier

In [None]:
clf = ExtraTreesClassifier()
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
etc = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = algo.append([etc])

### DecisionTreeClassifier

In [None]:
clf = DecisionTreeClassifier()
clf.fit(xtrain, ytrain)
predictions = clf.predict(xvalid)
print("accuracy_score",accuracy_score(yvalid, predictions))
print("auc",roc_auc_score(yvalid, predictions))
dtc = [clf.__class__,accuracy_score(yvalid, predictions),roc_auc_score(yvalid, predictions)]

In [None]:
algo = algo.append([dtc])

In [None]:
algo.sort_values([1], ascending=[False])