## Churn Prediction using AMLWorkbench

This notebook will introduce the use of the churn dataset to create churn prediction models. The dataset used to ingest is from SIDKDD 2009 competition. The dataset consists of heterogeneous noisy data (numerical/categorical variables) from French Telecom company Orange and is anonymized.

We will use the .dprep file created from the datasource wizard.

In [11]:
import dataprep
from dataprep.Package import Package
import pickle

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd
import numpy as np
import csv
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder

## Read Data

We first retrieve the data as a data frame using .dprep that we created using the datasource wizard. Print the top few lines using head()

In [4]:
with Package.open_package('CATelcoCustomerChurnTrainingSample.dprep') as pkg:
    df = pkg.dataflows[0].get_dataframe()

df.head()

Unnamed: 0,age,annualincome,calldroprate,callfailurerate,callingnum,customerid,customersuspended,education,gender,homeowner,...,penaltytoswitch,state,totalminsusedinlastmonth,unpaidbalance,usesinternetservice,usesvoiceservice,percentagecalloutsidenetwork,totalcallduration,avgcallduration,churn
0,12.0,168147.0,0.06,0.0,4251078000.0,1.0,Yes,Bachelor or equivalent,Male,Yes,...,371.0,WA,15.0,19.0,No,No,0.82,5971.0,663.0,0.0
1,12.0,168147.0,0.06,0.0,4251078000.0,1.0,Yes,Bachelor or equivalent,Male,Yes,...,371.0,WA,15.0,19.0,No,No,0.82,3981.0,995.0,0.0
2,42.0,29047.0,0.05,0.01,4251043000.0,2.0,Yes,Bachelor or equivalent,Female,Yes,...,43.0,WI,212.0,34.0,No,Yes,0.27,7379.0,737.0,0.0
3,42.0,29047.0,0.05,0.01,4251043000.0,2.0,Yes,Bachelor or equivalent,Female,Yes,...,43.0,WI,212.0,34.0,No,Yes,0.27,1729.0,432.0,0.0
4,58.0,27076.0,0.07,0.02,4251056000.0,3.0,Yes,Master or equivalent,Female,Yes,...,403.0,KS,216.0,144.0,No,No,0.48,3122.0,624.0,0.0


## Encode Columns

Convert categorical variable into dummy/indicator variables using pandas.get_dummies. In addition, we will need to change the column names to ensure there are no multiple columns with the same name 

In [5]:
columns_to_encode = list(df.select_dtypes(include=['category','object']))
print(columns_to_encode)
for column_to_encode in columns_to_encode:
    dummies = pd.get_dummies(df[column_to_encode])
    one_hot_col_names = []
    for col_name in list(dummies.columns):
        one_hot_col_names.append(column_to_encode + '_' + col_name)
    dummies.columns = one_hot_col_names
    df = df.drop(column_to_encode, axis=1)
    df = df.join(dummies)

print("Encoded columns:")
print(df.columns)

['customersuspended', 'education', 'gender', 'homeowner', 'maritalstatus', 'noadditionallines', 'occupation', 'state', 'usesinternetservice', 'usesvoiceservice']
Encoded columns:
Index(['age', 'annualincome', 'calldroprate', 'callfailurerate', 'callingnum',
       'customerid', 'monthlybilledamount', 'numberofcomplaints',
       'numberofmonthunpaid', 'numdayscontractequipmentplanexpiring',
       'penaltytoswitch', 'totalminsusedinlastmonth', 'unpaidbalance',
       'percentagecalloutsidenetwork', 'totalcallduration', 'avgcallduration',
       'churn', 'customersuspended_No', 'customersuspended_Yes',
       'education_Bachelor or equivalent', 'education_High School or below',
       'education_Master or equivalent', 'education_PhD or equivalent',
       'gender_Female', 'gender_Male', 'homeowner_No', 'homeowner_Yes',
       'maritalstatus_Married', 'maritalstatus_Single', 'noadditionallines_\N',
       'occupation_Non-technology Related Job', 'occupation_Others',
       'occupation_Te

## Modeling

First, we will build a Gaussian Naive Bayes model using GaussianNB for churn classification. Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes theorem with the “naive” assumption of independence between every pair of features.

In addition, we will also build a decision tree classifier for comparison:

- min_samples_split=20 requires 20 samples in a node for it to be split
- random_state=99 to seed the random number generator

In [6]:
model = GaussianNB()

train, test = train_test_split(df, test_size = 0.3)

target = train['churn'].values
train = train.drop('churn', 1)
train = train.values
model.fit(train, target)

expected = test['churn'].values
test = test.drop('churn', 1)
predicted = model.predict(test)
print("Naive Bayes Classification Accuracy", accuracy_score(expected, predicted))

Naive Bayes Classification Accuracy 0.892666666667


In [7]:
dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt.fit(train, target)
predicted = dt.predict(test)
print("Decision Tree Classification Accuracy", accuracy_score(expected, predicted))

Decision Tree Classification Accuracy 0.914333333333
