The goal of this notebook is to predict whether a customer will churn (i.e. leave) given a set of predictors. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

In [2]:
# Load the data
data = pd.read_csv('churn.csv')
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
# Remove unnecessary columns
data.drop(['customerID', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod'], axis = 1, inplace = True)

# Create dummy variables for categorical features
data = pd.get_dummies(data = data, columns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'TechSupport', 'StreamingTV', 'Contract'], drop_first = True)
data.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,MultipleLines_Yes,InternetService_Fiber optic,InternetService_No,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,Contract_One year,Contract_Two year
0,0,1,29.85,29.85,No,0,1,0,0,1,0,0,0,0,0,0,0,0,0
1,0,34,56.95,1889.5,No,1,0,0,1,0,0,0,0,0,0,0,0,1,0
2,0,2,53.85,108.15,Yes,1,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,45,42.3,1840.75,No,1,0,0,0,1,0,0,0,0,1,0,0,1,0
4,0,2,70.7,151.65,Yes,0,0,0,1,0,0,1,0,0,0,0,0,0,0


In [4]:
X = data.drop('Churn', axis = 1)
y = data.Churn

Next task is to split data into 80%-20% to prepare data for training and testing. Then, train a SVM with linear Kernel and investigate classification error on the test set. We will apply any appropriate preprocessing steps, and tune the model.

In [5]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state = 1)

#### Customer churn predicion using SVM with linear kernel

In [6]:
# Classifier
svm_lin = Pipeline([ ('scaler', StandardScaler()),
                         ('clf', SVC(kernel = 'linear', random_state = 1)) ])
parameters = {'clf__C' : [0.1, 1, 10, 100]}
svm_clf_lin = GridSearchCV(svm_lin, parameters, cv=10, n_jobs = -1)
svm_clf_lin.fit(X_train, y_train)

print(svm_clf_lin.best_params_)
print('Classification error', np.mean(svm_clf_lin.predict(X_test) != y_test))

{'clf__C': 0.1}
Classification error 0.21677327647476902


Result: Classification error with SVM linear kernel is 21.68%

Now let's try different polynomial and rbf kernels, and see if it improves test error. 

Note: Make sure to set seed to create reproducible results.

#### Customer churn predicion using SVM with polynomial kernel

In [7]:
# Polynomial
svm_poly = Pipeline([ ('scaler', StandardScaler()),
                     ('clf', SVC(kernel = 'poly', random_state = 1)) ])
parameters = {'clf__C' : np.logspace(-1, 2, 4,dtype=int),
              'clf__degree': [1,2,3,4],
              'clf__coef0': [0,1,2]}
svm_clf_poly = GridSearchCV(svm_poly, parameters, cv=10, n_jobs = -1)
svm_clf_poly.fit(X_train, y_train)

print(svm_clf_poly.best_params_)
print('Classification error',np.mean(svm_clf_poly.predict(X_test) != y_test))


{'clf__C': 100, 'clf__coef0': 0, 'clf__degree': 2}
Classification error 0.21321961620469082


#### Customer churn predicion using SVM with radial basis function kernel

In [8]:
# RBF
svm_rbf = Pipeline([ ('scaler', StandardScaler()),
                     ('clf', SVC(kernel = 'rbf', random_state = 1)) ])
parameters = {'clf__C' : np.logspace(-1, 2, 4,dtype=int),
              'clf__gamma': [0.1,0.5,1,2,3,4]}
svm_clf_rbf = GridSearchCV(svm_rbf, parameters, cv=10, n_jobs = -1)
svm_clf_rbf.fit(X_train, y_train)

print(svm_clf_rbf.best_params_)
print('Classification error', np.mean(svm_clf_rbf.predict(X_test) != y_test))

{'clf__C': 1, 'clf__gamma': 0.1}
Classification error 0.2068230277185501


Best result: Classification error with SVM radial basis function is minimum (i.e. 20.68%) therefore SVM with radial basis function kernel predicts customer churn rate more accurately than with SVM linear/ polynomial kernel methods.

As a practice, let's fit a regularized logistic regression on this data set as well, and see if we can achieve a better test error.

In [9]:
# Logistic regression
clf_logit = Pipeline([('scale',StandardScaler()),
                     ('clf',LogisticRegression(solver='liblinear', penalty='l1',random_state=1))])
                        # using l1 regularization and liblinear solver method for logistic regression

parameters = {'clf__C' : np.logspace(-2, 10, 10), 'clf__max_iter' : np.linspace(1000, 10000, 10)}

clf_logistic = GridSearchCV(clf_logit, param_grid = parameters, cv=10, scoring='accuracy', n_jobs=-1)
clf_logistic.fit(X_train, y_train)  # fit model on training set

print(clf_logistic.best_params_)
print('Classification error', np.mean(clf_logistic.predict(X_test) != y_test))

{'clf__C': 0.21544346900318834, 'clf__max_iter': 1000.0}
Classification error 0.21393034825870647


Observation: 

Logistic regression with Lasso regularization gets better result that linear SVM but worse than SVM with polynomial and radial basis function kernal in predicting customer churn.