# Classification - Churn

In [168]:
#Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [169]:
#Import dataset

dataset = pd.read_csv('churn.csv')

In [170]:
dataset

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


# Exploring the dataset

#### Check if any values are missing

In [171]:
print((dataset.isnull().sum()))

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


#### Check unique values for each column

In [172]:
print(dataset.nunique())

customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64


##### Check unique values for columns with 3 or more unique values that we might want to simplify or correct

In [173]:
print("MultipleLines") 
print(dataset["MultipleLines"].unique())

print("InternetService") 
print(dataset["InternetService"].unique())

print("OnlineSecurity") 
print(dataset["OnlineSecurity"].unique())

print("OnlineBackup") 
print(dataset["OnlineBackup"].unique())

print("DeviceProtection") 
print(dataset["DeviceProtection"].unique())

print("TechSupport") 
print(dataset["TechSupport"].unique())

print("StreamingTV") 
print(dataset["StreamingTV"].unique())

print("StreamingMovies") 
print(dataset["StreamingMovies"].unique())

print("Contract") 
print(dataset["Contract"].unique())

print("PaymentMethod") 
print(dataset["PaymentMethod"].unique())

MultipleLines
['No phone service' 'No' 'Yes']
InternetService
['DSL' 'Fiber optic' 'No']
OnlineSecurity
['No' 'Yes' 'No internet service']
OnlineBackup
['Yes' 'No' 'No internet service']
DeviceProtection
['No' 'Yes' 'No internet service']
TechSupport
['No' 'Yes' 'No internet service']
StreamingTV
['No' 'Yes' 'No internet service']
StreamingMovies
['No' 'Yes' 'No internet service']
Contract
['Month-to-month' 'One year' 'Two year']
PaymentMethod
['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']


# Data preprocessing

### There's no reason to have two values for 'No'.

In all except for one column 'No internet service' is the same as 'No'

In [174]:
#Changing 'No internet service' and 'No phone service' to 'No'.

dataset = dataset.replace(['No internet service', 'No phone service'], 'No')

### Every column should be numeric.

In [175]:
#Converting churn Yes/No into 1/0
dataset['Churn'] = dataset['Churn'].replace(['Yes'], 1)
dataset['Churn'] = dataset['Churn'].replace(['No'], 0)

#Lets keep "No internet service" for InternetService since the column is categorical with more than 2 possible values.
dataset["InternetService"] = dataset["InternetService"].replace(0, "No internet service")

#### Handle empty strings in TotalCharge since they're not convertable to floats.
We can drop the row entirely, set it to 0 but that might have an unwanted effect, or replace it with midvalue. 

Here the row/s are dropped since there aren't too many of them.

In [176]:
dataset['TotalCharges'] = dataset['TotalCharges'].replace(' ', np.nan)
dataset = dataset[dataset["TotalCharges"].notnull()]
dataset = dataset.reset_index()[dataset.columns]

#Convert strings into floats
dataset["TotalCharges"] = pd.to_numeric(dataset["TotalCharges"])

### Convert categorical variables
Pandas get_dummies and convert categorical variables into dummy/indicator variables.

However, we don't want an overflow of columns, so we can drop one from each. It does't affect the outcome and prevent redundant data.

In [177]:
dataset = pd.get_dummies(dataset, columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents','PhoneService', 
                                             'MultipleLines','OnlineSecurity', 'DeviceProtection', 'TechSupport', 
                                             'StreamingTV', 'StreamingMovies','Contract', 'PaperlessBilling', 'PaymentMethod',
                                            'InternetService','OnlineBackup'],
                        drop_first=True)

## Feature scaling and OneHotEncoder
The numbers differs highly between the different columns but they are equally important.

Higher numbers for the computer are of more importance, so we want to scale the columns.

In [178]:
from sklearn.preprocessing import StandardScaler

#Scale TotalCharges, MonthlyCharge and Tenure.
sc = StandardScaler()
cols_for_scaling = ['MonthlyCharges', 'TotalCharges', 'tenure']
dataset[cols_for_scaling] = sc.fit_transform(dataset[cols_for_scaling])

In [179]:
dataset

Unnamed: 0,customerID,tenure,MonthlyCharges,TotalCharges,Churn,gender_Male,SeniorCitizen_1,Partner_Yes,Dependents_Yes,PhoneService_Yes,...,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,InternetService_Fiber optic,InternetService_No,OnlineBackup_Yes
0,7590-VHVEG,-1.280248,-1.161694,-0.994194,0,0,0,1,0,0,...,0,0,0,1,0,1,0,0,0,1
1,5575-GNVDE,0.064303,-0.260878,-0.173740,0,1,0,0,0,1,...,0,1,0,0,0,0,1,0,0,0
2,3668-QPYBK,-1.239504,-0.363923,-0.959649,1,1,0,0,0,1,...,0,0,0,1,0,0,1,0,0,1
3,7795-CFOCW,0.512486,-0.747850,-0.195248,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,9237-HQITU,-1.239504,0.196178,-0.940457,1,0,0,0,0,1,...,0,0,0,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,6840-RESVB,-0.343137,0.664868,-0.129180,0,1,0,1,1,1,...,1,1,0,1,0,0,1,0,0,0
7028,2234-XADUH,1.612573,1.276493,2.241056,0,0,0,1,1,1,...,1,1,0,1,1,0,0,1,0,1
7029,4801-JZAZL,-0.872808,-1.170004,-0.854514,0,0,0,1,1,0,...,0,0,0,1,0,1,0,0,0,0
7030,8361-LTMKD,-1.158016,0.319168,-0.872095,1,1,1,1,0,1,...,0,0,0,1,0,0,1,1,0,0


## Splitting the dataset into the Training set and Test set

In [180]:
X = dataset.drop(["Churn", "customerID"], axis = 1)
y = dataset["Churn"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# AI Modelling

In [181]:
from sklearn.metrics import accuracy_score, confusion_matrix

## Logistic Regression

In [182]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(random_state = 40)

### Train

In [183]:
logreg.fit(X_train, y_train)

### Predict

In [184]:
logreg_result = logreg.predict(X_test)

### Accuracy

In [185]:
accuracy_score(logreg_result, y_test)

0.806680881307747

In [186]:
logreg_mat = confusion_matrix(y_test, logreg_result)
print(logreg_mat)

[[934 104]
 [168 201]]


In [187]:
#934: Correct!
#104: Incorrect!
#168: Incorrect!
#201: Correct!
print(934 + 201, "correct predictions")

1135 correct predictions


## Random Forest Classifier

In [188]:
from sklearn.ensemble import RandomForestClassifier

### Train

In [189]:
forest = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 40)
forest.fit(X_train, y_train)

### Predict

In [190]:
forest_result = forest.predict(X_test)

### Accuracy

In [191]:
accuracy_score(forest_result, y_test)

0.7967306325515281

In [192]:
forest_mat = confusion_matrix(y_test, forest_result)
print(forest_mat)

[[936 102]
 [184 185]]


In [193]:
#936: Correct!
#102: Incorrect!
#184: Incorrect!
#285: Correct!
print(936 + 185, "correct predictions")

1121 correct predictions


## Closest Neighbors

In [194]:
from sklearn.neighbors import KNeighborsClassifier

### Train

In [195]:
neighbors = KNeighborsClassifier()
neighbors.fit(X_train, y_train)

### Predict

In [196]:
neighbors_result = neighbors.predict(X_test)

### Accuracy

In [197]:
accuracy_score(neighbors_result, y_test)

0.7718550106609808

In [198]:
neighbors_mat = confusion_matrix(y_test, neighbors_result)
print(neighbors_mat)

[[885 153]
 [168 201]]


In [199]:
#885: Correct!
#153: Incorrect!
#168: Incorrect!
#201: Correct!
print(885 + 201, "correct predictions")

1086 correct predictions


## SVC

In [200]:
from sklearn.svm import SVC

### Train

In [201]:
svc = SVC(kernel="linear", random_state=40)
svc.fit(X_train, y_train)

### Predict

In [202]:
svc_result = svc.predict(X_test)

### Accuracy

In [203]:
accuracy_score(svc_result, y_test)

0.798862828713575

In [204]:
svc_mat = confusion_matrix(y_test, svc_result)
print(svc_mat)

[[928 110]
 [173 196]]


In [205]:
#928: Correct!
#110: Incorrect!
#173: Incorrect!
#196: Correct!
print(928 + 196, "correct predictions")

1124 correct predictions


# Results

Logistic Regression : 1135 correct predictions

Random Forest : 1121 correct predictions

Closest Neighbour : 1086 correct predictions

SVC : 1124 correct predictions

In [206]:
#We can present the chance of churn in a new column.

dataset["Chance_of_Churn"] = logreg.predict_proba(dataset[X_test.columns])[:,1]
dataset[["customerID", "Chance_of_Churn"]].head(10)

Unnamed: 0,customerID,Chance_of_Churn
0,7590-VHVEG,0.635158
1,5575-GNVDE,0.047866
2,3668-QPYBK,0.302404
3,7795-CFOCW,0.033413
4,9237-HQITU,0.694138
5,9305-CDSKC,0.792482
6,1452-KIOVK,0.41782
7,6713-OKOMC,0.290284
8,7892-POOKP,0.618156
9,6388-TABGU,0.010671


# Conclusion

Logistic Regression has the highest accuracy with its 1135 correct predictions. Around 80%. LR is used to predict categorical data, like ours, when the dependent variable is zero or one, false or true. Other models can do this as well but also handle regression. LR can be used for churn, decision making if a customer should be granted a loan or not, etc.  So it's no surprise that LR did well in what it's specialized in.

Random Forest can predict more than just true or false. It can be used in regression as well as categorization. By running a number of decision trees, it generates a result based on the output from all those trees. In this case, it did well with almost 80% correct answers.

Closest Neighbour classifies data points based on their neighbours by storing, calculating and sorting values. On large datasets, this is not optimal. However, our dataset is fairly small and the CN did well with 1086 correct predictions.

SVC uses a hyperlane between categories and separates them. The closest data points, support vectors, at the edge of the hyperlane are used to teach the model. Its accuracy is the closest to LR with 1124 correct predictions.

Logistic Regression is from this exercise considered the best for classifying churn from this dataset. It predicts eight out of ten correctly.