# Telecom Churn Prediction Project

## Final Model Evaluation

This notebook uses the model selected in the Modeling_final notebook and applies it to the test dataset to
make a final assesment of the model.

The model chosen from all the evaluation in the Modeling_final notebook is logistic regression.

As described in the upstream notebooks, there is a strong correlation between the monthly contract customers and churn.  Given that the output of the model is to be used for targeted marketing the model will need to perform better than simply targeting marketing to the month to month customers.

In [19]:
import os
import sys
import time

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

import umap


In [20]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve,f1_score
from sklearn.metrics import roc_auc_score, fbeta_score, confusion_matrix

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

from yellowbrick.classifier import ClassificationReport
from sklearn.metrics import confusion_matrix

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV

In [21]:
# Read the csv file save by the clean/eda notebook
train_df = pd.read_csv('../data/churn_train_clean.csv').drop('Unnamed: 0', axis=1)
test_df = pd.read_csv('../data/churn_test_clean.csv').drop('Unnamed: 0', axis=1)
train_df.columns

Index(['customerID', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges',
       'Churn', 'Month-to-month', 'One year', 'DSL', 'Fiber optic', 'Female'],
      dtype='object')

In [22]:
# Define the columns that will be used for the model training and inputs.
input_columns = ['SeniorCitizen', 'Partner', 'Dependents', \
       'tenure', 'PhoneService', 'MultipleLines',  \
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', \
       'StreamingTV', 'StreamingMovies', 'MonthlyCharges', 'TotalCharges', \
       'Month-to-month', 'One year', 'Fiber optic', 'Female']

# Create the training test set
X_train = train_df[input_columns]
y_train = train_df['Churn']

# Create the testing dataset
X_test = test_df[input_columns]
y_test = test_df['Churn']

Now that the training and test datasets have been defined let's build the model

In [23]:
# Define the logistic regression model using the parameters output from the grid search in the upstream notebooks
log_best = LogisticRegression(C=0.03, penalty='l2', solver='liblinear', random_state=45)

# Fit the model using the training dataset
log_best.fit(X_train,y_train)

# Generate a first pass accuracy score
log_y_hat = log_best.predict(X_test)
log_y_proba = log_best.predict_proba(X_test)
print(accuracy_score(y_test, log_y_hat))

0.8109028960817717


In [24]:
roc_auc_score(y_test, log_y_proba[:,1])

0.8568659030286054

In [25]:
recall_score(y_test, log_y_hat)

0.5448851774530271

In [26]:
f1_score(y_test, log_y_hat)

0.6105263157894737

Now we want to look more closely at the model performance.  We will define a probability threshold and retest the model output applying that threshold.  From the new predictions we will generate a confusion matrix.

In [16]:
# Define the threshold we will use to generate a new prediction
threshold = 0.5

# Run the model and test against the threshold.
y_predict = [1 if log_best.predict_proba(X_test)[x][1] > threshold else 0 for x in range(X_test.shape[0])]

# Generate a confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_predict).ravel()

From the confusion matrix we can calculate Precision and Recall.  For this model we want to maximize Recall due to the relatively high 'cost' of losing a customer.  The formulas for precision and recall are:

$$Precision = \frac{True Positive}{True Positive + False Positive}$$
<br>
<br>
<br>

$$Recall = \frac{True Positive}{True Positive + False Nagative}$$

In [17]:
# Calculate Precision and Recall
precision = tp / (tp + fp)
recall = tp / (tp + fn)

In [18]:
precision, recall

(0.6941489361702128, 0.5448851774530271)