# Telecom Churn Prediction Project 

## Modeling and Grid Search for a Hybrid

In this notebook we're going to explore using a hybrid model.<br>

The telecom customers in this dataset can be split into two distinct groups based on the length of their contract.  Customers with month to month contracts represent 88% of the churn.  It seems possible that two different models could be developed, one for each of the groups.  On the final model testing the predictions would then come from the appropriate model.
<br>
<br>
The original training dataset was split into two groups.  Each of these groups was then used as the input to a grid search of the same classifiers used for the single classifier solutions.  The best model for each group will be used in the final training/testing phase of the process and compared to the single classifier models.

In [1]:
import os
import sys
import time

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

sys.path.append('/Users/maboals/Documents/Work/Programming/PyStuff/MyTools/src')
from MLModelingTools import model_test, model_testN

sys.path.append('../src')
from my_eval_tools import calc_roc_data
from my_eval_tools import calc_hybrid_roc_data 
from my_eval_tools import hybrid_predict, hybrid_predict_proba
from my_eval_tools import calc_pr_sweep
from my_eval_tools import predict_sweep

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, precision_recall_curve,f1_score, fbeta_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

from yellowbrick.classifier import ClassificationReport
from sklearn.metrics import confusion_matrix

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

from sklearn.model_selection import GridSearchCV

Using TensorFlow backend.


## Get the clean training data

Read the training data file.  This file was created by running the notebooks:
* Telecom to SQL
* Telecom clean and eda

In [3]:
# Read the csv file save by the clean/eda notebook
train_df = pd.read_csv('../data/churn_train_clean.csv')


# Sometimes the index column is read as an unnamed column, if so drop it
if 'Unnamed: 0' in train_df.columns :
    train_df = train_df.drop('Unnamed: 0', axis=1)
    
train_df.columns

Index(['customerID', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges',
       'Churn', 'Month-to-month', 'One year', 'DSL', 'Fiber optic', 'Female'],
      dtype='object')

Split off just the portion of the dataset that is month to month customers


In [5]:
# Define which columns we're going to use in our modeling.
train_columns1 = ['Month-to-month', 'SeniorCitizen', 'Partner', 'Dependents', \
       'tenure', 'PhoneService', 'MultipleLines',  \
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', \
       'StreamingTV', 'StreamingMovies', 'MonthlyCharges', 'TotalCharges', \
       'Fiber optic', 'Female']

train_columns2 = train_columns1.copy()
train_columns2.append('Churn')

train_columns = train_columns1

month_df = train_df[train_df['Month-to-month'] == 1]
not_month_df = train_df[train_df['Month-to-month'] == 0]

X_month = month_df[train_columns].drop('Month-to-month', axis=1)
y_month= month_df['Churn']

X_not_month = not_month_df[train_columns].drop('Month-to-month', axis=1)
y_not_month = not_month_df['Churn']
X_not_month.info(), y_not_month.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2389 entries, 6 to 5281
Data columns (total 16 columns):
SeniorCitizen       2389 non-null int64
Partner             2389 non-null int64
Dependents          2389 non-null int64
tenure              2389 non-null int64
PhoneService        2389 non-null int64
MultipleLines       2389 non-null int64
OnlineSecurity      2389 non-null int64
OnlineBackup        2389 non-null int64
DeviceProtection    2389 non-null int64
TechSupport         2389 non-null int64
StreamingTV         2389 non-null int64
StreamingMovies     2389 non-null int64
MonthlyCharges      2389 non-null float64
TotalCharges        2389 non-null float64
Fiber optic         2389 non-null int64
Female              2389 non-null int64
dtypes: float64(2), int64(14)
memory usage: 317.3 KB


(None, (2389,))

In [7]:
# Split the original dataset into train and validation groups
#  set stratify to true so both classes are represented it the splits
X_nm_train, X_nm_test, y_nm_train, y_nm_test = train_test_split(X_not_month, y_not_month, test_size=0.2)


X_m_train, X_m_test, y_m_train, y_m_test = train_test_split(X_month, y_month, test_size=0.2)


### Train, test split

Make a training data subset and a validation data subset

Then make a balanced training dataset using smote

In [8]:
# Make a balance set for model training
sm = SMOTE(random_state=42)
X_train_month_smt, y_train_month_smt = sm.fit_resample(X_m_train, y_m_train)
X_train_not_month_smt, y_train_not_month_smt = sm.fit_resample(X_nm_train, y_nm_train)

### Baseline with Logistic Regression

Make a process pipeline to use the grid search cross validation tool

In [28]:
sm = SMOTE(random_state=42)
log_model = LogisticRegression()
steps = [('smt', sm), ('LOG', log_model)]

pipeline = Pipeline(steps) # define the pipeline object.

In [29]:
log_param_grid = {
    'smt__random_state': [45],
    'LOG__solver': ['liblinear'],
    'LOG__C' : [0.001, 0.01, 0.02, 0.03, 0.07, 0.1, 0.5, 0.75, 1, 1.5, 3, 10, 20],
    'LOG__penalty' : ['l1', 'l2']
}

In [30]:
log_grid_month = GridSearchCV(pipeline, param_grid=log_param_grid, scoring='recall',cv=5, n_jobs=-1)
log_grid_not_month = GridSearchCV(pipeline, param_grid=log_param_grid, scoring='recall', cv=5, n_jobs=-1)

In [31]:
log_grid_month.fit(X_train_month_smt, y_train_month_smt)
log_grid_not_month.fit(X_train_not_month_smt, y_train_not_month_smt)

print("score = %3.2f" %(log_grid_month.score(X_m_test,y_m_test)))
print(log_grid_month.best_params_)

print("score = %3.2f" %(log_grid_not_month.score(X_nm_test,y_nm_test)))
print(log_grid_not_month.best_params_)


score = 0.70
{'LOG__C': 10, 'LOG__penalty': 'l1', 'LOG__solver': 'liblinear', 'smt__random_state': 45}
score = 0.34
{'LOG__C': 10, 'LOG__penalty': 'l1', 'LOG__solver': 'liblinear', 'smt__random_state': 45}


In [32]:
'''

recall
score = 0.76
{'LOG__C': 0.01, 'LOG__penalty': 'l1', 'LOG__solver': 'liblinear', 'smt__random_state': 45}

'''

"\n\nrecall\nscore = 0.76\n{'LOG__C': 0.01, 'LOG__penalty': 'l1', 'LOG__solver': 'liblinear', 'smt__random_state': 45}\n\n"

### Create model validation pipeline and do grid search for KNN Model

In [33]:
from imblearn.pipeline import Pipeline
sm = SMOTE(random_state=42)
knn = KNeighborsClassifier()
steps = [('smt', sm), ('KNN', knn)]

pipeline = Pipeline(steps) # define the pipeline object.

Set the parameters for the pipeline steps

In [34]:
knn_param_grid = {
    'smt__random_state': [45],
    'KNN__n_neighbors': [2, 4, 6, 8, 10, 20, 50],
}


Use grid search to find the optimum parameters for the Knn model

In [35]:
knn_grid_month = GridSearchCV(pipeline, param_grid=knn_param_grid, scoring='recall', cv=5, n_jobs=-1)
knn_grid_not_month = GridSearchCV(pipeline, param_grid=knn_param_grid, scoring='recall', cv=5, n_jobs=-1)

In [36]:
knn_grid_month.fit(X_train_month_smt, y_train_month_smt)
print("score = %3.2f" %(knn_grid_month.score(X_m_test,y_m_test)))
print(knn_grid_month.best_params_)

knn_grid_not_month.fit(X_train_not_month_smt, y_train_not_month_smt)
print("score = %3.2f" %(knn_grid_month.score(X_nm_test,y_nm_test)))
print(knn_grid_not_month.best_params_)

score = 0.60
{'KNN__n_neighbors': 20, 'smt__random_state': 45}
score = 0.22
{'KNN__n_neighbors': 8, 'smt__random_state': 45}


In [37]:
'''
recall
score = 0.75
{'KNN__n_neighbors': 10, 'smt__random_state': 45}

'''

"\nrecall\nscore = 0.75\n{'KNN__n_neighbors': 10, 'smt__random_state': 45}\n\n"

## Random Forest Classifier

Build the pipeline and search parameter grid for random forest

In [38]:
sm = SMOTE(random_state=42)
rf = RandomForestClassifier()
steps = [('smt', sm), ('RFC', rf)]

rf_pipeline = Pipeline(steps) # define the pipeline object.

In [39]:
rf_param_grid = {
    'smt__random_state': [10],
    'RFC__n_estimators': [50, 100, 150, 200, 1000],
    'RFC__max_depth' : [2,3,4],
    'RFC__max_features' : [5, 10, 15],
    'RFC__criterion' : ['gini', 'entropy'],
    'RFC__random_state' :[42]
}

In [40]:
rf_grid_month = GridSearchCV(rf_pipeline, param_grid=rf_param_grid, scoring='recall', cv=5, n_jobs=-1)
rf_grid_not_month = GridSearchCV(rf_pipeline, param_grid=rf_param_grid, scoring='recall', cv=5, n_jobs=-1)

In [41]:
rf_grid_month.fit(X_train_month_smt, y_train_month_smt)

print("score = %3.2f" %(rf_grid_month.score(X_m_test,y_m_test)))
print(rf_grid_month.best_params_)

score = 0.71
{'RFC__criterion': 'entropy', 'RFC__max_depth': 2, 'RFC__max_features': 5, 'RFC__n_estimators': 1000, 'RFC__random_state': 42, 'smt__random_state': 10}


In [42]:
rf_grid_not_month.fit(X_train_not_month_smt, y_train_not_month_smt)

print("score = %3.2f" %(rf_grid_not_month.score(X_nm_test,y_nm_test)))
print(rf_grid_not_month.best_params_)

score = 0.69
{'RFC__criterion': 'entropy', 'RFC__max_depth': 4, 'RFC__max_features': 15, 'RFC__n_estimators': 1000, 'RFC__random_state': 42, 'smt__random_state': 10}
