***Authors:  Jarod Carroll, Daihong Chen, Mihir Bhagat***

# Predicting Customer Churn in SyriaTel

**Data Source: https://www.kaggle.com/becksddf/churn-in-telecoms-dataset**

## Methodology

1. Data Acquisition
   - Get from Kaggle
2. Baseline Model
    - Get features so they are usable in a model
    - Split data into training and holdout
    - Split training data into sub training data and evaluation data
    - Train a stacking model
    - Evaluate the model
3. Impove the model
    - Work with the features
    - Change estimators in the stacking model
    - Tune Hyperparameters
4. Final Model
    - Train on full training set
    - Evaluate on holdout set

#### Import Necessary Packages

In [1]:
import pickle
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
from src.model_maker import *

## Data Acquisition

The data was downloaded from Kaggle and put into the data folder

## Making a Baseline Mode

In order to make a baseline model some features where changed. First, collumns for international plan and voicemail plan were converted to integers. Then the State columns was one hot encoded so it could be used in the model. The target value was set aas the churn column

In [3]:
df = pd.read_csv('../../data/Customer Churn Data.csv')
df['international plan'] = (df['international plan'] == 'yes').astype(int)
df['voice mail plan'] = (df['voice mail plan'] == 'yes').astype(int)
ohe = OneHotEncoder(sparse = False)
ohe_states = pd.DataFrame(ohe.fit_transform(pd.DataFrame(df['state'])), columns = ohe.get_feature_names())
df = pd.concat([df, ohe_states], axis = 1)
df = df.drop(['state'], axis = 1)

y = df['churn']
X = df.copy()
X.drop(['churn', 'area code','phone number'], axis = 1, inplace = True)

The data was then split into a training and holding set then the training data was further broken into a training and testing set. The testing set was used to evaluate if the model improved. The holding set will be used for final model evaluation.

In [7]:
X_full_train, X_holdout, y_full_train, y_holdout = train_test_split(X, y)
X_train, X_test, y_train, y_test = train_test_split(X_full_train, y_full_train)

A stacking model was then trained using the training data and using KNN, random forest classifier, and a gradient boosting classifier as estimators. This model was saved into a pickle file and can be loaded from there.

In [8]:
base_model = read_pickle('../../src/base_model.pickle');

To get an idea of how this model is doing we looked at how it performed on the testing set.

In [9]:
metrics(y_test, base_model.predict(X_test))

{'Accuracy': '0.9616',
 'Precision': '0.9879518072289156',
 'Recall': '0.780952380952381',
 'F1': '0.8723404255319148',
 'confusion_matrix': array([[519,   1],
        [ 23,  82]])}

## Improving the Model

In order to improve the model there are a couple of things that can be done. First we worked with the features a bit. Very few things led to model improvements but there was one important one. Summing up all of the charge columns led to larger model improvements so that was used in the final model. It was found out that the state column did not contribute to the model so that was removed. \
The next step to improving the model was to change the base estimators. Many were tried but we found that getting rid of KNN and using a logistic regression improved our model. \
After this the hyper parameters of the estimators were changed in orderto fine tune our model. The only change in that led to an improved model was changing the solver in the logistic regression to liblinear.
After optimizing all of these a final model was made.

## Final Model

Using all the improvements the model was saved to a pickle file to be loaded here. The model was trained on the full training set.

In [10]:
final_model = read_pickle('../../src/model.pickle')

This model was then evaluated using the holdout set.

In [12]:
X_train, X_holdout, y_train, y_holdout = get_train_and_test_data()
metrics(y_holdout, final_model.predict(X_holdout))

{'Accuracy': '0.984',
 'Precision': '1.0',
 'Recall': '0.8881118881118881',
 'F1': '0.9407407407407408',
 'confusion_matrix': array([[857,   0],
        [ 16, 127]])}

As we see the model does very good on the holdout set. Our model has a high accuracy of 98.4%. The only part the model doesn't do good on is that is sometimes predicts that the customer will not leave when they would.