<h1>Capstone Project #1</h1>
<h2>Predicting Customer Churn</h2>
<h3>Cliff Robbins</h3>

<h3>Proposal</h3>
<p>My project will focus on a problem that 28 million business face each day of operation, customer churn.</p>

<h3>Description:</h3>
<p><strong>Customer churn</strong>, also known as customer attrition, customer turnover or customer defection is the loss of clients or customers.  Many companies include customer churn rate as part of their monitoring metrics because the cost of retaining current customers compared to acquiring new customers is much less.  
Within customer churn there is the concept of voluntary and involuntary churn with voluntary being a customer leaves on their own choice while involuntary could be attributed to customer relocation to a long term care facility, death or customer relocation in a different state/geography.  In most analytical models, involuntary churn is excluded from the metric.
</p>

<h3>Formulation of a Question</h3>
<p>When a company first starts up, the founding members can typically handle all of the various customer concerns.  As the company continues to grow, the founders can no longer service all of the various clients with support handled by a customer service team.  The customer service team focuses on current issues and a proactive approach is lost.</p>
<p>As the company grows, the company still cares about its clients; however, due to the large customer base they can no longer address each and every customer.  This is a real problem for companies.  How does a company proactively predict if a customer is happy or unhappy?  How does a company know if a customer is so unhappy that they are willing to leave?  If a company knew if a customer was getting ready to leave, could they reach out to the customer and mend the relationship?</p>
<h3>Hypothesis</h3>
<p>I believe past customer data can predict future customer churn. </p>
<h3>Prediction</h3>
<p>If I had past customer data that showed various features and whether they stayed or churned we could use that data to predict future outcomes of current customers.</p>
<h3>Testing</h3>
<p>To test my hypothesis, I will use a set of customer data with various features along with whether they churned or not.</p>
<p>The data has 7043 rows and can be found at:</p>
<p>https://www.kaggle.com/blastchar/telco-customer-churn</p>

<h2>Beyond Baseline Machine Learning</h2>
In the previous notebook, I fit the data to a Logistic Regression model using L1 and L2 regularization.  The accuracy was 75% for L1 and 74% for L2.  The performance of the model showed in imbalance regarding customers that churned.  The F1 score for customers that did not churn was 74% and the F1 score for customers that did churn was 27% (for L1 regularization).<br><br>
This is indicative of the data set where the customers that churned have a much lower percentage compared to those that did not churn.  This is balance classification issue which cannot be fixed with throwing more data at it because there is a natural imbalance between the classes.<br><br>
In this notebook I will use different models and data sampling techniques to test if the accuracy and/or performance improves.

In [5]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# Load features and labels
X = np.load('data/cp1-3-X-data.npy')
y = np.load('data/cp1-3-y-data.npy')

print(X.shape)
print(y.shape)

(7043, 7)
(7043,)


Split the data and apply oversampling to the label (y) using SMOTE to the training data only.<br>
I am using oversampling because there are only 7043 entries.  There is not alot of data and it is not recommended to undersample if your dataset is considered small.

In [4]:
from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE 

# REF https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.SMOTE.html

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42,stratify=y)

sm = SMOTE(random_state=42)
X_train_res,y_train_res = sm.fit_resample(X_train,y_train)

print('Original test shape %s' % Counter(y_train))
print('Resample test shape %s' % Counter())


Original test shape Counter({False: 3880, True: 1402})
Resample test shape Counter({False: 3880, True: 3880})


<h2>Model Selection</h2>
Now that the data is prepared, I will us a variety of models and then compare the accuracy and performance of the models.

<h2>Logistic Regression</h2>

In [17]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
# REF https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

# Original Data set
lr = LogisticRegression(penalty='l1',C=0.1)
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
lr_res = LogisticRegression(penalty='l1',C=0.1)
lr_res.fit(X_train_res,y_train_res)
y_pred_res = lr_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7444633730834753
[[1205   89]
 [ 361  106]]
              precision    recall  f1-score   support

       False       0.77      0.93      0.84      1294
        True       0.54      0.23      0.32       467

   micro avg       0.74      0.74      0.74      1761
   macro avg       0.66      0.58      0.58      1761
weighted avg       0.71      0.74      0.70      1761


Resampled Data Set
Accuracy Score 0.6700738216922203
[[793 501]
 [ 80 387]]
              precision    recall  f1-score   support

       False       0.91      0.61      0.73      1294
        True       0.44      0.83      0.57       467

   micro avg       0.67      0.67      0.67      1761
   macro avg       0.67      0.72      0.65      1761
weighted avg       0.78      0.67      0.69      1761





<h2>Naïve Bayes</h2>

In [18]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.naive_bayes import GaussianNB
# REF https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

# Original Data set
nb = GaussianNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
nb_res = GaussianNB()
nb_res.fit(X_train_res,y_train_res)
y_pred_res = nb_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7410562180579217
[[1005  289]
 [ 167  300]]
              precision    recall  f1-score   support

       False       0.86      0.78      0.82      1294
        True       0.51      0.64      0.57       467

   micro avg       0.74      0.74      0.74      1761
   macro avg       0.68      0.71      0.69      1761
weighted avg       0.77      0.74      0.75      1761


Resampled Data Set
Accuracy Score 0.7052810902896082
[[881 413]
 [106 361]]
              precision    recall  f1-score   support

       False       0.89      0.68      0.77      1294
        True       0.47      0.77      0.58       467

   micro avg       0.71      0.71      0.71      1761
   macro avg       0.68      0.73      0.68      1761
weighted avg       0.78      0.71      0.72      1761



<h2>Decision Tree</h2>

In [26]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.tree import DecisionTreeClassifier
# REF https://scikit-learn.org/stable/modules/tree.html

depth = 6 # Tested various depths from 3 to 10

# Original Data set
tree = DecisionTreeClassifier(max_depth=depth,random_state=42,max_features=None,min_samples_leaf=15)
tree.fit(X_train,y_train)
y_pred = tree.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
tree_res = DecisionTreeClassifier(max_depth=depth,random_state=42,max_features=None,min_samples_leaf=15)
tree_res.fit(X_train_res,y_train_res)
y_pred_res = tree_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7563884156729132
[[1129  165]
 [ 264  203]]
              precision    recall  f1-score   support

       False       0.81      0.87      0.84      1294
        True       0.55      0.43      0.49       467

   micro avg       0.76      0.76      0.76      1761
   macro avg       0.68      0.65      0.66      1761
weighted avg       0.74      0.76      0.75      1761


Resampled Data Set
Accuracy Score 0.7240204429301533
[[900 394]
 [ 92 375]]
              precision    recall  f1-score   support

       False       0.91      0.70      0.79      1294
        True       0.49      0.80      0.61       467

   micro avg       0.72      0.72      0.72      1761
   macro avg       0.70      0.75      0.70      1761
weighted avg       0.80      0.72      0.74      1761



<h2>kNN</h2>

In [37]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
# REF https://scikit-learn.org/stable/modules/neighbors.html

# The lower k = more complex model and can lead to overfitting.
# The higher k = less complex model and can lead to underfitting.
neighbors = 7 # Tested various depths from 1 to 10

# Original Data set
knn = KNeighborsClassifier(n_neighbors=neighbors)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
knn_res = KNeighborsClassifier(n_neighbors=neighbors)
knn_res.fit(X_train_res,y_train_res)
y_pred_res = knn_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7438955139125497
[[1107  187]
 [ 264  203]]
              precision    recall  f1-score   support

       False       0.81      0.86      0.83      1294
        True       0.52      0.43      0.47       467

   micro avg       0.74      0.74      0.74      1761
   macro avg       0.66      0.65      0.65      1761
weighted avg       0.73      0.74      0.74      1761


Resampled Data Set
Accuracy Score 0.7308347529812607
[[1049  245]
 [ 229  238]]
              precision    recall  f1-score   support

       False       0.82      0.81      0.82      1294
        True       0.49      0.51      0.50       467

   micro avg       0.73      0.73      0.73      1761
   macro avg       0.66      0.66      0.66      1761
weighted avg       0.73      0.73      0.73      1761



<h2>SVM</h2>

In [45]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.svm import SVC
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

c = 0.8 # Tested various depths from 1 - .025

# Original Data set
svm = SVC(C=c,random_state=42,gamma='auto')
svm.fit(X_train,y_train)
y_pred = svm.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
svm_res = SVC(C=c,random_state=42,gamma='auto')
svm_res.fit(X_train_res,y_train_res)
y_pred_res = svm_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7563884156729132
[[1119  175]
 [ 254  213]]
              precision    recall  f1-score   support

       False       0.82      0.86      0.84      1294
        True       0.55      0.46      0.50       467

   micro avg       0.76      0.76      0.76      1761
   macro avg       0.68      0.66      0.67      1761
weighted avg       0.74      0.76      0.75      1761


Resampled Data Set
Accuracy Score 0.7047132311186826
[[854 440]
 [ 80 387]]
              precision    recall  f1-score   support

       False       0.91      0.66      0.77      1294
        True       0.47      0.83      0.60       467

   micro avg       0.70      0.70      0.70      1761
   macro avg       0.69      0.74      0.68      1761
weighted avg       0.80      0.70      0.72      1761



<h2>Random Forest with Random Hyperparameter Grid Search</h2>

In [50]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

# Hyperparemeter testing
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# Original Data set
rfm = RandomForestClassifier()
rfm_random = RandomizedSearchCV(estimator = rfm, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rfm_random.fit(X_train,y_train)
y_pred = rfm_random.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
rfm_res = RandomForestClassifier()
rfm_random_res = RandomizedSearchCV(estimator = rfm, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rfm_random_res.fit(X_train_res,y_train_res)
y_pred_res = rfm_random_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(rfm_random.best_params_)
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))
print(rfm_random_res.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  3.5min finished


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   30.7s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  4.9min finished


Original Data Set
Accuracy Score 0.7620670073821693
[[1122  172]
 [ 247  220]]
              precision    recall  f1-score   support

       False       0.82      0.87      0.84      1294
        True       0.56      0.47      0.51       467

   micro avg       0.76      0.76      0.76      1761
   macro avg       0.69      0.67      0.68      1761
weighted avg       0.75      0.76      0.76      1761

{'n_estimators': 400, 'min_samples_split': 10, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 90, 'bootstrap': True}

Resampled Data Set
Accuracy Score 0.7228847245883021
[[900 394]
 [ 94 373]]
              precision    recall  f1-score   support

       False       0.91      0.70      0.79      1294
        True       0.49      0.80      0.60       467

   micro avg       0.72      0.72      0.72      1761
   macro avg       0.70      0.75      0.70      1761
weighted avg       0.79      0.72      0.74      1761

{'n_estimators': 600, 'min_samples_split': 2, 'min_samples_l

<h2>Random Forest with Hyperparameter Grid Search</h2>
We will take the best params from above and rerun the search.

In [54]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

# Hyperparemeter testing

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300,400, 1000]
}

param_grid_res = {
    'bootstrap': [True],
    'max_depth': [40, 50, 70, 80],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [ 2, 4,8],
    'n_estimators': [300,400,500,600,700,800,1000]
}


# Original Data set
rfm = RandomForestClassifier()
rfm_random = GridSearchCV(estimator = rfm, param_grid = param_grid, cv = 3, verbose=2, n_jobs = -1)
rfm_random.fit(X_train,y_train)
y_pred = rfm_random.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
rfm_res = RandomForestClassifier()
rfm_random_res = GridSearchCV(estimator = rfm, param_grid = param_grid_res, cv = 3, verbose=2, n_jobs = -1)
rfm_random_res.fit(X_train_res,y_train_res)
y_pred_res = rfm_random_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(rfm_random.best_params_)
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))
print(rfm_random_res.best_params_)

Fitting 3 folds for each of 360 candidates, totalling 1080 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    4.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   30.4s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 1080 out of 1080 | elapsed:  4.1min finished


Fitting 3 folds for each of 504 candidates, totalling 1512 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 997 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 1442 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done 1512 out of 1512 | elapsed: 11.2min finished


Original Data Set
Accuracy Score 0.7620670073821693
[[1123  171]
 [ 248  219]]
              precision    recall  f1-score   support

       False       0.82      0.87      0.84      1294
        True       0.56      0.47      0.51       467

   micro avg       0.76      0.76      0.76      1761
   macro avg       0.69      0.67      0.68      1761
weighted avg       0.75      0.76      0.75      1761

{'bootstrap': True, 'max_depth': 90, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}

Resampled Data Set
Accuracy Score 0.7223168654173765
[[899 395]
 [ 94 373]]
              precision    recall  f1-score   support

       False       0.91      0.69      0.79      1294
        True       0.49      0.80      0.60       467

   micro avg       0.72      0.72      0.72      1761
   macro avg       0.70      0.75      0.70      1761
weighted avg       0.79      0.72      0.74      1761

{'bootstrap': True, 'max_depth': 80, 'max_features': 'auto',

<h2>Random Forest with Best Estimators</h2>
I am applying the best estimators to the Random Forest model.

In [56]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

# Hyperparemeter testing

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300,400, 1000]
}

param_grid_res = {
    'bootstrap': [True],
    'max_depth': [40, 50, 70, 80],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [ 2, 4,8],
    'n_estimators': [300,400,500,600,700,800,1000]
}


# Original Data set
rfm = RandomForestClassifier(bootstrap=True, max_depth=90, max_features='auto', min_samples_leaf=4, min_samples_split=10, n_estimators=100)
rfm.fit(X_train,y_train)
y_pred = rfm.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
rfm_res = RandomForestClassifier(bootstrap=True, max_depth=80, max_features='auto', min_samples_leaf=5, min_samples_split=2, n_estimators=500)
rfm_res.fit(X_train_res,y_train_res)
y_pred_res = rfm_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7586598523566156
[[1125  169]
 [ 256  211]]
              precision    recall  f1-score   support

       False       0.81      0.87      0.84      1294
        True       0.56      0.45      0.50       467

   micro avg       0.76      0.76      0.76      1761
   macro avg       0.68      0.66      0.67      1761
weighted avg       0.75      0.76      0.75      1761


Resampled Data Set
Accuracy Score 0.7223168654173765
[[899 395]
 [ 94 373]]
              precision    recall  f1-score   support

       False       0.91      0.69      0.79      1294
        True       0.49      0.80      0.60       467

   micro avg       0.72      0.72      0.72      1761
   macro avg       0.70      0.75      0.70      1761
weighted avg       0.79      0.72      0.74      1761



<h2>AdaBoost</h2>

In [60]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.ensemble import AdaBoostClassifier
# REF https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

n = 100

# Original Data set
abc = AdaBoostClassifier(n_estimators=n, random_state=42)
abc.fit(X_train,y_train)
y_pred = abc.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
abc_res = AdaBoostClassifier(n_estimators=n, random_state=42)
abc_res.fit(X_train_res,y_train_res)
y_pred_res = abc_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7643384440658717
[[1135  159]
 [ 256  211]]
              precision    recall  f1-score   support

       False       0.82      0.88      0.85      1294
        True       0.57      0.45      0.50       467

   micro avg       0.76      0.76      0.76      1761
   macro avg       0.69      0.66      0.67      1761
weighted avg       0.75      0.76      0.75      1761


Resampled Data Set
Accuracy Score 0.7194775695627484
[[901 393]
 [101 366]]
              precision    recall  f1-score   support

       False       0.90      0.70      0.78      1294
        True       0.48      0.78      0.60       467

   micro avg       0.72      0.72      0.72      1761
   macro avg       0.69      0.74      0.69      1761
weighted avg       0.79      0.72      0.74      1761



<h2>Resample with RUS and AdaBoost Model</h2>

In [66]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from imblearn.under_sampling import RandomUnderSampler
# https://github.com/dialnd/imbalanced-algorithms

#Resample using RUS
rus = RandomUnderSampler(random_state=42)
X_train_res,y_train_res = rus.fit_resample(X_train,y_train)

n = 100

# Original Data set
abc = AdaBoostClassifier(n_estimators=n, random_state=42)
abc.fit(X_train,y_train)
y_pred = abc.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

# Resample Data set
abc_res = AdaBoostClassifier(n_estimators=n, random_state=42)
abc_res.fit(X_train_res,y_train_res)
y_pred_res = abc_res.predict(X_test)
accuracy_res = accuracy_score(y_pred_res,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print()
print('Resampled Data Set')
print('Accuracy Score %s' % accuracy_res)
print(confusion_matrix(y_test, y_pred_res))
print(classification_report(y_test, y_pred_res))

Original Data Set
Accuracy Score 0.7643384440658717
[[1135  159]
 [ 256  211]]
              precision    recall  f1-score   support

       False       0.82      0.88      0.85      1294
        True       0.57      0.45      0.50       467

   micro avg       0.76      0.76      0.76      1761
   macro avg       0.69      0.66      0.67      1761
weighted avg       0.75      0.76      0.75      1761


Resampled Data Set
Accuracy Score 0.7262918796138558
[[913 381]
 [101 366]]
              precision    recall  f1-score   support

       False       0.90      0.71      0.79      1294
        True       0.49      0.78      0.60       467

   micro avg       0.73      0.73      0.73      1761
   macro avg       0.70      0.74      0.70      1761
weighted avg       0.79      0.73      0.74      1761



<h2>Leveraging RUSBoost model/resampling library</h2>

In [69]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import rus
# https://github.com/dialnd/imbalanced-algorithms

# Original Data set
abc = rus.RUSBoost(n_estimators=n, n_samples=300)
abc.fit(X_train,y_train)
y_pred = abc.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Original Data Set
Accuracy Score 0.6422487223168655
[[724 570]
 [ 60 407]]
              precision    recall  f1-score   support

       False       0.92      0.56      0.70      1294
        True       0.42      0.87      0.56       467

   micro avg       0.64      0.64      0.64      1761
   macro avg       0.67      0.72      0.63      1761
weighted avg       0.79      0.64      0.66      1761



<h2>Leveraging SMOTEBoost model/resampling library</h2>
Refer to https://www3.nd.edu/~dial/publications/hoens2013imbalanced.pdf

In [70]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import smote
# REF https://github.com/dialnd/imbalanced-algorithms
# REF https://www3.nd.edu/~dial/publications/hoens2013imbalanced.pdf

# Original Data set
abc = smote.SMOTEBoost(n_estimators=n, n_samples=300)
abc.fit(X_train,y_train)
y_pred = abc.predict(X_test)
accuracy = accuracy_score(y_pred,y_test)

print('Original Data Set')
print('Accuracy Score %s' % accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Original Data Set
Accuracy Score 0.5883021010789324
[[587 707]
 [ 18 449]]
              precision    recall  f1-score   support

       False       0.97      0.45      0.62      1294
        True       0.39      0.96      0.55       467

   micro avg       0.59      0.59      0.59      1761
   macro avg       0.68      0.71      0.59      1761
weighted avg       0.82      0.59      0.60      1761



<h2>Assessing results</h2>
For problems with class imbalance, metrics such as precision, recall, and f1-score give good insight to how a classifier performs with respect to the minority class. Depending on the problem, the goal is to optimize precision and/or recall of the classifier. In this case, I want a model that catches the most number of instances of the minority class, even if it increases the number of false positives. A classifier with a high recall score will give the greatest number of potential customer churns, or at least raise a flag on most of the cases. 