<a id="TableOfContents"></a>
# TABLE OF CONTENTS:
<li><a href='#imports'>Imports</a></li>
<li><a href="#telco">Get Telco Dataset</a></li>
<li><a href="#split">Split Telco Dataset</a></li>
<li><a href="#keycol">Key Columns</a></li>
<li><a href="#DTC">Decision Tree Classifier Modeling</a></li>
<li><a href="#RFC">Random Forest Classifier Modeling</a></li>
<li><a href="#KNN">K-Nearest Neighbors Modeling</a></li>
<li><a href="#LR">Logistic Regression Modeling</a></li>
<li><a href="#top3">Top 3 Models</a></li>

##### Orientation:
The purpose of this file is to create models to predict whether or not a customer will churn and get the best performing models.

<a id='imports'></a>
# IMPORTS:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [1]:
# tabular data
import numpy as np
import pandas as pd

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# DecisionTree modeling:
from sklearn.tree import DecisionTreeClassifier as DTC, export_text, plot_tree

# RandomForest modeling:
from sklearn.ensemble import RandomForestClassifier as RFC

# KNN modeling:
from sklearn.neighbors import KNeighborsClassifier as KNN

# Logistic Regression modeling:
from sklearn.linear_model import LogisticRegression as LR

# Other sklearn stuff
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

# .py files
import prepare
import evaluation
import model

from sklearn.feature_selection import SelectKBest, chi2

<a id='telco'></a>
# Get Telco Dataset:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [2]:
# Acquire cleaned 'telco' dataset
telco = prepare.prep_telco()
telco.sample()

  uniques = Index(uniques)


Unnamed: 0,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,device_protection,...,sign_dayofweek_1,sign_dayofweek_2,sign_dayofweek_3,sign_dayofweek_4,sign_dayofweek_5,sign_dayofweek_6,total_services,total_extra_services,value_per_total_services,value_per_total_extra_services
5432,Female,0,Yes,Yes,63,Yes,No,No internet service,No internet service,No internet service,...,0,0,0,1,0,0,1,0,19.35,inf


<a id='split'></a>
# Split Telco Dataset:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [3]:
# Split telco dataset
train, val, test = prepare.split(telco, 'churn')

train.shape:(3943, 100)
validate.shape:(1691, 100)
test.shape:(1409, 100)


In [4]:
# Verify split shapes
train.shape, val.shape, test.shape

((3943, 100), (1691, 100), (1409, 100))

<a id='keycol'></a>
# Key Columns:
<li><a href='#TableOfContents'>Table of Contents</a></li>

- Columns:
    - 'partner'
    - 'dependents'
    - 'online_security'
    - 'online_backup'
    - 'device_protection'
    - 'tech_support'
    - 'streaming_tv'
    - 'streaming_movies'
    - 'payment_type'
    - 'contract_type'
    - 'total_services'
    - 'total_extra_services'

In [712]:
# List of pertinent columns
# Bin signup_date
keylist = [
    'online_security_No',
    'online_backup_No',
    'device_protection_No',
    'tech_support_No',
    'contract_type_Month-to-month',
    'internet_service_type_Fiber_optic',
    'payment_type_Electronic_check',
    'sign_year',
    'tenure',
    'value_per_total_services'
]

In [713]:
# Assign x/y train/val/test cols:
x_train = train[keylist]
y_train = train['churn']
x_val = val[keylist]
y_val = val['churn']
x_test = test[keylist]
y_test = test['churn']

<a id='DTC'></a>
# Decision Tree Classifier Modeling:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [714]:
# Create base dictionary
dtcscores = {
    'model' : ['actual'],
    'train' : [100],
    'val' : [100],
    'diff' : [0],
    'test' : [100]
}

In [715]:
# Add baseline to dictionary
baselinetrain = round((train.churn == 'No').mean(), 5)
baselineval = round((val.churn == 'No').mean(), 5)
baselinediff = round(abs(baselinetrain - baselineval), 5)
baselinetest = round((test.churn == 'No').mean(), 5)
dtcscores['model'].append('baseline')
dtcscores['train'].append(baselinetrain)
dtcscores['val'].append(baselineval)
dtcscores['diff'].append(baselinediff)
dtcscores['test'].append(baselinetest)

In [716]:
# Confirm df funcitonality with actual and baseline
pd.DataFrame.from_dict(dtcscores)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
1,baseline,0.73472,0.73448,0.00024,0.73456


In [717]:
# Baseline predictions for TP / TN
other = pd.DataFrame(y_train)
other['base'] = 'No'
matrix = confusion_matrix(other.churn, other.base, labels=('Yes', 'No'))
TNbase = (matrix[0, 0] / (matrix[0, 0] + matrix[1, 0]))
TPbase = (matrix[1, 1] / (matrix[1, 1] + matrix[0, 1]))
print(f'True Negative base prediction ("Yes"): {TNbase}')
print(f'True Positive base prediction ("No"): {TPbase}')

True Negative base prediction ("Yes"): nan
True Positive base prediction ("No"): 0.7347197565305605


  TNbase = (matrix[0, 0] / (matrix[0, 0] + matrix[1, 0]))


In [718]:
# Make and fit DTC of 2-25 depth to train
for i in range(2, 26):
    dtc = DTC(max_depth=i, random_state=100)
    dtc.fit(x_train, y_train)
    model = dtc.predict(x_train)
    trainscore = round(dtc.score(x_train, y_train), 5)
    valscore = round(dtc.score(x_val, y_val), 5)
    diffscore = round(abs(trainscore - valscore), 5)
    testscore = round(dtc.score(x_test, y_test), 5)
    dtcscores['model'].append(f'model{i}')
    dtcscores['train'].append(trainscore)
    dtcscores['val'].append(valscore)
    dtcscores['diff'].append(diffscore)
    dtcscores['test'].append(testscore)

In [719]:
# Find best 'train' model
pd.DataFrame.from_dict(dtcscores).sort_values(by='train', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
25,model25,0.99087,0.72679,0.26408,0.7225


In [720]:
# Find best 'val' model
pd.DataFrame.from_dict(dtcscores).sort_values(by='val', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
4,model4,0.80903,0.78829,0.02074,0.76508


In [721]:
# Find lowest 'diff' value 
pd.DataFrame.from_dict(dtcscores).sort_values(by='diff', ascending=True).head(3)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
1,baseline,0.73472,0.73448,0.00024,0.73456
3,model3,0.80167,0.78238,0.01929,0.76792


In [722]:
# Find best 'test' model
pd.DataFrame.from_dict(dtcscores).sort_values(by='test', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
5,model5,0.81512,0.78652,0.0286,0.77147


In [723]:
dtc = DTC(max_depth=3,
          random_state=100)
dtc.fit(x_train, y_train)
model = dtc.predict(x_test)
matrix = confusion_matrix(y_test, model, labels=('Yes', 'No'))
print(matrix)
TNdtc = (matrix[0, 0] / (matrix[0, 0] + matrix[1, 0]))
TPdtc = (matrix[1, 1] / (matrix[1, 1] + matrix[0, 1]))
print(f'True Negative base prediction ("Yes"): {TNdtc}')
print(f'True Positive base prediction ("No"): {TPdtc}')

[[125 249]
 [ 78 957]]
True Negative base prediction ("Yes"): 0.6157635467980296
True Positive base prediction ("No"): 0.7935323383084577


- Best Model:
    - model 3
        - Train: 80.2%
        - Val: 78.2%
        - Diff: 2.0%
        - Test: 76.8%
        - TP('No'): 79.4%
        - TN('Yes'): 61.6%
        - Restrictions:
            - max_depth = 3
            - random_state=100
- Baseline:
    - Test: 73.5%
    - TP('No'): 73.5%
    - TN('Yes'): 0.0%
    
    
- Model 3 & baseline comparison
    - Test: + 3.3%
    - TP('No'): + 5.9%
    - TN('Yes'): + 61.6%

<a id='RFC'></a>
# Random Forest Classifier Modeling:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [724]:
# Create base dictionary
rfcscores = {
    'model' : ['actual'],
    'train' : [100],
    'val' : [100],
    'diff' : [0],
    'test' : [100]
}

In [725]:
# Add baseline to dictionary
baselinetrain = round((train.churn == 'No').mean(), 5)
baselineval = round((val.churn == 'No').mean(), 5)
baselinediff = round(abs(baselinetrain - baselineval), 5)
baselinetest = round((test.churn == 'No').mean(), 5)
rfcscores['model'].append('baseline')
rfcscores['train'].append(baselinetrain)
rfcscores['val'].append(baselineval)
rfcscores['diff'].append(baselinediff)
rfcscores['test'].append(baselinetest)

In [726]:
# Confirm df funcitonality with actual and baseline
pd.DataFrame.from_dict(rfcscores)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
1,baseline,0.73472,0.73448,0.00024,0.73456


In [727]:
# Make and fit RFC of 2-25 depth to train
for i in range(2, 26):
    rfc = RFC(max_depth=i, random_state=100)
    rfc.fit(x_train, y_train)
    model = rfc.predict(x_train)
    trainscore = round(rfc.score(x_train, y_train), 5)
    valscore = round(rfc.score(x_val, y_val), 5)
    diffscore = round(abs(trainscore - valscore), 5)
    testscore = round(rfc.score(x_test, y_test), 5)
    rfcscores['model'].append(f'model{i}')
    rfcscores['train'].append(trainscore)
    rfcscores['val'].append(valscore)
    rfcscores['diff'].append(diffscore)
    rfcscores['test'].append(testscore)

In [728]:
# Find best 'train' model
pd.DataFrame.from_dict(rfcscores).sort_values(by='train', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
20,model20,0.99239,0.767,0.22539,0.76224


In [729]:
# Find best 'val' model
pd.DataFrame.from_dict(rfcscores).sort_values(by='val', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
8,model8,0.85569,0.78947,0.06622,0.77999


In [730]:
# Find lowest 'diff' value 
pd.DataFrame.from_dict(rfcscores).sort_values(by='diff', ascending=True).head(4)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
1,baseline,0.73472,0.73448,0.00024,0.73456
2,model2,0.79939,0.78119,0.0182,0.76011
3,model3,0.80852,0.78238,0.02614,0.76011


In [731]:
# Find best 'test' model
pd.DataFrame.from_dict(rfcscores).sort_values(by='test', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
7,model7,0.83338,0.78829,0.04509,0.78141


In [732]:
rfc = RFC(max_depth=7,
          random_state=100)
rfc.fit(x_train, y_train)
model = rfc.predict(x_test)
matrix = confusion_matrix(y_test, model, labels=('Yes', 'No'))
print(matrix)
TNdtc = (matrix[0, 0] / (matrix[0, 0] + matrix[1, 0]))
TPdtc = (matrix[1, 1] / (matrix[1, 1] + matrix[0, 1]))
print(f'True Negative base prediction ("Yes"): {TNdtc}')
print(f'True Positive base prediction ("No"): {TPdtc}')

[[190 184]
 [124 911]]
True Negative base prediction ("Yes"): 0.6050955414012739
True Positive base prediction ("No"): 0.8319634703196347


- Best Model:
    - model 7
        - Train: 83.3%
        - Val: 78.8%
        - Diff: 4.5%
        - Test: 78.1%
        - TP('No'): 83.2%
        - TN('Yes'): 60.5%
        - Restrictions:
            - max_depth = 7
            - random_state=100
- Baseline:
    - Test: 73.5%
    - TP('No'): 73.5%
    - TN('Yes'): 0.0%
    
    
- Model 7 & baseline comparison
    - Test: + 4.6%
    - TP('No'): + 9.7%
    - TN('Yes'): + 60.5%

<a id='KNN'></a>
# K-Nearest Neighbors Modeling:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [733]:
# Create base dictionary
knnscores = {
    'model' : ['actual'],
    'train' : [100],
    'val' : [100],
    'diff' : [0],
    'test' : [100]
}

In [734]:
# Add baseline to dictionary
baselinetrain = round((train.churn == 'No').mean(), 5)
baselineval = round((val.churn == 'No').mean(), 5)
baselinediff = round(abs(baselinetrain - baselineval), 5)
baselinetest = round((test.churn == 'No').mean(), 5)
knnscores['model'].append('baseline')
knnscores['train'].append(baselinetrain)
knnscores['val'].append(baselineval)
knnscores['diff'].append(baselinediff)
knnscores['test'].append(baselinetest)

In [735]:
# Confirm df funcitonality with actual and baseline
pd.DataFrame.from_dict(knnscores)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
1,baseline,0.73472,0.73448,0.00024,0.73456


In [736]:
# Make and fit KNN of 1-25 neighbors to train
for i in range(1, 26):
    knn = KNN(n_neighbors=i)
    knn.fit(x_train, y_train)
    model = knn.predict(x_train)
    trainscore = round(knn.score(x_train, y_train), 5)
    valscore = round(knn.score(x_val, y_val), 5)
    diffscore = round(abs(trainscore - valscore), 5)
    testscore = round(knn.score(x_test, y_test), 5)
    knnscores['model'].append(f'model{i}')
    knnscores['train'].append(trainscore)
    knnscores['val'].append(valscore)
    knnscores['diff'].append(diffscore)
    knnscores['test'].append(testscore)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [737]:
# Find best 'train' model
pd.DataFrame.from_dict(knnscores).sort_values(by='train', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
2,model1,0.99087,0.71082,0.28005,0.71043


In [738]:
# Find best 'val' model
pd.DataFrame.from_dict(knnscores).sort_values(by='val', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
17,model16,0.81131,0.7806,0.03071,0.76934


In [739]:
# Find lowest 'diff' value 
pd.DataFrame.from_dict(knnscores).sort_values(by='diff', ascending=True).head(4)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
1,baseline,0.73472,0.73448,0.00024,0.73456
26,model25,0.80599,0.7806,0.02539,0.76721
25,model24,0.80776,0.77942,0.02834,0.76721


In [740]:
# Find best 'test' model
pd.DataFrame.from_dict(knnscores).sort_values(by='test', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
17,model16,0.81131,0.7806,0.03071,0.76934


In [741]:
knn = KNN(n_neighbors=16)
knn.fit(x_train, y_train)
model = knn.predict(x_test)
matrix = confusion_matrix(y_test, model, labels=('Yes', 'No'))
print(matrix)
TNdtc = (matrix[0, 0] / (matrix[0, 0] + matrix[1, 0]))
TPdtc = (matrix[1, 1] / (matrix[1, 1] + matrix[0, 1]))
print(f'True Negative base prediction ("Yes"): {TNdtc}')
print(f'True Positive base prediction ("No"): {TPdtc}')

[[153 221]
 [104 931]]
True Negative base prediction ("Yes"): 0.5953307392996109
True Positive base prediction ("No"): 0.8081597222222222


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


- Best Model:
    - model 16
        - Train: 81.1%
        - Val: 78.1%
        - Diff: 3.0%
        - Test: 76.9%
        - TP('No'): 80.8%
        - TN('Yes'): 59.5%
        - Restrictions:
            - n_neighbors = 7
- Baseline:
    - Test: 73.5%
    - TP('No'): 73.5%
    - TN('Yes'): 0.0%
    
    
- Model 16 & baseline comparison
    - Test: + 3.4%
    - TP('No'): + 7.3%
    - TN('Yes'): + 59.5%

<a id='LR'></a>
# Logistic Regression Modeling:
<li><a href='#TableOfContents'>Table of Contents</a></li>

In [742]:
# Create base dictionary
lrscores = {
    'model' : ['actual'],
    'train' : [100],
    'val' : [100],
    'diff' : [0],
    'test' : [100]
}

In [743]:
# Add baseline to dictionary
baselinetrain = round((train.churn == 'No').mean(), 5)
baselineval = round((val.churn == 'No').mean(), 5)
baselinediff = round(abs(baselinetrain - baselineval), 5)
baselinetest = round((test.churn == 'No').mean(), 5)
lrscores['model'].append('baseline')
lrscores['train'].append(baselinetrain)
lrscores['val'].append(baselineval)
lrscores['diff'].append(baselinediff)
lrscores['test'].append(baselinetest)

In [744]:
# Confirm df funcitonality with actual and baseline
pd.DataFrame.from_dict(lrscores)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
1,baseline,0.73472,0.73448,0.00024,0.73456


In [745]:
# Make and fit LR to train
lr = LR(random_state=100)
lr.fit(x_train, y_train)
model = lr.predict(x_train)
trainscore = round(lr.score(x_train, y_train), 5)
valscore = round(lr.score(x_val, y_val), 5)
diffscore = round(abs(trainscore - valscore), 5)
testscore = round(lr.score(x_test, y_test), 5)
lrscores['model'].append('model1')
lrscores['train'].append(trainscore)
lrscores['val'].append(valscore)
lrscores['diff'].append(diffscore)
lrscores['test'].append(testscore)

In [746]:
# Find best 'train' model
pd.DataFrame.from_dict(lrscores).sort_values(by='train', ascending=False).head(2)

Unnamed: 0,model,train,val,diff,test
0,actual,100.0,100.0,0.0,100.0
2,model1,0.8108,0.79007,0.02073,0.77502


In [748]:
model = lr.predict(x_test)
matrix = confusion_matrix(y_test, model, labels=('Yes', 'No'))
print(matrix)
TNdtc = (matrix[0, 0] / (matrix[0, 0] + matrix[1, 0]))
TPdtc = (matrix[1, 1] / (matrix[1, 1] + matrix[0, 1]))
print(f'True Negative base prediction ("Yes"): {TNdtc}')
print(f'True Positive base prediction ("No"): {TPdtc}')

[[201 173]
 [144 891]]
True Negative base prediction ("Yes"): 0.5826086956521739
True Positive base prediction ("No"): 0.8374060150375939


- Best Model:
    - model 1
        - Train: 81.1%
        - Val: 79.0%
        - Diff: 2.1%
        - Test: 77.5%
        - TP('No'): 83.7%
        - TN('Yes'): 58.2%
        - Restrictions:
            - random_state = 100
- Baseline:
    - Test: 73.5%
    - TP('No'): 73.5%
    - TN('Yes'): 0.0%
    
    
- Model 1 & baseline comparison
    - Test: + 4.0%
    - TP('No'): + 10.2%
    - TN('Yes'): + 58.2%

<a id='top3'></a>
# Top 3 Models:
<li><a href='#TableOfContents'>Table of Contents</a></li>

### Overview of above 4 models compared to baseline
- Baseline:
    - Test: 73.5%
    - TP('No'): 73.5%
    - TN('Yes'): 0.0%
- Decision Tree Classifier:
    - Model 3
        - Test: + 3.3%
        - TP('No'): + 5.9%
        - TN('Yes'): + 61.6%
- Random Forest Classifier:
    - Model 7
        - Test: + 4.6%
        - TP('No'): + 9.7%
        - TN('Yes'): + 60.5%
- K-Nearest Neighbors:
    - Model 16
        - Test: + 3.4%
        - TP('No'): + 7.3%
        - TN('Yes'): + 59.5%
- Logistic Regression:
    - Model 1
        - Test: + 4.0%
        - TP('No'): + 10.2%
        - TN('Yes'): + 58.2%

### Top Models
- For overall score:
    - Random Forest Classifier
        - Model 7
            - + 4.6%
- For True Positives ('No'):
    - Logistic Regression
        - Model 1
            - + 10.2%
- For True Negatives ('Yes'):
    - Decision Tree Classifier
        - Model 3
            - + 61.6%