### Which metric is important to your client? 
- Recall over Precision: Recall. It would be better to cast the net wider and provide interventions to more potential churn customers than to pull back and miss potential churn customers. The business investment is in creative development of interventions - so if some people that are misclassified get these interventions it shouldn't incur huge costs. The creative guideline would be not to call out people directly. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from ipywidgets import interactive, FloatSlider

import imblearn.over_sampling

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, roc_curve 

%matplotlib inline

In [2]:
data = pd.read_csv('/Users/jennihawk/Documents/Data Science/Classification/Churn Project/Models/chatr_clean.csv')

In [3]:
data.head()

Unnamed: 0,customerID,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,InternetService_Fiber,InternetService_No,Contract_One_Year,Contract_Two_year,PaymentMethod_Crcard,...,DeviceProtection_No_internet_serv,DeviceProtection_Yes,TechSupport_No_internet_serv,TechSupport_Yes,StreamingTV_No_internet_serv,StreamingTV_Yes,StreamingMovies_No_internet_serv,StreamingMovies_Yes,PaperlessBilling_Yes,Churn_Yes
0,7590-VHVEG,0,1.0,29.85,29.85,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,5575-GNVDE,0,34.0,56.95,1889.5,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
2,3668-QPYBK,0,2.0,53.85,108.15,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
3,7795-CFOCW,0,45.0,42.3,1840.75,0,0,1,0,0,...,0,1,0,1,0,0,0,0,0,0
4,9237-HQITU,0,2.0,70.7,151.65,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1


In [4]:
#data.info()

In [5]:
data.columns

Index(['customerID', 'SeniorCitizen', 'tenure', 'MonthlyCharges',
       'TotalCharges', 'InternetService_Fiber', 'InternetService_No',
       'Contract_One_Year', 'Contract_Two_year', 'PaymentMethod_Crcard',
       'PaymentMethod_Electr_Check', 'PaymentMethod_Mailed_check',
       'MultipleLines_No_phone_serv', 'MultipleLines_Yes', 'Dependents_Yes',
       'gender_Male', 'Partner_Yes', 'PhoneService_Yes',
       'OnlineSecurity_No_internet_serv', 'OnlineSecurity_Yes',
       'OnlineBackup_No_Internet_Serv', 'OnlineBackup_Yes',
       'DeviceProtection_No_internet_serv', 'DeviceProtection_Yes',
       'TechSupport_No_internet_serv', 'TechSupport_Yes',
       'StreamingTV_No_internet_serv', 'StreamingTV_Yes',
       'StreamingMovies_No_internet_serv', 'StreamingMovies_Yes',
       'PaperlessBilling_Yes', 'Churn_Yes'],
      dtype='object')

### Model Setup

In [6]:
features_in = ['SeniorCitizen', 'tenure', 'MonthlyCharges',
       'TotalCharges', 'InternetService_Fiber', 'InternetService_No',
       'Contract_One_Year', 'Contract_Two_year', 'PaymentMethod_Crcard',
       'PaymentMethod_Electr_Check', 'PaymentMethod_Mailed_check',
       'MultipleLines_No_phone_serv', 'MultipleLines_Yes', 'Dependents_Yes',
       'gender_Male', 'Partner_Yes', 'PhoneService_Yes',
       'OnlineSecurity_No_internet_serv', 'OnlineSecurity_Yes',
       'OnlineBackup_No_Internet_Serv', 'OnlineBackup_Yes',
       'DeviceProtection_No_internet_serv', 'DeviceProtection_Yes',
       'TechSupport_No_internet_serv', 'TechSupport_Yes',
       'StreamingTV_No_internet_serv', 'StreamingTV_Yes',
       'StreamingMovies_No_internet_serv', 'StreamingMovies_Yes',
       'PaperlessBilling_Yes']

y = data['Churn_Yes']
X = data[features_in]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

pipe = Pipeline([('scaler', StandardScaler()), ('LogReg', LogisticRegression())])

In [7]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(5274, 30)
(1758, 30)
(5274,)
(1758,)


### Fit Logistic Regression to Training Data

In [8]:
pipe.fit(X_train, y_train)  # applies scaling on training data

### Metrics

#### Average Rate of Churn on Test Data
- See if you have class imbalance
- Since I don't have 50% average churn - there's a class imbalance

In [9]:
np.mean(y_train)

0.26753886992794845

In [10]:
np.mean(y_test)

0.2605233219567691

### Oversampling to Address Class Imbalance

In [11]:
# setup for the ratio argument of RandomOverSampler initialization
n_pos = np.sum(y_tr == 1)
n_neg = np.sum(y_tr == 0)
ratio = {1 : n_pos * 4, 0 : n_neg} 

# randomly oversample positive samples: create 4x as many 
ROS = imblearn.over_sampling.RandomOverSampler(sampling_strategy = ratio, random_state=42) 

#To create actual sample, use the fit_sample method to create the dataset with a desired proportion 
X_tr_rs, y_tr_rs = ROS.fit_resample(X_tr, y_tr)


NameError: name 'y_tr' is not defined

#### Hard Class Predictions
Predict Churn / Not Churn

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
y_pred

#### Soft Class Predictions
- Giving probabilities of being one class or another
- If class labels strings Sklearn displays in alphabetical order. If numerical class labels they'll be in ascending order

In [None]:
pipe.predict_proba(X_test)[:5]

#### Accuracy
Percentage of observations that were correctly classified.
When one class is significantly less common that the other accuracy is often not the most helpful metric to optimize.

In [None]:
#accuracy score on train data
pipe.score(X_train, y_train)

In [None]:
#accuracy score on test data
pipe.score(X_test, y_test)

#### Confusion Matrix

In [None]:
#sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)

Get this to print out percentage option too

In [None]:
logreg_confusion = confusion_matrix(y_test, y_pred)

In [None]:
logreg_confusion

In [None]:
def make_confusion_matrix(model, threshold = 0.5):
    # Predict class 1 if probability of being in class 1 is greater than threshold
    # (model.predict(X_test) does this automatically with a threshold of 0.5)
    y_predict = (model.predict_proba(X_test)[:, 1] >= threshold)
    churn_conf = confusion_matrix(y_test, y_pred)
    plt.figure(dpi=80)
    sns.heatmap(churn_conf, cmap=plt.cm.Blues, annot=True, square=True, fmt='d',);

    #plt.savefig('confusion_matrix.png', dpi=300) 

#### Lowering threshold from the 0.5 default increases recall. Precision gets worse.

In [None]:
# confusion matrix with threshold slider
interactive(lambda threshold: make_confusion_matrix(pipe, threshold), threshold=(0.0,1.0,0.02))

In [None]:
def make_confusion_matrix(model, threshold = 0.5):
    # Predict class 1 if probability of being in class 1 is greater than threshold
    # (model.predict(X_test) does this automatically with a threshold of 0.5)
    y_predict = (model.predict_proba(X_test)[:, 1] >= threshold)
    churn_conf = confusion_matrix(y_test, y_pred, normalize = 'all')
    plt.figure(dpi=80)
    sns.heatmap(churn_conf, cmap=plt.cm.Blues, annot=True, square=True, fmt='.2%');
    #plt.savefig('confusion_matrix_percent.png', dpi=300) 

trouble shooting why the slider isn't working
As for why the slider does not change your confusion_matrix, I would start debugging that by investigating what pipe outputs. For example does pipe give hard classes or soft probabilities and if they are probabilities are they mostly extreme (close to 0 or 1) or spread out

In [None]:
# confusion matrix with threshold slider
# how the widget works https://ipywidgets.readthedocs.io/en/stable/examples/Using%20Interact.html?highlight=interact
interactive(lambda threshold: make_confusion_matrix(pipe, threshold), threshold=(0.0,1.0,0.02))

Confusion matrix findings
The model:
- correctly classified 1151 people who didn't cancel their subscription. 
- correctly classified 237 customers as people who were going to cancel their subscription and did.
- incorrectly classified  149 customers as people who were going to cancel their subscription when they actually kept their subscription. 
- It incorrectly classified 221 customers as people who were going to keep their subscription but actually canceled it. 

### Precision, Recall, F1 Scores
- Precision goes down as you decrease the threshold, while recall goes up. This is called the _precision-recall tradeoff_.
- Precision = true positives (correctly classified as people who were going to churn) divided by all of our model's predicted positives.(100% precision indicates that all of the positives identified by our model were actual positives.) 
- Recall = the number of true positives correctly classified, divided by the actual positives in the dataset. 
- F1 = the harmonic mean of precision and recall. It's designed to penalize situations where precision or recall is significantly better than the other metric. 

In [None]:
print("Default treshold:")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"F1: {f1_score(y_test, y_pred)}")

### ROC Curve
- ROC AUC (area under curve) metric is 1 for a perfect classifier, and it's equal to 0.5 for a model that performs as well as random guess. 

How the variables below are working
- the variables (fpr, tpr, thresholds) are returned by the roc_curve function are stored in these three variables - fpr (false positive rate), tpr (true positive rate) and thresholds. Fpr and tpr are used later to plot the chart.

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, pipe.predict_proba(X_test)[:,1])

In [None]:
plt.plot(fpr, tpr,lw=2)
plt.plot([0,1],[0,1],c='orange',ls='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])


plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve');
print("ROC AUC score = ", roc_auc_score(y_test, pipe.predict_proba(X_test)[:,1]))
#plt.savefig('ROC Curve.png', dpi=300) 

### Interpreting Coefficients: Would like to indicate which features are more strongly associated with churn - we do this with coefficients, correct?
- One unit of increase in x actually increases the log odds by beta units. 
- In other words: One unit of increase in x, increases the odds by an exponential factor of beta.
- If the features coefficient beta is positive, increasing that feature makes the positive class more likely
- If beta is negative, increasing the feature does the opposite and the positive class becomes less likely

### Coefficient For Each Feature

#### Coeffcients in Log Odds Units

In [None]:
#.T to transform cuz there was a shape error
coefs_tst_data = pd.DataFrame(pipe['LogReg'].coef_.T, X.columns, columns = ['Coeff_Log_Odds'])

In [None]:
#coefs_tst_data.sort_values(by='Coeff_Log_Odds', ascending = False)

#### Coeffecients: Exponentiate to get rid of log odds
These are now odds NOT log odds

In [None]:
coefs_tst_data['Coeff_Odds'] = coefs_tst_data['Coeff_Log_Odds']

In [None]:
#coefs_tst_data

In [None]:
coefs_tst_data['Coeff_Odds'] = coefs_tst_data['Coeff_Odds'].apply(lambda x: np.exp(x))

In [None]:
coefs_tst_data.sort_values(by='Coeff_Odds', ascending = False)

In [None]:
#pipe['LogReg'].coef_

In [None]:
#np.exp(pipe['LogReg'].coef_)

In [None]:
#type(pipe['LogReg'])