# Evaluating and Tuning a Binary Classification Model

## Goals

After this lesson, you should be able to:

- Build and explain confusion matrices from a model output
- Calculate various binary classification metrics
- Explain the AUC/ROC curve, why it matters, and how to use it
- Understand when and how to optimize a model for various metrics
- Optimize a classification model based on costs

## Heart Disease Data Set

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import OneHotEncoder

import matplotlib.pyplot as plt

[Dataset info](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)

In [None]:
df = pd.read_csv('./data/heart.csv')

In [None]:
print(df.shape)
df.head(3)

In our dataset we have 303 patients and 13 independent variables and 1 binary target variable.

When we are working with classification problems it is always good practice to check the class balance.

In [None]:
df['target'].value_counts(normalize = True)

We see that approximately %54 of the patients are in the class 0 which refers to 'no presence' of a heart disease. Consequently, %45 of the patients have a heart disease. 

## Creating Train-Test Split

In [None]:
## For model evaluation we split our data into two parts: Train - Test

X = df.drop('target', axis = 1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 77, 
                                                    stratify = y, # in classification problems 
                                                                  # when you split the data 
                                                                  # you want to keep the ratio in the classes.
                                                    test_size = .2 # This is usually the ratio but it might change 
                                                                   # according to the problem at hand.
                                                   )

In [None]:
## Let's check number of 1 and 0 in both datasets
y_train.mean(), y_test.mean()

!! Now __forget__ the test set

[sklearn - Why do we need Train-Test-Validation?](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

## Choosing a Perfomance Metric for Model Evaluations

__Model Selection vs Model Evaluation__

- Model Selection/Model Comparison: What is the best parameters for a given model. Between different models which one is better models the reality.

Ex: If we are working with an app that runs a machine learning algorithm model selection is choosing the process of choosing a final algorithm to deploy.


- Model Evaluation: After selecting a 'best' model with model selection how this model will perform in the 'real' case.

Ex: Model evaluation is where we want to predict how successful this algorithm will be.

[Available tools in sklearn](https://scikit-learn.org/stable/model_selection.html)

<img src= 'images/table.png' width = 450 />

### Accuracy

$$\text{Accuracy} =  \frac{\text{# of Correct Predictions}}{\text{# of Total Cases}}$$

- Accuracy overall gives a good idea about an estimators performance but sometimes it is not directly relevant to the problem. (Especially in imbalanced dataset we should expect that event the dummy model could perform a high accuracy.)

### Recall

$$ \text{Recall} = \frac{\text{# True Positives}}{\text{# of Condition Positive}} = \frac{\text{TP}}{\text{TP + FN}} $$

- __Q__: Given that the total number of "Condition Positives" are fixed. How can we improve the __Recall__ score?


- In our case, recall score corresponds to out of 100 patients with heart disease how many of them are succesfully predicted as positive.

### Precision

$$ \text{Precision} = \frac{\text{# True Positives}}{\text{# of Predicted Positive}} = \frac{\text{TP}}{\text{TP + FP}} $$

- __Q__: Given that the total number of "Condition Positives" are fixed. How can we improve the __Precision__ score?

- In our case, precision score corresponds to: out of 100 positive prediction how many of them are really the having heart disease.

__Your turn__

- Suppose we are trying to classify videos whether they are safe for kids or not. Which metric does make more sense to use? (safe = 1, not_safe = 0)

- We are training a classification algorithm for fraud detection for a bank. Which metric does make more sense to use? (fraud = 1, normal = 0)

## Data Prep Before Training a Model

[A good blog post on handling categorical variables](https://www.bogotobogo.com/python/scikit-learn/scikit_machine_learning_Data_Preprocessing-Missing-Data-Categorical-Data.php)

In [None]:
# we can also check the categorical variables with scatter matrix plot
# but notice that this is not practical in higher dimensions
pd.plotting.scatter_matrix(df, figsize= (14, 10))
plt.show()

In [None]:
categorical_variables = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']

In [None]:
remaining_list = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

[There are many interesting tools for processing data](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector)

__Your Turn__

- Convert Categorical Variables to OneHotEncoding

- [Dummies vs OneHot: Read the second answer](https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-are-the-pros-and-cons)

In [None]:
pd.get_dummies(X_train, columns= categorical_variables, drop_first= True).shape

Now try to transform test data with get_dummies method.

In [None]:
from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler

In [None]:
ss = StandardScaler()

In [None]:
## create an encoder object. This will help us to convert
## categorical variables to new columns
encoder = OneHotEncoder(handle_unknown= 'error',
                        drop='first',
                        categories= 'auto')

## Create an columntransformer object.
## This will help us to merge transformed columns
## with the rest of the dataset.

ct = ColumnTransformer(transformers =[('ohe', encoder, ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])],
                                    remainder= 'passthrough')
ct.fit_transform(X_train)
X = ct.transform(X_train)

In [None]:
X.shape

In [None]:
ct.transformers_[0][1].get_feature_names(categorical_variables)

In [None]:
X[:5, :6]

Now try to transform test dataset by using ct object.

__Don't forget!!__

- Apply the same transformations to the test data.

In [None]:
Xtest  = ct.transform(X_test)
Xtest.shape

__Scaling Features__ 

-- Let's go back to the column transformer.

[Different Scalers and Their Effect on Data](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
standard_scaler = StandardScaler()
standard_scaler.fit(X)
X = standard_scaler.transform(X)
## apply the trained transformations to test.

Xtest = standard_scaler.transform(Xtest)

In [None]:
X_test.shape

In [None]:
np.mean(X,axis = 0)

## What do you expect if you check the means of X_test? Try

## Model Training

[Check sklearn for documentation of Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


[For solvers](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

In [None]:
log_reg = LogisticRegression(penalty = 'none', max_iter= 10000)
log_reg.fit(X, y_train)

In [None]:
## What is this score?
print(log_reg.score(X, y_train))

## Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

In [None]:
y_pred = log_reg.predict(X)

score = log_reg.score(X, y_train)


In [None]:
cm = confusion_matrix(y_train, y_pred)

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Pastel1');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);
plt.savefig('toy_Digits_ConfusionSeabornCodementor.png')
#plt.show();


__Your Turn__

- Find Recall and Precision scores

__Reminder__

$$ \text{True Positive Rate} = \text{Recall} = \frac{\text{# True Positives}}{\text{# of Condition Positive}} = \frac{\text{TP}}{\text{TP + FN}} $$

$$ \text{Precision} = \frac{\text{# True Positives}}{\text{# of Predicted Positive}} = \frac{\text{TP}}{\text{TP + FP}} $$

In [None]:
## find them here

### Using sklearn for precision and recall

In [None]:
## Recall

In [None]:
from sklearn.metrics import recall_score

In [None]:
recall_score(y_train, y_pred)

In [None]:
from sklearn.metrics import precision_score

In [None]:
precision_score(y_train, y_pred)

In [None]:
## there are other important metrics too

from sklearn.metrics import f1_score

f1_score(y_train, y_pred)

### Using Cross Validation Scores for Model Evaluation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
log_reg = LogisticRegression(penalty = 'none', max_iter= 10000 )

In [None]:
y_scores = cross_val_score(log_reg, X, y_train, cv = 5, scoring= 'f1')

In [None]:
y_scores

## ROC Curves for Model Selection

<img src='./images/conf_matrix_classification_metrics.png' width=650/>

In [None]:
log_reg_vanilla = LogisticRegression(penalty= 'none', max_iter= 10000)

log_reg_l2 = LogisticRegression(penalty = 'l2', C = 0.01, max_iter= 10000)

In [None]:
log_reg_vanilla.fit(X, y_train)

y_probs_vanilla = log_reg_vanilla.predict_proba(X)

In [None]:
log_reg_l2.fit(X, y_train)
y_probs_l2 = log_reg_l2.predict_proba(X)

In [1]:
## let's change the treshold to see the effect of it on FPR and TPR

In [None]:
predicts = []
for item in log_reg_vanilla.predict_proba(X):
    if item[1] <= .20:
        predicts.append(0)
    else:
        predicts.append(1)
        
conf_matrix = pd.DataFrame(confusion_matrix(y_train, predicts),
                           index = ['actual 0', 'actual 1'], 
                           columns = ['predicted 0', 'predicted 1'])
conf_matrix

### Plotting ROC curves

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
from sklearn.metrics import roc_curve

In [None]:
fpr_v, tpr_v, thresholds_v = roc_curve(y_train, y_probs_vanilla[:,1])
fpr_l2, tpr_l2, thresholds_l2 = roc_curve(y_train, y_probs_l2[:,1])

In [None]:
def plot_roc_curve(fpr, tpr, label = None):
    plt.plot(fpr, tpr, linewidth =2 , label = label)
    plt.plot([0,1], [0,1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    
plot_roc_curve(fpr_v, tpr_v, label = 'Vanilla')
plot_roc_curve(fpr_l2, tpr_l2, label = 'L2-Penalty')
plt.legend()
plt.show()

Also we can measure the __A__rea __U__nder __C__urve scores


In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
## for 
roc_auc_score(y_train, y_probs_vanilla[:,1])

In [None]:
roc_auc_score(y_train, y_probs_l2[:,1])

### The Default Measure (in most prebuilt models) - Accuracy

$$ \frac{(TP + TN)}{(TP + FP + TN + FN)} $$

<img src='./images/conf_matrix_classification_metrics.png' width=650/>

Category definitions - possible outcomes in binary classification

- TP = True Positive (class 1 correctly classified as class 1) - e.g. Patient with cancer tests positive for cancer
- TN = True Negative (class 0 correctly classified as class 0) - e.g. Patient without cancer tests negative for cancer
- FP = False Positive (class 0 incorrectly classified as class 1) - e.g. Patient without cancer tests positive for cancer
- FN = False Negative (class 1 incorrectly classified as class 0) - e.g. Patient with cancer tests negative for cancer

 $$ \text{Possible misclassifications} $$

<img src='./images/type-1-type-2.jpg' width=400/>
 

Remember that Logistic Regression gives probability predictions for each class, in addition to the final classification. By default, threshold for the prediction is set to 0.5, but we can adjust that threshold.

In [None]:
predicts = []
for item in log_reg_vanilla.predict_proba(X):
    if item[1] <= .20:
        predicts.append(0)
    else:
        predicts.append(1)

In [None]:
conf_matrix = pd.DataFrame(confusion_matrix(y_train, predicts),
                           index = ['actual 0', 'actual 1'], 
                           columns = ['predicted 0', 'predicted 1'])
conf_matrix

### The AUC / ROC curve (Area Under Curve of the Receiver Operating Characteristic)

<img src='images/pop-curve.png' width=500/>


In [None]:
results_df = X_train.copy()

In [None]:
results_df['probabilities'] = log_reg_vanilla.predict_proba(X)[:, 1]
results_df['target'] = y_train

In [None]:
results_df.head(2)