## Evaluating a Classification Model

In [14]:
# read the data into a Pandas DataFrame
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age','label']
pima = pd.read_csv('pima-indians-diabetes.csv',skiprows=range(0,9), header=None)
pima.columns = col_names
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [20]:
pima.shape

(768, 9)

* label  
    1: diabetes  
    0: no diabetes  

**Question:** Can we predict the diabetes status of a patient given their health measurements?

In [15]:
# define X and y
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']

# X is a matrix, hence we use [] to access the features we want in feature_cols
X = pima[feature_cols]

# y is a vector, hence we use dot to access 'label'
y = pima.label

In [23]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [25]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression

# instantiate model
logreg = LogisticRegression()

# fit model
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [28]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

**Classification accuracy:** percentage of correct predictions

In [31]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.6927083333333334


Classification accuracy is 69%

**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

We must always compare with this

In [41]:
# check most frequent class in y_test
y_test.value_counts()

0    130
1     62
Name: label, dtype: int64

In [45]:
# calculate the percentage of zeros
1 - y_test.mean()

0.6770833333333333

In [46]:
# calculate null accuracy in a single line of code
# only for binary classification problems coded as 0/1
max(y_test.mean(), 1 - y_test.mean())

0.6770833333333333

This means that a dumb model that always predicts 0 would be right 68% of the time

* This shows how classification accuracy is not that good as it's close to a dumb model
* It's a good way to know the minimum we should achieve with our models

In [49]:
# print the first 25 true and predicted responses
print('True:', y_test.values[0:25])
print('Pred:', y_pred_class[0:25])

True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]
Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]


**Conclusion:**

* Classification accuracy is the **easiest classification metric to understand**  
* But, it does not tell you the **underlying distribution of response values**  
    * We examine by calculating the null accuracy  
* And, it does not tell you what **"types" of errors** your classifier is making

### Confusion Matrix

In [53]:
# IMPORTANT: first argument is true values, second argument is predicted values
# this produces a 2x2 numpy array (matrix)
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print(confusion)
#[row, column]
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

[[118  12]
 [ 47  15]]


### Metrics computed from a confusion matrix

**Classification Accuracy:** Overall, how often is the classifier correct?

In [55]:
print((TP + TN) / float(TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))

0.6927083333333334
0.6927083333333334


**Classification Error:** Overall, how often is the classifier incorrect?

   * Also known as "Misclassification Rate"

In [58]:
print((FP + FN) / float(TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred_class))

0.3072916666666667
0.30729166666666663


**Sensitivity:** When the actual value is positive, how often is the prediction correct?

   * Something we want to maximize
   * How "sensitive" is the classifier to detecting positive instances?
   * Also known as *"True Positive Rate"* or *"Recall"*
   * TP / all positive
   * all positive = TP + FN

In [63]:
sensitivity = TP / float(FN + TP)
print(sensitivity)
print(metrics.recall_score(y_test, y_pred_class))

0.24193548387096775
0.24193548387096775


**Specificity:** When the actual value is negative, how often is the prediction correct?

   * Something we want to maximize
   * How "specific" (or "selective") is the classifier in predicting positive instances?
   * TN / all negative
   * all negative = TN + FP

In [65]:
specificity = TN / float(TN + FP)
print(specificity)

0.9076923076923077


Our classifier

   * Highly specific
   * Not sensitive

**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?

In [68]:
false_positive_rate = FP / float(TN + FP)
print(false_positive_rate)
print(1 - specificity)

0.09230769230769231
0.09230769230769231


**Precision:** When a positive value is predicted, how often is the prediction correct?

   * How "precise" is the classifier when predicting positive instances?

In [71]:
precision = TP / float(TP + FP)
print(precision)
print(metrics.precision_score(y_test, y_pred_class))

0.5555555555555556
0.5555555555555556


**Which metrics should you focus on?**

* Choice of metric depends on your business objective
    * Identify if FP or FN is more important to reduce
    * Choose metric with relevant variable (FP or FN in the equation)

https://www.ritchieng.com/machine-learning-evaluate-classification-model/