# Various Performance Metrics

Once we have the predicted labels for our test set, and the actual test set labels, it's super to do various evaluation metrics.

When you have various models coming from different Python packages. I believe it's always important to compare validation/test metrics in a separate notebook using the actual predicted and expected labels. That way, you can be sure your evaluation metrics were calculated consistently across models.

**Classification accuracy:** percentage of correct predictions

In [9]:
import pickle
import glob
import numpy as np
from sklearn import metrics
#sklearn to save the day yet again!

In [10]:
ROOT_DIR = "/content/drive/MyDrive/MSDS_marketing_text_analytics/master_files/1_text_classification"
DATA_DIR = "%s" % ROOT_DIR
EVAL_DIR = "%s/evaluation" % ROOT_DIR
MODEL_DIR = "%s/models" % ROOT_DIR

In [11]:
eval_files = glob.glob("%s/*SGDC*" % EVAL_DIR)

In [12]:
eval_files

['/content/drive/MyDrive/MSDS_marketing_text_analytics/master_files/1_text_classification/evaluation/y_test_SGDClassifier.p',
 '/content/drive/MyDrive/MSDS_marketing_text_analytics/master_files/1_text_classification/evaluation/y_pred_SGDClassifier.p']

In [13]:
y_test = np.asarray(pickle.load(open(eval_files[0], 'rb')))
y_pred = pickle.load(open(eval_files[1], 'rb'))

In [14]:
print(metrics.accuracy_score(y_test, y_pred))

0.837471783295711


In [15]:
# calculate the percentage of ones
y_test.mean()

0.5048908954100828

In [26]:
y_pred.mean()

0.510910458991723

In [16]:
# calculate the percentage of zeros
1 - y_test.mean()

0.4951091045899172

In [17]:
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())

0.5048908954100828

## Confusion matrix

Table that describes the performance of a classification model

In [18]:
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred))

[[546 112]
 [104 567]]


**Basic terminology**

- **True Positives (TP):** we *correctly* predicted that it is healthy living
- **True Negatives (TN):** we *correctly* predicted that it is not healthy living
- **False Positives (FP):** we *incorrectly* predicted that it is healthy living (a "Type I error")
- **False Negatives (FN):** we *incorrectly* predicted that it is not healthy living (a "Type II error")

In [19]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

## Metrics computed from a confusion matrix

**Classification Accuracy:** Overall, how often is the classifier correct?

In [20]:
print((TP + TN) / (TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred))

0.837471783295711
0.837471783295711


**Classification Error:** Overall, how often is the classifier incorrect?

- Also known as "Misclassification Rate"

In [21]:
print((FP + FN) / (TP + TN + FP + FN))
print(1 - metrics.accuracy_score(y_test, y_pred))

0.16252821670428894
0.16252821670428896


**Sensitivity:** When the actual value is positive, how often is the prediction correct?

- How "sensitive" is the classifier to detecting positive instances?
- Also known as "True Positive Rate" or "Recall"

In [22]:
print(TP / (TP + FN))
print(metrics.recall_score(y_test, y_pred))

0.8450074515648286
0.8450074515648286


**Specificity:** When the actual value is negative, how often is the prediction correct?

- How "specific" (or "selective") is the classifier in predicting positive instances?

In [23]:
print(TN / (TN + FP))

0.8297872340425532


**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?

In [24]:
print(FP / (TN + FP))

0.1702127659574468


**Precision:** When a positive value is predicted, how often is the prediction correct?

- How "precise" is the classifier when predicting positive instances?

In [25]:
print(TP / (TP + FP))
print(metrics.precision_score(y_test, y_pred))

0.8350515463917526
0.8350515463917526


In [27]:
print(TP / (TP + FP))
print(metrics.recall_score(y_test, y_pred))

0.8350515463917526
0.8450074515648286
