# Classification model evaluation

The purpose of this lesson is to learn how to evaluate the performance of classification models.

How do we know if our model is performing well? It's important to establish a baseline against which we can compare the performance of our model. For classification problems, the baseline is often the most prevalent (mode) label present in our dataset. Our baseline "model" simply predicts the most abundant label for every observation. I will demonstrate how to create a baseline in a dataframe.

How do we compare the performance of our models to our baseline? We need a way to quantify model and baseline performance. We'll cover three common evaluation metrics: accuracy, precision, and recall. These are a few of the most common evaluation metrics used for classification models. Once we've computed these metrics for our models, we can compare them to identify the top performer.

How do we compute these evaluation metrics? We will generate the aptly named confusion matrix and use it to calculate the accuracy, precision, and recall. I will explain how to calculate each metric and we will do it by hand!

Will we always have to calculate these by hand? No, that's the beauty of computers. At the end of the lesson, we'll learn how to automatically calculate these metrics using the sklearn library.

In [1]:
#imports
import pandas as pd

from sklearn.metrics import classification_report

Let's create sample model predictions using everyone here today. I'll input whether or not each person is wearing glasses, and then generate some model "predictions" over this same dataset (it's just me making bad guesses).

In [2]:
#example df
df = pd.DataFrame({'glasses': ['yes', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'no'],
                   'preds': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes']})

df

Unnamed: 0,glasses,preds
0,yes,yes
1,no,no
2,yes,yes
3,yes,no
4,yes,yes
5,no,yes
6,yes,no
7,yes,yes
8,no,yes


Before we can evaluate the quality of predictions made by our robust classification model, we need to establish a baseline. Our baseline will be the most abundant (mode) label in our dataset (the glasses column). In our example, we could count the number of instances of yes and no. When dealing with a much larger dataset, it's important to do these operations programmatically! We can use the pandas series method .mode() to get the baseline prediction.

In [3]:
#baseline
df.glasses.mode()

0    yes
Name: glasses, dtype: object

We should add the baseline prediction to the dataframe as a new column. This will make it more convenient for us to evaluate model performance as we work through the notebook.

In [4]:
#add baseline to df
df['baseline'] = df.glasses.mode()[0]
df

Unnamed: 0,glasses,preds,baseline
0,yes,yes,yes
1,no,no,yes
2,yes,yes,yes
3,yes,no,yes
4,yes,yes,yes
5,no,yes,yes
6,yes,no,yes
7,yes,yes,yes
8,no,yes,yes


Now that we have the actual labels, our model's predictions, and the baseline in the same dataframe, let's analyze the performance of the baseline.

Why evaluate the baseline first? It's nice to know the baseline at the beginning of model evaluation, because then we know what benchmarks we are attempting to beat!

To evaluate our baseline, we'll create a confusion matrix. This is done with pandas using the pd.crosstab() function. We will create a crosstab between the actual values (glasses column) and our baseline's predicted values (baseline column).

In [5]:
#baseline crosstab
pd.crosstab(df.glasses, df.baseline)

baseline,yes
glasses,Unnamed: 1_level_1
no,3
yes,6


### Baseline evaluation

Our crosstab is small because our baseline model only predicts one label (yes). The rows represent the actual labels from our dataset. We can see that 6 people in the room have glasses and 3 people don't. The columns represent the predictions made by our baseline model. It predicted "yes" 9 times (adding down the column), and it was correct 6 of those times. Let's calculate the accuracy, precision, and recall for our baseline model.

Accuracy: the number of correct guesses made over all guesses.  
In our case, the baseline accuracy is 6/9 or 67%.

Precision: the number of true positive guesses over the total number of positive guesses.  
In our case, we are treating glasses as the positive case. A true positive (TP) is when our model guesses glasses and is correct. A false positive (FP) is when our model guesses glasses and is wrong.  
The equation for precision is often written as TP / TP + FP.  
Using the equation, our baseline precision is 6/9 or 67%.

Recall: the number of true positives identified over all true positives present in the dataset.  
Again, we are treating having glasses as the positive case. A false negative (FN) is when our model guesses somebody is not wearing glasses but they are.  
The equation for recall is often written as TP / TP + FN.
Using the equation, our baseline recall is 6/6 or 100%.

Now that we've evaluated our baseline, let's see how our predictive model measures up!

In [6]:
#model crosstab
pd.crosstab(df.glasses, df.preds)

preds,no,yes
glasses,Unnamed: 1_level_1,Unnamed: 2_level_1
no,1,2
yes,2,4


### Model evaluation

We have an additional column in our confusion matrix! Unlike the example with the baseline, our model guessed both possible labels (yes and no). Let's work through the confusion matrix and calculate the same evaluation metrics for our predictive model.

Accuracy: we need to add the true negatives (where "no" intersects) and true positives (where "yes" intersects) and divide by all guesses.  
We have 1 true negative and 4 true positives, which means we divide 5 by 9 to get an accuracy of 5/9 or 56%.

Precision: we are still treating having glasses as the positive case. Remember, our equation for precision is TP / TP + FP.  
We have 4 true positives. How do we identify false positives? These are instances where our model predicted "yes", but the correct label was "no". Looking at the table, we follow the first row (no glasses) to the second column (predicted yes glasses) to see there are two false positives.  
Our precision is 4/6 or 67%.

Recall: our equation for recall is TP / TP + FN.  
We know we have 4 true positives from calculating the precision. How many false negatives do we have? We need to look at the second row (yes glasses) and the first column (we predicted they aren't wearing glasses). Our model made 2 false negative predictions.  
Our recall is also 4/6 or 67%.

How does our predictive model compare to the baseline?

### Comparison

The baseline had better accuracy than the model, and the same precision and recall. We would conclude that our model is NOT outperforming the baseline and is unfit for deployment.

Why do we have more evaluation metrics? Isn't accuracy enough? Precision and recall optimize our models in different ways, and they have important real-life applications.

In many cases, accuracy is good enough. Out of all the guesses made by our model, how many were correct?

Why precision? Optimizing for precision will minimize false positives. Precision is used when we determine that a false positive result would carry the most consequences!

On the other hand, optimizing for recall will minimize false negatives. We want high recall when the false negative result is the most damaging!

How can we calculate these metrics more efficiently with the sklearn library?

In [7]:
#model eval with sklearn
print(classification_report(df.glasses, df.preds))

              precision    recall  f1-score   support

          no       0.33      0.33      0.33         3
         yes       0.67      0.67      0.67         6

    accuracy                           0.56         9
   macro avg       0.50      0.50      0.50         9
weighted avg       0.56      0.56      0.56         9



One function calculates everything! Note how precision and recall are calculated treating each possible label as the positive class. Since we treated the "yes" label as the positive class in our calculations, we can see how that aligns with the second row of the classification report. There are some other metrics present in the report, including the f1-score. The f1-score is another prominent evaluation metric for classification problems. If you enjoyed this lesson on classification model evaluation, I encourage you to research additional evaluation metrics!