<a href="https://colab.research.google.com/github/RenatodaCostaSantos/Machine-Learning---Lessons/blob/main/Supervised%20ML/Logistic%20regression/LR__metrics_lesson_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metrics: Evaluating a logistic regression model performance

So far we've learned how to instantiate a logistic regression model and how to fit it with any data. However, how do we judge if the model performance?

We need a way to measure how the predicted outcomes relate to the actual outcomes themselves. The way machine learning implements it is through a metric.

In this lesson, we will learn about six different types of common metrics used in logistic regression problems:

- Accuracy,

- Sensitivity,

- Specificity,

- Positive predicted value (PPV),

- Negative predicted value (PPN),

- F1-score.

Once again we will use the auto dataset to exemplify the concepts we introduce during this lesson. Let's read the dataset:







In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Load auto dataset
auto = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Logistic regression/automobiles.csv')

In [3]:
auto.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
1,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
2,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
3,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
4,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


In [4]:
# Create a binary autocome column
auto['high_price'] = 0
auto.loc[auto['price'] > 15000, 'high_price'] = 1

In [5]:
# Check values for new column
auto['high_price'].value_counts()

0    119
1     40
Name: high_price, dtype: int64

In [6]:
# Create features and target variables
X = auto.drop(['high_price','price'], axis = 1)
y = auto['high_price']

In [7]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 712)

# Accuracy

The first and most important metric in classification problems is accuracy. It is defined by:
$$
accuracy = \frac{\text{Number of correct predictions}}{\text{Number of observations}}.
$$

In a binary problem, a correct prediction happens when the model predicts 1 when the outcome was 1 and 0 when the outcome was 0. It measures how accurate the model was. 

In practice, LogisticRegression class of sklearn has a [score method](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) that quickly performs the accuracy calculation (keep in mind that this method is not always available for classes of sklearn). Let's practice it using the auto dataset. We will use only the horsepower to create the first model of this lesson:



In [8]:
# Create subset of X
X_sub = X_train[['horsepower']]

In [9]:
# Instantiate a model
model = LogisticRegression()
# Fit the model using X
model.fit(X_sub,y_train)
# Check model's accuracy
score = model.score(X_sub,y_train)

print(f'The accuracy of the model on the training set was {score*100:.2f}%.')

The accuracy of the model on the training set was 86.61%.


The model predicted correctly 86% of the outcomes in the training set. Evaluating a model in the training set always gives an optimistic sense of how the model is performing. However, the model learns about the data every time we train it, so the correct way to judge a model's performance is by using the test set once we are confident about the model. For that reason, one should evaluate a model only once in the test set.

# Sensitivity

Sometimes it is more important to have a measurement of how many predictions, among all the positive outcomes, were correctly identified by a model. That's what sensitivity does. It is defined as:
$$
sensitivity = \frac{TP}{TP + FN},
$$
where $TP$ and $FN$ stand for true positives and false negatives respectively. True positives and false negatives are known to be positive observations, however, the true positives are positive outcomes that were correctly identified by the model, while false negatives are positive outcomes that were mistakenly predicted as a negative outcomes by the model.

The image below provides a way to visualize the sensitivity:

![sensitivity](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Logistic%20regression/images/sensitivity.png)



The word true means the labels of the predictions and outcomes match; both are either 0 or 1. The word false means the opposite, *i.e.*, the predictions and outcomes do not match. The words positive and negative are associated with the labels of the predictions made by the model.


One can think of the sensitivity as a conditional probability statement. In other words, given that an outcome was positive, what is the probability of a model correctly identifying it?

It is important to be aware that sensitivity is sometimes called the **recall** in machine learning literature.

Let's calculate the sensitivity of the model we created using the auto dataset. We will use the predict method of the LogisticRegression class to retrive the labels of the predictions:









In [10]:
# Find the number of true positives
tp = sum( (model.predict(X_sub) == 1) & (y_train ==1))
print(tp)
# Find the number of false negatives
fn = sum( (model.predict(X_sub) == 0) & (y_train == 1))
print(fn)
# Calculate sensitivity
sensitivity = tp / (tp+fn)

print(f'The sensitivity of the model is {sensitivity*100:.2f}%.')

20
14
The sensitivity of the model is 58.82%.


Sensitivity answers how many of the positive outcomes were correctly predicted by a model/test. However, sometimes it is more useful to know if a prediction is likely to agree with the outcome. That leads us to the next metric which is related to sensitivity.

# Negative predictive value (NPV)

Imagine a common real-life scenario: 

- A patient who tested negative was told the test used had high sensitivity. Should she trust the prediction of the test and go home relieved?

She is not sure and decides to ask her doctor. He, as an average doctor, gives her a complex answer:

- A negative prediction in a high-sensitivity test is useful to **rule out** a **positive outcome**. 

According to the sentence above, she is very likely not to be sick and should go home relieved. That's what doctors usually do; they say a beautiful long sentence and send you home. But she is a bit stubborn and wants to understand why she should be relieved. Let's consider an example to help her. Suppose a test has 90,9% sensitivity. That means, if the test was tried on 11 sick patients, it correctly identified 10 of them. It rarely missed sick patients. You can visualize it by focusing on the grey circle of the figure below:

![NPV](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Logistic%20regression/images/npv.png)

Now, this does not explain why she should be relieved after receiving a negative result from the high-sensitivity test. The reason why she can be relieved can be visualized by the orange circle of the figure above. It answers the following conditional probability question:

Given that this high-sensitivity **test prediction was negative**, what is the probability that the **patient is not sick**? In other words, what is $P(\text{outcome} = 0 \, | \, \text{prediction} = 0)$ for this test? 

As we can see from the figure, $P(\text{outcome} = 0 \, | \, \text{prediction} = 0) = 88,8 \%$, and if the patient got a negative result, she now has a good reason to be relieved (in practice one would like an even higher probability).

An intuitive way of thinking about it also comes from the drawing above. A high-sensitivity test leaves very few false negatives (upper left-hand corner of the figure). That makes a negative prediction very likely to be correct given that, from all negative predictions, very few of them are false negatives in a high-sensitivity test.

This conditional probability we just described is called **negative predictive value** or NPV. We can write it as:
$$
NPV = P(\text{outcome} = 0 \, | \, \text{prediction} = 0) = \frac{TN}{TN + FN}
$$

Note that it relies on the true negative outcomes, so it has to be computed on the training set. However, it answers a different question than the sensitivity metric. It estimates how likely a patient is not sick, given that she tested negative. 

A high-sensitivity model **does not rule in positive outcomes**, though. To rule in a positive result for a test, we should have a high-probability answer for the following question:

Given that a patient tested positive in a high-sensitivity test, what is the probability that he is sick?

By focusing on the blue circle of the image above, we can see that the answer to this question is $P(\text{outcome} = 1 \, | \, \text{prediction} = 1) = 58,8 \%$, which is not very high. That's because the definition of sensitivity does not consider false positives at all! However, they will impact the probability of a positive prediction agreeing with the outcome.

Let's calculate the negative predictive value for the model we have. Since the sensitivity for this model was not good, we do not know what to expect for the NPV. Let's check out its value:



In [11]:
# Compute true negatives
tn = sum((y_train == 0) & (model.predict(X_sub) == 0))
print(tn)
# Compute false negatives
fn = sum((y_train == 1) & (model.predict(X_sub) == 0))
print(fn)
# Calculate negative predictive value
npv = tn/(tn+fn)

print(f'The NPV of the model is {npv*100:.2f}%.')

90
14
The NPV of the model is 86.54%.


Even though the sensitivity was not high, the NPV was reasonably good, and negative predicted values by this model are likely to be negative outcomes.

# Specificity

In some cases, it is more important to check if negative outcomes are correctly predicted by a model. None of the metrics above provides that information. That is the role of specificity, which is defined by:
$$
specificity = \frac{TN}{TN + FP},
$$
where $TN$ and $FP$ stand for true negatives and false positives respectively. True negatives and false positives are known to be negative outcomes. However, true negatives are negative outcomes correctly predicted by the model, while false positives are negative outcomes mistakenly predicted as positive outcomes by the model.

The image below provides a way to visualize the specificity:

![specificity](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Logistic%20regression/images/specificity.png)

One can think of the specificity as a conditional probability statement. In other words, given that an outcome was negative, what is the probability of a model correctly identifying it?


Let's calculate the specificity of the model we created using the auto dataset. We will use the predict method of the LogisticRegression class to retrive the labels of the predictions:





















In [12]:
# Compute true negatives
tn = sum((y_train == 0) & (model.predict(X_sub) == 0))
print(tn)
# Compute false positives
fp = sum((y_train == 0) & (model.predict(X_sub) == 1))
print(fp)

# Calculate specificity
specificity = tn/(tn+fp)

print(f'The specificity of the model is {specificity*100:.2f}%.')

90
3
The specificity of the model is 96.77%.


The model has a high specificity (at least in the training set). That means, among all negative outcomes, most of them were correctly predicted by the model.

Specificity answers how many of the negative outcomes were correctly predicted by a model/test. However, sometimes it is more useful to know if a prediction is likely to agree with the outcome. We already saw that the NPV provides the answer for the negative predictions. For the positive predictions we are led to the next metric which is related to specificity.


# Positive predictive value (PPN)

Once again, imagine the common real-life scenario: 

- A patient who **tested positive** was told the test used had high specificity. Should she trust the prediction of the test and be worried?

She is not sure and decides to ask her doctor. He, as an average doctor, gives her a complex answer:

- A positive prediction in a high-specificity test is useful to **rule out** a **negative outcome**. 

According to the sentence above, she is very likely to be sick and should worry. That's what doctors usually do; they say a dreadful long sentence and send you home. But she is a bit stubborn and wants to understand why she should worry. Let's consider an example to help her accept this result. Suppose a test has 92,3% specificity. That means, if the test was tried on 13 healthy patients, it correctly identified 12 of them. It rarely missed healthy patients. You can visualize it by focusing on the purple circle of the figure below:

![PPV](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Logistic%20regression/images/ppv.png)


Now, this does not explain why the patient above should be worried once she got a positive result from the high-specificity test. The reason why she could start worrying can be visualized in the blue circle of the figure above. It answers the following conditional probability question:

Given that this high-specificity **test prediction was positive**, what is the probability that the **patient is sick**? In other words, what is $P(\text{outcome} = 1| \text{prediction} = 1)$ for this test? 

As we can see from the figure, $P(\text{outcome} = 1| \text{prediction} = 1) = 92,3 \%$, and if the patient got a positive result, her worries would be justified (in practice one would like an even higher probability).

An intuitive way of thinking about it also comes also from the drawing above. A high-specificity test leaves very few false positives (bottom right-hand corner of the figure). This is what makes a positive prediction very likely to agree with the outcome given that, from all positive predictions, very few of them are false positives in a high-specificity test.

This conditional probability we just described is called **positive predictive value** or PPV. We can write it as:
$$
PPV = P(\text{outcome} = 1 \, | \, \text{prediction} = 1) = \frac{TP}{TP + FP}
$$

Note that it relies on the true positive outcomes, so it has to be computed on the training set. However, it answers a different question than the specificity metric. It estimates how likely a patient is sick, given that she tested positive. 

It is important to be aware that the PPV is also called the **precision** in machine learning literature.


A high-specificity model/test **does not rule in negative outcomes**, though. To rule in a negative result for a test, we would have to have a high-probability answer for the following question:

Given that a patient tested negative in a high-specificity test, what is the probability that he is not sick?

By focusing on the dark red circle of the image above we can see that the answer to this question is $P(\text{outcome} = 0 \, | \, \text{prediction} = 0) = 52,4 \%$, which is not very high. That's because the definition of specificity does not consider false negatives at all! However, they will impact the probability of a negative prediction agreeing with the outcome.

Let's calculate the positive predictive value for the model we have. Since the specificity for this model was very good, we do expect a high probability for the PPV. Let's check out this prediction:


In [13]:
# Compute the number of true positives
tp = sum((model.predict(X_sub) == 1) & (y_train == 1))
print(tp)
# Compute the number of false positives
fp = sum((model.predict(X_sub) == 1) & (y_train == 0))
print(fp)

ppv = tp/(tp+fp)

print(f'The positive predictive value of the model is {ppv*100:.2f}%.')

20
3
The positive predictive value of the model is 86.96%.


As expected, the PPV is high, meaning that positive predicted values by this model are likely to be positive outcomes.

# F1-score

Sometimes, for example in medical classification problems, it is crucial to predict positive cases correctly. A model/test that is able to do that can save lives. Thus, an important question arises: How can we tell that a model/test is good at predicting positive cases correctly?

The answer to that question is not straightforward. There are essentially two sides to the story.

1 - We need to be able to check if the model/test is good at selecting sick patients. For that, we need to apply/train the test/model in patients that are known to be sick and check how many of them were correctly identified. That's the role of **recall** or **sensitivity**, as one can see in the image below:

![high_recall](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Logistic%20regression/images/high_recall_avg_precision.png)

As we can see, recall is a metric that informs if a test/model is very good at identifying sick patients once we **know** that the patients were sick. It provides information about the model/test, but no extra information about the patients (we already knew the patients' condition).

2 - We have to know if a patient who tested positive is indeed sick. For that, we apply the test/model on patients without knowing their condition and compare the predictions with the true outcomes. That's what the **precision** or **positive predictive value** does.

![high_precision](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Logistic%20regression/images/high_precision_avg_recall.png)


The images above show that a test/model with high sensitivity does not imply it will have high precision. Also, a model/test with high precision does not imply high sensitivity.

To be sure we are correctly identifying and communicating our findings to patients, we need a metric that will return a high number only when a test/model has high recall and high precision. This is the scenario shown in the image below:

![great_model](https://raw.githubusercontent.com/RenatodaCostaSantos/Machine-Learning---Lessons/main/Supervised%20ML/Logistic%20regression/images/high_precision_high_recall.png)

For that, we need a metric that takes into consideration the precision and the recall. That's what the f1-score does. It is defined as:
$$
\text{f1-score} = \frac{2*p * r}{p + r}
$$
where, $p$, is the precision (also known as PPV) and $r$ is the recall (also known as sensitivity).

The f1-score has two important properties:

- It lies between the precision and recall values.

- It is never greater than the arithmetic mean of the precision and the recall. The numerical value is weighted towards the minimum of the $p$ and $r$ values, only equal to the arithmetic mean when $p=r$.

The f1-score is a better metric to classify positive outcomes reliably. It returns a value between the precision and the recall, leaning towards the worse of those values. If the f1-score is high, we can be more confident to say in one sentence that the model/test is good at selecting sick patients **and**, once a patient tested positive, it is very likely he/she is, in fact, sick.

# Summary

In this lesson we learned about the following metrics:

- Accuracy

- Sensitivity

- Specificity

- Positive predictive value

- Negative predictive value

- F1-score


We also presented examples and visualizations to clarify the concepts above, created a logistic regression model, and used it to calculate the values of five of those metrics. 

