<a href="https://colab.research.google.com/github/RenatodaCostaSantos/Machine-Learning---Lessons/blob/main/Supervised%20ML/Logistic%20regression/Interpreting_logistic_regression_parameters_lesson_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Logistic regression is a classification model that can predict a binary or multiclass outcome, however, we will focus on binary classification in this lesson. Problems with a binary answer are common in practice. For example, predicting if a person is sick or healthy is common in logistic regression applications.

In this lesson, we will learn how to interpret the coefficients of a logistic regression model.

We will use the same auto dataset from from the previous lesson to exemplify the concepts using a real-world problem.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Read dataset
auto = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/automobiles.csv')

In [3]:
auto.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
1,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
2,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
3,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
4,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


We will follow the same steps as the previous lesson and create a new binary column that contains values 0 for cars with price lower than $15,000 and 1 otherwise.

In [4]:
# Create high_price column
auto['high_price'] = 0
auto.loc[auto['price'] > 15000, 'high_price'] = 1

In [5]:
# Check new column values
auto['high_price'].value_counts()

0    119
1     40
Name: high_price, dtype: int64

In [6]:
# Create features and target variables
X = auto.drop(['high_price','price'], axis = 1)
y = auto['high_price']

In [7]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 712)

# The logistic regression object

To build a model we will use the sklearn logistic regression [class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). It can take many different parameters as one can see in the documentation. In this lesson we will use it to build a logistic regression model and use some of the attributes and methods of this class. 

We start by importing the LogisticRegression class from sklearn and build a model right away:

In [8]:
from sklearn.linear_model import LogisticRegression
# Instantiate a model
model = LogisticRegression()

There you go. We have a created model we can use to train and test the data. The model in this case is an object of the LogisticRegression class. It saves us all the time we spent on the previous lesson where we built it from scratch. The object we have in hands is much more powerful and efficient than the ones we created in the previous lesson, but now we have an idea of how it works under the hood.

# Fitting the model

We will follow similar steps as in the previous lesson and start using only the 'horsepower' feature to predict the outcomes for the 'high_price' target:


In [9]:
# Define a subset of the features
X_sub = X_train[['horsepower']]
# Fit the model
model.fit(X_sub,y_train)

LogisticRegression()

Done! We have a model that was trained with the 'horsepower' feature. As simple as that!

Our goal is to interpret the coefficients of the model we just created. For that it is useful to remember the equation for the logistic regression model we derived in the last lecture:
$$
log(\frac{EY}{1-EY}) = \alpha + \beta*X.
$$

In the linear regression model the coefficients are interpreted as the intercept and the slope of the line. However, in the logistic regression the predictors $X$ are not connected to the binary outcomes $y$, but to the log-odds of the success probability of obtaining an outcome $y$ (see the previous lesson notes if this sentence confuses you).

To interpret the parameters, we can use a similar approach to the linear regression. First, when **a given predictor $X$** is null, we obtain:
 $$
log(\frac{EY}{1-EY}) = \alpha.
$$
This tells us that the log-odds of the success probability of an outcome, for a given feature $X$, is equal to the $\alpha$ parameter.

Log-odds are hard to interpret, but the expression above provides an important step to start clarifying the meaning of the coefficients.

In practice, one can easily obtain the intercept, $\alpha$, for a logistic regression using sklearn by using the intercept attribute:

In [10]:
# Obtain the logistic regression intercept parameter
alpha = model.intercept_
print(model.intercept_)

[-8.97855833]


# Odds interpretation

If we apply the exponential on both sides of the log-odds equation for the intercept, we obtain:
 $$
Odds = \frac{EY}{1-EY} = e^\alpha.
$$
Odds are easier to interpret than log-odds. It gives the likelihood of obtaining a positive result over a negative one. In other words, if the odds are greater than 1, the probability of an event happening is greater than not happening. If it is less than 1 the conclusion is the opposite.

Odds are useful when events happen many times. It tells us what to expect in the long run while the probability itself does not provide that information.

Let's apply what we learned in the auto dataset by computing the odds of a car having a price higher than $15,000:


In [11]:
# Calculate the odds of a car being higher than $15,000
odds = np.exp(alpha)
print(odds)

[0.00012608]


As we can see, the odds of having a high price car with horsepower equal to zero is very close to zero, meaning, it is very unlikely.

# Interpreting the slope

To interpret the slope, $\beta$ we subtract the value for the logistic model when the predictor $X = 0$: 
$$
log(\frac{EY(0)}{1-EY(0)}) = \alpha,
$$
from the value when $X=1$:
$$
log(\frac{EY(1)}{1-EY(1)}) = \alpha + \beta.
$$
The subtraction leads to:
$$
log(\frac{Odds(1)}{Odds(0)}) = \beta,
$$
where the argument of the log function above is called the **odds ratio**. It tells how much the odds ratio change if we change the predictor $X$ by one unit. We can rewrite the equation above as:
$$
\frac{Odds(1)}{Odds(0)} = e^{\beta}.
$$
If the odds ratio is greater than one, it implies that the probability of success $EY$ is increasing non-linearly with the predictors $X$ (that's because the odds depend on $EY$ and the odds ratio is equal to the exponential of $\beta$ when $X$ goes from 0 to 1).

Let's use the auto dataset to illustrate this point in practice. We use the coef_ attribute from the logistic regression class to obtain the slope $\beta$:

In [14]:
# Retrieving the slope from the logistic regression model
beta = model.coef_

In [15]:
# Compute the odds ratio
odds_ratio = np.exp(beta)
print(odds_ratio)

[[1.07991717]]


The odds ratio for the model is greater than one. It means that the probability of success for a car with more horsepower increases exponentially. In other words, cars with more powerful engines are more likely to have a higher price.

# Multiple predictors

We have used only one feature to build the model above. Let's generalize it by including more features.

The interpretation of the coefficients changes slightly in this case. The slope, for example, is interpreted as the change of the log of the odds ratio when we vary one of the predictors $X$ by a unit and keep the other predictors constant (a typical procedure when performing partial derivatives).

Let's build a logistic regression model with multiple parameters and interpret it:

In [17]:
# Instantiate a logistic model
model2 = LogisticRegression()

In [18]:
# Create a subset of the features set
X_subset = X_train[['horsepower','highway_mpg']]

In [28]:
# Fit the model with the features above
model2.fit(X_subset,y_train)
# Get parameters
oddsRatio_horsepower = np.exp(model2.coef_[0,0])
oddsRatio_highway = np.exp(model2.coef_[0,1])

print(f'The odds ratio for the horsepower variable is {oddsRatio_horsepower:.2f}.')
print(f'The odds ratio for the highway_mpg variable is {oddsRatio_highway:.2f}.')

The odds ratio for the horsepower variable is 1.04.
The odds ratio for the highway_mpg variable is 0.73.


The odds ratio is greater than one for the horsepower variable when the highway_mpg is constant and less than one for the highway_mpg when the horsepower is constant. In other words:

1- Cars with the same consumption per gallon but with more powerful engines are more likely to be more expensive and,

2- Cars with the same horsepower but with higher petrol consumption are likely to be less expensive.

# Success probability

The odds and odds ratio are much more useful than the success probability of an outcome itself. That's why we sidestepped the connection between the predictors $X$ and the success probability $EY$. However, one can still compute $EY$ with the mathematical expressions above. It is easy to show that:
$$
EY = \frac{e^{\alpha + \beta*X}}{1 +e^{\alpha + \beta*X} }.
$$

There is no easy way to isolate the coefficients of the logistic regression in terms of the success probability $EY$. However, we can still compute it given a value for a predictor $X$.

Scikitlearn provides a method to perform the calculation above more quickly via the [predict_proba](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba) method.

It is important to mention that an outcome could still be negative even if the probability of predicting it as positive is high. This is another reason why the odds and odds ratio are preferred. It informs how many positive and negative outcomes one would expect in a dataset.

To illustrate how the predict_proba works, let's compute the predicted success probabilities for all the data points of the subset we defined above:


In [29]:
# Array containing all predicted probabilities for the outcomes of X_subset
model2.predict_proba(X_subset)

array([[1.64812456e-01, 8.35187544e-01],
       [9.94480393e-01, 5.51960662e-03],
       [9.94480393e-01, 5.51960662e-03],
       [9.16053075e-01, 8.39469252e-02],
       [9.96792251e-01, 3.20774860e-03],
       [7.98588049e-01, 2.01411951e-01],
       [4.25664094e-01, 5.74335906e-01],
       [9.53188185e-01, 4.68118147e-02],
       [9.43713718e-01, 5.62862825e-02],
       [9.95623324e-01, 4.37667600e-03],
       [2.69324022e-01, 7.30675978e-01],
       [9.99754222e-01, 2.45777611e-04],
       [8.04750559e-02, 9.19524944e-01],
       [3.79219443e-01, 6.20780557e-01],
       [7.98588049e-01, 2.01411951e-01],
       [1.06784383e-01, 8.93215617e-01],
       [5.71353216e-01, 4.28646784e-01],
       [9.79540125e-01, 2.04598753e-02],
       [9.99974398e-01, 2.56018410e-05],
       [9.98740349e-01, 1.25965136e-03],
       [9.99637902e-01, 3.62097743e-04],
       [9.98740349e-01, 1.25965136e-03],
       [6.08930337e-01, 3.91069663e-01],
       [7.51094892e-01, 2.48905108e-01],
       [7.181450

As we can see, it is much harder to quickly obtain useful information from the array above.

# Summary

In this lesson we learned:

- How to isolate and interpret the coefficients of a logistic regression model in terms of the odds and odds ratio.

- How to build a logistic regression model using sklearn.

- How to retrieve the coefficients from the model.

- How to use the predict_proba to calculate the success probability of every data in a dataset.

- Why odds and odds ratio are preferred rather than the success probability when interpreting the coefficients of a logistic regression model.