# Logistic Regression Refresher

Based on [this resource](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/)

### Import Data and Libraries

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

def probability(logodds):
    '''
    This function takes log odds from linear regression 
    and translates to probability.
    '''
    return np.exp(logodds)/(1 + np.exp(logodds))

def extract_odds_ratio(params):
    print(round(np.exp(params), 2))

In [2]:
data = pd.read_csv("data/sample.csv")
data = sm.add_constant(data)
data.head()

Unnamed: 0,const,female,read,write,math,hon,femalexmath
0,1.0,0,57,52,41,0,0
1,1.0,1,68,59,53,0,53
2,1.0,0,44,33,54,0,0
3,1.0,0,63,44,47,0,0
4,1.0,0,47,52,57,0,0


### Logistic regression with no predictor variables

Let’s start with the simplest logistic regression, a model without any predictor variables.  In an equation, we are modeling $logit(p)=\beta_0$

In [3]:
lr = sm.Logit(data['hon'], data['const']).fit()
print(lr.summary())

Optimization terminated successfully.
         Current function value: 0.556775
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                    hon   No. Observations:                  200
Model:                          Logit   Df Residuals:                      199
Method:                           MLE   Df Model:                            0
Date:                Fri, 03 Feb 2023   Pseudo R-squ.:               8.068e-11
Time:                        17:52:55   Log-Likelihood:                -111.36
converged:                       True   LL-Null:                       -111.36
Covariance Type:            nonrobust   LLR p-value:                       nan
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.1255      0.164     -6.845      0.000      -1.448      -0.803


In [4]:
# # Formula notation, identical results
# lr = smf.logit('hon ~ 1', data=data).fit()
# lr.summary()

This means $log(p/(1-p)) = -1.12546$.  What is $p$ here?  It turns out that $p$ is the overall probability of being in honors class (hon = 1).  Let’s take a look at the frequency table for hon.

In [5]:
data.hon.value_counts()

0    151
1     49
Name: hon, dtype: int64

So $p = 49 / (151 + 49) =  .245$. The odds are $.245/(1-.245) = .3245$ and the log of the odds (logit) is $log(.3245) = -1.12546$.  In other words, the intercept from the model with no predictor variables is the estimated log odds of being in honors class for the whole population of interest.  We can also transform the log of the odds back to a probability: $p = exp(-1.12546)/(1+exp(-1.12546)) = .245$, if we like.

### Logistic regression with a single dichotomous predictor variables

Now let’s go one step further by adding a binary predictor variable, female, to the model.  Writing it in an equation, the model describes the following linear relationship: $logit(p) = \beta_0 + \beta_1 * \text{female}$

In [6]:
lr = sm.Logit(data['hon'], data[['female','const']]).fit()
print(lr.summary())

Optimization terminated successfully.
         Current function value: 0.549016
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                    hon   No. Observations:                  200
Model:                          Logit   Df Residuals:                      198
Method:                           MLE   Df Model:                            1
Date:                Fri, 03 Feb 2023   Pseudo R-squ.:                 0.01394
Time:                        17:52:55   Log-Likelihood:                -109.80
converged:                       True   LL-Null:                       -111.36
Covariance Type:            nonrobust   LLR p-value:                   0.07811
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
female         0.5928      0.341      1.736      0.083      -0.076       1.262
const         -1.4709      0.

Before trying to interpret the two parameters estimated above, let’s take a look at the crosstab of the variable hon with female.

In [7]:
pd.crosstab(data.hon, data.female)

female,0,1
hon,Unnamed: 1_level_1,Unnamed: 2_level_1
0,74,77
1,17,32


In our dataset, what are the odds of a male being in the honors class and what are the odds of a female being in the honors class?  We can manually calculate these odds from the table: for males, the odds of being in the honors class are $(17/91)/(74/91) = 17/74 = .23$; and for females, the odds of being in the honors class are $(32/109)/(77/109) = 32/77 = .42$.  The ratio of the odds for female to the odds for male is $(32/77)/(17/74) = (32*74)/(77*17) = 1.809$.  So the odds for males are 17 to 74, the odds for females are 32 to 77, and the odds for female are about 81% higher than the odds for males.

Now we can relate the odds for males and females and the output from the logistic regression.  The intercept of -1.471 is the log odds for males since male is the reference group (`female = 0`).  Using the odds we calculated above for males, we can confirm this: $log(.23) = -1.47$.  The coefficient for female is the log of odds ratio between the female group and male group: $log(1.809) = .593$.  So we can get the odds ratio by exponentiating the coefficient for female. Most statistical packages display both the raw regression coefficients and the exponentiated coefficients for logistic regression models. It's a little bit manual in Python:


In [8]:
extract_odds_ratio(lr.params)

female    1.81
const     0.23
dtype: float64


### Logistic regression with a single continuous predictor variable

Another simple example is a model with a single continuous predictor variable such as the model below.  It describes the relationship between students’ math scores and the log odds of being in an honors class, like this: $logit(p) = \beta_0 + \beta_1 * \text{math}$.


In [9]:
lr = sm.Logit(data['hon'], data[['math','const']]).fit()
print(lr.summary())

Optimization terminated successfully.
         Current function value: 0.417683
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                    hon   No. Observations:                  200
Model:                          Logit   Df Residuals:                      198
Method:                           MLE   Df Model:                            1
Date:                Fri, 03 Feb 2023   Pseudo R-squ.:                  0.2498
Time:                        17:52:55   Log-Likelihood:                -83.537
converged:                       True   LL-Null:                       -111.36
Covariance Type:            nonrobust   LLR p-value:                 8.718e-14
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
math           0.1563      0.026      6.105      0.000       0.106       0.207
const         -9.7939      1.

In this case, the estimated coefficient for the intercept is the log odds of a student with a math score of zero being in an honors class.  In other words, the odds of being in an honors class when the math score is zero is exp(-9.793942) = .00005579.  

These odds are very low, but if we look at the distribution of the variable **math**, we will see that no one in the sample has math score lower than 30.  In fact, all the test scores in the data set were standardized around mean of 50 and standard deviation of 10.  So the intercept in this model corresponds to the log odds of being in an honors class when math is at the hypothetical value of zero.

How do we interpret the coefficient for math?  The coefficient and intercept estimates give us the following equation:

$log(p / (1 - p)) = logit(p) = -9.7939 + 0.1563 * \text{math}$

Let’s fix **math** at some value. We will use 54.  Then the conditional logit of being in an honors class when the **math** score is held at 54 is

$log(p / (1 - p))(\text{math} = 54) = -9.7939 + 0.1563 * 54$

We can examine the effect of a one-unit increase in math score.  When the math score is held at 55, the conditional logit of being in an honors class is

$log(p / (1 - p))(\text{math} = 55) = -9.7939 + 0.1563 * 55$

Taking the difference of the two equations, we have the following:

$log(p / (1 - p))(math=55) - log(p / (1 - p))(math = 54) = .1563$

Or solved...

In [10]:
round((-9.7939 + 0.1563 * 55) - (-9.7939 + 0.1563 * 54), 4)

0.1563

We can say now that the coefficient for **math** is the difference in the log odds.  In other words, for a one-unit increase in the math score, the expected change in log odds is .1563404.

Can we translate this change in log odds to the change in odds? Indeed, we can.  Recall that logarithm converts multiplication and division to addition and subtraction. Its inverse, the exponentiation converts addition and subtraction back to multiplication and division.  If we exponentiate both sides of our last equation, we have the following:

In [11]:
extract_odds_ratio(lr.params)

math     1.17
const    0.00
dtype: float64


So we can say for a one-unit increase in math score, we expect to see about 17% increase in the odds of being in an honors class.  This 17% of increase does not depend on the value that math is held at.

### Logistic regression with multiple predictor variables and no interaction terms

In general, we can have multiple predictor variables in a logistic regression model: 

$logit(p) = \beta_0 + \beta_1 * x1 + ... + \beta_k * xk$

Applying such a model to our example dataset, each estimated coefficient is the expected change in the log odds of being in an honors class for a unit increase in the corresponding predictor variable holding the other predictor variables constant at certain value.  Each exponentiated coefficient is the ratio of two odds, or the change in odds in the multiplicative scale for a unit increase in the corresponding predictor variable holding other variables at certain value.  Here is an example:

$logit(p) = \beta_0 + \beta_1 * \text{math} + \beta_2 * \text{female} + \beta_3 * \text{read}$

In [12]:
lr = sm.Logit(data['hon'], data[['math', 'female', 'read','const']]).fit()
print(lr.summary())

Optimization terminated successfully.
         Current function value: 0.390424
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                    hon   No. Observations:                  200
Model:                          Logit   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Fri, 03 Feb 2023   Pseudo R-squ.:                  0.2988
Time:                        17:52:55   Log-Likelihood:                -78.085
converged:                       True   LL-Null:                       -111.36
Covariance Type:            nonrobust   LLR p-value:                 2.348e-14
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
math           0.1230      0.031      3.931      0.000       0.062       0.184
female         0.9799      0.

In [13]:
extract_odds_ratio(lr.params)

math      1.13
female    2.66
read      1.06
const     0.00
dtype: float64


This fitted model says that, holding **math** and **reading** at a fixed value, the odds of getting into an honors class for **females** (female = 1)over the odds of getting into an honors class for **males** (female = 0) is `exp(.9799) = 2.66`.  In terms of percent change, we can say that the odds for females are 166% higher than the odds for males.  The coefficient for math says that, holding female and reading at a fixed value, we will see 13% increase in the odds of getting into an honors class for a one-unit increase in math score since `exp(.1229589) = 1.13`.

### Logistic regression with an interaction term of two predictor variables

In all the previous examples, we have said that the regression coefficient of a variable corresponds to the change in log odds and its exponentiated form corresponds to the odds ratio.  This is only true when our model does not have any interaction terms.  When a model has interaction term(s) of two predictor variables, it attempts to describe how the effect of a predictor variable depends on the level/value of another predictor variable.  The interpretation of the regression coefficients become more involved.

Let’s take a simple example.

$$logit(p) = log(p/(1-p))= \beta_0 + \beta_1 * \text{female} + \beta_2 * \text{math} + \beta_3 * \text{female} * \text{math}$$

In [14]:
# Create Interaction
data['femalexmath'] = data['female'] * data['math']

lr = sm.Logit(data['hon'], data[['const', 'female', 'math', 'femalexmath']]).fit()
print(lr.summary())

Optimization terminated successfully.
         Current function value: 0.399417
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                    hon   No. Observations:                  200
Model:                          Logit   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Fri, 03 Feb 2023   Pseudo R-squ.:                  0.2826
Time:                        17:52:55   Log-Likelihood:                -79.883
converged:                       True   LL-Null:                       -111.36
Covariance Type:            nonrobust   LLR p-value:                 1.381e-13
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const          -8.7458      2.129     -4.108      0.000     -12.919      -4.573
female         -2.8999    

In [15]:
# # Formula notation, identical results. No need to create interaction var first
# lr = smf.logit(formula='hon ~ female + math + female:math', data=data).fit()
# lr.summary()

In the presence of interaction term of female by math, we can no longer talk about the effect of `female`, holding all other variables at certain value, since it does not make sense to fix `math` and `female:math` at certain value and still allow `female` change from 0 to 1!

In this simple example where we examine the interaction of a binary variable and a continuous variable, we can think that we actually have **two equations**: one for males and one for females.  For males `(female=0)`, the equation is simply

$$logit(p) = log(p/(1-p))= \beta_0 + \beta_2 * \text{math}$$

For females, the equation is

$$logit(p) = log(p/(1-p))= (\beta_0 + \beta_1) + (\beta_2 + \beta_3 ) * \text{math}$$

Now we can map the logistic regression output to these two equations. So we can say that the **coefficient for math is the effect of math when female = 0**.  More explicitly, we can say that for male students, a one-unit increase in math score yields a change in log odds of 0.13.  

On the other hand, for the female students, a one-unit increase in math score yields a change in log odds of $(.13 + .067) = 0.197$.  In terms of odds ratios, we can say that for male students, the odds ratio is $exp(.13)  = 1.14$ for a one-unit increase in math score and the odds ratio for female students is $exp(.197) = 1.22$ for a one-unit increase in math score.  **The ratio of these two odds ratios (female over male) turns out to be the exponentiated coefficient for the interaction term of female by math: $1.22/1.14 = exp(.067) = 1.07$.**

In [16]:
extract_odds_ratio(lr.params)

const          0.00
female         0.06
math           1.14
femalexmath    1.07
dtype: float64
