In [None]:
import pandas
import plotnine
import numpy
import scipy
import math
from plotnine import *


#   Logistic Regression

So far we have seen how to use linear regression to identify relationships between two continuous quantities. In many cases we are actually interested in the probability of an event occurring and how it is affected by other factors.  In order to study this we use logistic regression.  

**Example: Understanding Risk Factors**

In medicine, it is often of interest to determine factors that are associated with the occurence or non-occurence of an illness. For example we might ask ourselves if eating refined sugar is a risk factor for diabetes.  In order to understand this we would identify variables 
$$
y \left\{
\begin{array}{cl}
1& \mbox{If a patient has diabetes}\\
0& \mbox{else}
\end{array}
\right.
$$
and
$$
x = \mbox{amount of sugar consumed in grams/day}.
$$
We might then want to assume that 
$$
p(x)=\Pr(\mbox{patient has diabetes})=f(x)
$$ 
where $y$ is a Bernoulli random with probability $p(x)$,
$$
y\sim Bern\left(p(x)\right).
$$

We will see in this module how to estimate the function $f(x)$ using logistic regression.

##    Linear Regression for Binomial Data

We have seen linear regression as a straightforward method of ``fitting" a line to data in order to understand the linear relationship between to variables $y$ and $x$, i.e. we assume that 
$$
y\propto x.
$$
In the case where we are interested the relationship between $x$ and $\Pr(y=1)$, we typically have a a set of observations of $y\in [0,1]$ where 
$$
\Pr(y=1;x)=p(x)
$$
and the resulting probability mass function is
$$
Pr(y;x)=p(x)^y(1-p(x))^{1-y}.
$$
The resulting likelihood function the data $\mathbf{y}=(y_1,\ldots,y_n)$ and $\mathbf{x}=(x_1,\ldots,x_n)$
$$
L(p(x)|\mathbf{y})=\prod_{i=1}^np(x_i)^{y_i}(1-p(x_i))^{1-y_i}.
$$
Notice that we assume that $x_i\in\mathcal{R}$ but that $p(x_i)\in (0,1)$, so we can't simply write 
$$
p(x_i)=x_i\beta
$$
without having some complicated restrictions on $\beta$.  Instead we assume that 

$$
p(x_i) = \frac{\exp(x_i\beta)}{1+\exp(x_i\beta)}
$$
solving for $p_i=p(x_i)$ we get
$$
\log\left(\frac{p_i}{1-p_i}\right)=x_i\beta.
$$
Note that it is possible to include an intercept and additional parameters in that case then we would write
$$
\log\left(\frac{p_i}{1-p_i}\right)=\beta_0+\sum_{j=1}^p\beta_jx_{ij}
$$

**NOTE**

Alternatively we can write
$$
p(x_i) = \frac{1}{1+\exp(-x_i\beta)}.
$$
The function 
$$
\log\left(\frac{p}{1-p}\right)
$$
is referred to as the _logit_ function, hence we call the regression problem 
$$
\log\left(\frac{p}{1-p}\right)=x\beta
$$
of estimating $\beta$ _logistic regression_.


##    Computing Results

If we make the substitution of
$$
p(x_i) = \frac{\exp(x_i\beta)}{1+\exp(x_i\beta)}
$$
into the likelihood function we can see that the result is quite cumbersome and an analytic solution is not available
$$
L(\beta|\mathbf{y})=\prod_{i=1}^n\left(\frac{\exp(x_i\beta)}{1+\exp(x_i\beta)}\right)
^{y_i}\left(\frac{1}{1+\exp(x_i\beta)}\right)^{1-y_i}.
$$
The solution to this can be written as 
$$
\hat{\beta}=\max_{\beta}\prod_{i=1}^n\left(\frac{\exp(x_i\beta)}{1+\exp(x_i\beta)}\right)
^{y_i}\left(\frac{1}{1+\exp(x_i\beta)}\right)^{1-y_i}
$$
but this presents a challenge for finding an analytical solution.  Instead we rely on numerical techniques to find estimates of the model parameters. 

### Example: Diabetes Risk Factor Modelling 
Diabetes is a chronic disease affecting millions of people around the world. Diabetes occurs when the body either can't produce or properly use insulin, a hormone that regulates blood sugar. Hyperglycemia, or elevated blood sugar can damage many vital organs and systems in the human body, increasing risks of heart attacks and strokes, blindness, poor circulation and potential loss of limbs, and kidney failure. 

The Pima Indian Diabetes Data Set consists of 768 observations of 9 variables:

`pregnant`	Number of times pregnant

`glucose`	Plasma glucose concentration (glucose tolerance test)

`pressure`	Diastolic blood pressure (mm Hg)

`triceps`	Triceps skin fold thickness (mm)

`insulin`	2-Hour serum insulin (mu U/ml)

`mass`	Body mass index (weight in kg/(height in m)\^2)

`pedigree`	Diabetes pedigree function

`age`	Age (years)

`diabetes`	Class variable (test for diabetes)

Fit a logistic regression model to determine which of these various factors affect the probability that an individual in the data set has diabetes. 

#### Solution

A logistic regression model is a good choice to fit this data.  The response variable is 
\begin{align}
y=& \left\{\begin{array}{cc}
1,& \text{if the patient has diabetes}\\
0, & \text {if the patient doesn't have diabetes}
\end{array}\right.
\end{align} 
and the covariates are on continuous numeric scales (with the exception of age).

The resulting model will be
$$
\log\left(\frac{p_i}{1-p_i}
\right)=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}\dots\beta_8x_{i8}
$$
with the likelihood function
$$
L(\mathbf{\beta}|\mathbf{y})=\prod_{i=1}^n\left(\frac{\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}{1+\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}\right)^{y_i}\left(1-\frac{\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}{1+\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}
\right)^{1-y_i}.
$$
Of course solving this requires using a software based solution to use a numerical optimisation routine.

#### Code

In [None]:

import statsmodels

##  We load the data from a CSV file
##  Note that the variable diabetes has been coded as a 0-1 variable

df = pandas.read_csv("../DATA/diabetes.csv")

##  We fit the model using the XXXX function

model = statsmodels.formula.api.logit('diabetes~pregnant+glucose+pressure+triceps+insulin+mass+pedigree+age',df) 

res = model.fit()

res.summary()


##    Interpreting Results

The results from linear regression have relatively straightforward interpretations. The parameters of the linear model are easily interpretable as the coefficients describing the linear relationships between dependent and independent variables. The results for a logistic regression are a bit more difficult as the linear relationship between the variables are between a non-linear function of the dependent variable and the independent variables.  In order to provide a means of interpreting the coefficients of the logistic regression we first define the concept of odds. 

###   Odds
Odds are defined as the ratio of a probability and its compliment.  For example, consider an event $A$, the odds of $A$ are
$$
\text{Odds of $A$}=\frac{\Pr(A)}{1-\Pr(A)}.
$$
The odds are interpreted as a ratio in terms of ``for'' and ``against''.  For example if the odds of $A$ are 2, then we would say that the odds are 2 to 1 in favor of $A$, in other words the probability that $A$ occurs is twice the probability that $A$ _doesn't_ occur.  

We can solve for $\Pr(A)$ algebraically as follows.  Assuming the odds of $A$ are $2$ 
\begin{align}
2&=\frac{\Pr(A)}{1-\Pr(A)}\\
\frac{1}{2}&=\frac{1-\Pr(A)}{\Pr(A)}\\
\frac{3}{2}&=\frac{1}{\Pr(A)}\\
\frac{2}{3}&=\Pr(A)
\end{align}

### Logistic Regression Coefficients

We can see immediately that the linear relationship in the logistic regression model is between the _log_ of the odds and the independent variables.  It is natural then to assume that if we expotentiate both sides of the equation we can interpret the effects on the odds
$$
\begin{align}
\log\left(\frac{p_i}{1-p_i}\right) &= \beta_0+\beta_1x_{1i}+\cdots +\beta_Kx_{iK}\\
\frac{p_i}{1-p_i}&=\exp\left(\beta_0+\beta_1x_{1i}+\cdots +\beta_Kx_{iK}\right)\\
\frac{p_i}{1-p_i}&=e^{\beta_0}e^{\beta_1x_{1i}}\cdots e^{\beta_Kx_{ki}}.
\end{align}
$$
The interpretation is that a unit increase in an independent variable increases the odds (i.e. the probability of an event) by a factor of $\exp{\beta_K}$.  



    $\beta$    $\exp(\beta)$     $\Delta X/\Delta \Pr(A)$
 ----------- ----------------- --------------------------------
  $\beta<0$   $\exp{\beta}<1$   $x\uparrow$ $\Pr(A)\downarrow$ 
  $\beta>0$   $\exp(\beta)>1$   $x\uparrow$ $\Pr(A)\uparrow$ 
  $\beta=0$   $\exp(\beta)=0$   $x\uparrow$ no change 
 

These results give us a means of interpreting the coefficients of a logistic regression model. Note that as the odds of $A$ increase, so does $\Pr(A)$, so the general interpretation of the coefficients applys to $\Pr(A)$. 


###   Diabetes Risk Factor Modelling 
Diabetes is a chronic disease affecting millions of people around the world. Diabetes occurs when the body either can't produce or properly use insulin, a hormone that regulates blood sugar. Hyperglycemia, or elevated blood sugar can damage many vital organs and systems in the human body, increasing risks of heart attacks and strokes, blindness, poor circulation and potential loss of limbs, and kidney failure. 

The Pima Indian Diabetes Data Set consists of 768 observations of 9 variables:

`pregnant`	Number of times pregnant

`glucose`	Plasma glucose concentration (glucose tolerance test)

`pressure`	Diastolic blood pressure (mm Hg)

`triceps`	Triceps skin fold thickness (mm)

`insulin`	2-Hour serum insulin (mu U/ml)

`mass`	Body mass index (weight in kg/(height in m)\^2)

`pedigree`	Diabetes pedigree function

`age`	Age (years)

`diabetes`	Class variable (test for diabetes)

Fit a logistic regression model and determine the affect of these various factors on the likelihood of an individual in the data set having diabetes. 

#### Solution

A logistic regression model is a good choice to fit this data.  The response variable is 
\begin{align}
y=& \left\{\begin{array}{cc}
1,& \text{if the patient has diabetes}\\
0, & \text {if the patient doesn't have diabetes}
\end{array}\right.
\end{align} 
and the covariates are on continuous numeric scales (with the exception of age).

The resulting model will be
$$
\log\left(\frac{p_i}{1-p_i}
\right)=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}\dots\beta_8x_{i8}
$$
with the likelihood function
$$
L(\mathbf{\beta}|\mathbf{y})=\prod_{i=1}^n\left(\frac{\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}{1+\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}\right)^{y_i}\left(1-\frac{\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}{1+\exp\left(\beta_0+\sum_{j=1}^8\beta_jx_{ij}\right)}
\right)^{1-y_i}.
$$
Of course solving this requires using a software based solution to use a numerical optimisation routine.

#### Code

In [None]:

##  We load the data from a CSV file
##  Note that the variable diabetes has been coded as a 0-1 variable

df = pandas.read_csv("../DATA/diabetes.csv")

##  We fit the model using the XXXX function

model = statsmodels.formula.api.logit('diabetes~pregnant+glucose+pressure+triceps+insulin+mass+pedigree+age',df) 

res = model.fit()

res.summary()




From the model summary we can see that some of the variables appear to be statistically significant. By examining the signs of the coeffcients, we know that the intercept is not useful as mathematically it is the log of the odds ratio when the independent variables are all equal to 0, which is not a realistic situation.  

We can note that based on the signs and significance of the coefficients, as glucose increases and insulin decreases the risk of diabetes increases.  Considering that Type I diabetes is defined as a case where the pancreas produces little or no insulin, and Type II diabetes is defined as a deficient processing of glucose, so these relationships are expected and provide no insight into risk factors for diabetes.  

Of the remaining covariates all coefficients are positive with the exception of `pressure` which is the patient's diastolic blood pressure; as this increases, the risk of diabetes decreases.  The effect of weight (`mass`), family history (`pedigree`) and pregnancy status (`pregnant`) are statistically significant positive coefficients.  Increasing weight, a family history of diabetes and being pregnant are associated with increased probability of diabetes. 

#### Plots

In [None]:

df["fitted_values"] = 1/ (1+ numpy.exp(-res.fittedvalues))
df["resid"] = res.resid_generalized

(ggplot(df)+
geom_point(aes(x = 'pregnant', y = 'diabetes' ))+
geom_line(aes(x = 'pregnant', y = 'fitted_values'))
)

(ggplot(df)+
geom_point(aes(x = 'glucose', y = 'diabetes' ))+
geom_line(aes(x = 'glucose', y = 'fitted_values'))
)

(ggplot(df)+
geom_point(aes(x = 'pressure', y = 'diabetes' ))+
geom_line(aes(x = 'pressure', y = 'fitted_values'))
)

(ggplot(df)+
geom_point(aes(x = 'insulin', y = 'diabetes' ))+
geom_line(aes(x = 'insulin', y = 'fitted_values'))
)

(ggplot(df)+
geom_point(aes(x = 'mass', y = 'diabetes' ))+
geom_line(aes(x = 'mass', y = 'fitted_values'))
)

(ggplot(df)+
geom_point(aes(x = 'pedigree', y = 'diabetes' ))+
geom_line(aes(x = 'pedigree', y = 'fitted_values'))
)

(ggplot(df)+
geom_point(aes(x = 'age', y = 'diabetes' ))+
geom_line(aes(x = 'age', y = 'fitted_values'))
)


#   Exercises for Additional Practice

##    Example 1

On January 28, 1986 the US Space Shuttle Challenger launched from Cape Kennedy in Southern Florida.  73 Seconds into the flight the space craft broke up due to an explosion of the external fuel tank (ET).  Subsequent investigations traced the root cause of failure to a rubber o-ring in a joint on the solid rocket booster (SRB) that failed, allowing hot gases from the SRB to impinge on the ET, leading to the explosion.  

Investigations collected data from previous flights including: recorded launch temperature, order of flight, and pre-flight pressure check tests for joint leakage.  Analyse these data using using logistic regression for the failure or joints in previous launches. Note that the indicator for failure is the variable `n_failure`, launch temperature is `temp`, and pressure check levels are `psi`.

In [None]:

df = pandas.read_csv("../DATA/srb.csv")




**Solution:**

In [None]:

df = pandas.read_csv("../DATA/srb.csv")

model = statsmodels.api.formula.logit("n_failure~temp+psi",df)

results = model.fit()

results.summary()



Now fit the model just considering temperature

In [None]:

df = pandas.read_csv("../DATA/srb.csv")




**Solution:**

In [None]:

df = pandas.read_csv("../DATA/srb.csv")

model = statsmodels.api.formula.logit("n_failure~temp",df)

results = model.fit()

results.summary()



Given these results if the launch temperature on 28 January 1986 was 28 F according to this model, what was the probability of o-ring failure?  Hint you can use `predict`. 

In [None]:

df_predict = pandas.DataFrame({'temp': [28]})


**Solution:**

In [None]:

df_predict = pandas.DataFrame({'temp': [28]})

results.predict(df_predict)
