## Multiple Linear Regression

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

We'll be using a dataset on possums in Australia and New Guinea, borrowed from the OpenIntro Statistics textbook.

You can read more about it here: https://www.openintro.org/data/index.php?data=possum

Our goal will be to understand the relationship between the total length in cm (`total_l`) and the other variables in our dataset.

In [None]:
possum = pd.read_csv('../data/possum.csv')

In [None]:
possum.head(2)

In [None]:
import statsmodels.formula.api as sm
import statsmodels.api as stats

**Example 1:**

Question: Is tail length significantly related to total length after controlling for the head length?

**Null Hypothesis:** $\beta_{tail\_l} = 0$

**Alternative Hypothesis:** $\beta_{tail\_l} \neq 0$

To test this, we'll fit our reduced and full models.

In [None]:
lr_reduced = sm.ols('total_l ~ head_l', data = possum).fit()
lr_full = sm.ols('total_l ~ head_l + tail_l', data = possum).fit()

Then use the `anova_lm` method.

In [None]:
stats.stats.anova_lm(lr_reduced, lr_full)

Based on this, we can reject the null hypothesis and conclude that tail length is significant after controlling for total length.

The dataset also contains a site variable, which indicates where the possum was trapped.

**Question:** Does the average total length differ depending on the site?

**Null Hypothesis:** $\beta_{site\_i} = 0$ for all sites

**Alternative Hypothesis:** $\beta_{site_i} \neq 0$ for at least one site

Note that site is encoded using an integer. We need to let statsmodels know that this is a categorical variable, which we can do by using a `C` in our formula.

In [None]:
lr_reduced = sm.ols('total_l ~ 1', data = possum).fit()
lr_full = sm.ols('total_l ~ C(site)', data = possum).fit()

stats.stats.anova_lm(lr_reduced, lr_full)

In [None]:
sns.boxplot(data = possum, x = 'site', y = 'total_l');

**Question:** Does the average total length differ depending on the site, after controlling for the effect all other variables?

**Null Hypothesis:** $\beta_{site\_i} = 0$ for all sites

**Alternative Hypothesis:** $\beta_{site_i} \neq 0$ for at least one site

In [None]:
lr_reduced = sm.ols('total_l ~ pop + sex + age + head_l + skull_w + tail_l', data = possum).fit()
lr_full = sm.ols('total_l ~ pop + sex + age + head_l + skull_w + tail_l + C(site)', data = possum).fit()

stats.stats.anova_lm(lr_reduced, lr_full)

Even after accounting for all other variables, the site is significant.

## Interactions

To create interaction terms, you separate your variables by a `:` in the formula.

In [None]:
lr_full =sm.ols('total_l ~ head_l + sex + head_l:sex', data = possum).fit()
lr_full.summary()

**Question:** Are the interaction terms significant?

**Null Hypothesis:** $\beta_{head\_l:sex} = 0$

**Alternative Hypothesis:** $\beta_{head\_l:sex} \neq 0$ 

In [None]:
lr_reduced =sm.ols('total_l ~ head_l + sex', data = possum).fit()
lr_full =sm.ols('total_l ~ head_l + sex + head_l:sex', data = possum).fit()

stats.stats.anova_lm(lr_reduced, lr_full)

Conclusion: The interaction term is significant.

## Polynomial Regression

Let's revisit the cars dataset.

In [None]:
cars = pd.read_csv('../data/auto-mpg.csv')

In [None]:
cars.plot(kind = 'scatter', x = 'displacement', y = 'mpg', figsize = (10,6));

We saw last time that when trying to fit a simple linear regression model, we saw a distinct pattern in the residuals. 

In [None]:
lr_cars = sm.ols('mpg ~ displacement', data = cars).fit()

plt.figure(figsize = (10,6))
plt.scatter(cars['displacement'], lr_cars.resid)
xmin, xmax = plt.xlim()
plt.hlines(y = 0, xmin = xmin, xmax = xmax)
plt.xlim(xmin, xmax);

It looks like the relationship is not linear, but instead is curved. We can try to capture this using a polynomial.

$$\text{mpg}_i = \beta_0 + \beta_1\cdot \text{displacement}_i + \beta_2 \cdot \text{(displacement)}^2_i + \epsilon_i$$ 

To let statsmodels know that we want a polynomial, we surround the formula with an I.

In [None]:
lr_poly = sm.ols('mpg ~ displacement + I(displacement**2)', data = cars).fit()
lr_poly.summary()

Inspecting the residuals, it looks like we have removed the nonlinearity.

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(cars['displacement'], lr_poly.resid)
xmin, xmax = plt.xlim()
plt.hlines(y = 0, xmin = xmin, xmax = xmax)
plt.xlim(xmin, xmax);

However, we have a different problem - it looks like the residuals don't have constant variance.

If we look at the residuals vs the fitted values, we can see that larger values of the response have higher variance.

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(lr_poly.fittedvalues, lr_poly.resid)
xmin, xmax = plt.xlim()
plt.hlines(y = 0, xmin = xmin, xmax = xmax)
plt.xlim(xmin, xmax);

A potential fix to this is to use the logarithm of the target.

In [None]:
plt.scatter(x = cars['displacement'], y = np.log(cars['mpg']));

In [None]:
lr_poly_log = sm.ols('np.log(mpg) ~ displacement + I(displacement**2)', data = cars).fit()
lr_poly_log.summary()

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(lr_poly_log.fittedvalues, lr_poly_log.resid)
xmin, xmax = plt.xlim()
plt.hlines(y = 0, xmin = xmin, xmax = xmax)
plt.xlim(xmin, xmax);

Now, let's see what the intervals look like.

Note that since our target is the logarithm of the mpg, we need to exponentiate our predictions.

In [None]:
var = 'displacement'

x_pred = pd.DataFrame({
    var: np.linspace(start = cars[var].min(),
                               stop = cars[var].max(), num = 250)
})

pred = lr_poly_log.get_prediction(x_pred).summary_frame()

cars.plot(kind = 'scatter', x = var, y = 'mpg', figsize = (10,6))

plt.plot(x_pred[var], np.exp(pred['mean']), color = 'grey', label = 'predicted mean')

plt.plot(x_pred[var], np.exp(pred['mean_ci_lower']), color = 'blue', label = 'confidence interval')
plt.plot(x_pred[var], np.exp(pred['mean_ci_upper']), color = 'blue')

plt.plot(x_pred[var], np.exp(pred['obs_ci_lower']), color = 'black', label = 'prediction interval')
plt.plot(x_pred[var], np.exp(pred['obs_ci_upper']), color = 'black')

plt.legend();