# Multiple linear regression with categorical variables

This exercise is useful to see:

- an alternative way to perform multiple linear regression
- the use of categorical variables
- model fit measures

Import all the packages that we need.

In [ ]:
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
from statsmodels.stats.outliers_influence import OLSInfluence

%matplotlib inline
plt.style.use('ggplot') # emulate pretty r-style plots

Read the data from Carseats.csv.

In [ ]:
carseats_df = pd.read_csv('Data/Carseats.csv', index_col=0)
carseats_df.head()

### a) Fit a multiple regression model to predict 'Sales' using 'Population', 'Urban', and 'US' as predictors

In [ ]:
model = smf.ols('Sales ~ Population + Urban + US', data=carseats_df)
estimate = model.fit()
print(estimate.summary())

### b) Provide an interpretation of each coefficient in the model.

The model can be represented as follows: $Sales = \beta_0 + \beta_{urban} * Urban + \beta_{US} * US + \beta_{Population} * Population$, where $Urban \in \{0,1\}$ and $US \in \{0,1\}$

Urban:  
Slope of the urban variable represents the added sales of car sales in urban locations. Negative coefficient -> urban sales declining. 
High p-value -> Not significant/We can't reject the nullhypothesis $(\beta_{Urban} = 0)$. -> drop it out of the model

US:  
Slope of the US variable represents the increased sales at US locations vs. non-US locations. Positive coefficient -> US sales increasing.
Low p-value -> significant -> $\beta_{US} \neq 0$

Population:  
Slope of the Population variable represents the increased sales in more densely populated areas. Positive coefficient -> sales in more populated areas increasing. High p-value -> Not significant/We can't reject the nullhypothesis $(\beta_{Population} = 0)$. -> drop it out of the model

### c) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

In [ ]:
reduced_model = smf.ols('Sales ~ US', data=carseats_df)
reduced_estimate = model.fit()
print(reduced_estimate.summary())

### d) How well do the models fit the data? Produce the diagnostic plots of the reduced model.

In [ ]:
# Obtain the residuals, studentized residuals and the leverages for the reduced model
fitted_values = estimate.fittedvalues
residuals = estimate.resid.values
studentized_residuals = OLSInfluence(estimate).resid_studentized_internal
leverages = OLSInfluence(estimate).influence

# Plot
fig, (ax1, ax2, ax3) = plt.subplots(1,3,figsize=(16,4))
# Residuals
ax1.scatter(fitted_values, residuals, facecolor='none', edgecolor='b')
ax1.set_xlabel('fitted values')
ax1.set_ylabel('residuals')
#Studentized Residuals
ax2.scatter(fitted_values, studentized_residuals, facecolor='none', edgecolor='b')
ax2.set_xlabel('fitted values')
ax2.set_ylabel('studentized residuals')
# Leverages
ax3.scatter(leverages, studentized_residuals, facecolor='none', edgecolor='b')
ax3.set_xlabel('leverages')
ax3.set_ylabel('studentized_residuals')