<img align=right src="images/inmas.png" width=130x />

# Notebook 03b - Multiple Linear Regression - Supplement

Material covered in this notebook:

This notebook follows along the notes [here](Notes/3_MultipleLinearRegression.pdf)


### Prerequisite
Notebook 03a

------------------------------------

In [None]:
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sbn
import scipy.stats as stats

# Feature Selection for Regression

We introduced multiple regression, which involved estimating the coefficients of a linear relationship between a dependent/response variable $Y$ and independent/predictor/feature variables $X_1,\dots,X_k$:

$$Y=\beta_0 + \beta_1X_1 +\cdots + \beta_kX_k$$

How do we know that we have put the right covariates into the model. How can we tell if our model is "good" or even "best"?

Let's revisit the course evaluation dataset and see if we can do better than our simple linear regression model.

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/akmand/datasets/master/openintro/evals.csv")
display(data)

data.columns

In [None]:
linear_model_formulation = smf.ols("score ~ bty_avg", data = data)

lm_results = linear_model_formulation.fit()
lm_results.params

In [None]:
beta_ests = np.flip(np.array(lm_results.params))
x_vals = np.linspace(0,11,1000)
y_vals = [np.polyval(beta_ests, i) for i in x_vals]

plt.plot(x_vals, y_vals, color= 'blue', linewidth=3, zorder = 2)
plt.scatter(data.bty_avg, data.score, color = 'black', zorder = 2)

In [None]:
lm_summary = lm_results.summary()
print(lm_summary)

In [None]:
#R^2 adjusted for the linear model
lm_results.rsquared_adj

In [None]:
lm_results.aic

In [None]:
lm_results.bic

We are only explaining about 3% of the variation in course evaluation scores. What else can we add to the model to help us without the model getting too unwieldy?

## Model Selection

Clearly a balance must be struck between model's complexity and  its capacity to effectively predict new data. The methods by which we strike this balance in practice fall under the category of **model selection**. The model selection task is typically cast as an application of **Occam's razor**, which may be summarized as:

   "Entities should not be multiplied beyond necessity."
    
I.e., in the presence of (equally likely) competing explanations for a phenomenon, the simplest should be preferred.

In parametric settings, this is typically carried out by computing, for each model under consideration, an expression of the form

$$\text{log-likelihood} - \text{complexity penalty}$$

Commonly used model selection criteria which fall under this paradigm include the **Akaike information criterion** (AIC) and **Bayesian information criterion** (BIC) which penalize the log-likelihood penalty respectively with

$$p_{\text{AIC}}=k\\
p_{\text{BIC}} = \frac{k}{2}\log n$$

## Forward Selection of Features

Forward selection involves the following procedure:

   1. Start with an empty regression model (i.e., just an intercept, if appropriate).
   2. For each feature not yet included, fit the model associated by including only that feature among.
   3. Note either the significance of the new feature (in the `.summary()` call) or a desired model selection criterion.
   4. If there is at least one feature which is significant or improves the criterion, keep the feature which is most significant or most improves the criterion and return to step 2.
   5. If no additional feature is significant or improves the criterion, return the model without any new features.

We will first use the $\text{adjusted-}R^2$ as our criterion since there's so few data. The adjusted R^2 accounts for the fact that we add complexity every time we add a new covariate. The regular R^2 will always increase when we add a new covariate. This is not necessarily true of the adjusted version.

To simplify we will only consider a subset of the covariates.

- rank
- ethnicity
- gender
- age
- cls_perc_eval
- bty_avg


In [None]:
results_rank_only = smf.ols("score ~ rank", data = data).fit()
results_ethnicity_only = smf.ols("score ~ ethnicity", data = data).fit()
results_gender_only = smf.ols("score ~ gender", data = data).fit()
results_age_only = smf.ols("score ~ age", data = data).fit()
results_cls_perc_eval_only = smf.ols("score ~ cls_perc_eval", data = data).fit()


print("Adj Rsquared for rank only is "+ str(results_rank_only.rsquared_adj))
print("Adj Rsquared for ethnicity only is "+ str(results_ethnicity_only.rsquared_adj))
print("Adj Rsquared for gender only is "+ str(results_gender_only.rsquared_adj))
print("Adj Rsquared for age only is "+ str(results_age_only.rsquared_adj))
print("Adj Rsquared for cls_perc_eval only is "+ str(results_cls_perc_eval_only.rsquared_adj))


It looks like using bty_score is the best first covariate. Now what can we add to it to improve the most?

In [None]:
results_bty_rank_only = smf.ols("score ~ rank + bty_avg", data = data).fit()
results_bty_ethnicity_only = smf.ols("score ~ ethnicity + bty_avg", data = data).fit()
results_bty_gender_only = smf.ols("score ~ gender + bty_avg", data = data).fit()
results_bty_age_only = smf.ols("score ~ age + bty_avg", data = data).fit()
results_bty_cls_perc_eval_only = smf.ols("score ~ cls_perc_eval + bty_avg", data = data).fit()


print("Adj Rsquared for bty_rank only is "+ str(results_bty_rank_only.rsquared_adj))
print("Adj Rsquared for bty_ethnicity only is "+ str(results_bty_ethnicity_only.rsquared_adj))
print("Adj Rsquared for bty_gender only is "+ str(results_bty_gender_only.rsquared_adj))
print("Adj Rsquared for bty_age only is "+ str(results_bty_age_only.rsquared_adj))
print("Adj Rsquared for bty_cls_perc_eval only is "+ str(results_bty_cls_perc_eval_only.rsquared_adj))

bty_avg and cls_perc_eval work best together. Now what do we add?

In [None]:
results_bty_cls_rank_only = smf.ols("score ~ rank + bty_avg + cls_perc_eval", data = data).fit()
results_bty_cls_ethnicity_only = smf.ols("score ~ ethnicity + bty_avg  + cls_perc_eval", data = data).fit()
results_bty_cls_gender_only = smf.ols("score ~ gender + bty_avg  + cls_perc_eval", data = data).fit()
results_bty_cls_age_only = smf.ols("score ~ age + bty_avg  + cls_perc_eval", data = data).fit()


print("Adj Rsquared for bty_cls_rank only is "+ str(results_bty_cls_rank_only.rsquared_adj))
print("Adj Rsquared for bty_cls_ethnicity only is "+ str(results_bty_cls_ethnicity_only.rsquared_adj))
print("Adj Rsquared for bty_cls_gender only is "+ str(results_bty_cls_gender_only.rsquared_adj))
print("Adj Rsquared for bty_cls_age only is "+ str(results_bty_cls_age_only.rsquared_adj))


Adding gender helps. Let's keep going.

In [None]:
results_bty_cls_gender_rank_only = smf.ols("score ~ rank + bty_avg + cls_perc_eval + gender", data = data).fit()
results_bty_cls_gender_ethnicity_only = smf.ols("score ~ ethnicity + bty_avg  + cls_perc_eval + gender", data = data).fit()
results_bty_cls_gender_age_only = smf.ols("score ~ age + bty_avg  + cls_perc_eval + gender", data = data).fit()


print("Adj Rsquared for bty_cls_gender_rank only is "+ str(results_bty_cls_gender_rank_only.rsquared_adj))
print("Adj Rsquared for bty_cls_gender_ethnicity only is "+ str(results_bty_cls_gender_ethnicity_only.rsquared_adj))
print("Adj Rsquared for bty_cls_gender_age only is "+ str(results_bty_cls_gender_age_only.rsquared_adj))

Rank helps but we start to see diminishing returns.

In [None]:
results_bty_cls_gender_rank_ethnicity_only = smf.ols("score ~ ethnicity + bty_avg  + cls_perc_eval + gender + rank", data = data).fit()
results_bty_cls_gender_rank_age_only = smf.ols("score ~ age + bty_avg  + cls_perc_eval + gender + rank", data = data).fit()


print("Adj Rsquared for bty_cls_gender_rank_ethnicity only is "+ str(results_bty_cls_gender_rank_ethnicity_only.rsquared_adj))
print("Adj Rsquared for bty_cls_gender_rank_age only is "+ str(results_bty_cls_gender_rank_age_only.rsquared_adj))

Age helps. Should we add the last variable?

In [None]:
results_bty_cls_gender_rank_age_ethnicity_only = smf.ols("score ~ ethnicity + bty_avg  + cls_perc_eval + gender + rank + age", data = data).fit()


print("Adj Rsquared for bty_cls_gender_rank_age_ethnicity only is "+ str(results_bty_cls_gender_rank_age_ethnicity_only.rsquared_adj))


This marginally helps. So the "full" model is not overkill based on this metric. Recall that this isn't really a full model because we only considered a subset of variables.

## Backward Selection of Features

If forward selection involves incrementally adding features to our model, then it naturally stands to reason that **backward selection** involves incrementally *removing* features. The procedure is as follows:

 1. Start with a full regression model.
 2. For each feature included in the model, fit the model which omits that feature.
 3. Note the model selection criterion in question.
 4. If there is at least one feature which, upon omission, improves the model selection criterion, omit the feature which results in the largest improvement, and return to step 2.
 5. Else, if omitting any feature does not improve the model selection criterion, return the current model.

In this case we are going to see if the variables are statistically significant in the presence of the other variables (at the 0.05 level).


In [None]:
full_mod = smf.ols("score ~ ethnicity + bty_avg  + cls_perc_eval + gender + rank + age", data = data).fit()
full_summary = full_mod.summary()
print(full_summary)

The ethnicity binary categorical variable is not significant. We can remove it from the model.

In [None]:
smaller_mod = smf.ols("score ~ bty_avg  + cls_perc_eval + gender + rank + age", data = data).fit()
smaller_summary = smaller_mod.summary()
print(smaller_summary)

Everything else is statistically significant. We can stop with this model.

## Your turn: Feature Selection for Iris Data

Perform forward feature selection on Fisher's *iris* dataset. Use `Petal_Length` as the response variable and the remaining four variables as the predictors. Note that `species` is a categorical factor.

Use either BIC or AIC as your criterion. Recall that in Python "lower BIC/AIC" equates to "better model." See [here](https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other) for some more details about the two metrics.

Then perform backward feature selection using a p-value cutoff of 0.06.

In [None]:
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
iris = pd.read_csv(csv_url, names=col_names)

print(iris)

## Check your understanding

Consider the two models you built. Which of them would you choose? Convince me using at least 2 pieces of evidence.  