# Review module

**Instructions**

In order to complete this review module, we recommend you follow these instructions:

1. Complete the code cells provided to you in this notebook, but do **not** change the names of the variables. If you do that, the autograder will fail and you will not receive any points.
2. Run all the answered cells before you run the testing cells. The answers must exist before they are graded!
3. Remove from each cell the code `raise NotImplementedError()` and replace it with your implementation.
4. Do not round any quantity.

### Exercise 1 (1 point)

The `wine` DataFrame contains data for about 1,600 Portuguese red wine samples, including quality ratings assigned by human tasters (the `quality` column) and chemical markers obtained via lab tests:

In [1]:
import pandas as pd
import statsmodels.formula.api as sm
wine = pd.read_csv("data/winequality-red.csv", sep=";")
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Using `statsmodels`, code this simple linear regression:

$$
quality = \beta_0 + \beta_1 alcohol + \varepsilon
$$

Save the **fitted** model in a variable called `reg1`.

In [2]:
# YOUR CODE HERE

reg1 = sm.ols(formula='quality ~ alcohol', data=wine).fit()
reg1.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.227
Model:,OLS,Adj. R-squared:,0.226
Method:,Least Squares,F-statistic:,468.3
Date:,"Sat, 21 Jan 2023",Prob (F-statistic):,2.83e-91
Time:,02:53:00,Log-Likelihood:,-1721.1
No. Observations:,1599,AIC:,3446.0
Df Residuals:,1597,BIC:,3457.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1.8750,0.175,10.732,0.000,1.532,2.218
alcohol,0.3608,0.017,21.639,0.000,0.328,0.394

0,1,2,3
Omnibus:,38.501,Durbin-Watson:,1.748
Prob(Omnibus):,0.0,Jarque-Bera (JB):,71.758
Skew:,-0.154,Prob(JB):,2.62e-16
Kurtosis:,3.991,Cond. No.,104.0


### Exercise 2 (1 point)

Using `statsmodels`, code this multiple linear regression:

$$
quality = \beta_0 + \beta_1 alcohol + \beta_2 pH + \beta_3 sulphates + \beta_4 density + \beta_5 residual{\ }sugar + \beta_6 citric{\ }acid + \varepsilon
$$

Save the **fitted** model in a variable called `reg2`.

**Hint:** `statsmodels` formulas expect variable names that don't have spaces in them. To include a variable name with spaces, use [`Q` syntax](https://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q) (`Q` stands for "quote") like this: `Q("total sulfur dioxide")`

In [3]:
# YOUR CODE HERE
reg2 = sm.ols(formula='quality ~ alcohol + pH + Q("sulphates") + density + Q("residual sugar") + Q("citric acid")', data=wine).fit()
reg2.summary()

0,1,2,3
Dep. Variable:,quality,R-squared:,0.289
Model:,OLS,Adj. R-squared:,0.286
Method:,Least Squares,F-statistic:,107.6
Date:,"Sat, 21 Jan 2023",Prob (F-statistic):,5.38e-114
Time:,02:53:04,Log-Likelihood:,-1654.4
No. Observations:,1599,AIC:,3323.0
Df Residuals:,1592,BIC:,3361.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,16.9484,13.314,1.273,0.203,-9.166,43.063
alcohol,0.3409,0.022,15.822,0.000,0.299,0.383
pH,-0.4047,0.139,-2.911,0.004,-0.677,-0.132
"Q(""sulphates"")",0.8048,0.107,7.498,0.000,0.594,1.015
density,-14.1879,13.255,-1.070,0.285,-40.187,11.811
"Q(""residual sugar"")",-0.0085,0.014,-0.626,0.532,-0.035,0.018
"Q(""citric acid"")",0.3996,0.121,3.306,0.001,0.162,0.637

0,1,2,3
Omnibus:,37.898,Durbin-Watson:,1.745
Prob(Omnibus):,0.0,Jarque-Bera (JB):,59.752
Skew:,-0.216,Prob(JB):,1.06e-13
Kurtosis:,3.843,Cond. No.,12500.0


### Exercise 3 (1 point)

Complete the code in the below cell to extract the following values from the model you fitted in the previous exercise:

* The $R^2$ (save it in a variable called `rsq`)
* The intercept $\beta_0$ (save it in a variable called `beta0`)
* The coefficient of the $alcohol$ variable, $\beta_1$ (save it in a variable called `beta1`)

**Note:** Do not change the name of the function.

In [4]:
def extract_results(fitted_model):
    """
    Extracts R squared, intercept and beta 1 from the model.
    """
    # YOUR CODE HERE
    rsq = fitted_model.rsquared
    beta0 = fitted_model.params[0]
    beta1 = fitted_model.params['alcohol']
    return rsq, beta0, beta1

rsq, beta0, beta1 = extract_results(reg2)
print("R squared: ", rsq)
print("Intercept: ", beta0)
print("Beta 1: ", beta1)

R squared:  0.28855431357406913
Intercept:  16.948376515840327
Beta 1:  0.3408938824939653


## Testing cells

Run the below cells to check your answers. Make sure you run your solution cells first before running the cells below, otherwise you will get an error when checking your answers.

In [5]:
# Ex. 1
assert "reg1" in globals(), "Ex. 1 - You need to create the reg1 variable!"
assert type(reg1).__name__ == 'RegressionResultsWrapper', "Ex. 1 - The reg1 variable is not a statsmodels regression model! Did you forget to fit it? Remember to use the .fit() method!"
print("Exercise 1 passed the preliminary sanity check.")

Exercise 1 passed the preliminary sanity check.


In [6]:
# Ex. 2
assert "reg2" in globals(), "Ex. 2 - You need to create the reg2 variable!"
assert type(reg2).__name__ == 'RegressionResultsWrapper', "Ex. 2 - The reg2 variable is not a statsmodels regression model! Did you forget to fit it? Remember to use the .fit() method!"
assert len(reg2.params) == 7, "Ex. 2 - Your regression has either too many or too few input variables! There must be seven betas: One intercept and six input variables"
print("Exercise 2 passed the preliminary sanity check.")

Exercise 2 passed the preliminary sanity check.


In [7]:
# Ex. 3
assert "extract_results" in globals(), "Ex. 3 - You need to create the extract_results function! Did you forget to run the cell where you defined it?"
print("Exercise 3 passed the preliminary sanity check.")

Exercise 3 passed the preliminary sanity check.


## Attribution

"Wine quality", October 07, 2009, Cortez, Paulo *et al.*, Creative Commons Attribution 4.0 International license, https://archive-beta.ics.uci.edu/ml/datasets/wine+quality