# Topics in Econometrics and Data Science: Tutorial 7

#### General Note

You will very likely find the solution to these exercises online. We, however, strongly encourage you to work on these exercises without doing so. Understanding someone elseâ€™s solution is very different from coming up with your own. Use the lecture notes and try to solve the exercises independently.

# Section 2: Linear Regression

## Exercise 1: Linear Regression: Prediction

We consider a regression problem where we want to predict our dependent variable $Y$ in terms of the explanatory variable $X$. In this exercise we try to create an optimal prediction of $Y$ given $X$.

### A)

1. Load the [`prediction.csv`](https://alexandragibbon.github.io/StatProg-HHU/data/prediction.csv) data set and assess its structure. \
\
**Hint**: You can use [`pandas.DataFrame.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) to assess the structure of the data set.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

In [None]:
os.chdir("[INSERT YOUR PATH HERE!]")

data = pd.read_csv('data/prediction.csv', sep=',', na_values=".")
data.head()

Unnamed: 0,x,y
0,1.047198,9.242999
1,1.186824,6.86105
2,1.32645,-2.258883
3,1.466077,3.645338
4,1.605703,5.755031


2. Generate a scatter plot of the $Y$ values against the $X$ values using [`plt.figure`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html).

### B)

Now, we want to run a linear regression using the [`sklearn`](https://scikit-learn.org/stable/) package and add the regression line to the scatter plot. Then, we compute the in-sample Means Squared Error (MSE) in prediction. \
\
**Hint:** *in-sample* means that we estimate the model and evaluate its performance using the same data.

1. To solve this task, first, save `x` and `y` in two separate data frames using [`pd.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

2. Run a linear regression using [`sklearn.linear_model.LinearRegression`](https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LinearRegression.html). \
\
**Hint**: First, fit the linear model (`sklearn.linear_model.LinearRegression.fit`) using the entire data set ($X$ and $Y$). Then, use the estimated/fitted linear model to predict $Y$ (`sklearn.linear_model.LinearRegression.predict`) based on the same $X$ data that was used to estimate/fit the model. 

In [1]:
from sklearn.linear_model import LinearRegression
linmod = LinearRegression()

3. Add the regression line to the scatter plot created in part A.

4. Compute the in-sample MSE in prediction.

### C)

In part C, we evaluate how well the model estimated in part B predicts the new observations provided in the data set `predictiontest.csv`. Compute the out-of-sample MSE in prediction and illustrate your results in the scatter plot. \
\
**Hint:** *out-of-sample* means that the model is estimated using one data set and its performance is evaluated on a different data set.

Load the [`predictiontest.csv`](https://alexandragibbon.github.io/StatProg-HHU/data/predictiontest.csv) data set:

In [None]:
os.chdir("[INSERT YOUR PATH HERE!]")

datatest = pd.read_csv('data/predictiontest.csv', sep=',', na_values=".")
datatest.head()

Unnamed: 0,xtest,ytest
0,1.047198,5.808596
1,1.186824,1.27423
2,1.32645,4.892476
3,1.466077,-3.124069
4,1.605703,6.282259


In [None]:
datatest.shape

(49, 2)

1. First, rename the columns of `datatest` to `x` and `y` using [`pandas.DataFrame.rename`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html#pandas-dataframe-rename).

2. Save the columns `x` and `y` in two separate data frames using [`pd.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and name them, e.g., `xtest` and `ytest`.

3. Then, generate out-of-sample predictions. For this, use the estimated model from part B and predict the new observations provided in `predictiontest.csv`. \
\
**Hint:** Use `sklearn.linear_model.LinearRegression.predict`.

4. Create a scatter plot showing the test data and the predicted values.

5. Compute the out-of-sample MSE in prediction.

### D) 

In part D, we try to improve the predictive performance by including high-order polynomials of the variable $X$ in our regression. Then, we add the predictions from the polynomial model to the scatter plot. \
Do this for both the in-sample predictions and and the out-of-sample predictions.

1. Transform `x` into polynomial features of, e.g., degree $q=5$ using [`sklearn.preprocessing.PolynomialFeatures`](https://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) and its method `fit_transform`.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

2. Convert your transformed `x` into a data frame using `pandas.DataFrame` and display the first few rows of the transformed data.

3. Then, estimate/fit the model based on your transformed `x` data and `y`.

4. Generate the in-sample and out-of-sample predictions. \
\
**Hint:** Before you can generate the out-of-sample predictions, you need to generate transformations (i.e. high-order polynomials) for the x-values (`xtest`) of the test sample.

5. Compute the in-sample and out-of-sample MSE.

6. Add the predictions from the polynomial model to the scatter plot. For this, create two scatter plots: one for the in-sample predictions and one for the out-of-sample predictions. \
\
**Hint:** The first scatter plot shows the train data (based on the data set `prediction.csv`), the in-sample predicted values of the linear model of task B and the in-sample predicted values of the polynomial model. The second scatter plot shows the test data (based on the data set `predictiontest.csv`), the out-of sample predicted values of the linear model of task C and the out-of-sample predicted values of the polynomial model.

### E)

Increase the polynomial order in your approximation of the regression curve and see how the in-sample and out-of-sample MSE behave.

### F)

In task F, generate a plot of the in-sample and out-of-sample MSE depending on the order of the polynomial, $q$, in the regression function.

1. First, you need to write a for-loop that iterates over different values of degree $q$ to generate the in-sample and out-of-sample predictions and calculate the in-sample MSE and out-of-sample MSE. Include the following steps in your for-loop: \
\
i. For each value of $q$, generate polynomial features of degree $q$ for both the training data (`x`) and test data (`xtest`). \
ii. Estimate/fit a polynomial model using the polynomial features of degree $q$ on the training data (transformed `x` and `y`). \
iii. Make in-sample and out-of-sample predictions. \
iv. For each value of $q$, compute the in-sample MSE and the out-of-sample MSE.

In [6]:
q = np.arange(1,12,1)
MSE_ins = np.zeros(len(q))
MSE_oos = np.zeros(len(q))

2. Finally, generate the plot.