## Description


In today's class, we will learn how to perform linear regression.  
It's a method of fitting a linear function to a data set with a single predicted numerical variable $Y$ and several explanatory variables (a.k.a. predictors or independent variables) stored in a matrix $X$.  
The model of linear regression can be written as  
$$Y_i = X_i\beta + \epsilon_i,$$
where $X_i$ is the i-th row of the $X$ matrix and $\epsilon_i$ is the effect of factors influencing the value of $Y$ that are not included in $X$. This effect is called an *error*, but note that it's a slightly misleading name.   

When we want to include categorical variables as predictors (i.e. model the differences of income between voivodeships), we create so-called *dummy variables*: for each category of the variable, we create a binary dummy variable which is equal to 1 if a given observation comes from this category and 0 otherwise.

In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
!gdown https://drive.google.com/uc?id=1FInZ2jrlZGNColU4sHF9JKGHP39fTVut
!gdown https://drive.google.com/uc?id=1n1qS6dcVVKcVJOuUIIm0VTz6cSyrtzDH

## Data & library imports

In this notebook, we'll introduce another Python library for statistical data analysis. The `statsmodels` library implements several statistical tests and methods for linear regression.

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
from scipy.stats import norm, uniform
import statsmodels.api as sm

In [None]:
income = pd.read_csv('BDL municipality incomes 2015-2020.csv', sep=';', dtype={'Code': 'str'})
income.dropna(inplace=True)

**Excercise 1** (pen&paper or blackboard). Find OLS estimator for the model  $$y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \varepsilon_i,$$ where $n = 4$,
$x_1^T = [1, 1, 2, −4]$, $x_2^T
= [−3, −3, 5, 1]$, $y^T = [1, 2, 3, 1]$. Find the residuals, the vector of the fitted values, and the value of $R^2$.

**Excercise 2** In this exercise, we'll learn how to estimate the linear regression in Python. Check your answers from Exercise 1 by constructing a linear regression model. Provide the interpretation for the coefficients.

In [None]:
x1 = [1,1,2,-4]
x2 = [-3,-3,5,1]
y = [1,2,3,1]


X = pd.DataFrame({
    'x1': x1,
    'x2': x2,
})

# this one is important -- we have to add an intercept manually
X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

## Get the fitted values:
fitted_values_summary = results.get_prediction().summary_frame()
fitted_values = fitted_values_summary['mean']
## Compute the residuals:
residuals = y - fitted_values
## The average prediction error can be estimated by RMSE
## (it's an estimation on training data, so it's biased)
print('RMSE of the model:', np.sqrt(np.mean(residuals**2)))
# for log-transformed data, we need to transform the predictions back:
# print('RMSE of back-transformed model:', np.sqrt(np.mean((10**Y - 10**fitted_values)**2)))
## Standardize the residuals & compute the square root:
residuals_standardized = (residuals - residuals.mean())/residuals.std()
residuals_sqroot = np.sqrt(np.abs(residuals_standardized))
## Calculate R^2:
R2 = 1 - np.sum(residuals**2)/np.sum((y - np.mean(y))**2)
print('Coefficient of determination:', R2)

## Forecasting

**Exercise 4.** In this exercise, we'll learn how to use the linear regression for forecasting. Construct a linear regression model to explain the income of municipalities in 2017 (this is our dependent variable) based on the incomes from the years 2015 to 2016 (those are our independent variables).  
Calculate the RMSE (root mean squared error) of the fitted values.

Now, use this model to predict the incomes in 2018 based on the incomes from 2016 to 2017. Compute the RSS. Did the prediction error change? Why?

Predict the incomes in 2019 and 2020. Can you notice something particular in the RMSE values? What is the consequence for forecasting using machine learning models?

In [None]:
## With intercept:
X = income[['2015', '2016']]
model = sm.OLS(income['2017'], X)
results = model.fit()
print(results.summary())

predict2017 = results.predict()
predict2018 = results.predict(income[['2016', '2017']])
predict2019 = results.predict(income[['2017', '2018']])
predict2020 = results.predict(income[['2018', '2019']])

In [None]:
## Compute the residuals:
residuals2017 = income['2017'] - predict2017
residuals2018 = income['2018'] - predict2018
residuals2019 = income['2019'] - predict2019
residuals2020 = income['2020'] - predict2020
## The average prediction error can be estimated by RMSE
## (it's an estimation on training data, so it's biased)
print('RMSE of the model for predict2017:', np.sqrt(np.mean(residuals2017**2)))
print('RSS of the model for predict2017:', np.sum(residuals2017**2))

print('RMSE of the model for predict2018:', np.sqrt(np.mean(residuals2018**2)))
print('RSS of the model for predict2018:', np.sum(residuals2018**2))

print('RMSE of the model for predict2019:', np.sqrt(np.mean(residuals2019**2)))
print('RSS of the model for predict2019:', np.sum(residuals2019**2))

print('RMSE of the model for predict2020:', np.sqrt(np.mean(residuals2020**2)))
print('RSS of the model for predict2020:', np.sum(residuals2020**2))

df = pd.DataFrame({'year': ['2017','2018','2019','2020'],
                   'RSS': [np.sum(residuals2017**2),
                           np.sum(residuals2018**2),
                           np.sum(residuals2019**2),
                           np.sum(residuals2020**2)]})
df
fig = px.line(df, x="year", y="RSS", title='')
fig.show()