<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>|

# *Statistical Machine Learning* - Notebook 10, version for students
**Author: Michał Ciach**  


## Data & module loading

In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
!gdown https://drive.google.com/uc?id=1xOJfD-jexDbHSOCg1EiyAxqc5kXjMvX0

Downloading...
From: https://drive.google.com/uc?id=1GW1pjKOCoKOlC4Jqbqql_ghYD_n0iC6O
To: /content/BDL municipality incomes 2015-2020.csv
100% 228k/228k [00:00<00:00, 55.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1xOJfD-jexDbHSOCg1EiyAxqc5kXjMvX0
To: /content/protein_lengths.tsv
100% 29.3M/29.3M [00:00<00:00, 32.6MB/s]


In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
import numpy.random as rd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoLars, ElasticNet
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.decomposition import PCA
from scipy.stats import uniform, norm
import statsmodels.api as sm

In [None]:
income = pd.read_csv('BDL municipality incomes 2015-2020.csv', sep=';', dtype={'Code': 'str'})
income = income.sample(frac=1)  # row permutation
income

Unnamed: 0,Code,Region,2015,2016,2017,2018,2019,2020
761,1017072,Pątnów (2),5618088.01,6041046.58,6388953.20,6540889.40,8357211.32,8.264981e+06
1333,1608013,Dobrodzień (3),12888649.88,13760652.90,15407907.28,16298020.97,18342796.59,2.053157e+07
2152,2817043,Pasym (3),7495574.18,9336405.26,9827614.61,9647872.56,11288611.57,1.658072e+07
1505,1818042,Radomyśl nad Sanem (2),8193930.06,9416574.62,9713684.07,9513654.65,9828816.82,1.061634e+07
1537,2002052,Juchnowiec Kościelny (2),28734177.62,33221505.65,34363010.82,36601698.29,44582910.88,4.738575e+07
...,...,...,...,...,...,...,...,...
2220,3008033,Kępno (3),49319597.07,52966611.14,56995670.58,61607738.32,75775056.38,7.853859e+07
1654,2202011,Chojnice (1),63683744.60,67366571.71,73469481.45,76272132.54,86835866.18,1.127390e+08
1189,1425072,Kowala (2),11443175.14,12581845.04,13966847.34,15335980.61,17293242.19,1.821279e+07
1983,2604192,Zagnańsk (2),16861060.52,17399850.73,18247830.50,20815124.37,22377275.99,2.891039e+07


In [None]:
voivodeship_names = {
    '02': 'Dolnośląskie',
    '04': 'Kujawsko-pomorskie',
    '06': 'Lubelskie',
    '08': 'Lubuskie',
    '10': 'Łódzkie',
    '12': 'Małopolskie',
    '14': 'Mazowieckie',
    '16': 'Opolskie',
    '18': 'Podkarpackie',
    '20': 'Podlaskie',
    '22': 'Pomorskie',
    '24': 'Śląskie',
    '26': 'Świętokrzyskie',
    '28': 'Warmińsko-mazurskie',
    '30': 'Wielkopolskie',
    '32': 'Zachodniopomorskie'
}
code_list = [s[:2] for s in income["Code"]]
name_list = [voivodeship_names[code] for code in code_list]
income['Voivodeship'] = name_list

In [None]:
is_a_city = [s[-1] == '1' for s in income['Code']]
city_income = income[is_a_city].dropna()
city_income = city_income[city_income.apply(lambda x: all(x!=0), axis=1)]  # removing zero incomes
city_income.iloc[:, 2:8] = city_income.iloc[:, 2:8].apply(np.log10)  # log-transformation
city_income

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  city_income.iloc[:, 2:8] = city_income.iloc[:, 2:8].apply(np.log10)  # log-transformation


Unnamed: 0,Code,Region,2015,2016,2017,2018,2019,2020,Voivodeship
1882,2415011,Pszów (1),7.381327,7.307716,7.321516,7.344537,7.389390,7.538188,Śląskie
6,0202011,Bielawa (1),7.734602,7.768836,7.783880,7.788475,7.845954,7.831974,Dolnośląskie
170,0401011,Aleksandrów Kujawski (1),7.241407,7.205308,7.227717,7.259916,7.357145,7.366252,Kujawsko-pomorskie
531,0664011,Zamość (1),8.087314,8.113054,8.178470,8.206739,8.256196,8.286722,Lubelskie
1626,2013011,Wysokie Mazowieckie (1),7.331218,7.382608,7.404417,7.436529,7.617855,7.559426,Podlaskie
...,...,...,...,...,...,...,...,...,...
1712,2211011,Hel (1),6.982967,7.018176,7.056795,7.106389,7.246894,7.382643,Pomorskie
1304,1603011,Kędzierzyn-Koźle (1),8.167595,8.203248,8.227040,8.248268,8.306276,8.251505,Opolskie
1918,2463011,Chorzów (1),8.440343,8.443416,8.480234,8.541380,8.535107,8.546033,Śląskie
1549,2003021,Brańsk (1),6.849000,6.879375,6.912214,6.960343,7.008355,7.055538,Podlaskie


## Regularized linear models

When fitting linear models to the data (i.e. estimating the values of the $\beta$ parameters), we encounter the same *variance - bias tradeoff* as with any other estimation procedure. In our case, the estimators of $\beta$ are random variables, because they're calculated using the the dependent variable $Y\sim \mathcal{N}(X\beta, \sigma^2)$, where $\sigma^2$ is the variance of the error term $\epsilon$.
We also have a variance - bias tradeoff when we compute predicted values for new data, because our predictions are based on the observed values of $Y$.

This tradeoff is closely connected to the problem of *overfitting*. This happens when our model fits the training data too well and it learns to predict not only the trends, but also the random errors.
When the model is overfitted, it mimics the errors from the training data set when we use it for forecasting - i.e. it is biased. This usually happens when our training data set is too small, e.g. when the number of observations is similar to the number of predictors.

One of the approaches to deal with overfitting and the bias - variance tradeoff is to use *regularized* linear models. Most commonly used are the $L_2$-regularized model, called the Ridge regression, and $L_1$-regularized model, called the LASSO:

$$\hat{\beta}_{\text{Ridge}} = \text{argmin}_\beta ||Y - X\beta||^2 + \alpha ||\beta||^2$$

$$\hat \beta_{\text{LASSO}} = \text{argmin}_\beta ||Y - X\beta||^2 + \alpha |\beta|$$

Although they seem similar, they have different properties and applications.

In the cell below, I have prepared a data set for you to use in the next few exercises. I have taken the city incomes, added interactions, and then centered and scaled the variables. Note that since all the variables are centered (including the response `Y`), we don't need to include intercepts in our linear models - they will always be equal to zero. Also note that the observations were already permuted in the *Data & modules* section to avoid biasing cross-valiation.    

In [1]:
X = city_income[['2015', '2016', '2017']].copy()
X['2015sq'] = X['2015']**2
X['2016sq'] = X['2016']**2
X['2017sq'] = X['2017']**2
X['prod56'] = X['2015']*X['2016']
X['prod67'] = X['2016']*X['2017']
X['prod57'] = X['2015']*X['2017']
X['2015cb'] = X['2015']**3
X['2016cb'] = X['2016']**3
X['2017cb'] = X['2017']**3
X['prod567'] = X['2015']*X['2016']*X['2017']

# Standardization:
original_column_means = {}
original_column_stds = {}
for c in X.columns:
  original_column_means[c] = X[c].mean()
  original_column_stds[c] = X[c].std()
  X[c] -= X[c].mean()
  X[c] /= X[c].std()

Y = city_income['2018'].copy()
Y -= Y.mean()
Y /= Y.std()

NameError: ignored

**Exercise 1.** In this exercise, we will compare the ordinary least squares, Ridge regression and Lasso to predict incomes of the Polish cities in 2018.  

1. Create an OLS (ordinary least squares) linear model explaining the income in 2018 (in the `Y` vector) using incomes in the three previous years together with the interactions between all variables (in the `X` data frame), without an intercept. Use the `statsmodels` library (the documentation is available [here](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html)). Check the summary of the model.   
  1.1. Is the whole model statistically significant? Inspect the p-value of the F test (the *Prob (F-statistic)* value in the summary header).  
  1.2. Which variables are statistically significant? Is the model simple to interpret?   
  1.3. What is the size of the model (i.e. the number of non-zero $\beta$ coefficients)? Are there any coefficients equal zero?  
2. Evaluate the test RMSE of the model using 10-fold cross validation. For this, you will need to create the model again, this time as a `LinearRegression` function from `scikit-learn`. For cross validation, use the `cross_val_score` function from `scikit-learn` with parameters `cv=10`, `scoring='neg_root_mean_squared_error'`.    
3. Now, fit a Ridge regression model to the data, by using the `Ridge` function from `scikit-learn`. Use the parameter value `alpha=0.001`.  
  3.1. Evaluate the RMSE using cross-validation. You can use the `cross_val_score` function from `scikit-learn`. Is it lower or higher than for the OLS? How about its standard deviation?     
  3.2. Inspect the coefficients of the model. Are there any coefficients equal zero? Are the coefficients similar or different to the ones from OLS? Are they smaller in magnitude? Do they have the same signs?   
  3.3. Does the model obtained with the Ridge regression have a similar interpretation to the one obtained with OLS? Answer by comparing the coefficients and checking which ones are positive or negative and their magnitudes.   
4. Now, fit a Lasso model to the data, by using either the `Lasso` or the `LassoLars` function from `scikit-learn` (they implement two different algorithms to calculate the coefficients). Use the parameter value `alpha=0.001`. Note that, as opposed to Ridge and OLS, the coefficients of the Lasso model are calculated numerically using iterative procedures. If you run into convergence problems, you can try switching the algorithm or increasing the numbers of iterations by adjusting the `max_iter` parameter.  
  4.1. Evaluate the RMSE using cross-validation. Is it lower or higher than the one for the OLS and Ridge?   
  4.2. Inspect the coefficients of the model. Are there any coefficients equal zero?  
  4.3. Does the fitted model have a similar interpretation to any of the two previous ones?    
5. Now, let's compare the performance of regularized linear models to the performance of a forward (or backward) model selection in OLS.  
  5.1. Use the `SequentialFeatureSelector` function from `scikit-learn` to select a set of features for prediction with OLS, in the direction (forward/backward) of your choice.  
  5.2. Fit an OLS model on the selected features.  
  5.3. Evaluate the RMSE using cross-validation and compare it to the other three approaches.   
  5.4. Inspect the coefficients of the model and compare them to the other approaches, including the interpretation of the model.   
  5.5.\* Use the `statsmodels` library to fit an OLS model on the selected features and check how model selection has influenced its statistics. In particular, are there some significant variables now? Are all variables significant?  
6. \* Try to combine two or more modelling techniques used in this exercise and see if you can obtain an even better model.    
6. Which of the tested approaches would you choose for this data set, and why?

**Exercise 2.** In the previous exercise, Lasso and Ridge regularized models obtained a comparable performance to each other and to an OLS model constructed after a forward feature selection. This is not always the case! To highlight some of the differences between these techniques, in this exercise we will compare the performance of the Ridge and Lasso regularizations on simulated data.

1. Simulate a $240\times 100$ matrix of independent variables $X$ from the uniform distribution $Unif(0, 1)$. We will use the upper half of the matrix as the training data set, and the lower half as the test one.
2. Simulate 6 dependent variables $Y_1, \dots, Y_6$, for the following two models:
  - $\beta_1, \dots, \beta_{100}=1$,
  - $\beta_1, \dots, \beta_{10}=1,\ \beta_{11}, \dots, \beta_{100}=0$,
  and three values of the standard deviation of the error term: $\sigma = 0.01, 0.1, 1$.

3. For each of the dependent variables, use the first 120 observations to fit a linear model, a Ridge regression model with $\alpha=3$, and a LASSO model with $\alpha=0.1$. Use the model to predict the values of the dependent variable for the lower half of the $X$ matrix and calculate the RMSE.

4. Create a bar plot showing the RMSE for the three algorithms depending on the data set. What is the advantage of each regularization? Are there cases when no regularization should be used? Is there a single regularization that works best for all data sets? Run your code several times to compare the results for different $X$ matrices.


**Exercise 3.** In the previous two exercises, we have used Ridge and Lasso with seemingly arbitrary regularization parameters. In this exercise, we will finally see how to properly select the values of these parameters. The answer is, again, cross-validation. We create a range of parameter values to inspect (e.g. with `np.linspace(0, 1, num=100)`), and, for each value, we perform a cross-validation to estimate the prediction error. Finally, we select the parameter which gives the lowest error.  

1. Use the `GridSearchCV` function from `scikit` to find the optimal regularization parameter for the Ridge regression on the city income data with interactions (for convenience, a cell to generate the data is copied below). Remember that we use models without intercept for these data. Pay attention to the `cv` and `scoring` keyword arguments. Set any reasonable range of the parameter values to inspect - we will adjust it in a moment anyway.  
  1.1. Create a data frame with the results of the grid search. Use it to create a plot of the mean test score agains the value of the regularization parameter.   
  1.2. Based on the plot, see of you need to adjust the range of the parameter values to find the (approximate) optimum. Adjust it accordingly.  
  1.3. What is the optimal RMSE?   
2. Use the `GridSearchCV` function to optimize the Lasso parameter. Repeat the steps from point 1.  
  2.1. Do you encounter numerical problems for small values of the regularization parameter? If so, how to choose the optimal value?  
  2.2. Is the optimal value of the parameter the same as for the Ridge regression? Is this a feature of the two models, or of this particular data set?   
  2.3. Is the optimal RMSE lower or higher than for the Ridge regression in this case? Do you suspect that this is a feature of Lasso, or of this particular data set?   
3. \* An *Elastic net* is a straightforward combination of Ridge and Lasso regularizations - we simply add two penalty terms to the MSE term. Use appropriate `scikit` functions to fit an Elastic net model to the city income data. Use cross validation to find optimal regularization parameters. Try to come up with a way to visualize the dependence of RMSE on the value of the parameters. Can you find a set of parameters which result in a lower RMSE than the Ridge and Lasso models? Is the method numerically stable for low values of the regularization parameters?   

**Exercise 4.** Let's get back to Bajtowo, where the city council has lost the income data for 2017. Their income data base reads as follows:

2015: 23 070 510.29 PLN  
2016: 24 454 660.81 PLN  
2017: ???  
2018: 27 085 401.60 PLN  
2019: 31 890 616.12 PLN    
2020: 33 421 082.55 PLN    

Use your newly acquired regularization skills to try and create a better model to predict the missing income. Remember to transform and scale the data properly. You may use the `original_column_means` and `original_column_stds` variables, created in the *Data & modules* section.

In [None]:
bajtowo = pd.DataFrame({'2015': [np.log10(23070510.29)],
                        '2016': [np.log10(24454660.81)],
                        '2018': [np.log10(27085401.60)],
                        '2019': [np.log10(31890616.12)],
                        '2020': [np.log10(33421082.55)]})

<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>|