# Machine Learning: Session 2

## Regression, regularization and cross-validation

In this task you will experiment with linear regression and see what happens when we use regularized versions of it. More precisely, you will try out Ridge and Lasso regularization. In addition, we will see how using cross-validation helps us to get more stable estimates for our performance.

Read in the data in **data.csv** and split it into training (50%) and testing (50%) set. Use random seed 0 (train_test_split method).

In [None]:
import pandas as pd
import sklearn
import numpy as np

CRED = '\033[91m'
CEND = '\033[0m'

data = pd.read_csv("data.csv", index_col = 0)

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

## Task 1. Multivariate linear regression (1 point)

#### <font color='purple'>(a) Implement the fitting procedure of non-regularized multivariate ordinary least squares linear regression, as presented in the lecture slides (matrix operations). Fit on the training data and save the coefficients and the intercept for use in subtask (1c). Print out the coefficients corresponding to the five first features.

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
first_five_my_ols_coefficients = ...

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

In [None]:
print('Coefficients of five first features according to my OLS implementation:', first_five_my_ols_coefficients)

#### <font color='purple'>(b) Call out the `sklearn.linear_model.LinearRegression` learning algorithm from the sklearn package. Fit the model on the training data and save it for use in the following subtasks. Print out the coefficients corresponding to the five first features.

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
first_five_sklearn_ols_coefficients = ...

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

In [None]:
print('Coefficients of five first features according to sklearn OLS implementation:', first_five_sklearn_ols_coefficients)

#### <font color='purple'>(c) Demonstrate that the methods of subtasks (1a) and (1b) give the same results by showing that they find the same coefficients and intercept. </font>

You maybe won't get exactly the same results because of precision problems of floats so the idea is to compare if the values are equal up to some precision (e.g. check if the difference is less than 0.000001). If for some reason you are not able to get the assertions to pass with the given precision then please change the precision such that the assertions would pass.

In [None]:
try:
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    raise NotImplementedError("Please assign intercepts and coefficients to the given variables.")
    my_intercept = ...
    my_coefficients = ...
    sklearn_intercept = ...
    sklearn_coefficients = ...
    precision = 0.000001
    ##### YOUR CODE ENDS ##### (please do not delete this line)
    assert(abs(my_intercept - sklearn_intercept) < precision)
    for i in range(99):
        assert(abs(my_coefficients[i] - sklearn_coefficients[i]) < precision)
    print('The assertions have passed with precision:',precision)
except NotImplementedError as e:
    print(CRED, "TODO:", e, CEND)

#### <font color='purple'>(d) Using the sklearn model from subtask (1b) predict the results on the training and testing set and calculate and show the root mean square errors (RMSE). Since you need to do the same evaluation in future tasks also, please implement a function 'evaluate' for this.

In [None]:
def evaluate(regression_model_class_instance, trainX, trainY, testX, testY):
    print("\n#################\n")
    print(regression_model_class_instance, '\n')
    
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    raise NotImplementedError("Implement RMSE for train and test sets.")
    rmse_tr = ...
    rmse_te = ...
    ##### YOUR CODE ENDS ##### (please do not delete this line)
    print("RMSE train:", rmse_tr)
    print("RMSE test:", rmse_te)
    
    return rmse_tr, rmse_te

try:
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    evaluate(..., ..., ..., ..., ...)
    ##### YOUR CODE ENDS ##### (please do not delete this line)
except NotImplementedError as e:
    print(CRED, "TODO:", e, CEND)

## Task 2. Ridge & Lambda regularized regression  (1 point)

This blogpost might clarify regularization a bit: https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

Intuition behind the regularization: https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261

#### <font color='purple'>(a) Implement the fitting procedure of ridge regression, as presented in the lecture slides (matrix operations). Fit on the training data with regularization parameter equal to 1 and save the coefficients and the intercept for use in subtask (2c). Print out the coefficients corresponding to the five first features.

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
first_five_my_ridge_coefficients = ...

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

In [None]:
print('Coefficients of five first features according to my ridge implementation:', first_five_my_ridge_coefficients)

#### <font color='purple'>(b) Call out the `sklearn.linear_model.Ridge` learning algorithm from the sklearn package. Fit the model on the training data with regularization parameter equal to 1 and save it for use in the following subtasks. Print out the coefficients corresponding to the five first features.</font>

Use parameters `solver = "cholesky", tol = 0.000000000001` in order to get more similar results to your own implementation. The default parameter for the regularization is already 1 so no need to specify that. The parameters `solver` and `tol` are necessary to force sklearn to use closed-form solution. Otherwise it would use numerical optimization which would give more different results from yours. **In the future tasks, please use the default option and don't force it to use the closed-form solution (numerical will be faster!).**

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
first_five_sklearn_ridge_coefficients = ...

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

In [None]:
print('Coefficients of five first features according to my ridge implementation:', first_five_sklearn_ridge_coefficients)

#### <font color='purple'>(c) Demonstrate the correctess of your implementation the same way as in the previous exercise. For this compare your coefficients and intercept as obtained in subtask (2a) with the coeffiecients and intercept from sklearn, as obtained in subtask (2b). The results can actually vary quite a bit due to implementation differences in matrix operations. Compare that the differences in results (coefficients and intercept) are less than 0.02. If for some reason you are not able to get the assertions to pass with the given precision then please change the precision such that the assertions would pass.

In [None]:
try:
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    raise NotImplementedError("Please assign intercepts and coefficients to the given variables.")
    my_ridge_intercept = ...
    my_ridge_coefficients = ...
    sklearn_ridge_intercept = ...
    sklearn_ridge_coefficients = ...
    precision = 0.02
    ##### YOUR CODE ENDS ##### (please do not delete this line)
    assert(abs(my_ridge_intercept - sklearn_ridge_intercept) < precision)
    for i in range(99):
        assert(abs(my_ridge_coefficients[i] - sklearn_ridge_coefficients[i]) < precision)
    print('The assertions have passed with precision:',precision)
except NotImplementedError as e:
    print(CRED, "TODO:", e, CEND)

#### <font color='purple'>(d) Train a Lasso model using the sklearn package (use the default regularization parameter) and save it for future use. Print out the coefficients corresponding to the five first features.</font>

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
first_five_sklearn_lasso_coefficients = ...

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

In [None]:
print('Coefficients of five first features according to sklearn lasso implementation:', first_five_sklearn_lasso_coefficients)

#### <font color='purple'>(e) Evaluate the sklearn Ridge and Lasso models on the training and testing set and calculate and show the RMSE, using the function 'evaluate' from subtask (1d).

In [None]:
try:
    print('Evaluation of sklearn ridge regression model:')
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    evaluate(..., ..., ..., ..., ...)
    ##### YOUR CODE ENDS ##### (please do not delete this line)
    
    print('Evaluation of sklearn lasso regression model:')
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    evaluate(..., ..., ..., ..., ...)
    ##### YOUR CODE ENDS ##### (please do not delete this line)
except NotImplementedError as e:
    print(CRED, "TODO:", e, CEND)

## Task 3. Choosing a suitable regularization parameter  (1 point)

Since different parameters can lead to very different results we need to do some parameter tuning and find a suitable regularization parameter for both Ridge and Lasso. We could try out different values and see which ones lead to the best results on the test set. However, then we would overfit to our test data and we would not have an adequate estimate of how good the model is in the end. That is why we need to do parameter tuning only using the training set.

Use **alphas = np.linspace(0.01, 10, 100)** for Ridge and **alphas = np.linspace(0.01, 5, 100)** for Lasso. The method generates 100 values with equal steps between the first and second parameter.

#### <font color='purple'>(a) **Method 1:** Divide the training set into training and validation set using 90%/10% split and a random seed 0 (train_test_split method). Train Ridge and Lasso models with different alpha values on the training set and calculate the RMSE values on the validation set. Choose and report the alpha that has the best RMSE for Ridge and another alpha that has the best RMSE for Lasso (save both alpha and RMSE values).

In [None]:
def method_1(model,alphas,random_seed):
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    raise NotImplementedError("Calculate RMSE for Ridge and Lasso models.")
    ...
    ##### YOUR CODE ENDS ##### (please do not delete this line)
    return best_alpha,rmse

try:
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    method_1(..., ..., ...) ## ridge
    method_1(..., ..., ...) ## lasso
    ##### YOUR CODE ENDS ##### (please do not delete this line)
except NotImplementedError as e:
    print(CRED, "TODO:", e, CEND)

#### <font color='purple'>(b) **Method 2:** Instead of doing only one training/validation split, use 10-fold cross validation. For each alpha value calculate the validation errors for each of the folds and average the results. Then choose and report the alpha that has the best RMSE for Ridge and another alpha that has best RMSE for Lasso (save both alpha and RMSE values). For doing the 10-fold split use the sklearn method KFold (kf = KFold(n_splits=10, random_state = 0, shuffle = True)). To see more about how to iterate through the folds see the documentation for the method.

In [None]:
def method_2(model,alphas,random_seed):
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    raise NotImplementedError("Implement 10-fold cross-validation.") 
    ...
    ##### YOUR CODE ENDS ##### (please do not delete this line)
    return best_alpha,rmse
try:
    ##### YOUR CODE STARTS ##### (please do not delete this line)
    method_2(..., ..., ...) ## ridge
    method_2(..., ..., ...) ## lasso
    ##### YOUR CODE ENDS ##### (please do not delete this line)
except NotImplementedError as e:
    print(CRED, "TODO:", e, CEND)

## Task 4. Comparing the stability of Method 1 and Method 2  (1 point)

#### <font color='purple'>(a) Run Method 1 and Method 2 both 10 times, every time using a different value 0,1,2,...,9 as the random_state. Report the best alpha and RMSE for both parameter tuning methods and for both regularization methods for each of the 10 trials.

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

#### <font color='purple'>(b) What can you say about the stability of the methods? Which one gives more stable information about which alpha to use? Which alpha values turn out to be best in the end for these data?

**Answer:**

#### <font color='purple'>(c) Create two plots (one for Ridge and one for Lasso) where on each plot there are two boxplots - one for showing the distribution of the RMSE values for the 10 trials for Method 1 and the other for Method 2.

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

#### <font color='purple'>(d) Comment on why the results look like they do? In general, when tuning parameters, is it better to use one training-validation split or K-fold cross-validation? Why?

**Answer:**

## Task 5. Regularization parameter effect on the coefficients  (1 point)

#### <font color='purple'>(a) The regularization parameter influences the values of the coefficients. Create two plots (one for Ridge and one for Lasso) that have the regularization parameter on the x-axis and coefficient values on the y-axis. You don't have to take all 99 values, you can take for example the first 20. Show each coefficient as a line (on the same plot) and comment on what happens when the regularization parameter increases. </font>

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

#### <font color='purple'>(b) What does Ridge regression do and what does Lasso regression do? How do they differ? </font>

**Answer:**

## Task 6. Evaluating different models  (1 point)

#### <font color='purple'>(a) Choose the values of alpha for Ridge and Lasso according to subtask (4b). Now let's see which model works best for our data by evaluating the test RMSE. Compare the following models by reporting the training and testing set RMSE: </font>

1. Non-regularized linear regression
2. Ridge regression with your chosen parameter
3. Lasso regression with your chosen parameter
4. A "dumb" model that always predicts the mean value of y_train
5. An ideal model that the instructors have used for generating the data (the true coefficients are [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, ..., 0] and intercept 0).

In [None]:
##### YOUR CODE STARTS ##### (please do not delete this line)

In [None]:
##### YOUR CODE ENDS ##### (please do not delete this line)

#### <font color='purple'>(b) Which method gives the best results and by looking at which value do you claim that? Why did this method work the best in your opinion?

**Answer:**

#### <font color='purple'>(c) Were all of the "smart" models better than the "dumb" one (baseline). What would it mean if the learned model would give worse results?

**Answer:**

#### <font color='purple'>(d) Were the learned models far from the ideal one? Were the learned coefficients similar to the true ones?

**Answer:**

#### <font color='purple'>(e) Which model overfitted the most, how can you see that?

**Answer:**

#### <font color='purple'>(f) Are regularized methods always better than methods without regularization (not only in this case but in general). Why/why not?

**Answer:**

 ## <font color='red'>This was the last task! Please restart and run all before submission!</font>