#### General guidance

This serves as a template which will guide you through the implementation of this task. It is advised
to first read the whole template and get a sense of the overall structure of the code before trying to fill in any of the TODO gaps.
This is the jupyter notebook version of the template. For the python file version, please refer to the file `template_solution.py`.

First, we import necessary libraries:

In [101]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold


# Add any additional imports here (however, the task is solvable without using 
# any additional imports)
# import ...
from sklearn.linear_model import Ridge
import os

 #### Loading data

In [102]:
# Pull data if not exists
DATA_PATH = 'data'
if not os.path.exists(DATA_PATH):
    !bash pull_data.sh
else:
    print("Data already fetched!")

Data already fetched!


In [103]:
df = pd.read_csv("data/train.csv")

Y = df.iloc[:, 0].to_numpy()
X = df.iloc[:, 1:].to_numpy()


#### Calculating the average RMSE

In [104]:
def calculate_RMSE(w, X, y):
    """This function takes test data points (X and y), and computes the empirical RMSE of 
    predicting y from X using a linear model with weights w. 

    Parameters
    ----------
    w: array of floats: dim = (13,), optimal parameters of ridge regression 
    X: matrix of floats, dim = (15,13), inputs with 13 features
    y: array of floats, dim = (15,), input labels

    Returns
    ----------
    RMSE: float: dim = 1, RMSE value
    """
    
    assert(w.shape == (13, ))
    assert(X.shape == (15, 13))
    assert(y.shape == (15, ))

    RMSE = 0

    
    y_actual = y
    y_predicted = np.dot(X, w)
    
    assert(y_predicted.shape == (15, ))

    
    RMSE = np.sqrt(np.square(np.subtract(y_actual, y_predicted)).mean()) #calcs RMSE 
    
    assert np.isscalar(RMSE)
    return RMSE

#### Fitting the regressor

In [105]:
def fit(X, y, lam):
    """
    This function receives training data points, then fits the ridge regression on this data
    with regularization hyperparameter lambda. The weights w of the fitted ridge regression
    are returned. 

    Parameters
    ----------
    X: matrix of floats, dim = (135,13), inputs with 13 features
    y: array of floats, dim = (135,), input labels)
    lam: float. lambda parameter, used in regularization term

    Returns
    ----------
    w: array of floats: dim = (13,), optimal parameters of ridge regression
    """
    assert(X.shape[1] == 13)

    model = Ridge(alpha=lam, fit_intercept=False)
    model.fit(X, y)
    
    w = model.coef_
    
    assert w.shape == (13,)
    return w

#### Performing computation

In [106]:
"""
Main cross-validation loop, implementing 10-fold CV. In every iteration 
(for every train-test split), the RMSE for every lambda is calculated, 
and then averaged over iterations.

Parameters
---------- 
X: matrix of floats, dim = (150, 13), inputs with 13 features
y: array of floats, dim = (150, ), input labels
lambdas: list of floats, len = 5, values of lambda for which ridge regression is fitted and RMSE estimated
n_folds: int, number of folds (pieces in which we split the dataset), parameter K in KFold CV

Compute
----------
avg_RMSE: array of floats: dim = (5,), average RMSE value for every lambda
"""

# The function calculating the average RMSE
lambdas = [0.1, 1, 10, 100, 200]
n_folds = 10

RMSE_mat = np.zeros((n_folds, len(lambdas)))

k_fold = KFold(n_splits=n_folds)

for i, (train_index, test_index) in enumerate(k_fold.split(X)):

    fold_X_train, fold_X_test = X[train_index], X[test_index]
    fold_y_train, fold_y_test = Y[train_index], Y[test_index]

    for j in range(len(lambdas)):
    
        fold_w = fit(fold_X_train, fold_y_train, lambdas[j])
        fold_rmse = calculate_RMSE(fold_w, fold_X_test, fold_y_test)
        RMSE_mat[i][j] = fold_rmse
        

avg_RMSE = np.mean(RMSE_mat, axis=0) # avg_RMSE: array of floats: dim = (5,), average RMSE value for every lambda
display(avg_RMSE)
assert avg_RMSE.shape == (5,)
print(RMSE_mat)
print(avg_RMSE)

array([5.5036383 , 5.48040028, 5.46988555, 5.93193113, 6.2433465 ])

[[7.4412339  7.47793331 7.58146902 8.19645872 8.50748161]
 [5.12826602 4.88393133 4.45282503 3.55256057 3.60399295]
 [7.70764701 7.70279186 7.72774439 7.77994411 7.88968326]
 [4.54006065 4.50059526 4.33989198 4.94678494 5.24054741]
 [4.07531646 4.0726219  4.19425423 4.89426434 5.24272243]
 [5.10975443 5.15192098 5.39446252 7.11982551 7.87880031]
 [6.55136404 6.56151161 6.55275056 7.13545034 7.47865912]
 [6.04021985 6.09837022 6.36717035 7.59224397 7.99355906]
 [4.88759828 4.88177704 4.70584949 4.30299719 4.45025962]
 [3.5549224  3.47254925 3.38243796 3.79878164 4.14775924]]
[5.5036383  5.48040028 5.46988555 5.93193113 6.2433465 ]


# Create Outputs


In [107]:
# Save results in the required format
np.savetxt("./output.csv", avg_RMSE, fmt="%.12f")

In [108]:
## end of task ##

!jupyter nbconvert --to python task.ipynb

import re #python regular expression matching module
with open('task.py', 'r') as f_orig:
    script = re.sub(r'# In\[.*\]:\n','', f_orig.read())
    script = script.replace('## end of task ##',
"""
## Exit here, the rest is only used for creating this file
exit(0)
"""
    , 1)
    script = script.replace("get_ipython().system('bash pull_data.sh')",
"""# get_ipython().system('bash pull_data.sh')
    print("We are missing the data/ folder, please download the data manually and extract everything to data/.")
    exit(1)""", 1
)
with open('task.py','w') as fh:
    fh.write(script[:script.index("\n")])
    fh.write("""
   
## Note: This file was automatically generated from an Jupyter Notebook.

def display(X):
    print(X)

""")
    fh.write(script[script.index("\n"):])


[NbConvertApp] Converting notebook task.ipynb to script
[NbConvertApp] Writing 4358 bytes to task.py
