# 1 Introduction

Welcome to the fifth practical session of CS233 - Introduction to Machine Learning.  
In this exercise class we will start using Machine Learning methods to solve regression problems.

In [None]:
# Useful starting lines
%matplotlib notebook
import numpy as np
np.random.seed(42)
import matplotlib.pyplot as plt


# 2 Regression Problem

Let $f$ be a function $f: \mathbb{R}^d \rightarrow \mathbb{R}^v$ with a data set $\left(X \subseteq \mathbb{R}^d, y
\subseteq\mathbb{R}^v \right )$. The regression problem is the task of estimating an approximation $\hat{f}$ of $f$.
Within this exercise we consider the special case of $v=1$, i.e. the problem is univariate as opposed to multivariate.
Specifically, we will analyze the Boston house prices data set and predict costs based on properties such as per capita crime rate by town, pupil-teacher ratio by town etc.

We will model the given data by means of a linear regression model, i.e. a model that explains a dependent variable in terms of a linear combination of independent variables.

**Q.** How does a regression problem differ from a classification problem?  
**Q.** Why is the linear regression model a linear model? Is it linear in the dependent variables? Is it linear in the parameters?  


# 2.1 Load and inspect data

For this exercise, we have Boston Housing dataset. The task is to predict the price of a house given a set of 12 [features](https://scikit-learn.org/stable/datasets/index.html#boston-dataset). Before jumping into the algorithm directly, it is good visualize some features and the price distribution. This step is called Data Exploration. We load the train and test data.

**Q.** Explore the relation between different features and the house prices. Describe what you see. Can you identify any trends?  


In [None]:
X_train    = np.load('X_train.npy')
y_train    = np.load('y_train.npy')
X_test     = np.load('X_test.npy')
y_test     = np.load('y_test.npy')

In [None]:
print(f'Number of training examples: {X_train.shape[0]}')
print(f'Number of test examples: {X_train.shape[0]}')
print(f'Number of features: {X_train.shape[1]}')

In [None]:
# Plot the distribution of prices
plt.figure(figsize=(4,4))

plt.title(f"Price Distribution for Train Set")
plt.hist(y_train)
plt.xlabel(f"Price Bins $y$")
plt.ylabel(f"Occurances")


In [None]:
# Exploratory analysis of the data. Have a look at the distribution of prices vs features
plt.figure(figsize=(9,4))
plt.subplot(1,2,1)
feature =  # choose different feature index
plt.scatter(X_train[:,feature], y_train)
plt.xlabel(f"Attribute {feature}$")
plt.ylabel("Price $y$")
plt.title(f"Attribute {feature} vs Price $y$")

plt.subplot(1,2,2)
feature =  # choose different feature index
plt.scatter(X_train[:,feature], y_train)
plt.xlabel(f"Attribute {feature}$")
plt.ylabel("Price $y$")
plt.title(f"Attribute {feature} vs Price $y$")

We normalize the data such that each feature has zero mean and unit standard deviation. Please fill in the required code and complete the function `normalize`.


In [None]:
'''
Make mean 0 and std dev 1 of the data.
'''
def normalize(X,mean,std):
    """
    Please fill in the required code here
    """ 
    X  = 
    return X

# normalize the data
mean  = 
std   = 
norm_X_train = normalize(X_train,mean,std)
norm_X_test = normalize(X_test,mean,std)

# 2.2 Closed-form solution for linear regression

We represent our output $\mathbf{y} \in R^{N\times1}$ as linear combination of variables $\mathbf{X} \in R^{N\times D}$. The objective of linear regression task is to find set of weights $\mathbf{w} \in R^{D\times1}$ for predictor varibles which minimize our loss, simplest being $l_2$ loss function as shown below 
\begin{align}
L(\mathbf{w}) &=\frac{1}{N} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|^2  \\
\nabla L(\mathbf{w}) &= -\frac{2}{N}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) 
\end{align}

$\nabla L(\mathbf{w}) = 0$ for minimum condition, we get

\begin{align}
\mathbf{w} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}
\end{align}


This is called an [analytical solution](https://en.wikipedia.org/wiki/Linear_least_squares). 
Please use this solution to complete the function `get_w_analytical` and `get_loss`. Tip: before implementation think about the dimension of returned variable. 

**Q.** What is the time complexity of this approach?  


In [None]:
def get_w_analytical(X_train,y_train):
    """
    compute the weight parameters w
    """
    
    """
    Please fill in the required code here
    """
        
    # compute w via the normal equation
    # Tip: use np.linalg.solve instead of np.linalg.inv as it provides stable solution
    w = 
    return w

def get_loss(w, X_train, y_train,X_test,y_test,val=False):
    # predict dependent variables and MSE loss for seen training data
    """
    Please fill in the required code here
    """
    loss_train = 
    loss_train_std = 
    
    # predict dependent variables and MSE loss for unseen test data
    """
    Please fill in the required code here
    """
    loss_test = 
    loss_test_std = 
    if not val:
        print("The training loss is {} with std:{}. The test loss is {} with std:{}.".format(loss_train, loss_train_std, loss_test,loss_test_std))
    else:
        print("The training loss is {} with std:{}. The val loss is {} with std:{}.".format(loss_train, loss_train_std, loss_test,loss_test_std))

    return loss_train, loss_test


In [None]:
# compute w and calculate its goodness
w_ana = get_w_analytical(norm_X_train,y_train)
get_loss(w_ana, norm_X_train, y_train, norm_X_test,y_test)

### 2.2.1 Adding $w_0$
We add a bias term i.e. $w_0$ to our formulation such that $y_n = x_n * w_n + x_{n-1} * w_{n-1} + ... + x_1 * w_1 + 1 * w_0$ . This involves making our feature dimension from $D$ to $D+1$ and $\mathbf{X} = [\mathbf{1}  ~~~ \mathbf{X} ]$, which corresponds to adding a column of ones.

**Q.** How does this term help?  


In [None]:
# add a column of ones to X_train and X_test and see if loss values change.
X_train_aug = 
X_test_aug = 
w_ana = get_w_analytical(X_train_aug,y_train)
get_loss(w_ana, X_train_aug,y_train, X_test_aug,y_test)

# 2.3 Numerical solution for linear regression


The linear regression model has an analytical solution, but we can also get the weight parameters $w$ numerically, e.g. via gradient descent. Please use this approach to complete the function `get_w_numerical` below.

**Q.** How do these results compare against those of the analytical solution? Explain the differences or similarities!   
**Q.** In which cases, it maybe be preferable to use the numerical approach over the analytical solution?  


In [None]:
# return the gradient of loss function
def gradient(w,X,y):
    return 

# implement the numerical method to solve linear regression
# max_iteration: number of times Gradient update happens
# lr: learning rate 
def get_w_numerical(X_train,y_train,X_test,y_test,max_iteration,lr):
    """compute the weight parameters w"""
    
    """
    Please fill in the required code here
    """
    
    # initialize the weights
    w  = np.ones(X_train.shape[1])
    
    # iterate a given number of epochs over the training data
    for iteration in range(max_iteration):
        
            
        #calculate gradient 
        grad = gradient(w,X_train,y_train)

        # update the weights
        w =
            
        if iteration % 500 == 0:
            print(f"Iteration {500+iteration}/{max_iteration}")
            get_loss(w, X_train,y_train, X_test,y_test)
            
    return w

In [None]:
# compute w and calculate its goodness, try different learning rate and see what happens

w_num = get_w_numerical(X_train_aug,y_train,X_test_aug,y_test,6500,1e-2)

**Q.** What is the stopping criteria for this method?

# 2.4 Ridge Regression


As seen in the lecture, we suffer from overfitting when the our model complexity increases. There are different ways to tackle this problem, like getting more data, changing the prediction method, regularization, etc. For the task of regression, we'll add a regularization to our training objective to mitigate this problem. Intutively, regularization restricts the domain from which the values of model parameters are taken, which means that we are biasing our model.  

In Ridge Regression, we restrict the $l_2$ norm of the coefficients $\mathbf{w}$. Our loss function looks as following,
\begin{align}
L(\mathbf{w}) &=\frac{1}{N} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|^2 + \frac{\lambda}{N}\|\mathbf{w}\|^2 \\
\nabla L(\mathbf{w}) &= -\frac{2}{N}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) + 2\frac{\lambda}{N}\mathbf{w}
\end{align}

$\nabla L(\mathbf{w}) = 0$ for minimum condition, we get

\begin{align}
\mathbf{w} &= (\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}
\end{align}

dimensions are following: $\mathbf{w}$ is $D\times1$; $\mathbf{y}$ is $N\times1$; $\mathbf{X}$ is $N\times D$; $\mathbf{I}$ is identity matrix of dimension $D \times D$ .

$\lambda$ is our penality term, also know as weight decay. By varying its value, we can allow biasing in our model.

**Q.**:
When $\lambda$ is high, our model is more complex or less?



For this task, we load same Boston Housing dataset with feature expansion strategy(will be covered in coming lectures) to get a complex model.  
**Q.** What is the features dimension in the complex model?  


In [None]:
# load the expanded features
X_train_expanded = np.load('X_train_feat_augmentation.npy',)
X_test_expanded = np.load('X_test_feat_augmentation.npy')

# normalize the features
mean  = 
std   = 
norm_X_train = normalize(X_train_expanded,mean,std)
norm_X_test = normalize(X_test_expanded,mean,std)

# add the ones column for bias-term
X_train_aug = 
X_test_aug = 


In [None]:
def get_w_analytical_with_regularization(X_train,y_train,lmda):
    """compute the weight parameters w with ridge regression"""
    
    """
    Please fill in the required code here
    """
    #create lambda matrix 
    lmda_mat = 
    # compute w via the normal equation
    # np.linalg.solve is more stable than np.linalg.inv
    w = 
    return w

In [None]:
w_reg = get_w_analytical_with_regularization(X_train_aug,y_train,lmda=0)
get_loss(w_reg,X_train_aug,y_train,X_test_aug,y_test)

**Q.** Compare the above train and test losses w.r.t regression with simpler model?  


## 2.4.1 Cross Validation


Cross Validation(CV) is used to choose value of $\lambda$. As seen in previous exercise, we will use K-fold CV.
We will use our training set and create K splits of it to choose best $\lambda$ and finally evaluate on our test set.

In [None]:
# Function to split data indices
# num_examples: total samples in the dataset
# k_fold: number fold of CV
# returns: array of shuffled indices with shape (k_fold, num_examples//k_fold)
def fold_indices(num_examples,k_fold):
    ind = np.arange(num_examples)
    split_size = num_examples//k_fold
    
    #important to shuffle your data
    np.random.shuffle(ind)
    
    k_fold_indices = []
    # Generate k_fold set of indices
    k_fold_indices = 
    print(k_fold_indices)     
    return np.array(k_fold_indices)


In [None]:
# Function for using kth split as validation set to get accuracy
# and k-1 splits to train our model
def do_cross_validation_reg(k,k_fold_ind,X,Y,lmda=0):
    
    # use one split as val
    val_ind = k_fold_ind[k]
    # use k-1 split to train
    train_splits = [i for i in range(k_fold_ind.shape[0]) if i is not k]
    train_ind = k_fold_ind[train_splits,:].reshape(-1)
   
    #Get train and val using train and val indices
    cv_X_train = 
    cv_Y_train = 
    cv_X_val = 
    cv_Y_val =
    
    #fit on train set using regularised version
    w = get_w_analytical_with_regularization(cv_X_train,cv_Y_train,lmda)
    
    #get loss for val
    loss_train,loss_test = get_loss(w,cv_X_train,cv_Y_train,cv_X_val,cv_Y_val,val=True)
    print(loss_test,lmda)
    return loss_train,loss_test

In [None]:

# Grid Search Function
# params: hyperparameter to tune
# k_fold: fold for CV to be done
# fold_ind: splits of training set
# X,Y: training examples
# return: returns the training and validation loss for the range of hyperparamter

def grid_search_cv(params,k_fold,fold_ind,function,X,Y):
    
    #save the values for the combination of hyperparameters
    grid_train = np.zeros(len(params))
    grid_val = np.zeros(len(params))
       
    for i, p in enumerate(params):
        print('Evaluating for {} ...'.format(p))
        loss_train = np.zeros(k_fold)
        loss_test = np.zeros(k_fold)
        for k in range(k_fold):
            loss_train,loss_test[k] = function(k,fold_ind,X,Y,p)
        grid_train[i] = 
        grid_val[i] = 
        
    
    return grid_train, grid_val

Let's do 4-fold CV. 

In [None]:
k_fold = 4
fold_ind = fold_indices(X_train_aug.shape[0],k_fold)

In [None]:
#list of lambda values to try.. use np.logspace
minimum_pow =
maximum_pow = 
search_lambda = np.logspace(minimum_pow,maximum_pow,num=5000)

#call to the grid search function
grid_train,grid_val = grid_search_cv(search_lambda,k_fold,fold_ind,do_cross_validation_reg,X_train_aug,y_train)

In [None]:
# plot the curves for losses
search_lambda = [round(s,4) for s in search_lambda]
plt.figure(figsize=(10,10))
plt.subplot(2,1,1)
plt.plot(grid_train)
plt.xticks(np.arange(0,len(search_lambda),500), search_lambda[::500])
plt.xlabel('lambda')
plt.ylabel('Train loss')
plt.title('Train Loss for different lambda')

plt.subplot(2,1,2)
plt.plot(grid_val)
plt.xticks(np.arange(0,len(search_lambda),500), search_lambda[::500])
plt.xlabel('lambda')
plt.ylabel('Val loss')
plt.title('Val Loss for different lambda')

**Q.** Look at the above plots and what can you conclude about model complexity as we increase lambda?  

In [None]:
# best val score
best_score = 
print(best_score)

# params which give best val score
l= 
# best_degree = search_degree[d]
best_lambda = search_lambda[l]
print('Best score achieved using lambda:{}'.format(best_lambda))


In [None]:
#Evaluate on the test set
w = get_w_analytical_with_regularization(X_train_aug,y_train,best_lambda)

get_loss(w,X_train_aug,y_train,X_test_aug,y_test)

**Q.**: Compare this value with before Cross Validation?  


**Q**: How would you proceed to improve the prediction?

