# Lab 3
### Introduction to Machine Learning, 2021-2022 period 4

This assignment is to be done with a partner. Please only submit ONE .ipynb (not .py) file per pair!

**Total points: 10**

**Deadline: 2022-06-03 17:00**

**Write your names and student ids here before submission:____**

## Linear Regression
With a linear regression problem, the goal is to predict the value of a certain variable $y$. In contrast to classification algorithms (such as $k$-Nearest Neighbours from the previous exercise), $y$ has a continuous value.
The output value $y$ will be modelled as a linear combination of the (transformed) input values $\mathbf{x}$. We expect these input values to influence $y$ in some way, just like with classification. The extent to which the variables influence our output is determined by an unknown target function $f$. We try to estimate that target function by using linear regression.

Like any model, a linear regression model is a simplified version of reality. We may not have access to variables that do influence $f$. We may also include input variables in our model that have no effect on the real $f(\mathbf{x})$.

Moreover, the odds of $f$ being an actual linear combination of (a transformation of) our input variables is very small. But often, a linear estimate is the best we can do.

This assignment is designed to give insight into the behaviour of linear models through multiple simulation experiments in different environments. 
In such a simulation experiment the data won't have to come from a file, but we will generate it ourselves (probabilistically). This way we have all the control over the distribution of our data: we can see all the effects different properties of these distributions have on our in-sample error and our estimation of the out-of-sample error.

We can compare the performance of the linear model under different circumstances (What happens with a more complicated target function $f$? What happens if there is more noise? What happens with less training data?)

## Our Experiment
### Data generation
Every datapoint $(\mathbf{x},y)$ will be sampled randomly. In our case $\mathbf{x}$ is a vector with six numbers. As per usual in a linear model, $x_0 = 1$. The other elements of $\mathbf{x}$ are distributed normally with expected value = 0 and variance = 1. (This is called a *standard normal distribution*.)

Once we have our $\mathbf{x}$, we generate our output label $y$ according to $y = f(\mathbf{x}) + \epsilon$, where  $\epsilon$ is normally distributed with expected value = 0 and variance = $\sigma^2$; we will experiment with various values for $\sigma$.
All the random numbers should be generated **independently** from one another. 

### Settings to experiment with:
+ The target function $f$:
    - $f_1(\mathbf{x}) = 1 + x_1$
    - $f_2(\mathbf{x}) = 1 + x_1 + 0.3x_2 + 0.1x_3 + 0.03x_4 + 0.01x_5$
+ The hypothesis class:
    - $d=1$, which means our algorithm can only see $x_0$ and $x_1$. 
    - $d=5$, which means our algorithm can use the whole vector $\mathbf{x}$. 
+ Noise:
    - $\sigma^2=0.2$
    - $\sigma^2=0.02$
+ $N$, the amount of training data:
    - $N=10$
    - $N=50$

## Experiment setup
For every combination of values listed above, perform the simulation experiment as follows:
- Repeat 100 times:
    - Create a training dataset of size $N$
    - Determine $\mathbf{w}_{\rm lin}$, the least-squares estimator (Use the formulas given in chapter 3 of the book)
    - Determine the (quadratic) in-sample error
    - Create a test dataset of size 100 using the same parameters
    - Use the test data to estimate the out-of-sample error
- Look at the means of $E_{\rm in}$ and $E_{\rm out}$ over the 100 repeats

## Your Code
Feel free to add new cells or structure your code however you like, just make sure the grader can understand it and will know how to run it with different parameters.

It is recommended to use numpy or another library of your choice for functionalities like normal distribution generators and matrix multiplication. [Here is a handy guide (in notebook form) that deals with numpy arrays, matrices and number generation.](https://github.com/ageron/handson-ml/blob/master/tools_numpy.ipynb)

**1.** Write code you can use to generate a dataset, where you can choose parameters like $N$, $\sigma^2$ and $f_{1,2}$. Be sure to check if your normal-distribution-generator needs $\sigma$ (standard deviation) or $\sigma^2$ (variance) as input parameter. **(1.5pt)**

*Hint:* Use the test_generate_data() function to check that your function is working as expected

In [None]:
import numpy as np
import math

def generate_data(f_number, N, sigma_squared):
    """
    f_number is the target function (1 or 2)
    N is the number of (training) datapoints
    sigma_squared is the variance of the noise
    
    The return value should be a tuple (X, y) where X is a matrix whose rows are datapoints
    and whose columns are the dimensions of the vector x for each datapoint, and y is a (column) vector
    with the target values for each datapoint
    """
    
    # FILL THIS IN
    
    # Hint: remember that x0 needs to be 1 for all datapoints
    
    return (X, y)

def test_generate_data():
    # Check that without noise the label of a datapoint matches the target function
    X, y = generate_data(1, 1, 0)
    assert len(y) == 1
    assert y[0] == X[0][0] + X[0][1]
    
test_generate_data()

**2.** Write code that fits training data to a linear regression model, in other words, a function that creates $\mathbf{w}_{\rm lin}$. **(1.5pt)**

In [None]:
def fit(X, y, d):
    """
    The inputs are the training data (X,y), and the d that defines the hypothesis class
    """
    
    # FILL THIS IN
    
    return w_lin

**3.** Write code to evaluate a model: determine the $E_{\rm in}$ and $E_{\rm out}$ ($E_{\rm out}$ will be estimated on the test dataset). 

Choose two sets of parameters, and for each of these, plot the target function $y = f(\mathbf{x})$ and the learned regression function $y = \mathbf{w}_{\rm lin}^{\rm T} \mathbf{x}$ in one image (use different colours for the functions). To keep your plots 2-dimensional, plot just $x_1$ on the x-axis; this means your plot will only show the functions' behaviour for $x_2 = x_3 = x_4 = x_5 = 0$. It is recommended to pick $d=1$, $f=f_1$ for your experiments in this question, because then those inputs don't matter. Also plot the training and test data that was used (as dots or similar markers). Include a legend on the plots. 

How can you assess the performance of these hypotheses looking at the two plots? Does this agree with their computed $E_{\rm in}$ and $E_{\rm out}$? In which of the two experiments do you get better performance and why? **(1.5pt)**

*Hint:* Use the test_compute_error() function to check that your function is working as expected

In [None]:
# Solution should include:
#  - a function that computes the error(s);
#  - plotting the functions and datasets from two experiments;
#  - the discussion questions.

def compute_error(w, X, y):
    # X and y can be either training or test data
    # w is the w_lin resulting from calling the fit function above
    # The return value should be the average squared error (see Eqs. 3.3 and 3.4 in the book)
    
    # FILL THIS IN
    
    return avg_squared_error

def test_compute_error():
    # Use some mock data to check that the error computation is working properly
    w = np.array([0.1, 1, 2]).T
    X = np.array([np.array([1, 2, 3])])
    y = np.array([0.5]).T
    expected_E = 57.76
    assert np.abs(expected_E - compute_error(w, X, y)) < 0.01
    
test_compute_error()

In [None]:
# FILL THIS IN
"""
You need to write a piece of code that:
- chooses two sets of parameters from the "settings to experiment with" above
- creates training and test datasets for each of those sets of parameters, and calls the fit function to get w_lin
- calls the compute_error function to compute E_in and E_out
- plots the target and hypothesis functions as required
"""

*Your answer to the discussion question here*

**4.** Write a function that, given all our parameters, performs the experiment 100 times and calculates the mean performance of the trained models. 
**(1.5pt)**

In [None]:
def experiment(f_number, d, N, sigma_squared, num_repeats):
    """
    f_number is the target function (1 or 2)
    d is the hypothesis class (1 or 5)
    N is the number of training datapoints
    sigma_squared is the variance of the noise
    num_repeats is how many times you want to run the experiment
    
    The function should print out the average in- and out-of-sample error
    """
    
    # FILL THIS IN
    
    # Hint: you can use most of the work from question 3 to fill out this function


## Results
Enter your results in these tables (table 1 for $f_1$, table 2 for $f_2$).
To keep the table neat, you should round the results down to 3 decimals each. **(1pt)**


In [None]:
for target_function in [1, 2]:
    for d in [1, 5]:
        for N in [10, 50]:
            for sigma_squared in [.02, .2]:
                experiment(target_function, d, N, sigma_squared, num_repeats=100)

|$f_1$ |            || d=1      |           || d=5      |            |
|------|------------||----------|-----------||----------|------------|
| $n$  | $\sigma^2$ || $E_{in}$ | $E_{out}$ || $E_{in}$ | $E_{out}$  |
| 10   | .02        ||          |           ||          |            |
| 10   | .20        ||          |           ||          |            |
| 50   | .02        ||          |           ||          |            |
| 50   | .20        ||          |           ||          |            |



|$f_2$ |            || d=1      |           || d=5      |            |
|------|------------||----------|-----------||----------|------------|
| $n$  | $\sigma^2$ || $E_{in}$ | $E_{out}$ || $E_{in}$ | $E_{out}$  |
| 10   | .02        ||          |           ||          |            |
| 10   | .20        ||          |           ||          |            |
| 50   | .02        ||          |           ||          |            |
| 50   | .20        ||          |           ||          |            |


<sup>(press enter edit the table. ctrl+enter to show pretty table again. 
 If you love making $\LaTeX$ tables or hate this one, feel free to make your own one, as long as you stick to this parameter format)</sup>

## Report
Report on the experimental results. Discuss the following questions: **(3pt)**


#### 1. Which hypothesis class produced better (estimated) out-of-sample errors? Under which circumstances?

*Your answer here*

#### 2. For given training data, to what extent are the in-sample errors of a hypothesis class informative for the out-of-sample errors?

*Your answer here*

#### 3. Can you explain your results using bias-variance analysis? How?

*Your answer here*