In [None]:
import requests
from IPython.core.display import HTML
HTML(f"""
<style>
@import "https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css";
</style>
""")

# Generalising linear regression


In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In this exercise you will generalise regression to $N$-th order polynomials and use it to predict the price of a house (in Canadian dollars) based on its lot size (in square feet).
Suppose you want to buy a house in the City of Windsor, Canada. You contact a real-estate salesperson to get information about current house prices and receive details on 546 properties sold in Windsor in the last two years. You would like to figure out what the expected cost of a house might be given only the lot size of the house you want to buy. Fortunately, his dataset has only one independent variable (i.e. `lotsize`
, the lot size of a property) and one dependent variable (i.e. `price`
, the sale price of a house). You will train the dataset using polynomial regression to predict the house prices.
A polynomial _model_ of order $N$ is defined by:

$$
f_\theta(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots + \theta_N x^N,
$$

in which, the coefficients $\theta_i$ are the parameters of the model. Notice how the function is linear in the parameters, i.e. if $x$ is fixed, the function is linear. To estimate the parameters $\theta_i$, you can therefore set up a linear equation and solve for $\theta$:

$$
\begin{bmatrix}
    1 & x_1 & x_1^2 & x_1^3 & \dots & x_1^N \\
    1 & x_2 & x_2^2 & x_2^3 & \dots & x_2^N \\
    1 & x_3 & x_3^2 & x_3^3 & \dots & x_3^N \\
    \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
    1 & x_m & x_m^2 & x_m^3 & \dots & x_m^N
\end{bmatrix}
\times
\begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3 \\
    \vdots \\
    \theta_N
\end{bmatrix}
=
\begin{bmatrix}
    y_1 \\
    y_2 \\
    y_3 \\
    \vdots \\
    y_m
\end{bmatrix},
$$

or more compactly: $A \theta = y$. 
The _cost function_ $\ell(\hat{y}_i, y_i)$ for linear regression is the mean squared error between the known outputs $y_i$ and the predicted outputs $\hat{y}=f_{\theta}(x)$ of the model:

$$
\ell(\hat{y}_i, y_i) = (\hat{y}_i-y_{i})^2
$$

This cost function is simply the squarred error of each point. We know from the projection method that least squares minimises the sum of squares. In other words, the parameters $\theta$ can be decided by solving the following optimisation problem:

$$
\theta = \underset{\theta}{\operatorname{argmin}} \frac{1}{m}\sum_{i=1}^{m} \ell(\hat{y}_i, y_i)
$$

Just to summarize, we have our _model_ $f_\theta(x)$ which is a polynomial function in $x$. We then want to find parameters $\theta$ that minimizes the squared distance (the $\frac{1}{m}$ is just for scaling). We know from linear algebra that projecting the vector $(f(x_1), \dots, f(x_m)$ onto the column space of the design matrix defined by $A$ is equivalent to solving that optimisation problem.
## Data exploration
Load the dataset described above:


In [None]:
filename = "./data/simple_windsor.csv"
names = ["lotsize", "price"]
dataset = np.loadtxt(filename, delimiter=',', dtype=np.int32)

X_full, y_full = dataset.T
np.random.randn(2, 3)

Let us visualise the data:


In [None]:
plt.scatter(X_full, y_full)
plt.xlabel('Lot size')
plt.ylabel('House price');

This visualisation already tells us a lot about the usefulness of the data. Try to answer the following questions to the best of your abilities:

---
**Task 1 (easy): Questions**
Notice the large spread in house prices for relatively similar lot sizes. 
1. List at least three reasons for the house price variability given the lot size.


---
### Splitting into train and test data
We use a helper function from the _Scikit Learn_ library to split the dataset into $80\%$ training data and $20\%$ test data:


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.2, random_state=42)

## Generalising regression
The steps for linear regression are:
1. Define a model, e.g. linear or polynomial, and identity knowns and uknowns.
2. Generate a design matrix $A$ for the input dataset (see the ´get_design_matrix´ function below).
3. Estimate the model parameters using least squares ([Task 2](#estimate)).

The following exercises will guide you through this process.
We provide the function below for creating design matrices for polynomials of arbitrary order:


In [None]:
def get_design_matrix(x, order=1):
    """
    Get the coefficients of polynomial in a least square sense of order N.
    
    :param x: Must be numpy array of size (N).
    :order n: Order of Polynomial.
    """
    
    if order < 1 or x.ndim != 1:
        return x

    count = x.shape[0]
    matrix = np.ones((count, order + 1), np.float64)

    for i in range(1, order+1):
        matrix[:, i] = np.power(x, i)

    return matrix


---
**Task 2 (easy): Estimate parameters**
1. Implement the function `estimate(X, y, order)`
 below. The function should use `np.linalg.lstsq()`
 to estimate the model parameters. Use `get_design_matrix(X, order)`
 to generate an appropriate design matrix.


---


In [None]:
def estimate(X, y, order):
    """
    :param X: Input vector.
    :param y: Training data values.
    :param order: Order of the model to estimate.
    
    :return: Parameters of model.
    """
    ...


---
**Task 3 (easy): Implement linear model**
1. Use the learned model to predict house prices given an input vector $X$ of lot sizes. Implement the prediction function `predict(X, params)`
 in the cell below. 


---


In [None]:
def predict(X, params):
    """
    :param X: Input vector.
    :param params: Estimated parameters.
    
    :return: Predicted y-values.
    """
    ...


---
**Task 4 (easy): Prediction**
In this task you will combine the functions above to learn the model parameters for a polynomial model and use it for predictions of house prices. Implement the following steps in the code cell below.
1. Estimate parameters from `X_train`
 and `y_train`
. 
2. Then calculate the predicted `y`
-values for the provided lot-sizes in the variable `values`
 in the cell below. 
3. Plot the predicted house prices as a line-plot.


---


In [None]:
values = np.linspace(X_full.min(), X_full.max(), 50)

# (A) Estimate parameters

# (B) Evaluate model

plt.scatter(X_train, y_train)

# (C) Plot predicted values


---
**Task 5 (easy)**
In this task you will experiment with the order of the polynomial model.
1. Increase the order of the polynomial in your implementation above and evaluate the results.
a. A $3.$-order polynomial.
b. A $4.$-order polynomial.
c. A $7.$-order polynomial.

You should see the predictions starting to deviate drastically for the $7.$-order polynomial. 
1. Try to explain why this happens? _Hint: It has to do with the behavior of floating point numbers at extreme values._


---
The above problem can be solved by normalizing the input vectors. Normalization scales and translates (transforms) a series of input values to the interval $(0, 1)$ by using the minimum and maximum values of the inputs. 
The cell below provides helper functions for normalizing and unnormalising (the inverse transformation) input vectors:


In [None]:
def normalized(X):
    n = (X - np.min(X_full))/np.max(X_full)
    return n

def unnormalized(X):
    return X*np.max(X_full) + np.min(X_full)


---
**Task 6 (medium): Higher order polynomials with normalization**
In this task you will need to modify the code in [Task 4](#learn) using the helper functions above to normalize the inputs $X_{train}$ and re-train the model.
1. Normalize the inputs in the variable `X_{train}`
. 
2. Re-train the model parameters using the normalized inputs and plot the results. Use $3.$, $4.$, and $7.$ order polynomials as done in [Task 4](#learn).
3. How much did the results improve for the different model orders? Explain why normalization achieves better performance.


---


In [None]:
values = np.linspace(X_full.min(), X_full.max(), 50)

# Estimate parameters and predict y-values

plt.scatter(X_train, y_train, c="g")

# Plot predicted values

---
## Evaluation
We now want to evaluate the model using the test data. You will calculate the _root mean squarred error_ for various orders of polynomials and use the error to decide which order has the best tradeoff between bias and variance (underfitting/overfitting).
The _root mean squared error_ is simply the square root of the _mean squared error_: 

$$
 \sqrt{\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i})^2}
$$

We use it because it represents the average error in the same units as the data, i.e. house prices in our case.

---
**Task 7 (easy): Error calculation**
Implement the _root mean squared error_ in the `rmse`
 function below.
1. Normalize the `X`
 values using the `normalized`
 function
2. Predict the prices using the normalized `X`
 values and model parameters `theta`

3. Calculate and return the _root mean squared error_ of the predicted values


---


In [None]:
def rmse(theta, X, y):
    ...


---
**Task 8 (easy): Model evaluation**
The function `evaluate_models`
 in the cell below should calculate the _root mean squared error_ for models with polynomial orders from 1 to 20. You have to finish the implementation, starting at the `# Add code here`
 comment. Do the following:
1. Estimate the model parameters using the `estimate`
 function.
2. Calculate the _root mean squared error_ of the train and test sets respectively.


---


In [None]:
def evaluate_models():
    losses_train = []
    losses_test = []
    for order in range(1, 20):
        # Add code here
        # first, estimate parameters
        rmse_train = ...
        rmse_test = ...
        losses_train.append(rmse_train)
        losses_test.append(rmse_test)
    return losses_train, losses_test


---
**Task 9 (easy): Plotting results**
1. Plot the training and test losses in the cell below. 
2. Are the results what you expected? Explain why the two loss values evolve differently as the order of the polynomial increases.
3. Relate the results to the dilemma of underfitting and overfitting.


---


In [None]:
# Write your solution here


---
**Task 10 (medium): Reflection**
1. Is it possible to improve the test loss to an arbitrarily low value if you could use a different model?
2. Explain why this is or is not possible.


---
