<h3>Week 2: Vector/Matrix form linear regression lab</h3>


#### Aims 

* Use `scikit-learn` to implement simple linear regression
* Practice general linear regression with polynomial and RBF on the Olympic 100m data


#### Tasks 
* Replicate Lab 1 results with `scikit-learn`
* Rescale our data
* Write functions to construct the design matrix, $X$, with polynomials and RBFs
* Computing least square solutions in both 
* Solve the same problem with more sophicated gradient descent (code provided)


#### Task 1: Again, we start by loading the Olympic 100m men's data

In [None]:
import numpy as np 
%matplotlib inline
import pylab as plt

data = np.loadtxt('olympic100m.txt', delimiter=',') # make sure olympic100m.txt is in the right folder
x = data[:,0][:,None] # make x a matrix
t = data[:,1][:,None] # make t a column vector 

#### Task 2: `scikit-learn` is a widely-used library of ML algorithms. It implements linear regression as `sklearn.linear_model.LinearRegression`. Using this class, try to replicate your results from Lab Sheet 1.

*Note `scikit-learn` also includes many of the other algorithms we cover on this course and is commonly used in the real world. However these lab sheets guide you to implement the methods yourself in order to achieve a better understanding. Once you know how a method works, you can better understand how to adapt and tune existing standard implementations for your specific task.*

#### Task 3: Vector/Matrix form least square solution

#### Task 3.1 Rescale $x$ 
We rescale $x$ to make it small. Doing so will stablise the computatoin, otherwise it quickly becomes unfeasible to fit polynomials over ~$2000$. Let's test the following two options:
- Option 1: `(x-1896)/40`
- Option 2: `(x-np.mean(x))/np.std(x) `

In [None]:
x = ...# Test both options
...# plot the rescaled data

#### Task 3.2: Check the effect of the rescaling on contour plot for simple linear model $w_0 + w_1 x$
The rescaling shows the previously diffcult to see elliptical contours. For both rescaling options, you can use $5$ to $15$ for $w_0$, and $-2$ to $1$ for $w_1$

In [None]:
num_candidates = ...# number of candidates
w0_candidates = ...# generate a numpy array of possible w0 values e.g. 5 to 15
w1_candidates = ...# generate a numpy array of possible w1 values e.g. -2 to 1
L = np.zeros( shape = (num_candidates,num_candidates) ) # Pre-allocate the loss. We are going to have num_candidates times num_candidates of them

...# Write two nested for loops to compute L

plt.contour(w0_candidates, w1_candidates, L, 50) # A different way to plot contour without using meshgrid
plt.xlabel('$w_0$')
plt.ylabel('$w_1$')

#### Task 4.1 Write your own function to construct the design matrix with polynomials

$$\mathbf{X} = \begin{bmatrix}
    1       & x_{1} & x_{1}^2 & \dots & x_{1}^K \\
    1       & x_{2} & x_{2}^2 & \dots & x_{2}^K \\
    \vdots & \vdots &\vdots &\ddots &\vdots\\
    1       & x_{N} & x_{N}^2 & \dots & x_{N}^K
\end{bmatrix} $$

In [None]:
def polynomial (x, maxorder): # The np.hstack function might be helpful
    ... # Write you own code here
    return X

#### Task 4.2 Construct the design matrix with a predefined maximum polynomial order, say $9$ 

In [None]:
maxorder = 9
X_poly = polynomial (x, maxorder)
X_poly.shape

#### Task 4.3: Compute the least square solution

$$ \widehat{\mathbf{w}} = (\mathbf{X}^{T}\mathbf{X})^{-1} \mathbf{X}^T \mathbf{t} $$

*You can use either your own implementation or `scikit-learn.linear_model.LinearRegression`*

In [None]:
def least_square(X, t):
    ... # write your code here; if you do it 'by hand' you'll need np.linalg.solve (for matrix inversion) and np.linalg.dot
    return w
 
w_poly =  least_square(X_poly, t)   
w_poly

#### Task 4.4: Plot the fitted line

To make model predictions, we need transform any $x_{new}$ with the basis function:
$$\mathbf{x}_{new} = \left[\begin{array}{c}h_0(x_{new})\\\vdots\\h_K(x_{new})\end{array}\right], \quad t_{new} = \mathbf{x}_{new}^T \widehat{\mathbf{w}} $$

You need to construct a new design matrix for model prediction, e.g. `polynomial(x_test, maxorder)`

In [None]:
plt.plot(x, t, "ro")
x_test = ... # generate a separate set of x for plotting, min(x) to max(x) could be a good choice
f_test = ... # compute the corresponding prediction by the fitted model
plt.plot(x_test, f_test, linewidth=3)

#### Task 4.5: Linear regression with RBF

Use `sklearn.kernel_ridge.KernelRidge` to perform an RBF regression, and plot the results. Set `kernel='rbf'` when creating the object to force use of an RBF kernel. `KernelRidge.fit` takes the original data as input, and constructs the design matrix $\mathbf{X}$ internally. Note that the `alpha` parameter must be set to zero (or very small) to match what we saw in lectures; it adds an extra term to the loss, which is sometimes useful to avoid overfitting.

In [None]:
...  # create a KernelRidge object and fit it to the data

plt.plot(x, t, "ro")
x_test = ...  # generate a separate set of x for plotting, min(x) to max(x) could be a good choice
f_test_sklearn = ...  # predict using sklearn's kernel regression
plt.plot(x_test, f_test_sklearn, linewidth=3)

**The remainder of Task 4 is trickier; it walks you through re-implementing the above RBF regression by hand.** It will help you to understand the mechanics of the algorithm better, but is not important for everyday use.

#### Task 4.6: Write your own function to construct the design matrix with one RBF kernel centered at each data point

$$ h_k(x) = \exp \left( -\frac{ (x-\mbox{center}[k]) ^2}{2\mbox{width}}  \right)$$


In [1]:
def rbf (x, center, width):
    ... # write your code here
    return X

#### Task 4.7 Construct  the design matrix with $x$ itself as the center parameter
Start with `width = 10` and test different values. The result should be a `(27,27)` matrix

In [None]:
center = x
width = 10
X_rbf = rbf(x, center, width)

#### Task 4.8 Compute the least square solution with the previously defined function

Also compare your results with those from `scikit-learn`

In [None]:
w_rbf = least_square(X_rbf,t)
w_rbf

#### Task 4.9 Plot the fitted line

In [None]:
plt.plot(x, t, "ro")
x_test = ... # generate a separate set of x for plotting, min(x) to max(x) could be a good choice
f_test = ... # compute the corresponding prediction by the fitted model
plt.plot(x_test, f_test, linewidth=3)

#### Task 4.10 (advanced)

Change your (custom, not `sklearn`) implementation of `least_square` to use `np.linalg.inv` instead of `np.linalg.solve`. Does this still work? If so, which solution would you prefer? If not, why?

#### Task 5 (advanced): Instead of using the least square solution,  we test gradient descent in general linear regression setting

- Average squared loss: $L(\mathbf{w}) = \frac{1}{N} (\mathbf{t} - \mathbf{X}\mathbf{w})^T(\mathbf{t} - \mathbf{X}\mathbf{w})$
- Gradient: $ \frac{\partial L(\mathbf{w})}{\partial \mathbf{w}} = -\frac{2}{N} \left(\mathbf{X}^T\mathbf{t} - \mathbf{X}^T \mathbf{X} \mathbf{w}\right) $

#### Task 5.1 Define your own functions for the average squared loss and its gradient 

You will need to change the shape of `w` and `g`, so they fit the requirement in `scipy.optimize.minimize`. It requires the gradient (`g`) to be the shape of `(d,)`, `d` being the dimension of `w` and `g`.

In [1]:
def loss(w, X, t): # define the loss function
    L = ... # the average squared loss function
    return L

def gradient(w, X, t): # define the gradient function
    g = ...
    return g

#### Task 5.2 This cell checks if your gradient function is correct by compare it with numerical approximation

In [None]:
w0 = np.ones((X_rbf.shape[1], 1))[:,0]

eps    = 1e-4 # step size 
mygrad = gradient(w0, X_rbf, t)
fdgrad = np.zeros(w0.shape)
for d in range(len(w0)): # pertub each dimension in term
    mask = np.zeros(w0.shape) # a binary mask that only allows selected dimension to change
    mask[d]   = 1
    fdgrad[d] = (loss(w0 + eps*mask, X_rbf, t) - loss(w0 - eps*mask, X_rbf, t))/(2*eps) # numerical approximation with the definition of gradient

print("MYGRAD: ", mygrad) # my gradient output
print("FDGRAD: ", fdgrad) # numerical gradient
print("Error: ", np.linalg.norm(mygrad-fdgrad)/np.linalg.norm(mygrad+fdgrad) ) # error 

#### Task 5.3: We run an advanced gradient descent method, BFGS, which automatically determine the learning rate. 
SciPy's `minimize` has already implemented a range of different methods 

In [None]:
import scipy.optimize as opt

res = opt.minimize(loss, w0, args=(X_rbf, t), method='BFGS', jac=gradient, 
                   options={'gtol': 1e-7, 'disp': True}) # google scipy minimize for more information
res.x # solution

In [2]:
plt.plot(x, t, "ro")
x_test = ... # generate a separate set of x for plotting, min(x) to max(x) could be a good choice
f_test = ... # compute the corresponding prediction by the fitted model
plt.plot(x_test, f_test, linewidth=3)

NameError: name 'plt' is not defined

The loss at the least square solution is still lower

In [None]:
loss(w_rbf[:,0], X_rbf, t) < loss(res.x, X_rbf, t)