# Kernel Methods

###### COMP4670/8600 - Introduction to Statistical Machine Learning - Tutorial 4

## Discussion

Get into groups of two or three and take turns explaining the following (about 2 minutes each):
- regression vs classification
- Fisher's discriminant
- generative vs discriminative probabilistic methods
- logistic regression
- support vector machines
- basis functions vs kernels

$\newcommand{\RR}{\mathbb{R}}$
$\newcommand{\dotprod}[2]{\langle #1, #2 \rangle}$

Setting up the environment

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## The data set

This is the same dataset we used in Tutorial 2.

*We will use an old dataset on the price of housing in Boston (see [description](https://archive.ics.uci.edu/ml/datasets/Housing)). The aim is to predict the median value of the owner occupied homes from various other factors. We will use a normalised version of this data, where each row is an example. The median value of homes is given in the first column (the label) and the value of each subsequent feature has been normalised to be in the range $[-1,1]$. Download this dataset from [mldata.org](http://mldata.org/repository/data/download/csv/housing_scale/).*

Read in the data using pandas. Remove the column containing the binary variable 'CHAS' using ```drop```, which should give you a DataFrame with 506 rows (examples) and 13 columns (1 label and 12 features).

In [15]:
names = ['medv', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat']
data = pd.read_csv('housing_scale.csv', header=None, names=names)
data.head()
data.drop('chas', axis=1, inplace=True)
data.shape

<bound method NDFrame.head of      medv      crim    zn     indus       nox        rm       age       dis  \
0      24 -1.000000 -0.64 -0.864370 -0.370370  0.155011  0.283213 -0.461594   
1      21 -0.999528 -1.00 -0.515396 -0.654321  0.095995  0.565396 -0.302076   
2      34 -0.999529 -1.00 -0.515396 -0.654321  0.388772  0.198764 -0.302076   
3      33 -0.999414 -1.00 -0.873900 -0.699588  0.317111 -0.116375 -0.102911   
4      36 -0.998590 -1.00 -0.873900 -0.699588  0.374210  0.056643 -0.102911   
5      28 -0.999471 -1.00 -0.873900 -0.699588  0.099444  0.149331 -0.102911   
6      22 -0.998157 -0.75 -0.456745 -0.427984 -0.060740  0.312049 -0.194155   
7      27 -0.996893 -0.75 -0.456745 -0.427984  0.000575  0.919670 -0.123226   
8      16 -0.995393 -0.75 -0.456745 -0.427984 -0.206745  1.000000 -0.099292   
9      18 -0.996320 -0.75 -0.456745 -0.427984 -0.063805  0.709578 -0.006538   
10     15 -0.995087 -0.75 -0.456745 -0.427984  0.079134  0.882595 -0.051169   
11     18 -0.997501 -0

## Constructing new kernels

In the lectures, we saw that certain operations on kernels preserve positive semidefiniteness. Recall that a symmetric matrix $K\in \RR^n \times\RR^n$ is positive semidefinite if for all vectors $a\in\RR^n$ we have the inequality
$$
a^T K a \geqslant 0.
$$

Prove the following relations:
1. Given positive semidefinite matrices $K_1$, $K_2$, show that $K_1 + K_2$ is a valid kernel.
2. Given a positive semidefinite matrix $K$, show that $K^2 = K\cdot K$ is a valid kernel, where the multiplication is a pointwise multiplication (not matrix multiplication).

### Solution description

## Polynomial kernel using closure

Using the properties proven above, show that the inhomogenous polynomial kernel of degree 2
$$k(x,y) = (\dotprod{x}{y} + 1)^2$$
is positive semidefinite.

### Solution description

## Empirical comparison

Recall from Tutorial 2 that we could explicitly construct the polynomial basis function. In fact this demonstrates the relation
$$
k(x,y) = (\dotprod{x}{y} + 1)^2 = \dotprod{\phi(x)}{\phi(y)}.
$$
where
$$
\phi(x) = (x_1^2, x_2^2, \ldots, x_n^2, \sqrt{2}x_1 x_2, \ldots, \sqrt{2}x_{n-1} x_n, \sqrt{2}x_1, \ldots, \sqrt{2}x_n, 1)
$$
*This is sometimes referred to as an explicit feature map or the primal version of a kernel method.*

For the data above, construct two kernel matrices, one using the explicit feature map and the second using the equation for the polynomial kernel. Confirm that they are the same.

In [45]:
# Solution goes here

def phi(x):
    D = len(x)
    tmp = np.outer(x, x)*2**0.5
    for i in range(D):
        tmp[i,i] = tmp[i,i]/2**0.5
    return np.hstack((1,x*2**0.5,tmp[np.triu_indices(D)]))

def kernel_matrices_EFM(data):
    n = data.shape[0]
    Phi = np.array([ phi(data.iloc[i]) for i in range(n)])
    K = np.array([Phi[i,:].dot(Phi.T) for i in range(n)])
    return(K)

def kernel_polynomial(data):
    data = data.as_matrix()
    tmp = data.dot(data.T)+1
    return(tmp**2)

K  = kernel_matrices_EFM(data)
K.shape

K2 = kernel_polynomial(data)
K2-K

array([[  5.82076609e-11,   8.73114914e-11,   1.16415322e-10, ...,
          0.00000000e+00,  -5.82076609e-11,   0.00000000e+00],
       [  8.73114914e-11,   2.91038305e-11,   0.00000000e+00, ...,
          1.16415322e-10,   5.82076609e-11,   1.45519152e-11],
       [  1.16415322e-10,   0.00000000e+00,  -2.32830644e-10, ...,
          0.00000000e+00,   0.00000000e+00,   2.91038305e-11],
       ..., 
       [  0.00000000e+00,   1.16415322e-10,   0.00000000e+00, ...,
          2.32830644e-10,  -1.16415322e-10,   7.27595761e-11],
       [ -5.82076609e-11,   5.82076609e-11,   0.00000000e+00, ...,
         -1.16415322e-10,  -8.73114914e-11,  -7.27595761e-12],
       [  0.00000000e+00,   1.45519152e-11,   2.91038305e-11, ...,
          7.27595761e-11,  -7.27595761e-12,  -3.63797881e-12]])

There are pros and cons for each method of computing the kernel matrix. Discuss.

### Solution description

## Regularized least squares with kernels

This section is analogous to the part in Tutorial 2 about regularized least squares.

State carefully the cost function and the regulariser, defining all symbols, show that the regularized least squares solution can be expressed as in Lecture 5 and Lecture 9.
$$
w = \left( \lambda \mathbf{I} + \Phi^T \Phi\right)^{-1} \Phi t
$$
Please describe the reason for each step.

### Solution description

By substituting $w = \Phi^T a$, derive the regularized least squares method in terms of the kernel matrix $K$.

### Solution description

## Comparing solutions in $a$ and $\mathbf{w}$

Implement the kernelized regularized least squares as above. 
*This is often referred to as the dual version of the kernel method.*

Compare this with the solution from Tutorial 2. Implement two classes:
* ```RLSPrimal```
* ```RLSDual```

each which contain a ```train``` and ```predict``` function.

Think carefully about the interfaces to the training and test procedures for the two different versions of regularized least squares. Also think about the parameters that need to be stored in the class.

In [None]:
# Solution goes here

## (optional) General kernel

Consider how you would generalise the two classes above if you wanted to have a polynomial kernel of degree 3. For the primal version, assume you have a function that returns the explicit feature map for the kernel ```feature_map(X)``` and for the dual version assume you have a function that returns the kernel matrix ```kernel_matrix(X)```.