# Kernel methods

## Kernel ridge regression

Now, we have seen that, when defining the regressor in term of the $\alpha$ coefficients, the regressor is a linear combination of the inner product between the argument $\mathbf{x}$ and the feature vector of the dataset $\mathbf{x}^j$. 

This form leave open interesting possibilities. Instead of the inner product $\langle x, x^j \rangle$ in the Euclidean space $\mathbb{R}^d$, we could transform our elements in some different representation $\phi(\mathbf{x})$, with $\phi$ being a mapping from $\mathbb{R}^d$ to some Hilbert space $\mathcal{H}$ and named _feature map_, and substitute the Euclidean inner product with
$$ \langle \phi(x), \phi(x^j) \rangle_\mathcal{H}.$$
If we decide to use $\phi(x)$ instead of $\mathbf{x}$ in the ridge classifier, the regressor will not be linear (with respect to the feature space $\mathbb{R}^d$) anymore, allowing to capture more intricate relationship within the data. The function
$$\kappa(x, x') = \langle \phi(x), \phi(x') \rangle_\mathcal{H}.$$
is called a _kernel_, and a function in the form
$$\tilde{f}(x) = \sum_{j=1}^m \alpha_j \kappa(x, x^j)$$
is a _kernel machine_. 


### Comparison between primal and dual form of the kernel ridge regressor

Suppose we can efficiently compute the feature map $\phi : \mathbb{R}^d \to \mathcal{H}$. By mapping any feature vector $\mathbf{x}$ to its enhanced representation $\phi(\mathbf{x})$, we obtain a regressor in the following form:

$$\tilde{f}(\mathbf{x}) = \sum_{j = 1}^{\dim \mathcal{H}} w_j \phi_j(x)$$

with 

$$\tilde{w} = (\Phi(X)^\top \Phi(X) + \lambda I)^{-1} \Phi(X)^\top \mathbf{y}.$$

We can easily obtain the dual form from its primal by explicitly constructing the kernel function as the inner product of its transformed vectors. The regressor takes the form:

$$\tilde{f}(\mathbf{x}) = \sum_{j = 1}^{m} \alpha_j \langle \phi(x), \phi(x^j) \rangle$$

However, it's worth noting that we may have access to the kernel function $\kappa$ and can efficiently compute its values, even if we don't have access to the corresponding feature map. This is the case, for example, with the Gaussian kernel:

$$\kappa(x, x') = \exp(-c \lVert x - x' \rVert)$$

with $c > 0$. The corresponding feature map would map $x \in \mathbb{R}^d$ to a Gaussian function in the space $L_2(\mathbb{R}^d)$ with a mean value in $x$ and fixed variance. The situation for which we can compute efficiently $\kappa$ but not $\phi$ is named _kernel trick_. 

Creating an exact primal form for a kernel machine that uses this kernel is challenging. However, we can try to construct a feature map $\bar{\phi}$ such that $\bar{\phi}$ is efficiently computable and $\langle \bar{\phi}(x), \bar{\phi}(x') \rangle \approx \kappa(x, x')$. Many techniques have been proposed to build such an approximation, one of them being the Random Fourier Feature \[rff08\]. In this case, the feature map is approximated with an arbitrary number of random projections. The more projections used, the better the approximation will be.

How can we choose whether to use the primal or dual form of the kernel regressor? Here are some guidelines:

- When the feature map is easy to compute, use the primal form.
- When the feature map is hard to compute, but the dataset size $m$ is too high, you can either approximate the kernel function and use the primal form or subsample the dataset and use the dual form.
- When the feature map is hard to compute, and the dataset size $m$ is modest, use the dual form.