# Kernel models

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ChemAI-Lab/AI4Chem/blob/main/website/modules/02-kernel_models.ipynb)

**References:**
1. **Chapters 6**: [Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/wp-content/uploads/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf), C. M. Bishop.
2. **Chapter 2**:  [Gaussian Processes for Machine LearningOpen Access](https://direct.mit.edu/books/oa-monograph-pdf/2514321/book_9780262256834.pdf), C. E. Rasmussen, C. K. I. Williams
3. **Chapter 4**: [Machine Learning in Quantum Sciences](https://arxiv.org/pdf/2204.04198)
4. [**The Kernel Cookbook**](https://www.cs.toronto.edu/~duvenaud/cookbook/)

# Introduction

So far we have covered regression models of the form, 
$$
f(\mathbf{\phi}(\mathbf{x}),\mathbf{w}) = \mathbf{w}^\top \mathbf{\phi}(\mathbf{x})= \sum_{i}^d w_i \, \phi_i(\mathbf{x}),
$$
where the set of non-linear transformations $\phi_i$ when chosen properly can be powerful regression models.
The change from 

$$
\mathbf{x} = \underbrace{\begin{bmatrix}
 x_1 \\
 \vdots \\
x_d
\end{bmatrix}}_{\text{input space}} \;\;  \to \;\; \mathbf{\phi}(\mathbf{x}) = \underbrace{\begin{bmatrix}
\phi_1(\mathbf{x}) \\
 \vdots \\
\phi_{d'}(\mathbf{x})
\end{bmatrix}}_{\text{feature space}}

$$
is also know as **feature transformation**. For example, polynomial expansion. <br>


<!-- There is another class of models, where prediction is done through a linear combination of a **kernel function** evaluated on at the training data points.  -->

* We can construct another alternative approach to build a regression model using the **distance** or **similarity** between training data points; known as the **kernel** function. 
* The kernel function allows us to *implicitly* use a "high-dimensional" feature space; e.g., infinite polynomial expansion. 


# Kernel trick
Here, we present a~derivation of the kernel trick following [Appendix B in Machine Learning in Quantum Sciences](https://arxiv.org/pdf/2204.04198).

1. Let's define the ridge regression loss function
$$
\begin{equation}
    {\mathcal L}(\mathbf{w},\mathbf{X},\mathbf{y}) = \left\| \mathbf{X}\mathbf{w} - \mathbf{y} \right\|_2^2  + \lambda \left\| \mathbf{w} \right\|_2^2,
\end{equation}
$$
where, 
$$

\mathbf{X} = \begin{bmatrix}
\mathbf{x}_1^\top \\
\mathbf{x}_2^\top \\
\vdots \\
 \mathbf{x}_{N}^\top \\
\end{bmatrix} = \begin{bmatrix}
x_{1,1} & x_{1,2} &\cdots & x_{1,d}\\
x_{2,1} & x_{2,2}&\cdots & x_{2,d}\\
\vdots & \vdots &\ddots & \vdots\\
x_{N,1} & x_{N,2}&\cdots & x_{N,d}\\
\end{bmatrix} \text{ and  } \mathbf{y} = \begin{bmatrix}
y_1 \\
y_2 \\
\vdots \\
y_{N}
\end{bmatrix},
$$

The optimal set of parameters $\mathbf{w}^*$ is found by minimizing ${\mathcal L}(\mathbf{w},\mathbf{X},\mathbf{y})$ with respect to $\mathbf{w}$, 
$$
    \mathbf{w}^* = \arg \min_{\mathbf{w}} {\mathcal L}(\mathbf{w},\mathbf{X},\mathbf{y}) = \arg \min_{\mathbf{w}} \left\| \mathbf{X}\mathbf{w} - \mathbf{y} \right\|_2^2  + \lambda \left\| \mathbf{w} \right\|_2^2.
$$

1. We find the value of $\mathbf{w}$ where  $\nabla_{\mathbf{w}}{\mathcal L} = \mathbf{0}$. 