# Kernels

## Learning Objectives

After reading this notebook, students will be able to:

- Understand linearly separable and non-linearly separable data.
- Understand the concept of data transformation.
- Understand the concept of kernels.
- Create a new kernel

## Introduction
So far we have explored the concepts and uses of linearly inseparable datasets and types of kernels. In this section, we will learn the working mechanisms of **kernels**.

## Quick Recap: Dual Form of SVM

In Support Vector Machines, instead of solving the primal optimization problem directly, we often switch to the dual formulation, which is more efficient when the number of features is high or when using kernel functions.

The dual form of the hard-margin SVM can be written as:
$$\text{max}_{\alpha} \ \ \ \ \textbf{w}(\alpha) = \sum_{i=1}^n\alpha_i -\frac{1}{2} \sum_{i=1}^n\alpha_i\alpha_jy_iy_j(x_i^Tx_j )$$

$\text{subject to:}$

$$\alpha_i \geq 0, \ \ \ \ \ \ \ i=1,....,m$$
$$\sum_{i=1}^n\alpha_iy_i = 0$$
where,

$𝛼_i$ are the Lagrange multipliers (dual variables),

$𝑥_𝑖$​ and $𝑥_𝑗$ are training data points,

$𝑦_𝑖$ and $𝑦_𝑗$ are their corresponding class labels.
<div>
  <div align="center">
    <figure>
     <img src="https://i.postimg.cc/fT8dnRtp/support-vector.png", width=500>
     <figcaption>
     Figure 1: Representation of support vectors
     </figcaption>
    </figure>
</div>

A kernel function 𝐾($𝑥_𝑖$, $𝑥_𝑗$) computes the inner product of two vectors $𝑥_𝑖$ and $𝑥_𝑗$ after they have been mapped to a higher-dimensional space by a transformation function ϕ(x):
𝐾($𝑥_𝑖$, $𝑥_𝑗$) = ⟨𝜙($𝑥_𝑖$), 𝜙($𝑥_𝑗$)⟩

where:
$𝑥_𝑖$, $𝑥_𝑗$ ∈ $𝑅^𝑛$: input vectors in the original feature space

ϕ(x): a mapping function that transforms the input into a higher-dimensional space

⟨⋅,⋅⟩: the dot product in the transformed space

This mapping allows us to measure similarity in a transformed space without ever explicitly computing the transformation ϕ(x). This idea is fundamental to the kernel trick, which we will explore in the next section.

## Mapping Function ϕ(x)

The transformation function ϕ(x) is responsible for projecting the data from its original low-dimensional space into a higher-dimensional (possibly infinite-dimensional) feature space. The goal of this transformation is to make the data linearly separable in the new space.

However, explicitly computing ϕ(x) for high-dimensional or infinite-dimensional spaces can be computationally expensive or infeasible. That’s where the kernel trick comes in, which allows us to work with the inner products ⟨𝜙($𝑥_𝑖$), 𝜙($𝑥_𝑗$)⟩ directly, without ϕ(x).

Before diving into the kernel trick, let's first understand what data transformation is.

## Data Transformation
It is the most common technique used in non-linear data transformation. Data transformation is the technique of adding new data feature into the training data just by repurposing the existing data.

The idea behind data transformation is that, if we transform data to increase the dimensions without altering the original data, the separation between the data distribution becomes prominent and it becomes easier for us to classify accordingly.


In order to understand the idea behind data transformation ,
Let's imagine we have a function $\phi(x_i)$ that performs a transformation of the feature $x_i$. The feature transformation is performed by creating various combination of existing features. For example, when we have a data $x$ with three features represented as,
$x = \begin{bmatrix} x_1 \\ x_2 \\ x_3\end{bmatrix}$. The transformation of the feature would be
$$ \phi(x) = \begin{bmatrix} 1\\ x_1\\ x_2\\ x_3\\ x_1 x_2\\x_1 x_3\\ x_2 x_3\\ x_1 x_2 x_3 \end{bmatrix}
$$
A single transformed feature in the new feature space can be thought of as either taking an old feature into computation or not. For example, the first term $1$ of this transformed feature ($\phi(x)$) is obtained by considering none of the features of $x$. On the other hand, the last term $x_1 x_2 x_3$ is considering all the features of $x$. With this kind of transformation, the number of features in the transformed feature space is $2^n$, where $n$ is the dimension of vector $x$.

For a more general case, if we have a feature vector $$X = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ \vdots \\ x_d \end{bmatrix}$$ then,
$$
\phi(x) = \begin{bmatrix} 1\\ x_1\\ x_2\\ \vdots\\ x_d\\ x_1 * x_2\\ x_1 * x_3\\ \vdots\\ x_1 * x_d\\ \vdots\\ x_1 * x_2 * \cdots * x_d \end{bmatrix}
$$


The number of elements in $\phi(x)$ are in the range of $2^d$ as shown in the above example. The size of $\phi(x)$ increases exponentially with the dimension of the data $x_i$. In the general case, the size of $x_i$ will be in the range of hundreds and sometimes even thousands. When the size of $x_i$ is large, calculating $\phi(x_i)$ becomes computationally infeasible. It might be confusing to you why are we talking about the transformation which in the general case is not feasible to do. We are going to talk about this idea throughout this chapter and build upon it. Actually, we are also going to see a function which will transform our data to infinite dimension, which we can see is not computable even if we increase our computational resources.

You might be wondering how can we perform the data transformation to the high dimension, even to the infinite dimension? Here comes a beautiful technique which we will cover in detail below and is called the **Kernel Trick**. Kernel trick allows us to transform our features to a higher feature dimension and back to the original dimension by only computing the inner product of data points. If it is unclear to you, do not be intimidated. After going through this chapter, you will have a thorough understanding of the concept as well as the mathematical derivation of the kernel trick.

## Kernel Trick

<!-- In the above solution, we just multiplyed features to make a new feature and called it a day. In doing so the computional cost for the feature creation increasses exponentionaly and hence we use kernel trick.

Lets consider a linear model.
> $$Y = W^TX$$
As we already know that this linear model cannot classify the no-linear data hence our option ia to add features to them to make the complex model.  
After we make our model complex our model will look somrhing like this.
$$Y = W^T\phi(X)$$
But how do we find $\phi(x)$, in this scetion we will discuss below.

Consider, the cost function of our linear model with respcet to weight.  
cost function$(J(w))$ = $\sum_{i=1} ^n(y_n - w^T \phi(x_n))^2 + \frac{\lambda}{2} \sum_{i=1}^n ||w||^2$
taking derivative with respect to weight and equatin it to $0$, we get the optimal weight($w^*$)

$w^* = \frac{1}{\lambda} \sum^n_{n-1} (y_n - w^T \phi(x_n)) \phi(x_n)$  
 To make our equation somehow easier lets introduce a new varaible $\alpha$.
 $$\alpha = \frac{1}{\lambda} \sum^n_{n-1} (y_n - w^T \phi(x_n))$$
 Hence we can write out equation as  
 $w^* = \sum _{n=1}^{n} \alpha_n \phi(x_n)$

Vectorizing and substutiding with respect to weight($w$)

$J(w) = (y - \phi(w))^T ~ (y - \phi(w)) + \frac{\lambda}{2} w^Tw$

Simplifying the above equation we get,   
$J(w) = y^Ty -  y^T\phi w - w^T\phi ^T y + w^T\phi^T\phi w + \frac{\lambda}{2} w^Tw$

$J(\alpha) = y^Ty - y^T\phi(\phi^T\alpha) - (\phi^T\alpha)^T\phi^Ty + (\phi^T\alpha)^T \phi^T\phi (\phi^T\alpha)  + \frac{\lambda}{2} (\phi^T\alpha)^T(\phi^T\alpha)$

$J(\alpha) = y^Ty - y^T \phi \phi^T \alpha - \alpha \phi \phi^T y + \alpha^T \phi \phi^T \phi \phi^T\alpha + \frac{\lambda}{2} \alpha^T \phi \phi^T \alpha$

In the above expression note thae we have **$\phi \phi^T$**. which is called **Gram Matrix** or **Kernel matrix** represented by $K$. It exibits two properties:
* Symmetry  
 i.e. $K^T = K$  
* positive semi-definate  
 i.e. $\alpha^T K \alpha \geq 0$

 now again, lets simplify the above expression even forther.  
As we have:  

$J(\alpha) = y^Ty - y^T \phi \phi^T \alpha - \alpha \phi \phi^T y + \alpha^T \phi \phi^T \phi \phi^T\alpha + \frac{\lambda}{2} \alpha^T \phi \phi^T \alpha$

This expression can be written as:  
$J(\alpha) = y^Ty - y^T K \alpha - \alpha^TKy + \alpha^T K^2 \alpha + \frac{\lambda}{2} \alpha^T K \alpha$  
Since, $K = K^T$, we can rewrite above expression as:
$J(\alpha) = y^Ty - y^T K \alpha - y^T K \alpha + \alpha^T K^2 \alpha + \frac{\lambda}{2} \alpha^T K \alpha$

$J(\alpha) = y^Ty -2 y^T K \alpha + \alpha^T K^2 \alpha + \frac{\lambda}{2} \alpha^T K \alpha$

Now we minimize the value of $J(\alpha)$ by taking partial derivatives with respect to $\alpha$. Hence our expression will be:  
$\frac{\delta J \alpha}{\delta \alpha} = 0 - 2y^TK + 2 \alpha^T K^2 + \lambda \alpha^T K = 0$   
$-2y^T + 2 \alpha^T K + \lambda \alpha ^T = 0$   
$\alpha^Tk + \frac{\lambda}{2} \alpha^T = y^T$  
$\alpha^T = y^T(K + \frac{\lambda}{2}I)^{-1}$  
$\alpha = (K + \frac{\lambda}{2}I)^{-1} y$  
Which is our optimal $\alpha$. After getting our optimal alpha, we can express our weights in term of this kernel matrix.  
$\alpha = (K + \lambda^{'}I)^{-1}y$

$w = \phi^T \alpha = \phi^T(K + \lambda^{'}I)^{-1}y$
 -->


The kernel trick is a mathematical technique that allows Support Vector Machines (SVM) (and other algorithms) to operate in a high-dimensional, even infinite-dimensional, feature space without ever explicitly computing the transformation ϕ(x).

A kernel function can simply be defined as a mapping functionto a higher dimensional data. For the given datapoint $x_1$ and $x_2$, we represent kernel function($K$) as :
$$k(x_1,x_2) = <\phi(x_1), \phi(x_2)>$$  
Here, $x_1$ and $x_2$ are the given datapoints, $\phi(x_1)$ and $\phi(x_2)$ are the complex non linear values.


### Intuition of Kernels

To build a mathematical understanding of kernel methods, let's begin with a simple linear regression model. In its basic form, the hypothesis function is:

$$y = w^T x$$

When we introduce composite (non-linear) features into the model, we map the original input *x* into a new feature space using a transformation function ϕ(x). The model now becomes:

$$y = w^T \phi(x)$$  

### Linear Regression with Feature Transformation

Taking the cost function fo our linear regression model,

$J(w) = \frac{1}{2} \sum_{i=1}^n(y_n - w^T x_n)^2$

for optimum weight:  
$w^* = argmin ~~ J(w)$

After vectorizing and simplyfing we get,  
$w = (x^Tx)^{-1} x^T y$  

For complex features, we can write above expression as:  
$w = (\phi^T \phi)^{-1} \phi^T y$

 Now lets get into the **Kernalization**. As we have our cost function for the linear regression:  

$J(w) = \frac{1}{2} \sum_{i=1}^n(y_n - w^T\phi(x_n))^2$  

Taking derivative with respect to weight $w$, we can solve for $w$; $w^*$ means optimum weight.  
$w^* = \sum_{i=1}^n(y_n - w \phi(x_n)) \phi(x_n)$  

consider, $\alpha_n = \sum_{i=1}^n(y_n - w \phi(x_n))$ (This is just to make the equation easier)

$w^* = \sum_{i=1}^n \alpha _n \phi(x_n)$

Therefore, $w^* = \phi^T \alpha$

Revisiting our old equation again:  
$J(w) = \sum_{i=1} ^n(y_n - w^T\phi(x_n))^2$  
We now vectorize and substitude our weight $w$ and we keep simplifying our expression.  
$J(w) = (y - \phi w)^T (y - \phi w)$  
$J(w) = y^Ty - y^T \phi w - w^T \phi ^T y + w^T \phi ^T \phi w $  
$J(\alpha) = y^Ty - y^T \phi (\phi^T \alpha) - (\phi^T \alpha)^T \phi ^T y + (\phi^T \alpha)^T \phi ^T \phi (\phi^T \alpha)$

$J(\alpha) = y^Ty - y^T \phi \phi^T \alpha - \alpha^T \phi \phi ^T y + \alpha^T \phi \phi ^T \phi \phi^T \alpha$

Here $\phi \phi^T$ is a Gram Matrix or Kernel Matrix. We will denote it as $K$ from this point.  
$$K = \begin{bmatrix}
\phi(x_1)^T \phi(x_1)&\cdots&\phi(x_1) \phi(x_n)\\
\vdots & \ddots & \vdots\\
\phi(x_m)^T \phi(x_1)&\cdots&\phi(x_m)^T \phi(x_n)
\end{bmatrix}$$

The Gram matrix has a element that are the dot product of the every pair of feature vectors and it exibits the following properties:
* Symmetry:  
 $K^T = K$
* Positive Semi Definite:  
 $ A ^T K A \geq 0$ i.e. product to any vector($A$) and it's transpose is always positive.

Continuing with our equation,  
$J(\alpha) = y^Ty - y^T \phi \phi^T \alpha - \alpha^T \phi \phi ^T y + \alpha^T \phi \phi ^T \phi \phi^T \alpha$  

Simplifying the above equation,  

$J(\alpha) = y^Ty - y^T K \alpha - \alpha^T K y + \alpha^T K K \alpha$  
$J(\alpha) = y^Ty - y^T K \alpha - y^T K \alpha + \alpha^T K^2 \alpha$ $~~~~~(\because K = K^T)$

$J(\alpha) = y^Ty - 2y^T K \alpha + \alpha^T K^2 \alpha$

To get the optimum value of $\alpha$, we have to minimize the function. We can do that simply by taking derivatives with respectice to $\alpha$. Hence our equation will be:  
$\frac{\delta J(\alpha)}{\delta \alpha} = -2yK + 2\alpha^TK^2 = 0$  
$-2y + 2\alpha^TK = 0$  
$\alpha^T = y^TK$  
$\alpha = K^{-1}y$

Once we get the optimal $\alpha$, We can express the weights in the terms of kernel matrix:  
$w = \phi^T\alpha = \phi^TK^{-1}y = \phi (\phi \phi^T)^{-1}y$

Now lets have a look at the equation that we derived before without kernels:
Without Kernels: $w  = (\phi^T \phi)^{-1} \phi^T y$,  

With Kernels: $w = \phi (\phi \phi^T)^{-1}y$


What is the difference? Before we needed the $\phi^T \phi$ which is the covariance matrix but the $\phi \phi^T$ is the kernel matrix. I may look like we still need $\phi$. As we discussed before $\phi \phi^T$ is a kernel matrix. being a kernel matrix, it is symmetrical and it is positive semi-definite.

If look into the mercer's theorem, It says that a symmetrical positive semi definate metrix can be written as the dot product of $\phi$ i.e.  
$$\phi \phi^T = K =
\begin{bmatrix}
\phi(x_1)^T \phi(x_1)&\cdots&\phi(x_1) \phi(x_n)\\
\vdots & \ddots & \vdots\\
\phi(x_m)^T \phi(x_1)&\cdots&\phi(x_m)^T \phi(x_n)
\end{bmatrix}
=
\begin{bmatrix}
k(x_1, x_1)&\cdots&k(x_1, x_n)\\
\vdots & \ddots & \vdots\\
k(x_m, x_1)&\cdots&k(x_m, x_n))
\end{bmatrix}$$

This is the infamous Kernel Trick

Lets look at an example.
Consider a valid kernel:  
$$k(x, z) = (x^Tz)^2$$  
Suppose, we take two elements in our data and hence our datapoint becomes $x = (x_1, x_2)$ and $z=(z_1, z_2)$. Now we expand the above equation,
$$k(x, z) = (x^Tz)^2 = (x_1z_1 + x_2z_2)^2$$
$$k(x, z) = (x_1z_1)^2 + 2x_1z_1x_2z_2 + (x_2z_2)^2$$
$$k(x, z) = (x_1^2, \sqrt{2}x_1x_2, x_2^2) (z_1^2, \sqrt{2}z_1z_2, z_2^2)^T$$

$$k(x, z) = ~<\phi(x), \phi(z)>$$

We can see that dot product of non-linear basis function($\phi$) gives us the function of the basis vector inputs. We can compute the kernel matrix without knowing the true nature of $\phi$.




## Constructing Kernels:  
While kernel functions intuitively map input data into higher-dimensional feature spaces, not every function can serve as a valid kernel. A kernel must satisfy certain mathematical properties to ensure it corresponds to an inner product in some (possibly infinite-dimensional) feature space.

### Approach I:

One method is to explicitly define a feature mapping $\phi(x)$, and compute the kernel as the inner product in the transformed space:
$$k(x_1, x_2) =~ <\phi(x_1), \phi(x_2)> ~= \sum _{i=1} ^M \phi_i(x_1) \phi_i(x_2)$$

### Approach 2: Define Kernel Directly
Alternatively, we can define the kernel function directly, without specifying the mapping $\phi$. To verify its validity, we construct a matrix $K$ using the function over all input pairs: $K_{ij}$ = k($x_i$, $x_j$)

If the resulting matrix $K$ is symmetric and positive semidefinite (i.e., it satisfies the properties of a Gram matrix), then $k$ is a valid kernel.

### Approach 3: Combine Valid Kernels

Another powerful approach is to construct new kernels by combining known valid kernels. If $k_1(x, z)$ and $k_2(x, z)$ are valid kernels, then the following constructions also yield valid kernels:

$k(x, z) = ck_1(x, z)$  
$k(x, z) = f(x) k_1(x, z) f(z)$  
$k(x, z) = q(k_1(x, z))$  
$k(x, z) = exp(k1_(x, z))$  
$k(x, z) = k_1(x, z) + k_2(x, z)$  
$k(x, z) = k_1(x, z) k_2(x, z)$  
$k(x, z) = k_3(\phi(x), \phi(z))$  
$k(x, z) = x^T A z$  
$k(x, z) = k_a(x_a, z_a) + k_b(x_b, z_b)$  
$k(x, z) = k_a(x_a, z_a) k_b(x_b, z_b)$

 where $c > 0$ is a constant,

 $f(·)$ is any function,

 $q(·)$ is a polynomial with non-negative coefficients,

 $\phi(x)$ is a function from $x$ to $\mathbb{R}^M$,

 $k_3(·, ·)$ is a valid kernel in $\mathbb{R}^M$,

 $A$ is a symmetric positive semidefinite matrix,

 $x_a$ and $x_b$ are variables (not necessarily disjoint) with $x = (x_a, x_b)$,

 $k_a$ and $k_b$ are valid kernel functions over their respective spaces.



## Making Predictions  
In linear models, prediction is typically expressed as:

$y = w^T \phi(x)$.   
In the kernelized form, we first derive $w$ from training data using:

$w^T \phi(x) = y(\phi \phi^T)^{-1} \phi^T \phi(x)$ $~~~~~(\because w = \phi (\phi \phi^T)^{-1}y)$  
$w^T \phi(x)= y(K)^{-1} k_x$   
We replaced  $\phi^T \phi(x)$ as:  
$\phi^T \phi(x) = \begin{bmatrix} \phi(x_1)^T \phi(x )\\
\phi(x_2)^T\phi(x)\\
\vdots \\
\phi(x_n)^T\phi(x)\end{bmatrix} = k_x$

Here we see that we do not need to know the true nature of $\phi$ at all.

## Kernel Parameters

Kernel parameters are the internal constants or coefficients within a kernel function that control how the input data is transformed and how flexible or complex the resulting decision boundary can be. The most common kernel parameters include:

γ (gamma) in the Radial Basis Function (RBF) kernel

d (degree) in the Polynomial kernel

C in the SVM regularization term


### γ (Gamma) : in RBF Kernel

The gamma parameter controls the influence of each training sample. Mathematically, it appears in the RBF kernel as:

K(x,z)=exp(−γ$∥x−z∥^2$)

Low γ:

- Wider Gaussian functions.

- The influence of each data point is spread out.

- Leads to smoother decision boundaries.

- High bias, low variance.

High γ:

- Narrow Gaussians.

- Each data point has more localized influence.

- Results in complex boundaries, possibly overfitting.

- Low bias, high variance.

### d (Degree) : in Polynomial Kernel
In polynomial kernels, the degree determines the polynomial power of the dot product:   $$K(\mathbf{x}, \mathbf{z}) = (1 + \mathbf{x}^\intercal \mathbf{z})^d$$


Low d:
- Simpler models, like linear or quadratic boundaries.
- Higher bias, lower variance.

High d:
- More flexible decision boundaries capturing intricate patterns.
- Lower bias, higher variance.
- Greater risk of overfitting, especially with small datasets.

### C — Regularization Parameter in SVM

While not a parameter of the kernel function itself, C determines the trade-off between a smooth decision boundary and correctly classifying training points.

Low C:

- More tolerant of misclassifications (soft margin).
- Simpler model, higher bias, lower variance.

High C:
- Strives to classify every point correctly (hard margin).
- More complex model, lower bias, higher variance.


# Tips on Practical Use
**1. Setting C**:

C is 1 by default and it’s a reasonable default choice. If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.

LinearSVC and LinearSVR are less sensitive to C when it becomes large, and prediction results stop improving after a certain threshold. Meanwhile, larger C values will take more time to train, sometimes up to 10 times longer

## Key Takeaways:
* We need followings to solve linearly inseparabel data:
 * Multiple linear models
 * Transforming data
* Data transformation is adding dimensions to the data to make it linearly separable
* Kernel is just a data mapping function
* Using kernels is computionally cheap for data transformation.