## Linear Regression using Kernel Trick
### Kernel Trick
One of the limitation of linear regression is that it is usually implemented to utilize the linear relationship between the predictors in the input space, but this tend to overlook the interaction between these predictors in the input space. One of the approaches to mitigate this probelem is by mapping the input space into the feature space($\underline{x}\rightarrow\phi(\underline{x}$). In which thie feature space tends to be a in a higher dimension than the original input space, and in this space we can apply linear regression and achieve good result by finding the parameter of the non-linear model with respect to the feature. Nevertheless, this approach suffers from that it is computationally expensive to compute the feature vector for each observation. To solve this issue we use what is known as the kernel trick and a better name for this approach would be kernel substitution, in which instead of calculating the feature vector explicitly we could calculate it implicitly using the kernel function. The kernel function is an inner product between two feature vectors which can be expressed as follows:- $k(x,x')=\langle \phi(x), \phi(x')\rangle=\phi(x)^T\phi(x')$. There are many kernels from the like of linear kernel($k(x,x')=x^Tx'$), stationary kernel ($k(x,x')=c(x-x')$), Gaussian kernel ($k(x, x')=exp(\frac{-||x-x'||_2^2}{2\sigma^2})$), and etc. One can view kernels as a similarity measure between two observation. There are many conditions to choose a kernel one of those condition are sufficient condition(Mercer kernel) in which the kernel matrix must be P(S)D, in which this can be tested by checking for the eignevalues of the kernel matrix. THe kernel matrix is the outer product of the design matrix, and the design matrix will have its rows as the feature vector of each observation. The equation of kernel marix is as follows:- $K = \Phi\Phi^T$in which:-
$$
\begin{align*}
&\Phi=
\begin{pmatrix}
& \phi(\underline{x_1})^T\\
& \phi(\underline{x_1})^T\\
&....\\
& \phi(\underline{x_n})^T\\
\end{pmatrix}
\end{align*}
$$  
Therefore, the kernel matrix size is $R^{nxn}$ which is a symmetric matrix (or sometimes called Gram m). One of the core ideas in kernel is that you can build more complex kernels from atomic kernels, like, the linear kernel. Let's see this idea with the Gaussian kerenel.
$$
\begin{align*}
\begin{split}
&k(x, x')=exp(\frac{-||x-x'||_2^2}{2\sigma^2})=exp(\frac{-(x-x')^(x-x')}{2\sigma^2})=exp(-\frac{x^Tx-x^Tx'-x'^Tx+x'^Tx'}{2\sigma^2})\\
&k(x,x')=exp(\frac{k1(x, x')}{\sigma^2})exp(-\frac{k1(x', x')}{2\sigma^2})exp(-\frac{k1(x, x')}{2\sigma^2});\ where\ is\ k1(.,.)\ is\ linear\ kernel\\
&We\ can\ change\ this\ kernel\ to\ be\ nonlinear\ kernel\ from\ the\ like\ of\ k2(x, x')=(x^Tx'+c)^M\\
&Where\ k2\ can\ be\ also\ build\ from\ atomic\ kernels\ from\ the\ like\ of\ the\ linear\ kernel\
\end{split}
\end{align*}
$$

Also, it is known that the feature vector that is used to build the Gaussian kernel is a infinte dimension, hence, this would be computationally infeasible and this is indicative of the importance of kernel trick.

### Kernel Trick on Linear Regression Direct Solution
The cost function that we usually minimize for the linear regression is the mean squared error which can be driven by maximizing the likelihood. The following cost function will be used for the regularized cost function which is equivalent of using a gaussian prior.
$$
\begin{align*}
\begin{split}
&\nabla_w\big( J(w)=\frac{1}{2}\sum_{n}(w^T\phi(x_n) -t_n)^2 + \frac{\lambda}{2}||w||_2^2 \big)\\
&w = \frac{-1}{\lambda}\sum{n}(w^T\phi(x_n) - t_n)\phi(x_n) = \frac{-1}{\lambda}\sum{n}a_n\phi(x_n)=\Phi^T\underline{a}; substituting\ this\ solution\ into\ J(w)\\
&J(w) =\frac{1}{2}(\Phi(\Phi^Ta) - t)^T(\Phi(\Phi^Ta) - t) + \frac{\lambda}{2}(\Phi^Ta)^T\\
&J(w)=\frac{1}{2}a^T\Phi\Phi^T\Phi^T\Phi a-a^T\Phi\Phi^Tt -\frac{1}{2}t^t+\frac{\lambda}{2}a^T\Phi\Phi^Ta;by\ using\ K=\Phi\Phi^T\\
&As\ can\ be\ seen\ the\ w\ have\ disappeared\ and\ were\ replaced\ by\ a\, so\ we\ are\ maximizing\ w.r.t\ a\\
&\nabla_aJ(a)=\nabla_a(\frac{1}{2}a^TKK^Ta-a^TKt -\frac{1}{2}t^t+\frac{\lambda}{2}a^TK a)\\
&KK^Ta-K^Tt+\lambda Ka=0\rightarrow a = (KK^T+\lambda K)^{-1}K^Tt;\ by\ K=K^T\\
&a=(K+\lambda I_N)^{-1}t
\end{split}
\end{align*}
$$
And to make prediction $y(\phi(x_n), w) = w^T\phi(x_n)=a^T\Phi\phi(x_n)=\phi(x_n)^T\Phi^T(K+\lambda IN)^{-1}t=\underline(k(x)^T(K+\lambda I_N)^{-1}t$, where is k(x) is just the inner product of feature vector for xn and the design matrix. And this can be expressed as inner product of xn with every obsertation and this indicate that we need to store the dataset to make a prediction. So, one would think this is a severe downfall for this method but as we will see in SVM there would be few vectors to be stored which usually are called support vectors. As can be seen the prediction and the optimization were complete expressed by the kernel fucntion, hence, the name of kernel substitution.

Also, as can be seen from the equation for a that minimize the cost function, have a dimension of nx1 instead of px1. So, this raise an issue that if we have a large dataset then out parameter(a) would be large, but this trade off will be really appreciated when a simple linear model with respect to the predictors doesn't perform well and explicit computation for the feature vector is computationally infeasible. Also, the kernel trick will be a signficant factor in SVM that relies on changing the space of the dataset to be linearly separable in this new space while it would be non-linearly separable in the original space.

In [1]:
%matplotlib inline
import numpy as np 
import sklearn.preprocessing
import sklearn.datasets
import pandas as pd
import sklearn.model_selection
import numpy.random
import math
import sklearn.metrics
import sklearn.kernel_ridge

In [2]:
X, y = sklearn.datasets.load_boston(return_X_y=True)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=42)
standard = sklearn.preprocessing.StandardScaler()
X_train = standard.fit_transform(X_train)
y_train = standard.fit_transform(y_train.reshape(-1, 1))

training_data = np.c_[X_train, y_train]#All of the features are continuous, so, no need to use one-hot encoder and we can directly standard normalize the features of the data set

X_test = standard.transform(X_test)
y_test = standard.transform(y_test.reshape(-1, 1))

test_data = np.c_[X_test, y_test]
print(training_data.shape)
print(test_data.shape)

(379, 14)
(127, 14)


In [3]:
def gaussian_kernel(x, x_star, sigma):
    return np.exp(np.divide(-1*(np.linalg.norm(x-x_star)**2), 2*sigma**2))

def estimating_a(X, y, lambd, sigma):
    K = np.zeros((X.shape[0], X.shape[0]))
    for i in range(0, X.shape[0]):
        for j in range(0, X.shape[0]):
            K[i, j] = gaussian_kernel(X[i, :], X[j, :], sigma)
        #K[i, :] = gaussian_kernel(X[i, :], X[:, :], sigma)
    #print(K)
    I = np.eye(X.shape[0])
    a = np.dot(np.linalg.inv(K + lambd * I), y)
    return a, K

def prediction(xn, X, t, K, lambd, sigma):
    k = np.zeros((X.shape[0], 1))
    for i in range(0, X.shape[0]):
        k[i] = gaussian_kernel(xn, X[i, :], sigma)
    I = np.eye(X.shape[0])
    return (np.dot( (np.dot(k.T, np.linalg.inv( K + lambd*I ))), t))[0]



In [41]:
a, K = estimating_a(X_train, y_train, 0.3, 2)
pred = []
for x in X_train:
    pred.append(prediction(x, X_train, y_train, K, 0.3, 2))

sklearn.metrics.mean_squared_error(y_train, pred)#Should be in range of [-1, 1]

0.04700475472406587

In [50]:

pred = []
for x in X_test:
    pred.append(prediction(x, X_train, y_train, K, 0.3, 2))

sklearn.metrics.mean_squared_error(y_test, pred)#The test error became large which indicative of overfitting, so, we need to either change from Gaussian kernel or make the lambda with larger value to make the parameter more sparse

0.8148325652363119

### References 
* Chapter 3, and Chapter 5 from Bishop, C. (2006). Pattern Recognition and Machine Learning. Cambridge: Springer.
* Andrew Ng, Lec 7: (https://www.youtube.com/watch?v=s8B4A5ubw6c)
* Andrew Ng, Lec 8: (https://www.youtube.com/watch?v=bUv9bfMPMb4)