## Lecture 6: Non Linear Classification Class Notes

Data is not always linearly seperable, even if we allow for some slack in terms of accuracy, linear classification methods will not go very far.

The good news is we can extend the linear-classification methods very easily to handle non-linear classifcation as well. All we will need to do is, find a way to extend our feature vectors in such a way that they become linearly separable. This lecture is about techniques that help in extending feature vectors to higher dimensions in which the data becomes linearly-separable.

**Higher Order Feature Vectors**

Feature maps $\phi(x)$ are the primary theoretical constructs that will help in extending the linear classifiers to handle non-linear problems.

Defintion:

Let $x$ $\in R$, then a feature map $\phi(x)$ will map the data to $R^2$ if it can produce a feature vector given as below provided that it accepts $x \in R$ as an argument:

$\phi(x) = \begin{bmatrix}
            \phi_1(x)\\
            \phi_2(x)
            \end{bmatrix}
= \begin{bmatrix}
x\\
x^2
\end{bmatrix}
$

To understand the intuition as to how a feature map can make a non-linearly separable data, linearly separable refer to the excel sheet tab named **Non Linear Seperability**

There are two distinct concepts one needs to understand.

1. One can find the new coordinates by using the feature map $\phi(x)$ on each original point $x$.
2. One can also find the shape of the decision boundary in original feature space by tracing the locus of $sign(\theta.\phi(x)+\theta_0)$



**Introduction to Non-Linear Classification**

Both for classification and regression one can use the idea of feature maps to extend the original features to higher dimensions. The general mechanism by which one can use these higher dimensional features is as follows:

- Non linear classification: $sign(\theta.\phi(x)+\theta_0)$
- Non linear regression: $\theta.\phi(x)+\theta_0$

$\phi(x)$ can be take many forms for example:

1. $\phi(x) = \begin{bmatrix}
                x\\
                x^2\\
                \end{bmatrix}            
               $
               
               
2.  $\phi(x) = \begin{bmatrix}
                x\\
                x^2\\
                x^3\\
                \end{bmatrix}            
               $
               
               
3.  $\phi(x) = \begin{bmatrix}
                x\\
                x^2\\
                x^3\\
                x^4\\
                \end{bmatrix}            
               $
               
But how can one decide which feature transformation to use? One technique that can be used is k-fold cross validation. The intuition behind this is explained in sheet named **"cross validation"**


Using feature maps directly might not always be a great idea as there are two major implications of this:

1. The number of features explode exponentially if $x \in R^2$, where $d>=30$.

To illustrate this point lets imagine we have a feature map $\phi(x)$ that maps x to 1st, 2nd and 3rd order polynomial terms if if $x \in R^d$, where $d>=30$ then the feature vector will have:

- Atleast 30 1st order terms
- ${30+2-1}\choose{2}$$=465$, 2nd order terms
- ${30+3-1}\choose{3}$$=4960$, 3rd order terms

2. The computation time also increases as the dimensions of the feature vectors increase.

**Motivation for Kernel Methods**

Since we know that creating explicit feature vectors can be difficult computationally, in this section we will build the ground-work to reduce the computational load.

- For some feature maps its very easy to compute the dot product
Consider two vectors $x,x' \in R^2$. Also lets assume a feature transformation $\phi(x)$ defined as below

$\phi(x) = [x_1,x_2,x_1^2,\sqrt{2}x_1x_2,x_2^2]$

$\phi(x') = [x_1',x_2',x_1'^2,\sqrt{2}x_1'x_2',x_2'^2]$

Now if we evaluate $\phi(x).\phi(x')$, we can see that the following holds:

$\phi(x).\phi(x') = x.x' + (x.x')^2$

The major implications of this example are:

1. Dot product of feature vectors can be written as dot product of original vectors
2. In case only dot product of feature vectors is needed then there is no need to explicitly create feature vectors and then do the dot product.
3. There are certain forms of feature vectors for which above 2 will be true.

Another peculiar result is the following:

1. If 3 holds then, we generally use a term called kernel function, a kernel function takes original vectors and produces the dot product of specific feature maps applied to the original vectors, i.e.

$K(x,x')=\phi(x).\phi(x')$


**Common Kernel Functions and Decsion Boundaries**

1. Polynomial Kernel $(1+x.x')^p$

![](./imgs/polynomial.png)

2. RBF Kernel $K(x,x') = exp(\frac{-1}{2}||x-x'||^2)$

![](./imgs/rbf.png)


**Kernel Perceptron Algorithm**

Using the kernel perceptron algorithm one can very easily show that non-linear svm's can easily classify points in higher dimensional space without creating explicit feature vectors but only using kernel functions. See the deck named **Algorithm Demos.pptx**

**Choosing kernels and regularization parameter using k-fold cross validation**

The following code snippet uses ```scikit-learn``` an off the shelf python library for applied machine learning to demonstrate how to choose amongst different kernels and different values of $C$.

In [1]:
import pandas as pd
data = pd.read_csv("iris.csv")
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [2]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
X = data.drop('Species',axis=1).values
y = data['Species'].values
enc = LabelEncoder()
y = enc.fit_transform(y)
X_train,X_test,y_train,y_tests = train_test_split(X,y,test_size = 0.20, random_state = 42)
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = GridSearchCV(SVC(), parameters)
clf = clf.fit(X_train,y_train)

In [3]:
clf.best_estimator_

SVC(C=1, kernel='linear')

In [4]:
clf.cv_results_

{'mean_fit_time': array([0.00071588, 0.00045323, 0.00031552, 0.00036564]),
 'std_fit_time': array([7.65839795e-04, 8.44015240e-06, 2.22926344e-05, 1.07416500e-05]),
 'mean_score_time': array([0.00018816, 0.00023756, 0.00014033, 0.00018311]),
 'std_score_time': array([7.59373317e-05, 2.57492065e-06, 3.20085267e-06, 3.97237122e-06]),
 'param_C': masked_array(data=[1, 1, 10, 10],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['linear', 'rbf', 'linear', 'rbf'],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'linear'},
  {'C': 1, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'}],
 'split0_test_score': array([1., 1., 1., 1.]),
 'split1_test_score': array([0.95833333, 0.95833333, 0.95833333, 0.95833333]),
 'split2_test_score': array([0.875     , 0.83333333, 0.83333333, 0.83333333]),
 'split