In [1]:
import pandas as pd
import kernel_fda

## Linearly Separable DataSet Example

We'll start with the example data in Figure 12.1. The classes are linearly separable. We'll use a linear Fisher Discriminant to construct a classifier. A linear discriminant for a two-class problem uses the orthogonal distance of a point ${\bf x}$ from a line ${\bf w}$ to determine which class the point is in. If the point ${\bf x}$ is one side of the line we say it is in class 1, whilst if it is on the other side of the line we say it is in class 2. 

Measuring how far a point is from the line ${\bf w}$ is equivalent to measuring how far the point ${\bf x}$ is along the line $\boldsymbol{\beta}$, which is normal to the line ${\bf w}$. So we can express the classifier as the mathematical condition $\boldsymbol{\beta}^{\top}{\bf x} > c$, where $c$ is some constant. If the point ${\bf x}$ satisfies this condition the point is in one class, if it doesn't satisfy this mathematical condition, it is in the other class.

Training the linear discriminant is the process of determining the optimal line ${\bf w}$ which minimizes the classification error on the training dataset.  Determining the optimal line ${\bf w}$ is equivalent to finding the optimal line $\boldsymbol{\beta}$. Since we can see from Figure 12.1 that the two class, the red and the blue points, can be separated by a straight line we know that a linear discriminant using just the features we have, $x_{1}$ and $x_{2}$, will be sufficient to achieve a high-level of accuracy.

We will use KFDA_Poly class I have written to train the Fisher Discriminant. The class is in the kernel_fda.py module which can be found in the same directory as this notebook. You can see we have already imported the module at the start of the notebook. The KFDA_Poly class allows you to do kernel Fisher Discriminant Analysis with pure polynomial kernels of the form $\left ( \underline{x}\cdot\underline{y}\right )^{p}$. When calling the constructor for the KFDA_Poly class we specify the degree $p$ of the kernel we want to use. To do standard linear Fisher Discriminant Analysis, we specify degree $p=1$. 

The data for Figure 12.1 is in the file lda_ex1.csv in the Data directory of the GitHub repository. We have labelled the classes 1 and 2, rather than "red" and "blue". 

First we'll read in the data.

In [2]:
# Read in the data
df_LDA_ex1 = pd.read_csv('../Data/lda_ex1.csv')

Let's take a look at the data

In [3]:
# Take a quick look at the dataframe
df_LDA_ex1

Unnamed: 0,x1,x2,class
0,-0.297128,0.477975,2
1,-0.575548,-0.274354,1
2,-0.793637,-0.681858,1
3,0.842911,-0.766655,2
4,-0.566261,0.621195,2
...,...,...,...
995,-0.589864,0.320973,1
996,-0.503128,0.594245,2
997,0.105585,0.528042,2
998,0.005938,-0.026639,1


Now we'll build a linear discriminant. We'll instantiate a KFDA_Poly object with a linear kernel (degree=1). Specifiying a linear kernel is saying that we are going to do linear kernel Fisher Discriminant analysis (kernel FDA) but with a linear kernel, and so this is equivalent to standard linear Fisher Discriminant analysis.

In [4]:
# Create the linear classifier
linear_classifier_ex1 = kernel_fda.KFDA_Poly(degree=1)

Now we'll fit the linear classifier using the training data we have just read in 

In [5]:
# Fit the linear classifier
linear_classifier_ex1.fit(X=df_LDA_ex1[['x1','x2']], y=df_LDA_ex1['class'])

We can then score the trained linear classifier on the training set using the in-built score function

In [6]:
linear_classifier_ex1.score(X=df_LDA_ex1[['x1','x2']], y_true=df_LDA_ex1['class'])

0.998

We can see that the model scores very well on the training set. The proportion of the training points that the classifier correctly classifier is 0.998, i.e. nearly 100% accuracy on the training set. This is to be expected, as we know just by looking at Figure 12.1 that the two classes are separable by a straight line, i.e. a properly trained LDA classifier should be capable of fitting the training data perfectly. We also know that would predict any hold-out datapoints accurately, provided they are drawn from the same distribution as the training data, and so for the purposes of this illustration there is no need in this simple example to test our classifier on a holdout sample.

## Non-linearly Separable DataSet Example

Now we'll repeat the process but using the data from Figure 12.2. We know from looking at Figure 12.2 that a straight line can't separate the two classes perfectly. Consequently a trained linear discriminant should score poorly even on the training data. Let's check.

The data for Figure 12.2 is in the file lda_ex2.csv in the Data directory of the GitHub repository. It is in the same format as the previous example. First we'll read in the data 

In [7]:
# Read in the data
df_LDA_ex2 = pd.read_csv('../Data/lda_ex2.csv')

Next we'll take a quick look at the data.

In [8]:
# Take a quick look at the dataframe
df_LDA_ex2

Unnamed: 0,x1,x2,class
0,0.074847,-0.244870,1
1,0.084495,-0.631542,1
2,0.644712,-0.622468,2
3,-0.059989,0.772946,2
4,0.571118,-0.393794,1
...,...,...,...
995,-0.582585,-0.805099,2
996,0.293623,0.110988,1
997,-0.300592,-0.486035,1
998,0.891114,0.702401,2


Now we'll repeat the process we went through with the first example and train a linear discriminant on this data. First we instantiate the linear classifier.

In [9]:
# Create the linear classifier
linear_classifier_ex2 = kernel_fda.KFDA_Poly(degree=1)

Next we'll fit it to the training data from Figure 12.2

In [10]:
# Fit the linear classifier
linear_classifier_ex2.fit(X=df_LDA_ex2[['x1','x2']], y=df_LDA_ex2['class'])

And now we'll score the trained linear classifier on the training set data

In [11]:
linear_classifier_ex2.score(X=df_LDA_ex2[['x1','x2']], y_true=df_LDA_ex2['class'])

0.502

We can see the score on the training set is close to 0.5, i.e. only about 50% accuracy. A lot lower than in our first example. This is to be expected. No straight line can separate the two classes in Figure 12.2.

Can you think why the accuracy on the training set was close to 0.5, even though we have trained, i.e. optimized this linear classifier on the training data?

We know that the red and blue points in Figure 12.2 are separated by the boundary $x_{1}^{2} + x_{2}^{2} = \frac{1}{2}$. So if a point $\bf{x}$ has $x_{1}^{2} + x_{2}^{2} > \frac{1}{2}$ it is in the red class, whilst if $\bf{x}$ has $x_{1}^{2} + x_{2}^{2} < \frac{1}{2}$ it is in the blue class. 

So overall we can write our classifier condition as $x_{1}^{2} + x_{2}^{2} > \frac{1}{2}$. This classifier condition can also be written as $\boldsymbol{\beta}^{\top}\boldsymbol{\Phi} > \frac{1}{2}$, where $\boldsymbol{\Phi} = (x_{1}^{2}, x_{2}^{2}, \sqrt{2} x_{1}x_{2})$ and $\boldsymbol{\beta} = (1,1,0)$. This is in the form of a linear discriminant classifier, but where we are using a new feature vector $\boldsymbol{\Phi}$. However, the vector $\boldsymbol{\Phi}$ is precisely the new feature vector that was implicitly created when we used a quadratic dot-product kernel in our Mercer's theorem example in the main text. This suggests that if we train a kernel Fisher Discriminant classifier using a quadratic dot-product kernel $f(\bf{x}, \bf{y}) = \left ( \bf{x}\cdot \bf{y} \right )^{2}$ the trained classifier should be capable of perfectly separating the red and the blue points in the training data shown in Figure 12.2. Let's see. 

First we instantiate a kernel classifier object by specifying a polynomial kernel of degree 2.

In [12]:
# Create a quadratic dot-product kernel linear discriminant
kernel_classifier = kernel_fda.KFDA_Poly(degree=2)

Next, fit the kernel Fisher Discriminant to the training data from Figure 12.2

In [13]:
# Fit the kernel classifier to the training data
kernel_classifier.fit(X=df_LDA_ex2[['x1','x2']], y=df_LDA_ex2['class'])

Finally, we'll score the trained kernel Fisher Discriminant classifier on the training data. We should get something a lot higher than 0.5, and much closer to 1.

In [14]:
# Score the trained classifier on the training data
kernel_classifier.score(X=df_LDA_ex2[['x1','x2']], y_true=df_LDA_ex2['class'])

0.911

And we do get a trained classifier that fits the training data much better than a standard linear discriminant. The reason the trained classifier doesn't fit the training data perfectly, i.e. the accuracy proportion is not 1, is simply due to sampling variation. If we increased the size of the training data we would get closer and closer to a score of 1.