You suspect you have linearly inseparable data and want to reduce the
dimensions.

Use an extension of principal component analysis that uses kernels to allow for
non-linear dimensionality reduction:


In [2]:
# Load libraries
from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles
# Create linearly inseparable data
features, _ = make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.1)
# Apply kernal PCA with radius basis function (RBF) kernel
kpca = KernelPCA(kernel="rbf", gamma=15, n_components=1)
features_kpca = kpca.fit_transform(features)
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kpca.shape[1])

Original number of features: 2
Reduced number of features: 1


PCA is able to reduce the dimensionality of our feature matrix (e.g., the number
of features). Standard PCA uses linear projection to reduce the features. If the
data is linearly separable (i.e., you can draw a straight line or hyperplane
between different classes) then PCA works well. However, if your data is not
linearly separable (e.g., you can only separate classes using a curved decision
boundary), the linear transformation will not work as well. In our solution we
used scikit-learn’s make_circles to generate a simulated dataset with a target
vector of two classes and two features. make_circles makes linearly
inseparable data; specifically, one class is surrounded on all sides by the other
class.

![](./pics/linearyInseperable.jpg)

Kernels allow us to project the linearly inseparable data into a higher dimension
where it is linearly separable; this is called the kernel trick. Don’t worry if you
don’t understand the details of the kernel trick; just think of kernels as different
ways of projecting the data. There are a number of kernels we can use in scikit-learn’s kernelPCA, specified using the kernel parameter. A common kernel to
use is the Gaussian radial basis function kernel rbf, but other options are the
polynomial kernel (poly) and sigmoid kernel (sigmoid). We can even specify a
linear projection (linear), which will produce the same results as standard
PCA.
One downside of kernel PCA is that there are a number of parameters we need to
specify. For example, in Recipe 9.1 we set n_components to 0.99 to make PCA
select the number of components to retain 99% of the variance. We don’t have
this option in kernel PCA. Instead we have to define the number of parameters
(e.g., n_components=1). Furthermore, kernels come with their own
hyperparameters that we will have to set; for example, the radial basis function
requires a gamma value.
So how do we know which values to use? Through trial and error. Specifically
we can train our machine learning model multiple times, each time with a
different kernel or different value of the parameter. Once we find the
combination of values that produces the highest quality predicted values, we are
done. We will learn about this strategy in depth in Chapter 12