# Dimensionality Reduction: Principal Component Analysis

Sometimes, we have far too many features to work with. This is known as the *curse of dimensionality*.

*Principal component analysis* is the algorithm for reducing and combining features whilst preserving information. It works by looking at the variance of each feature and constructing hyperplanes that maximises the variance.

References:<br>
Scikit learn PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html <br>
Principal Component Analysis (PCA) with Scikit-learn: https://towardsdatascience.com/principal-component-analysis-pca-with-scikit-learn-1e84a0c731b0 <br>
A demo of K-Means clustering on the handwritten digits data: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py <br>

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn
%pip install seaborn
%pip install -U matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.cluster import KMeans

## Reducing the Breast Cancer Dataset

The Wisconsin breast cancer dataset has a whopping 30 features! This is a little too much for many algorithms to handle.

### Loading the data

Let's load the data and see what it looks like.

In [None]:
# Load the data as a Pandas DataFrame


# Load it again as a bunch object for plotting later on


<details><summary>Click to cheat</summary>

```python
# Load the data as a Pandas DataFrame
X, y = datasets.load_breast_cancer(return_X_y=True, as_frame=True)

# Load it again as a bunch object for plotting later on
cancer = datasets.load_breast_cancer()
```
</details>

In [None]:
# Show the first few rows of features and their names
X.head()

In [None]:
# Show the first few labels
# There are two labels: 0 for 'malignant' and 1 for 'benign'
y.head()

### Preprocessing our data

Usually, it's best to preprocess our data to make it easier to process later on.

We'll use *z*-scaling to ensure no feature's variance dominates any other.

In [None]:
from sklearn.preprocessing import StandardScaler

# Create the scaler


# Scale the data


<details><summary>Click to cheat</summary>

```python
from sklearn.preprocessing import StandardScaler

# Create the scaler
scaler = StandardScaler()

# Scale the data
X_scaled = scaler.fit_transform(X)
```
</details>

### Create the PCA

In [None]:
from sklearn.decomposition import PCA

# create the PCA with 30 components
# Will be using 30 so we can analyse the variance of each feature


# fit and transform the scaled data


<details><summary>Click to cheat</summary>

```python
from sklearn.decomposition import PCA

# create the PCA with 30 components
# Will be using 30 so we can analyse the variance of each feature
pca = PCA(n_components=30)

# fit and transform the scaled data
pca.fit(X_scaled)
X_pca = pca.fit_transform(X_scaled)
```
</details>

### Variance preserved

Now let's view the variance of the original dataset that was preserved from component to component.

In [None]:
# Variance of each feature
pca.explained_variance_ratio_ * 100

In [None]:
# Now let's plot the accumulated variance
plt.plot(np.cumsum(pca.explained_variance_ratio_ * 100))
plt.xlabel("Number of components")
plt.ylabel("Explained variance (%)")
plt.grid()
plt.show()

Hence, with only 6 components we have 95% of the variance!

### Plotting the first few components

In [None]:
# create a new PCA with only three components


# fit the data using the scaled data


# transform the data


<details><summary>Click to cheat</summary>

```python
# create a new PCA with only three components
pca = PCA(n_components=3)

# fit the data using the scaled data
pca.fit(X_scaled)
# transform the data
X_pca = pca.transform(X_scaled)
```
</details>

### Plot the results

In [None]:
# Plotting the PCA data
fig = plt.figure(figsize=(12, 8))
ax = plt.axes(projection='3d')

var = np.cumsum(pca.explained_variance_ratio_ * 100)[-1]
ax.scatter3D(X_pca[y == 0, 0], X_pca[y == 0, 1], X_pca[y == 0, 2], s=50, alpha=0.7, color='m')
ax.scatter3D(X_pca[y == 1, 0], X_pca[y == 1, 1], X_pca[y == 1, 2], s=50, alpha=0.7, color='b')
plt.title(f"3D Scatterplot: {var:.2f}% variance captured")
plt.xlabel("1st principal component")
plt.ylabel("2nd principal component")
ax.set_zlabel("3rd principal component")
ax.legend(["Malignant", "Benign"])
plt.show()

## Simplifying the Digits Dataset for Clustering

The digits dataset has a whopping 64 features! One for each pixel.

This is a lot of features to work with, so we'll reduce it first, then cluster it.

### Load the digits

In [None]:
# Load the data as a bunch object
data, labels = datasets.load_digits(return_X_y=True)

### Create the PCA

In [None]:
# Create an PCA with 2 components


# Fit and transform the data


<details><summary>Click to cheat</summary>

```python
# Create an PCA with 2 components
pca = PCA(n_components=2)

# Fit and transform the data
data_pca = pca.fit_transform(data)
```
</details>

### Create the Clusterer

The data forms spherical clusters, so K-Means works just fine.

In [None]:
# Create a K-means clusterer with k = 10 (one for each digit)


# fit the data


<details><summary>Click to cheat</summary>

```python
# Create a K-means clusterer with k = 10 (one for each digit)
kmeans = KMeans(n_clusters=10)

# fit the data
kmeans.fit(data_pca)
```
</details>

### Plot the results

In [None]:
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = 0.02  # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = data_pca[:, 0].min() - 1, data_pca[:, 0].max() + 1
y_min, y_max = data_pca[:, 1].min() - 1, data_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(
    Z,
    interpolation="nearest",
    extent=(xx.min(), xx.max(), yy.min(), yy.max()),
    cmap=plt.cm.Paired,
    aspect="auto",
    origin="lower",
)

plt.plot(data_pca[:, 0], data_pca[:, 1], "k.", markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(
    centroids[:, 0],
    centroids[:, 1],
    marker="x",
    s=169,
    linewidths=3,
    color="w",
    zorder=10,
)
plt.title(
    "K-means clustering on the digits dataset (PCA-reduced data)\n"
    "Centroids are marked with white cross"
)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.xlabel("1st Component")
plt.ylabel("2nd Component")
plt.show()