<h1>Chapter 08. Dimensionality Reduction</h1>

Dimensionality reduction is a technique used in machine learning and data analysis to reduce the number of input variables or features in a dataset while preserving essential information. It aims to simplify complex datasets by transforming them into a lower-dimensional space, making them more manageable for analysis and modeling. Dimensionality reduction methods seek to retain as much relevant information as possible while reducing the computational complexity and noise in the data, facilitating tasks such as visualization, clustering, and classification. Popular techniques include Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Linear Discriminant Analysis (LDA).

<h2>PCA</h2>

PCA (Principal Component Analysis) is a dimensionality reduction technique that simplifies high-dimensional datasets by identifying and representing patterns using fewer variables while preserving essential information.

Let's build a simple 3D dataset

In [1]:
import numpy as np


np.random.seed(4)

m = 60  # number of samples

# Define weights and noise level
w1, w2 = 0.1, 0.3
noise = 0.1

# Generate random angles
angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5

# Create an empty array to store data
X = np.empty((m, 3))

# Generate features using trigonometric functions and noise
X[:, 0] = np.cos(angles) + np.sin(angles) / 2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

<h3>Principal Components</h3>

Principal Components (PCs) are the main underlying patterns in data discovered through PCA, representing directions of maximum variance. They provide a new coordinate system where each component is a combination of original features. PCs aid in dimensionality reduction, visualization, and data compression.

In [2]:
X_centered = X - X.mean(axis=0)

# Perform Singular Value Decomposition (SVD) on the centered data
U, s, Vt = np.linalg.svd(X_centered)

c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

In [3]:
m, n = X.shape

S = np.zeros(X_centered.shape)
S[:n, :n] = np.diag(s)

In [4]:
np.allclose(X_centered, U.dot(S).dot(Vt))

True

<h3>Projecting down to <i>d</i> Dimensions</h3>

Projecting down to `d` dimensions involves transforming high-dimensional data into a lower-dimensional space while preserving essential information.

In [5]:
W2 = Vt.T[:, :2]
X2D = X_centered.dot(W2)

In [6]:
X2D_using_svd = X2D

<h3>Using Scikit-Learn</h3>

PCA (Principal Component Analysis) from scikit-learn is a dimensionality reduction technique that identifies the main patterns in data, allowing for the transformation of high-dimensional datasets into a lower-dimensional space while retaining critical information.

With Scikit-Learn, PCA is really trivial. It even takes care of mean centering

In [7]:
from sklearn.decomposition import PCA


pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

In [8]:
X2D[:5]

array([[ 1.26203346,  0.42067648],
       [-0.08001485, -0.35272239],
       [ 1.17545763,  0.36085729],
       [ 0.89305601, -0.30862856],
       [ 0.73016287, -0.25404049]])

In [9]:
X2D_using_svd[:5]

array([[-1.26203346, -0.42067648],
       [ 0.08001485,  0.35272239],
       [-1.17545763, -0.36085729],
       [-0.89305601,  0.30862856],
       [-0.73016287,  0.25404049]])

Running PCA multiple times on slightly different datasets may result in different results. In general the only difference is that some axes may be flipped. In this example, PCA using Scikit-Learn gives the same projection as the one given by the SVD approach, except both axes are flipped:

In [10]:
np.allclose(X2D, -X2D_using_svd)

True

Recover the 3D points projected on the plane (PCA 2D subspace)

In [11]:
X3D_inv = pca.inverse_transform(X2D)

There was some loss of information during the projection step, so the recovered 3D points are not exactly equal to the original 3D points

In [12]:
np.allclose(X3D_inv, X)

False

Reconstruction error calculation

In [13]:
np.mean(np.sum(np.square(X3D_inv - X), axis=1))

0.01017033779284855

The inverse transform in the SVD approach

In [14]:
X3D_inv_using_svd = X2D_using_svd.dot(Vt[:2, :])

The reconstructions from both methods differ because Scikit-Learn's `PCA` class automatically handles reversing the mean centering. However, subtracting the mean manually results in identical reconstruction.

In [15]:
np.allclose(X3D_inv_using_svd, X3D_inv - pca.mean_)

True

The `PCA` object gives access to the principal components that it computed

In [16]:
pca.components_

array([[-0.93636116, -0.29854881, -0.18465208],
       [ 0.34027485, -0.90119108, -0.2684542 ]])

<h3>Explained Variance Ratio</h3>

Explained Variance Ratio quantifies the proportion of dataset variance captured by each principal component in PCA, aiding in understanding the significance of individual components for data representation and dimensionality reduction.

In [17]:
pca.explained_variance_ratio_

array([0.84248607, 0.14631839])

In [18]:
1 - pca.explained_variance_ratio_.sum()

0.011195535570688975

The result suggests that 84.2% of the variance of the dataset lies along the first axis, and 14.6% lies along the second axis. The third axis remains less than 1.2%, which means it carries little information.