<a href="https://colab.research.google.com/github/RohanOpenSource/ml-notebooks/blob/main/DimensionalityReduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The more dimensions we have in our data, the longer the model to take to train, the more extremes there will be, and the higher the likelyhood that our data will not be represented properly. This conecept in data science is reffered to as the **Curse Of Dimensionality**. The most popular way to do this is PCA, which is short for principle component analysis. In PCA we find the line that best represents the data, and then we find the line orthagonal to it, and the line orthagonal to that, until we run out of dimensions. Then, we use these lines to project the data into a lower dimension such that the easy to find pattern in the data is not lost.

In [1]:
import numpy as np
import sklearn

Let us start by making a random 3d dataset.

In [9]:
np.random.seed(42) 
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

Firstly, we will implement pca using numpy.

In [10]:
X_centered = X - X.mean(axis=0) #centering the data at the origin
U, S, VT = np.linalg.svd(X_centered)
c1 = VT.T[:, 0]
c2 = VT.T[:, 1]

W2 = VT.T[:, :2]
X2D = X_centered.dot(W2)#dot product of the data and matrix W

We can do this with much more ease using scikit learn.

In [18]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)
pca.explained_variance_ratio_

array([0.85406025, 0.13622918])

To what dimension should be reduce the data to. 2 is a completely arbitrary number of dimensions and to few dimensions can be bad just as too many can. The easiest way to find the optimal number of dimensins is to add up the variance of a until it is sufficiently big.

In [19]:
pca2 = PCA()#dont specify ndim
pca2.fit_transform(X)
cs = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cs >= 0.95)+1

In [28]:
pca_o = PCA(n_components=d)
X_r = pca_o.fit_transform(X)

Just like batch gradient descent and stochastic gradient descent, calculating our principal components by sampling the entire dataset is quite slow. A quicker way to do it is sampling random data points

In [33]:
pca_rand = PCA(n_components=d, svd_solver="randomized")
X_reduced = pca_rand.fit_transform(X)