# Why do we need dimensionality reduction???

Having multiple features increases the "exploration" space, if we have $N_{1}$ possible values for feature 1 and $N_{2}$ possible values for feature 2 then we can have a total of $N_{1} \times N_{2}$ combinations. If we continue until getting $n$ features we would have a total of $N_{1}\times N_{2} \times ... \times N_{n}$ possible values, so the space increases exponentially with the number of features we have. 

Dimensionality reduction, as it names says, reduces the dimension of the feature space. In this specific notebook we will see the Principle Component Analysis (PCA) which uses linear algebra to find multiple linear combinations of features that have greatest impact in the feature-space. This is done as follows:

1. Transform the dataset $X$ such that the values of each feature is centered at 0, this can be easily done by substracting the mean of each feature to their corresponding column or by using a standardized transformation that gets the data to have mean 0 and standard deviation of 1.

2. Get the covariance matrix:
$$\text{cov}(X) = \frac{X^{T}X}{n-1}$$

where $n$ are the number of rows and $X$ is the dataset with shape $nxm$ where $m$ are the number of features.

3. Calculate the eigenvectors and eigenvalues of the system:
$$\text{cov}(X)\vec{p_{a}} = \lambda_{a}\vec{p_{a}}$$

i.e., the eigenvectors $\vec{p_{a}}$ which remain invariant after the operation of the covariance matrix and their corresponding eigenvalues $\lambda_{a}$

4. Transform the data by multiplying it by the matrix with the $N$ eigenvectors that got the greatest value.
$$X_{transformed} = X\cdot V_{N}$$


## Dataset creation

In [5]:
# Lets create a dataset from sklearn.datasets.make_blob
## Create sample dataset with sklearn 
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

n_samples = 1000
n_features = 5

data =  make_blobs(n_samples = n_samples, 
                   n_features = n_features, 
                   centers = 1,
                   random_state = 12)
feature_data = data[0]

## Center data

In [12]:
center_data = feature_data - feature_data.mean(axis = 0)

# Get covariance matrix (normalized)

In [17]:
import numpy as np

cov = np.transpose(center_data)@center_data/(len(center_data)-1) # This division is what normalize the covariance
cov

array([[ 0.99607584, -0.02450582,  0.02846781,  0.01337188, -0.03843541],
       [-0.02450582,  1.01683351,  0.00938314,  0.01686747, -0.02826945],
       [ 0.02846781,  0.00938314,  0.99091033, -0.02925299,  0.00343573],
       [ 0.01337188,  0.01686747, -0.02925299,  1.04513819,  0.02365677],
       [-0.03843541, -0.02826945,  0.00343573,  0.02365677,  1.0432768 ]])

## Get eigenvectors and eigenvalues of covariance matrix (normalized)

In [18]:
eig_sys = np.linalg.eig(cov)
eig_sys

EigResult(eigenvalues=array([0.92650081, 1.07989176, 1.00115088, 1.03113944, 1.05355178]), eigenvectors=array([[ 0.60542414,  0.30527195, -0.2389985 , -0.68016109, -0.1432873 ],
       [ 0.39135557,  0.09934794, -0.39782425,  0.64159948, -0.51677537],
       [-0.48934024,  0.26033264, -0.80391314, -0.07854956,  0.20081504],
       [-0.31891962, -0.54831639, -0.12010783, -0.3441484 , -0.68174438],
       [ 0.37301317, -0.72699042, -0.35201329,  0.03350894,  0.45531296]]))

## Transform the data

We will pick the top 3 eigenvectors

In [23]:
X_transf = feature_data@eig_sys[1][:, :3]
X_transf

array([[-5.34601898,  3.9805265 ,  7.99370598],
       [-2.38634649,  3.71377939,  7.60816424],
       [-4.45121296,  5.09147019,  7.6649479 ],
       ...,
       [-3.81499632,  5.64604836,  8.22068892],
       [-3.50139373,  1.15440944,  5.81720655],
       [-4.76827168,  4.12086166,  6.3339521 ]])