## ML2 Notebook

### Dimensionality Reduction

We have a dataset matrix $\mathbf{X}\in \mathbb{R}^{N\times D}$ that consists of $N$ data points, each of dimensionality $D$. We want to reduce the dimensionality of our dataset, so it takes up less space, and is easier to visualise. 

This means that we want a new dataset matrix $\mathbf{X_D}\in \mathbb{R}^{N\times M}$ that consists of $N$ data points, each of dimensionality $M$ where $M<D$. 

We can achieve this by using a transformation matrix $\mathbf{W}\in \mathbb{R}^{D\times M}$. This transforms our data to a lower dimension using $ \mathbf{X_{D}} =  \mathbf{X}\mathbf{W}$.

<img src="./transform1.png" title="transform1"/>


Let's illustrate this on our (standardised) iris dataset ($N=150, D=4$). Consider the scenario where we want to reduce our data to 2D. This gives us $M=2$.

$\mathbf{W}$ in this case must be a $4\times 2$ matrix. In this notebook we will use the following matrix for $\mathbf{W}$:

$\mathbf{W}=
\begin{bmatrix}
1 & 0 \\
0 & 1 \\
0 & 0 \\
0 & 0 \\
\end{bmatrix}$

This particular $\mathbf{W}$ is **not optimal** and is used for illustration. You will find out how to get the optimal $\mathbf{W}$ in the next video!

In [None]:
import numpy as np # Import the numpy package

X = np.load('iris_standardised.npy') # Load our standardised iris data

print(f'Our dataset is represented by a matrix X that is of shape {X.shape}')

In [None]:
W = np.array([[1,0],[0,1],[0,0],[0,0]]) # Create W
X_D = X @ W # Multiple X by W
print(X_D) # Print our reduced dimensionality dataset

In [None]:
print(f'Our dimensionality-reduced dataset is represented by a matrix X_D that is of shape {X_D.shape}')

We have chosen our $\mathbf{W}$ so that the transformed dataset keeps the first two dimensions of $\mathbf{X}$ and ignores the other two. As I mentioned, this isn't optimal, as we're throwing away information that might be valuable!



### Reconstruction

Our matrix $\mathbf{W}$ can be used to reduce the dimensionality of our dataset using $ \mathbf{X_{D}} = \mathbf{X}\mathbf{W}$

We also want to use $\mathbf{W}^\text{T}$ to try and reconstruct our original data from $ \mathbf{X_{D}}$ using $ \mathbf{\widetilde{X}} =  \mathbf{X_D}\mathbf{W^T}$ where $\mathbf{\widetilde{X}}$ is our reconstructed dataset.

<img src="./transform2.png" title="transform2"/>



Why do we want this? Well, **this setup lets us evaluate how good the transformation $\mathbf{W}$ is!**  If our reconstructed dataset resembles our original dataset then our transformation $\mathbf{W}$ was good; it was able to transform the dataset to a lower dimensional space, while keeping most of the important information intact.



Let's use $\mathbf{W}^\text{T}$ to bring our iris dataset back up to 4 dimensions, in an attempt to reconstruct the original dataset.

In [None]:
X_tilde = X_D @ W.T
print(X_tilde)

We're back in 4 dimensions! But we have lost all the values in the last two columns. These are now zero. We know that this is bad, but we can quantify how bad it is by computing the **reconstruction error**.

<img src="./recon.png" title="recon"/>



In [None]:
N = 150 # Number of data points
E = 0 # Start our reconstruction error as 0 and then add to it
for i in range(N):
    E += np.linalg.norm(X[i,:]-X_tilde[i,:]) # Add the error for each data point.

print(f'Reconstruction error is {E:0.2f}. This is bad!')

Note that the loop above is illustrative. As is often the case in Python, we can do everything quickly without loops if we think about it.

In [None]:
E = np.sum(np.sum((X-X_tilde)**2,1)**.5)
# Subtract X from X_tilde, square all the entries, sum across each row, take the square root, sum
print(E)