# Chapter 8: Dimensionality Reduction Exercises

## 1.

> What are the main motivations for reducing a dataset's dimensionality?

> What are the main drawbacks?

The main motivations for reducing a dataset's dimensionality are: 

- Speeding up training
- Data visualization

With thousands or millions of features, training becomes extremely slow and finding a good solution becomes more difficult. Also when a dataset is reduced to 2 or 3 dimensions, it can be plotted to allow for the possibility deeper insight that might have been missed or to be used as a communication tool for others.

The main drawbacks are:

- Loss of information
- Does not remove noise, only speeds up training
- Pipeline becomes more complex and harder to maintain

Dimensionality reduction is akin to compressing an image file. It makes the algorithm run faster but at a cost of performing worse (looking worse, in the case of image file). Also, reducing dimensions does not necessarily remove unwanted noise.

## 2.

> What is the curse of dimensionality?

The curse of dimensionality states that high-dimensional datasets are prone to be very sparse and have a greater risk of overfitting. This is because at higher dimensions, there is just a lot of space and so instances are likely to be far apart from each other.

Recall the Average distance in N-Dimension Comparison:

- 2D (unit square): 0.52
- 3D (unit cube): 0.66
- 1,000,000-D (unit hypercube): 408.25

## 3.

> Once a dataset's dimensionality has been reduced, is it possible to reverse the operation?

> If so, how? If not, why?

Once a dataset's dimensionality has been reduced, it is possible to reverse the operation by computing the inverse transformation - take the projected data and matrix multiply by the principal components of the original d-dimensions.

$$ \mathbf{X}_{recovered} = \mathbf{X}_{d-proj} \mathbf{W}_d^T $$

where $\mathbf{W}_d^T$ is the matrix containing the principal components of the original d-dimensions.

However due to information loss from the inital projection, the recovered data will be close to the original but not perfectly the same.

## 4.

> Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?

PCA can be used to reduce the dimensionality of a highly nonlinear dataset by implementing the kernel trick to PCA, also called Kernel PCA. 

Instead of finding a common hyperplane to project down onto, kPCA maps the dataset into a higher-dimensional feature space (sometimes even to $\infty$-dimensional space). 

And then it projects down to the lower-dimension, safely reducing its dimensions without any loss of information.

## 5.

> Suppose you perform PCA on a 1,000-dimensional dataset, setting the explained variance ratio to 95%.

> How many dimensions will the resulting dataset have?

The explained variance ratio is the proportion of the dataset's variance that lies along each principal component. There isn't enough information to determine how many dimensions the resulting dataset will have.

If a dataset is linear, PCA can reduce it down to 1-D and preserve all of its variance. But if the dataset is a random scatter plot, the resulting dimensions would be 950-D at worst as it has to preserve 95% of the variance.

## 6.

> In what cases would you use vanilla PCA, Incremental PCA, Randomized PCA, or Kernel PCA?

When to use the following PCA methods:

- Vanilla PCA:
    - Most small to medium training sets
    - Fairly linear and not complex projections
- Incremental PCA:
    - Large training sets
    - Online PCA (such as on the fly or new instance training)
- Randomized PCA:
    - Much faster training compared to vanilla PCA
    - Okay with finding an approximation for the first d-principcal components
    - (m or n > 500) and (d < 80% m or n)
- Kernel PCA:
    - Complex nonlinear projections
    - Preserves clusters of instances after projection

## 7.

> How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?

You can evaluate the performance by performing an inverse transformation to project the reduced dataset back to its original dimensionality. Then compare the reconstructed dataset with the original by computing the mean squared distance error.

A good dimensionality reduction algorithm would reduce the dimensions such that the algorithm runs faster and loses the least amount of information (ie. it minimizes the mean squared error between the reconstructed and original data).

## 8.

> Does it make any sense to chain two different dimensionality reduction algorithms?

It would make sense to chain two different dimensionality reduction algorithms. The first can be used to remove empty dimensions so that the second can be used to transform the dataset without having to worry about the empty dimensions.

## 9.

> 1. Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).

> 2. Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set.

> 3. Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.

> 4. Train a new Random Forest classifier on the reduced dataset and see how long it takes.

> 5. Was training much faster?

> 6. Next, evaluate the classifier on the test set.

> 7. How does it compare to the previous classifier?

## 10.

> 1. Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib.

> 2. You can use a scatterplot using 10 different colors to represent each image's target class.

> 3. Alternatively, you can replace each dot in the scatterplot with the corresponding instance's class (a digit from 0 to 9).

> 4. Or even plot scaled-down versions of the digit images themselves.
    >> Note: If you plot all digits, the visualization will be too cluttered, so you should either:  
    >> - Draw a random sample.
    >>
    >> - Or plot an instance only if no other instance has already been plotted at a close distance.

> 5. You should get a nice visualization with well-separated clusters of digits.

> 6. Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.