In [3]:
import os
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

#To make the notebook's output stabel across runs
np.random.seed(42)

#Uses Jupyter's own backend to plot
%matplotlib inline

#To make pretty figures
mpl.rc("axes", labelsize=14)
mpl.rc("xtick", labelsize=12)
mpl.rc("ytick", labelsize=12)

#Path to saving images
IMAGE_PATH = os.path.join("images")
os.makedirs(IMAGE_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True , fig_extension="png", resolution=300):
    path = os.path.join(IMAGE_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

***Curse of dimensionality*** happens when the number of features for training instances are very large, thousands or even millions. Thus, these features makes training extremely slow and make it much harder to find a good solution

It's possible to reduce the amount of feature considerably. For instance, in the MNIST images: the pixels on the image borders are almost always white, so we can drop them without losing too much information. Moreover, two neighboring pixels are often highly correlated: so you can merge them into a single pixel (by taking the mean of the two pixel intensities) and not lose too much information

> Reducing dimensionality does cause information loss, so even though it will speed up training, it may make the system perform slightly worse. Also, it makes your pipelines a bit more complex and thus harder to maintain. So, if training is too slow, you should first try to train your system with the original data before considering using dimensionality reduction

Dimensionality reduction is extremely useful for data visualization (DataViz). Reducing the number of dimensions down to two/three makes it possible to plot a condensed view of a high-dimensional training set on a graph and often gain some **important insights by visually detecting patterns such as clusters.** Also, DataViz is **critical to communicate your conclusions to people who are not data scientist** - decision makers who will use my results.

We'll discuss the **two main approaches to dimensionality reduction: _projection_ and _Manifold Learning_**, and cover the three most popular dimensionality reduction techniques: PCA, Kerlnel PCA and LLE

# Section: The Curse Of Dimensionality

There's plenty of space in high dimensions. Thus, high dimension datasets are at risk of being very sparse, that is, most training instances are likely to be far away from each other. This also means that a new instance will likely be far away from any training instance, making predictions much less reliable that in lower dimensions, since they are based on a much larger extrapolation.

In short, the more dimensions the training set has, the greater the risk of overfitting it

In theory, one solution could be to increase the size of the training set to reach a sufficient density of training instances. **Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions. With just 100 features, the number of training instances needed is more than the atoms in the observable universe in order for training instances to be within 0.1 of each other on average, assuming they are spread out uniformly across all dimensions**

# End Of Section: The Curse Of Dimensionality

# Section: Main Approaches For Dimensionality Reduction

##### Projection

In real-world problems, trianing instances are **not** spread out uniformly across all dimensions, they're almost constant, while others are highly correlated. As a result, all training instances lie within a much lower dimensional ***subspace*** of the high-dimensional space.

Let's illustrate this

## **INSERT FIGURE 8-2**

We can see that all training instances lie close to a plane: **this is a lower dimensional (2D) subsspace of the high-dimensional(3D) space.** If we project new instances to this subspace (represented by the short lines connecting the instances to the plane), we get the new 2D dataset shown on in figure 8-3. As a result, **we have just reduced the dataset's dimensionality from 3D to 2D.**

## INSERT FIGURE 8-3

It's not always the best approach to dimensionality reduction, because in many cases the subspace may twist and turn, such as the _Swiss roll_ toy data set in figure 8-4

## INSERT FIGURE 8-4

Simply projecting onto a plane (e.g, by dropping $x_3$) would squash different layers of the Swiss roll together, as shown on the left side of figure 8-5. **What we really want is to unroll the Swiss roll to obtain the 2D dataset on the right side of figure 8-5

## INSERT FIGURE 8-5

##### Manifold Learning

The Swiss roll is an example of a 2D ***manifold***. Simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher dimensional space. 

More generally, a **d-dimensional manifold is a part of an n-dimensional space (where d<n) that locally resembles a d-dimensional plane.** In the case of the Swiss roll, d=2 and n=3: it locally resembles a 2D plane, but is rolled in the third dimension

**Manifold Learning** models the manifold on which the training instances lie so that it can reduce the dimension with the help of algorithms. It relies on:
- **The manifold assumption (manifold hypothesis)**, which states that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold
- **The task at hand (classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold.** For example, the top row of figure 8-6 the Swiss roll is split into two classes: 3D space (left) shows the decision boundary to be fairly complex, but in the 2D unrolled manifold space (right), the decision boundary is a straight line
    - However, this implicit assumption doesn't always hold. For instance, in the bottom row of figure 8-6, the decision boundary is located at $x_1$ = 5. This decision boundary looks very simple in the original 3D space, but it looks more complex in the unrolled manifold (a collection of four independent line segments that devide the green from the yellow).

In short, reducing the dimensionality of your training set before training a model will usually speed up the training, but it may not always lead to better or simpler solution: **it depends on the dataset.**

## INSERT FIGURE 8-6