<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/203_svd-pca.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Singular Value Decomposition and Principal Component Analysis
___

In this notebook, we will look at two significant linear algebra concepts: singular value decomposition (SVD) and principal component analysis (PCA). SVD and PCA are essential in machine learning, especially when we want to extract crucial information from our data or/and reduce dimensionality.

There is a decent amount of mathematical theory behind both concepts, but this should not scare you! While there will be a fair bit of maths in this notebook. It's okay if you need help understanding everything on your first readthrough. As always, it's key to understand the intuition behind our tools and why and when to use them.

___
## Singular Value Decomposition (SVD)


### 🙀 🤯 Mathematical Formulation

For a real $m \times n$ matrix $\mathbf{A}$, the SVD is an algorithm which factorizes the matrix $\mathbf{A}$ into three matrices:

$$\mathbf{A} = \mathbf{U \Sigma V}^\top,$$

where
+ $\mathbf{U}$ is an $m \times m$ [orthogonal matrix](https://en.wikipedia.org/wiki/Orthogonal_matrix) (i.e., $\mathbf{U}^\top \mathbf{U} = \mathbf{UU}^\top = \mathbf{I}$), 
+ $\mathbf{\Sigma}$ is an $m \times n$ (rectangular) [diagonal matrix](https://en.wikipedia.org/wiki/Diagonal_matrix) (i.e., only its diagonal values can be non-zero, everything outside of the diagonal is zero), and 
+ $\mathbf{V}$ is an $n \times n$ [orthogonal matrix](https://en.wikipedia.org/wiki/Orthogonal_matrix) (i.e., $\mathbf{V}^\top \mathbf{V} = \mathbf{VV}^\top = \mathbf{I}$).

Writing it out gives:

$$\begin{align*}
\underset{m\times n}{\overbrace{
\begin{bmatrix}
a_{11} & a_{12}  & \dots & a_{1n} \\
a_{21} & a_{22}  & \dots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots\\
a_{m1} & a_{m2}  & \dots & a_{mn} \\
\end{bmatrix}}^{\mathbf{A}}} =
\underset{m\times m}{\overbrace{\begin{bmatrix}
u_{11}  & \dots  & u_{1m} \\
\vdots &  \ddots & \vdots \\
u_{m1} & \dots & u_{mm}
\end{bmatrix}}^{\mathbf{U}}} \
\underset{m\times n}{\overbrace{\begin{bmatrix}
\sigma_1 & 0  & \dots & 0 \\
0 & \sigma_2  & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots\\
0 & 0  & \dots & 0 \\
\end{bmatrix}}^{\mathbf{\Sigma}}} \
\underset{n\times n}{\overbrace{\begin{bmatrix}
u_{11}  & \dots  & u_{1n} \\
\vdots &  \ddots & \vdots \\
u_{n1} & \dots & u_{nn}
\end{bmatrix}}^{\mathbf{V}^\top}}
\end{align*}$$

Beware that $\mathbf{\Sigma}$ has $\min(m, n)$ non-zero values only, i.e., there will also be zeros on the diagonal if $\mathbf{A}$ is not a square matrix (i.e., if $m \neq n$).
___ 

Okay, let's reiterate with a bit less math: **SVD decomposes a matrix into three matrices.**
+ There is one orthogonal matrix which has as many rows and columns as $\mathbf{A}$ has **rows**.
+ There is one orthogonal matrix which has as many rows and columns as $\mathbf{A}$ has **columns**.
+ There is a matrix which is zero everywhere outside of the diagonal. This matrix has the same dimension as $\mathbf{A}$.


[Wikipedia proposes a very nice illustration of SVD](https://upload.wikimedia.org/wikipedia/commons/c/c8/Singular_value_decomposition_visualisation.svg).


We call the non-zero elements of $\mathbf{\Sigma}$ the **singular values**. The columns of $\mathbf{U}$ and $\mathbf{V}$ are called the **left-singular** and **right-singular vectors** respectively.

It is interesting to note that **the SVD always exists!** While many other matrix factorization do not work for any real matrix, the SVD does. In any case, this might be a lot to process, let's look at SVD in Python.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from numpy.linalg import svd # Import svd directly to make our code shorter

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Set the random seed for replicability
np.random.seed(72)

# Define the number of rows and columns
m, n = 4, 12

# Create a random m x n matrix
A = np.random.rand(m, n) 

# Compute its singular value decomposition
U, S, V = svd(A)

# Print the shape of the matrices
print(f"A has shape {A.shape}")
print(f"U has shape {U.shape}")
print(f"S has shape {S.shape}")
print(f"V has shape {V.shape}")

Looking at the shapes, we notice that `S` is a vector and not a diagonal matrix like we initially claimed. This is the way most programming languages do SVD. Fortunately, it's very easy to get our $m \times n$ diagonal matrix from this.

However, since $\mathbf{\Sigma}$ has some zero element on the diagonal, we can also ignore these elements and drop the unnecessary columns of $\mathbf{U}$ or rows of $\mathbf{V}$. To do this, we simply pass the parameter `full_matrices=False` to our `svd` command.

In [None]:
# Compute the SVD in reduced-form
U, S, V = svd(A, full_matrices=False)

# Print the shape of the matrices
print(f"A has shape {A.shape}")
print(f"U has shape {U.shape}")
print(f"S has shape {S.shape}")
print(f"V has shape {V.shape}")

Hence, by using this reduced form SVD, we have dropped 8 rows on our $\mathbf{V}^\top$ matrix. Finally, we still need to transform our vector of singular values `S` to a diagonal matrix, this can easily be done using `np.diag`.

So, if SVD really does work, we should be able to obtain `A` by multiplying `U`, `np.diag(S)`, and `V` together... let's find out!

In [None]:
A_reconstructed = U @ np.diag(S) @ V

In [None]:
# Make sure that A and A_reconstructed are the same 
# (up to some small numerical error)
np.all(np.abs(A - A_reconstructed) <= 1e-12)

Great, we got SVD to work! ... but what shall we make of it? Why is such a factorization even useful? Are we just doing linear algebra for fun? 🤨

Unfortunately, SVD is somewhat of an acquired taste. At first, it seems like plain linear algebra, but as your knowledge about machine learning grows, you will start seeing SVD in many places, even some you would not expect. SVD is a key component in signal processing, image processing, big data and many more domains!

### Image Compression

Let us look at SVD in the context of image processing, in particular image compression. As you already know, you can think of an image as a matrix: **A grayscale image with a height of $m$ pixels and a width of $n$ pixels can be represented by an $m \times n$ matrix, where each cell is the intensity of the pixel.** Different programming languages use different conventions, in Python, we will use *integer* values in $[0, 255]$ to represent the intensity. $0$ is completely black, and $255$ is completely white. Of course, if you have a colored image, e.g., using RGB (red-green-blue) colors, you can split this into three matrices: one for the red color intensity, one for the green color intensity and one for the blue color intensity.

A well-known package called Pillow allows us to work with images in Python. We can load them, display them, and transform them to numpy arrays (we can also do many more things, but that's what we will be doing for now).

In [None]:
# Pillow is a package that allows us to work with images
from PIL import Image

In [None]:
# Load an image using PIL
img = Image.open(f"{DATA_PATH}/images/gauss.jpg")

In [None]:
img.resize((300, 400)) # Show the image (resized to a smaller size)

In [None]:
# Transform the image to a numpy array
img_arr = np.asarray(img)
# Show the shape of the array
img_arr.shape

As expected, the dimensions of the image array in numpy refer to `(height, width, number of colors)`.

In [None]:
# Show the first 10 rows and 10 columns of the "red" layer of the above image
img_arr[:10, :10, 0]

In [None]:
# Convert to grayscale by taking the mean of all three color layers
# notice that Python expects to have an array of integers between [0, 255], 
# hence we need to transform back to integer!
def img_to_grayscale(img):
    # Only transform to grayscale if it has multiple layers in the 3rd dimension
    if len(img.shape) > 2:
        return img.mean(axis=2).astype(np.uint8)
    return img

In [None]:
# Transform the image to grayscale
img_gray = img_to_grayscale(img_arr)

# Read the image using Pillow and display it again
Image.fromarray(img_gray).resize((300, 400))

Now that we better understand how to load an image, convert it to a numpy array, and convert back a numpy array to an image, all we have to is to add a step where we perform linear algebra on our numpy array, so let's get back to our SVD.

SVD is particularly helpful to **approximate a matrix using a lower rank** (recall, the [rank](https://en.wikipedia.org/wiki/Rank_(linear_algebra)) is the number of linearly independent columns in a matrix).

Let's say that we want to approximate the image above by using a rank of $k$. In this case, we can use **truncated SVD**, which means that we will reconstruct our initial matrix by using only some parts of the $\mathbf{U}$, $\mathbf{\Sigma}$, and $\mathbf{V}^\top$ matrices. To be more precise, the truncated SVD of rank $k$ is:

$$\tilde{\mathbf{A}}_k = \mathbf{U}_k \mathbf{\Sigma}_k \mathbf{V}^\top_k,$$

where:
+ $\mathbf{U}_k$ is the $m \times k$ matrix formed by the first $k$ columns of $\mathbf{U}$
+ $\mathbf{\Sigma}_k$ is the diagonal matrix of the first $k$ singular values, and
+ $\mathbf{V}_k^\top$ is the $k \times n$ matrix formed by the first $k$ rows of $\mathbf{V}^\top$



#### ➡️ ✏️ <font style="color: green">**Question 1**</font>

Write a function, called `truncated_svd(A, k)`, which takes as input a matrix `A` and a rank `k`. This function should

1. Perform `svd` using numpy on the matrix `A` (use `full_matrices=False`)
2. Return the matrices $\mathbf{U}_k$, $\mathbf{\Sigma}_k$ and $\mathbf{V}^\top_k$. This means you have to subset the matrices correctly (if you do not remember how to subset matrices, go back to notebook 01d, however, it is very close to subsetting in pandas!), and you also want to directly turn the vector of singular values into a diagonal matrix (since `svd` will give you a vector of singular values instead of the diagonal matrix!)

In [None]:
# Enter your code here
def truncated_svd(A, k):
    pass # ⬅️ ... replace this with your code

In [None]:
# Once your function seems good, run this code to check if it works
#check_truncated_svd(truncated_svd)

Let's recap. We claimed that, using truncated SVD, we can approximate our initial matrix by a lower rank matrix, but how good will this approximation be? Well let's look at a visual example. In the code below, $k$ is the rank of our approximation, the higher the $k$, the better the approximation will be.

We will also define a useful function below that allows us to rescale the values of our data in the range $[0, 255]$.

In [None]:
# Rescale an array such that its values are within [0, 255]
rescale = lambda x: (x - x.min()) / (x.max() - x.min()) * 255

#### ➡️ ✏️ <font style="color: green">**Question 2**</font>

Plot singular values for all the three color representations. What does it tell you?

#### ➡️ ✏️ <font style="color: green">**Question 3**</font>

Run the code below, change the value of `k`, e.g., try out `k = 2`, `k = 5`, `k = 10`, `k = 20`, `k = 30`, `k = 50`, `k = 100`. Of course, you can also select another image from the images folder!


What do you observe? How is what you see related to Question 2?

In [None]:
# The rank of our approximation
k = 1

# Select the image
img_path = f"{DATA_PATH}/images/gauss.jpg"
img = np.asarray(Image.open(img_path))

# Perform truncated SVD
U, S, V = truncated_svd(img_to_grayscale(img), k)

# Reconstruct the image and display it using Pillow
Image.fromarray(rescale(U @ S @ V).astype(np.uint8))

In [None]:
# Of course, SVD also works with colors, we just have to apply SVD to every
# one of the three matrices for red, green, and blue, let's write two functions
# to make things easier

# Recompose a matrix given the U, S, and V matrices
recompose = lambda U, S, V: rescale(U @ S @ V).astype(np.uint8)
# Do truncated SVD with rank k and recompose the matrix
svd_recompose = lambda A, k: recompose(*truncated_svd(A, k))

# The rank of our approximation
k = 1

# Select the image and pass it to a numpy array
img_path = f"{DATA_PATH}/images/ml_linalg.png"
img = np.asarray(Image.open(img_path))

# Perform truncated SVD and recomposition on each axis 
# 🙀 🤯 the min(3, img.shape[2]) is because sometimes, images have 4 
# "color" matrices: R, G, B, and transparency
# We want to ignore transparency for this example
img_approximated = np.dstack(
    [svd_recompose(img[:, :, i], k=k) for i in range(min(3, img.shape[2]))]
)

# Reconstruct the image and display it using Pillow
Image.fromarray(img_approximated)

### How to decide better about number k?

In [None]:
# Compute Variance explained by each singular vector
# Select the image
img_path = f"{DATA_PATH}/images/gauss.jpg"
img = np.asarray(Image.open(img_path))
# scale the image matrix befor SVD
img_mat_scaled= (img-img.mean())/img.std()
# Perform  SVD
# Compute the SVD and plot for one color
U, S, V = svd(img_mat_scaled[:,:,0], full_matrices=True)
var_explained = np.round(S**2/np.sum(S**2), decimals=3)

# Variance explained top Singular vectors
var_explained[0:50]

# load seaborn for plotting
# Plot the singular values
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(var_explained[0:50], color="dodgerblue")
plt.xlabel('Singular Vector', fontsize=16)
plt.ylabel('Variance Explained', fontsize=16)
plt.tight_layout()
#plt.savefig("Line_Plot_with_Pandas_Python.jpg")

#### ➡️ ✏️ <font style="color: green">**Question 4**</font>

Can you now decide better what $k$ to take? Analyse your previous results and determine how many singular values to keep for each color so that the mean explained variance is 0.98. Reconstruct the image using that number $k$. What are your conclusions? What is the compression ratio we achieved?

In [None]:
#We provide solution only for 1 color, the rest compute yourself
#np.sum(var_explained[:21])
k = 21
# Select the image
img_path = f"{DATA_PATH}/images/gauss.jpg"
img = np.asarray(Image.open(img_path))

# Perform truncated SVD
U, S, V = truncated_svd(img_to_grayscale(img), k)

# Reconstruct the image and display it using Pillow
Image.fromarray(rescale(U @ S @ V).astype(np.uint8))

___
## Principal Component Analysis (PCA)

We have now learned a bit about SVD and we have seen how it can be used to approximate matrices. In this context, images a particularly insightful. As humans, visualizing how increasing the rank $k$ of our approximation yields a clearer image is easier to understand than simply looking at numbers in a matrix.

In a very non-formal way, using SVD, we find that we don't need all the information in the original image to get a clear picture. Indeed, if you play around with the $k$ in the examples above, you will not be able see a big difference between an approximation with a relatively high $k$ and the original image. 

Furthermore, when our chosen approximation rank $k$ is low, an increase in one unit of $k$ will yield a much bigger improvement than if it is already large. For instance, try going from $k=1$ to $k=2$ and then try going from $k=100$ to $k=101$, you will probably not notice any difference in the latter case.

Pushing this idea further, we can imagine that not all information is equal. When dealing with a massive data set, it could clearly be useful to somehow find a way to summarize the information it contains.

Karl Pearson proposed principal component analysis (PCA) in 1901 as a technique to **explain variance in the data**. Once again, PCA is relatively heavy on linear algebra but the main point for you is to get a good intuition as to why and when PCA can be useful.

Before diving into the idea of PCA, it is good to revise the concept of an [orthonormal basis](https://en.wikipedia.org/wiki/Orthonormal_basis). Without going to deep into the mathematics, an orthonormal basis is a set of unit vectors (i.e., vectors with length one) that are orthogonal to each other (i.e., their inner product is zero). The simplest orthonormal basis we can think of is defined by the basis vectors $[1, 0]$ and $[0, 1]$, it is also the standard basis in $\mathbb{R}^2$.

The main goal behind PCA is to create an orthonormal basis with basis vectors such that:
1. the data projected along the first basis vector (PC1) has the **maximum variance**
2. the data projected along the second basis vector (PC2) has the maximum variance given that it is orthogonal to the first basis vector.
3. the data projected along the $n^\text{th}$ basis vector has the maximum variance given that it is orthogonal to the $(n-1)^\text{th}$ basis vector.

The basis vectors defined by PCA are also called the **principal components**.

___
### Visualizing PCA

Thinking about orthonormal bases might seem somewhat abstract and unnecessarily complicated. Luckily, when working in two dimensions, it is easy to visualize linear algebra ideas, so let's try making it clearer with some plots...

Consider that you have the following data points:

In [None]:
np.random.seed(12)
N = 500 # Number of data points
X1 = 0.75 * np.random.randn(N,1) # Generate N random X1 values
X2 = .75 * X1 + np.random.randn(N,1) * 5e-1 # Generate X2 values based on X1

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

# Add data points
ax.scatter(X1, X2, alpha=0.8)

# Add standard basis vectors
ax.arrow(0, 0, 0, 1, head_width=0.1, width=0.02, color="black", label="Standard basis vectors")
ax.arrow(0, 0, 1, 0, head_width=0.1, width=0.02, color="black")

# Beautify plot
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.set_xlim((-3, 3))
ax.set_ylim((-3, 3))
ax.legend(loc="upper left")
ax.grid()

#### ➡️ ✏️<font color=green>**Question 5**</font>
If you had to draw a line in this plot such that the points along this line have the largest variance, how would this line look like?

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Compute PCA with 2 components on the data points
pca = PCA(n_components=2)
pca.fit(np.hstack((X1, X2))) # Fit PCA on the data
c = pca.components_ # Extract components to make later code more concise

In [None]:
# 🙀 🤯 There is no need to focus too much on this code, the important part is 
# the resulting plot!

# Create the figure
fig, ax = plt.subplots(figsize=(10, 10))

# Add data points
ax.scatter(X1, X2, alpha=0.8)

# Add standard basis vectors
ax.arrow(0, 0, 0, 1, head_width=0.1, width=0.02, color="black", 
         label="Standard basis vectors")
ax.arrow(0, 0, 1, 0, head_width=0.1, width=0.02, color="black")

# Add PCA basis vectors
ax.arrow(0, 0, c[0][0], c[0][1], head_width=0.1, width=0.02, color="red", 
         label="PCA basis vectors")
ax.arrow(0, 0, c[1][0], c[1][1], head_width=0.1, width=0.02, color="red")

# Beautify plot
ax.set_xlabel("X1")
ax.set_ylabel("X2")
ax.set_xlim((-3, 3))
ax.set_ylim((-3, 3))
ax.legend(loc="upper left")
ax.grid()

Perhaps you answered the question above by saying that the line which explains the most version should be the diagonal from the bottom left to the top right of the plot. If that is the case, you are right. The above plot shows us the basis vectors of the orthonormal basis defined by PCA in red. As we can see from the data points, the data varies the most along those vectors.

Consider our randomly generated data and the basis vectors of the orthonormal basis of the PCA. We can transform our data to this new basis, as the animation below shows. Notice how, **after the change of basis, the first principal component (that which explains the most variance) is on the horizontal axis, and, the second principal component is on our vertical axis**. Of course, all of this generalizes to more than two dimensions, but it becomes more difficult to grasp the intuition visually!

![test](https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data/images/pca.gif)

Notice how, as we transform the data space, this brings the PCA basis vectors in place of our standard basis vectors, this is a visualization of a **change of basis**.

⚠️ However, be careful with *reading too much into visualizations*. In particular, the above plot shows how the change of basis due to PCA is akin to a *rotation* of the data. This is not always the case. Without getting into too many details, the change of basis is a matrix multiplication, i.e., it can be any linear transformation. If you are not too sure of what a change of basis is or how to interpret a matrix multiplication, we highly recommend looking at 3Blue1Brown's amazing playlist: [The Essence of Linear Algebra](https://www.3blue1brown.com/topics/linear-algebra). It is without a doubt one of the most intuitive explanations of linear algebra basics.

___
## Exploratory Data Analysis
Once again, we have to ask ourselves: **buy why is this useful?**. Given the basis vectors of the PCA, we can multiply single data points by these vectors, which gives us the **principal components of each data point**. 

Do you remember when we learned about overfitting and we saw that we could add polynomials of our features to enrich our data and obtain a better fit? In a sense, we can also view principal components as potential new features, in fact, **the features which explain the most variance in the data!**

PCA is particularly useful in **exploratory data analysis**, it can give us insights on whether the **variation** in the data explains something. We might want to use it as a precursor for a more formal data analysis or to tease out some meaningful differences in the data. Let's have a look at how PCA can be useful using our old friend, the iris data set.

In [None]:
# Correlation plot helper. No need to focus on this code! This is only useful to make the correlation plots
# you see below. Focus on the plots, not the code to create them.
def corrplot(df, features, color_features, shape_features=None, figsize=(12, 12)):
    markers = ["o", "^", "s", "P", "*", "H", "X"]
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
    p = len(features)
    # Create the canvas
    fig, axs = plt.subplots(p, p, figsize=figsize)
    for i in range(p):
        for j in range(p):
            ax = axs[i, j]
            ax.xaxis.set_visible(False)
            ax.yaxis.set_visible(False)
            if i == j:
                ax.annotate(features[i], (0.5, 0.5), ha="center", va="center")    
            else:
                #ax.grid(linestyle="dashdot")
                for c,f in enumerate(df[color_features].unique()):
                    if shape_features:
                        for k,m in enumerate(df[shape_features].unique()):
                            dft = df[(df[color_features] == f) & (df[shape_features] == m)]
                            if i == 0 and j == 1:
                                ax.scatter(dft[features[i]], dft[features[j]], label=f"{f} ({m})",
                                           alpha=0.8, marker=markers[k], color=colors[c])
                            else:
                                ax.scatter(dft[features[i]], dft[features[j]], marker=markers[k], alpha=0.8, color=colors[c])
                    else:
                        dft = df[df[color_features] == f]
                        if i == 0 and j == 1:
                            ax.scatter(dft[features[i]], dft[features[j]], label=f, alpha=0.8)
                        else:
                            ax.scatter(dft[features[i]], dft[features[j]], alpha=0.8)
                if i == 0: #
                    if j % 2 == 1:
                        ax.xaxis.set_visible(True)
                        ax.xaxis.tick_top()
                elif i == p - 1:
                    if j % 2 == 0:
                        ax.xaxis.set_visible(True)
                ###
                if j == 0:
                    if i % 2 == 1:
                        ax.yaxis.set_visible(True)
                elif j == p - 1:
                    if i % 2 == 0:
                        ax.yaxis.set_visible(True)
                        ax.yaxis.tick_right()
    fig.legend()
    fig.suptitle("Correlation Plot")

In [None]:
import pandas as pd
iris = pd.read_csv(f"{DATA_PATH}/iris.csv")

In [None]:
# For PCA, we want to standardize our data, i.e., de-mean it and divide it by the standard deviation
standardize = lambda x: (x - x.mean()) / x.std()
features = iris.columns[:-1] # All sepal/petal length/width
iris[features] = standardize(iris[features])

In [None]:
# Column names for our principal components
principal_components = [f"PC{i+1}" for i in range(len(features))]

# Run PCA with 4 principal components on our iris data
pca = PCA(n_components=4)
iris[principal_components] = pca.fit_transform(iris[features])

In [None]:
# Display a correlation plot of the (standardized) features
corrplot(iris, features, "species")

In [None]:
# Display a correlation plot of the principal components
corrplot(iris, principal_components, "species")

#### ➡️ ✏️<font color=green>**Question 6**</font>
Compare the principal component correlation plot with the one based on the *true* features. Discuss it with your classmates: 
+ What do you observe?
+ How do you feel about the first principal component **PC1** compared to the further principal components?

In [None]:
crabs = pd.read_csv(f"{DATA_PATH}/crabs.csv")
crabs.head(10)

In [None]:
features = ["FL", "RW", "CL", "CW", "BD"]
crabs[features] = standardize(crabs[features])

In [None]:
# Column names for our principal components
principal_components = [f"PC{i+1}" for i in range(len(features))]

# Run PCA with 5 principal components on our crabs data
pca = PCA(n_components=5)
crabs[principal_components] = pca.fit_transform(crabs[features])

In [None]:
# Display a correlation plot of the (standardized) features)
corrplot(crabs, features, "species", "sex")

In [None]:
# Display a correlation plot of the principal components
corrplot(crabs, principal_components, "species", "sex")

#### ➡️ ✏️<font color=green>**Question 7**</font>
Compare the principal component correlation plot with the one based on the *true* features. Discuss it with your classmates: 
+ What do you observe?
+ What has changed between this plot and the one we did for the iris data set?
+ For which of the two data sets do you think PCA makes more sense?

#### ➡️ ✏️<font color=green>**Question 8**</font>
Use the same method of decision the number of components as in the SVD image compression. Use the iris and crabs data and provide your analysis.
Run dimensionality reduction and reconstruction of Gauss picture with PCA.
How do your results compare to the ones in the SVD part?