# Principal Component Analysis - Solutions

<hr style="clear:both">

**Author:** Sabri El Amrani

<hr style="clear:both">

In [None]:
# Function to align all tables to the left (useful for later on)

In [None]:
%%html
<style>
table {float:left}
</style>

### Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
sns.set()

import helpers

## 0. Introduction

*__Background:__
As a data scientist, you are tasked with analyzing a subset of leaf samples from an experiment conducted at INRA in Angers, France. The dataset includes reflectance measurements across wavelengths from 400 to 2450 nm for 18 leaf samples. Your objective is to apply dimensionality reduction and clustering techniques to uncover patterns and group the samples by species, which are unknown in this subset. Understanding these spectral profiles and identifying distinct species will provide insights into plant classification and the relationship between spectral data and species characteristics.*

<img src="images/leaf.jpg" style="width:700px"/>

[Source](https://www.flickr.com/photos/bob_81667/24688196150)

Here is a link to the [dataset](https://ecosis.org/package/angers-leaf-optical-properties-database--2003-). This notebook and the next take inspiration from a case study found in the book [Machine Learning for Engineers](https://link.springer.com/book/10.1007/978-3-030-70388-2).

## 1. Data loading & pre-processing

Let's start with some pre-processing.

In [None]:
# In Pandas, a data table is called a DataFrame (abbreviated to df)
df = pd.read_csv('data/angers-leaf-optical-properties.csv')

print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")
# Show the first 5 rows of the data
df.sample(5)

In [None]:
# As we only have 18 samples and we are doing unsupervised learning, we use all data for training
X_train, y_train, X_test, y_test, feature_names, label_map = helpers.preprocess_data(df=df, label="English Name", train_size=1.0, seed=44)

In [None]:
label_map

We have just revealed the 3 species from which our leaf samples are taken. In next week's notebook on clustering we will, however, forget about this for the sake of the exercise ;)

## 2. Data visualization

Let's now visualize the reflectance spectra of our training samples.

In [None]:
# Wavelength range from 400 to 2450
wavelength = np.arange(400, 2451)

# Define colors for each label
colors = ['blue', 'green', 'red']
label_colors = {label: colors[i] for i, label in enumerate(label_map.keys())}

# Plotting
plt.figure(figsize=(12, 6))

# Plot each row in X_train with consistent colors based on labels
for i in range(X_train.shape[0]):
    label = y_train[i]
    plt.plot(wavelength, X_train[i], color=label_colors[label], label=label_map[label])

# Adding legend (unique labels only)
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys())

# Adding axis labels and title
plt.xlabel('Wavelength (nm)')
plt.ylabel('Reflectance')

# Show the plot
plt.show()

The three species seem to have fairly similar behaviour, though some differences do appear. To faciliate next week's clustering task, let's try to reduce the dimensionality of our samples (currently 2051, which is a lot, especially for a dataset of only 18 samples!)

## 3. PCA

Let's now get into the topic of the day: dimensionality reduction using principal component analysis, a.k.a. PCA.

### Reminder of the algorithm:

1. __Standardization:__  Standardize (a.k.a. normalize) the dataset to have a mean of zero and a variance of one.

2. __Compute Covariance Matrix:__ Calculate the covariance matrix of the standardized data. The covariance matrix captures the variance and the relationship between different features in the dataset.

3. __Compute & Sort Eigenvalues and Eigenvectors:__ Compute the eigenvalues and eigenvectors of the covariance matrix. The eigenvalues represent the variance explained by each principal component, and the eigenvectors represent the direction of the principal components. Don't forget to sort the eigenvalues in descending order and arrange the eigenvectors accordingly. The eigenvalues and their corresponding eigenvectors are sorted to prioritize the principal components that explain the most variance.

4. __Transform Data:__ Transform the standardized data into the new PCA space by projecting it onto the eigenvectors. This results in a new feature matrix where the columns are the principal components.

5. __Compute Explained Variance:__ Calculate the explained variance ratio for each principal component by dividing each eigenvalue by the total sum of eigenvalues. This step helps in understanding the proportion of variance each principal component explains.

6. __Dimensionality Reduction:__ The algorithm outputs the transformed data (principal components), the explained variance ratios, and the eigenvectors (principal component directions). Using the explained variance, we can choose how many dimensions we keep to represent our data.

Let's implement the various steps one by one.

### 3.1 Standardization

In [None]:
# Compute the mean and standard deviation for each feature of the training set

### START CODE HERE ### (≈ 2 lines of code)
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)
### END CODE HERE ###


# Implement the normalize function
def normalize(X: np.ndarray, mean: np.ndarray, std: np.ndarray):
    """ Normalization of array using Z-score standardization
     Args:
        X: Dataset of shape (N, D)
        mean: Mean of shape (D, )
        std: Standard deviation of shape(D, )
    """
    ### START CODE HERE ###
    X_normalized = (X - mean) / std
    ### END CODE HERE ###
    return X_normalized

# Normalize features of the training, val and test set using the mean and std of the training set features
X_train = normalize(X_train, mean, std)

### 3.2 Compute Covariance Matrix

In [None]:
def compute_covariance_matrix(X_standardized):
    """
    Compute the covariance matrix of the standardized data.

    Parameters:
    X_standardized (np.ndarray): Standardized feature matrix.

    Returns:
    np.ndarray: Covariance matrix.
    """
    ### START CODE HERE ### 
    return np.cov(X_standardized, rowvar=False)
    ### END CODE HERE ### 

cov_matrix = compute_covariance_matrix(X_train)

### 3.3 Compute & Sort Eigenvalues and Eigenvectors

In [None]:
def compute_eigenvalues_and_eigenvectors(cov_matrix):
    """
    Compute the eigenvalues and eigenvectors of the covariance matrix.

    Parameters:
    cov_matrix (np.ndarray): Covariance matrix.

    Returns:
    tuple: Sorted eigenvalues and corresponding eigenvectors.
    """
    ### START CODE HERE ### 
    # Compute the eigenvalues and eigenvectors
    # Hint: check np.linalg.eig
    eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
    ### END CODE HERE ### 

    # Convert to real values if they have a negligible imaginary part
    eigenvalues = np.real(eigenvalues)
    eigenvectors = np.real(eigenvectors)
    
    ### START CODE HERE ### 
    # Sort the eigenvalues and eigenvectors in descending order of the eigenvalues
    sorted_index = np.argsort(eigenvalues)[::-1]
    sorted_eigenvalues = eigenvalues[sorted_index]
    sorted_eigenvectors = eigenvectors[:, sorted_index]
    return sorted_eigenvalues, sorted_eigenvectors
    ### END CODE HERE ### 

eig_val, eig_vec = compute_eigenvalues_and_eigenvectors(cov_matrix)

### 3.4 Transform data

Project the data onto the computed eigenvectors.

__Note:__ Though not relevant in this notebook, we can also project the validation and test sets into the PCA space, but using the eigenvectors computed with the training data.

In [None]:
def transform_data(X_standardized, eigenvectors):
    """
    Transform the data into the new PCA space.

    Parameters:
    X_standardized (np.ndarray): Standardized feature matrix.
    eigenvectors (np.ndarray): Eigenvectors of the covariance matrix.

    Returns:
    np.ndarray: Transformed feature matrix.
    """
    ### START CODE HERE ### 
    return np.dot(X_standardized, eigenvectors)
#   ## END CODE HERE ### 

X_pca_train = transform_data(X_train, eig_vec)

### 3.5 Compute Explained Variance

Reminder of the formula:

<img src="images/formula.png" style="width:500px"/>

[Source](https://towardsdatascience.com/principal-component-analysis-ac90b73f68f5)

In [None]:
def compute_explained_variance(eigenvalues):
    """
    Compute the explained variance ratio for each principal component.

    Parameters:
    eigenvalues (np.ndarray): Eigenvalues of the covariance matrix.

    Returns:
    np.ndarray: Explained variance ratios.
    """
    ### START CODE HERE ### 
    return eigenvalues / np.sum(eigenvalues)
    ### END CODE HERE ### 

explained_variance = compute_explained_variance(eig_val)

Now let's plot the explained variance.

In [None]:
helpers.plot_explained_variance(explained_variance)

__Question:__ How many dimensions should we keep to explain at least 90 % of the variance.

__Answer:__ 3, value from which the cumulative explained variance exceeds 0.9 on the plot.

### 3.6 Dimensionaliy Reduction

Let's reconstruct the training data from the 3 first principal components only (i.e. retransform to the original space). 

In [None]:
def reconstruct_data(X_pca, eigenvectors, n_components, mean, std):
    """
    Reconstruct the data from the PCA-transformed space using a subset of principal components and denormalize it.

    Parameters:
    X_pca (np.ndarray): Transformed feature matrix with all principal components.
    eigenvectors (np.ndarray): Eigenvectors of the covariance matrix.
    n_components (int): Number of principal components to use for reconstruction.
    mean (np.ndarray): Mean used for standardizing the original data.
    std (np.ndarray): Standard deviation used for standardizing the original data.

    Returns:
    np.ndarray: Reconstructed and denormalized feature matrix in the original space.
    """
    ### START CODE HERE ### 
    # Keep only the first n_components of X_pca
    X_pca_reduced = X_pca[:, :n_components]
    
    # Select the first n_components eigenvectors
    selected_eigenvectors = eigenvectors[:, :n_components]
    
    # Reconstruct the original data
    X_reconstructed = np.dot(X_pca_reduced, selected_eigenvectors.T)
    
    # Denormalize the reconstructed data
    X_reconstructed = X_reconstructed * std + mean
    
    return X_reconstructed
    ### END CODE HERE ### 

# Example usage
n_components = 3  # Specify the number of principal components to use for reconstruction
X_reconstructed_train = reconstruct_data(X_pca_train, eig_vec, n_components, mean, std,)

We can now plot these reduced vectors. How do they compare with the original plots we made in Section 2?

In [None]:
# Wavelength range from 400 to 2450
wavelength = np.arange(400, 2451)

# Define colors for each label
colors = ['blue', 'green', 'red']
label_colors = {label: colors[i] for i, label in enumerate(label_map.keys())}

# Plotting
plt.figure(figsize=(12, 6))

# Plot each row in X_reconstructed_train with consistent colors based on labels
for i in range(X_reconstructed_train.shape[0]):
    label = y_train[i]
    plt.plot(wavelength, X_reconstructed_train[i], color=label_colors[label], label=label_map[label])

# Adding legend (unique labels only)
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys())

# Adding axis labels and title
plt.xlabel('Wavelength (nm)')
plt.ylabel('Reflectance')

# Show the plot
plt.show()

Pretty good right?

__Note:__ Remember that we reduced the dimensionality of our dataset from 2051 to 3, which is a massive reduction!

## 4. Save projected data for later processing

For the sake of visualization and for later processing in the upcoming notebook on k-means, let's now keep our first two principal components and visualize them in the projected space.

In [None]:
helpers.plot_pca_scatter(X_pca_train, y_train, label_map)

__Preliminary question for next notebook:__ Do the data points form clusters we can visualize in this reduced PCA space? If so, how many?

__Answer:__ To be determined in the notebook on k-means!

In [None]:
with open('data/pca_preprocessed_angers_dataset.pkl', 'wb') as f:
    pickle.dump((X_pca_train, y_train, label_map), f)

That concludes this tutorial on PCA. Thank you for taking part! In the next notebook we will use k-means to determine how our projected data clusters, and whether these identified clusters match the underlying species.