# Practical 1: Dimensionality Reduction

**Course:** WBCS032-05 Introduction to Machine Learning  
**Student Names:**  Winand Metz, Matthias Nijman  
**Student Numbers:**  S6417221, S4667999

---

## Assignment Overview

In this assignment, you will implement Principal Component Analysis (PCA) to reduce the dimensionality of the data, as discussed in the lecture.

You will work with the `COIL20.mat` dataset on Themis. The dataset consists of 1440 images, where each image has a size of $32 \times 32$ pixels and is flattened into a vector of length 1024. All images are stored in one matrix of size $1440 \times 1024$, where each row represents one image and each column corresponds to a pixel. The images come from 20 different objects, and each object is recorded at 72 different rotation angles, with a rotation step of 5 degrees.

## 1. Introduction (1 point)

Describe the goal of this assignment and briefly explain why dimensionality reduction and PCA are useful in data analysis.

**Your answer here:**

## 2. Methods (3 points)

### 2.1 Explain the PCA Algorithm (0.5 points)
Explain the PCA algorithm in a general manner.

**Your answer here:**  
PCA (Principal Component Analysis) is a linear method of feature extraction for unsupervised machine learning.
The goal is to find a new set of dimensions that are the combinations of the original dimensions.
This is done by mapping data points from a high-dimensional space to a low-dimensional space while minimizing information loss.
The steps of PCA are:
1. Standardize the data set.
2. Compute the principal components:
    - Compute the co-variance matrix.
    - Compute eigenvalues and eigenvectors.
3. Choose reduced dimensionality *d*.
4. Only pick the first *d* eigenvectors, where the eigenvectors are the principal components.  
In general, the i-th eigenvector of the covariance matrix is associated to the i-th eigenvalue.


### 2.2 Implementation (2.5 points)

You need to implement the PCA algorithm **yourself**. Both the code quality and correctness will be graded.

*__Note:__* **Do not change the cell labels! Themis will use them to automatically grade your submission.**

In [None]:
# Load required libraries

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
from sklearn.manifold import TSNE

# Data configuration
data_file_path = 'COIL20.mat'
image_shape = (32, 32)

# PCA parameters
d = 40  # Target dimensionality

# t-SNE parameters
tsne_perplexity = 4

#### PCA Algorithm Steps

Implement the following steps:

1. **Normalize the data:**
   $$Z = \frac{X - \mu}{\sigma}$$
   where $\mu$ is the mean of all samples and $\sigma$ is the standard deviation.

2. **Compute the covariance matrix of the normalized data** and obtain its eigenvalues $D$ and eigenvectors $U$.  
   You may use `np.linalg.eig` in Python.

3. **Sort the eigenvectors in descending order of their eigenvalues** and select the first $d$ principal components to form $U_d$.

4. **Reduce the dimensionality of the data** by projecting the normalized data onto the selected principal components.

In [2]:
# Implement PCA here
def PCA(x, d):
    """
    Apply Principal Component Analysis.

    Args:
        x (np.ndarray): Dataset matrix (each column represents a variable)
        d (int): Dimensionality of the projection

    Returns:
        tuple: (U_d, eigen_values, Z_d)
            - U_d (np.ndarray): Matrix of principal components, sorted descending
            - eigen_values (np.ndarray): Eigenvalues, sorted descending
            - Z_d (np.ndarray): Reduced version of the dataset
    """
    pass  # TODO: Implement PCA

In [3]:
# Extract dataset 

In [None]:
# Apply PCA to the dataset
U_d, eigen_values, Z_d = PCA(...)  # TODO: Pass the correct parameters

## 3. Experimental Results (4 points)

*__Note:__* This section is not graded by Themis.

### 3.1 Eigenvalue Profile

Write code in the cell below to plot the eigenvalue profile of the data. This plot helps determine how many principal components to retain for dimensionality reduction. Make sure that all plots are clearly labeled. Each figure must include labeled axes, an appropriate title, and a legend where applicable.

- **X-axis:** Eigenvalue indices $(1, 2, \ldots, 1024)$
- **Y-axis:** Eigenvalue magnitude


In [None]:
# TODO: Plot the eigenvalue profile

### 3.2 Dimensionality Table

Create a table reporting the dimensionality $d$ required to keep 0.9, 0.95, and 0.98 fraction of the total variance. Write code in the cell below to compute these values, then fill in the table.

Use the formula:
$$d = \frac{\sum_{i=1}^{d}\lambda_i}{\sum_{i=1}^{n}\lambda_i}$$

In [None]:
# TODO: Calculate d for variance thresholds: 0.9, 0.95, 0.98

| Variance Fraction | Dimensionality (d) |
|-------------------|-------------------|
| 0.90              |                   |
| 0.95              |                   |
| 0.98              |                   |

### 3.3 t-SNE Visualization

Visualize the reduced data using t-SNE in a 2-dimensional feature space.

- Use different colors for data points from different objects
- Every 72 data examples correspond to one object
- You can use `sklearn.manifold.TSNE` in Python with the configured perplexity parameter

In [None]:
# TODO: Visualize reduced data using t-SNE in 2D
# Hint: Use different colors for each of the 20 objects

## 4. Discussion (2 points)

Discuss your observations on the obtained results:
- What does the eigenvalue profile tell you about the data?
- How well does PCA reduce the dimensionality while preserving variance?
- What do you observe in the t-SNE visualization? Are the objects well-separated?

**Your answer here:**

### A note on Themis
Themis will only grade your implementation of the PCA algorithm, thus giving a maximum of `2.5` points. It does so by executing every cell up to and including the PCA call. Make sure your code runs without errors and produces the expected outputs before submitting.

## Contribution

State your individual contribution.

**Your answer here:**