# Week5 lab1: Dimensionality reduction

**This lab**: the aim is to use PCA dimensionality reduction for data compression, data visualisation, and feature selection. During this lab it is recommended to check the sklearn documentation on pca https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

You'll need `olivettifaces.mat` and `titanic.csv` files to work on this lab.

## Exploring data in PCA plots: "titanic" dataset

Install needed packages

In [1]:
# your code here


Load the Titanic dataset (titanic.csv) using Pandas' read_csv function. Use scikit-learn's PCA class from the sklearn.decomposition module to visualize the feature data. From this data, extract a binary (0-1) vector y of labels/outcomes indicating a passenger had survived or not. Create a feature matrix X whose columns are Pclass, Sex, Age, Siblings_SpousesAboard, Parents_ChildrenAboard, and Fare features. You may need to convert the Sex variable (currently a string) into a 0-1 numerical variable to include it in the feature matrix.

In [2]:
# import pandas as pd
# import numpy as np
# from sklearn.decomposition import PCA

# your code here

Use scikit-learn's PCA class from the sklearn.decomposition module to visualize the feature data. 

In [6]:
# import matplotlib.pyplot as plt
# Perform PCA
# your code here 

Detecting anomalies, consider these steps:
First, it computes the median of the dataset X using the median() method of a Pandas DataFrame. The median represents the middle value in the dataset, separating the lower 50% of values from the upper 50% of values.

Next, it computes a threshold value that represents the maximum distance an observation can be from the median before it is considered an anomaly. The threshold is set to 3 times the standard deviation of the data using the std() method of a Pandas DataFrame. The standard deviation represents the amount of variation or dispersion in the dataset.

Finally, it identifies the anomalies in the dataset by checking which observations have an absolute difference greater than the threshold value from the median. The where() function of NumPy is used to return the indices of the anomalies in the dataset. Print out these anomaloes. 

In [7]:
# your code here 


Using the PCA-transformed data, plot the PCA for the rest of the data in blue, and the anomalies in red to see where they are in the plot. You can use Matplotlib's scatter function to create the plot. The x-axis and y-axis should correspond to the first two principal components. Label the plot accordingly to distinguish the anomalous data points

In [8]:
# your code here 

## GDimensionality reduction: compression/reconstruction

Gaussian distribution data: Create a 2D feature matrix (dataset) containing 500 samples from a Gaussian distribution centred at [0,0] with a random covariance matrix. Scatter plot this data. Since covariance is not diagonal, you should be able to see that data will have correlated features.

In [9]:
# your code here 


Use PCA commend to compute the principal components (PCs) of this data. Plot the PC axes over the 2D dataset's scatter plot. Verify that the two components are orthogonal to each other, and that the first PC indeed shows a direction where data has the most variation (i.e., variance).

In [11]:
# Perform PCA on the data
# pca = PCA(n_components=2)

# your code here 

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance. Use the appropriate output of the pca command to get the data's projections along each of the PC axes. Keep only the projections along the first (most dominant) PC, this is the 1D dimension reduced version of the original 2D data.

In [12]:
# Transform the data to get the projections along the PC axes
# your code here 

# Keep only the projections along the first (most dominant) PC
# your code here 

To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data: Follow the lecture note materials to approximately reconstruct the higher-dimensional 2D data from this 1D projection. This should involve using appropriate outputs from the pca function about e.g., the mean/centre of the dataset, the first PC vector, as well as projected data in the coordinate of the first PC (i.e., scores).

In [13]:

# Perform the inverse transform to reconstruct the 2D data
# your code here 

# Plot the original data and the reconstructed data
# your code here 



## Visualisation: PCA vs. tSNE

Load the Fisher Iris dataset using scikit-learn's load_iris function. Use scikit-learn's PCA class to reduce the feature dimension of this data from 4 to 2. 

Visualize the dimension-reduced 2D dataset using Matplotlib's scatter function, where data points are colored according to their class (species) labels.

In [14]:
# import matplotlib.pyplot as plt
# from sklearn.datasets import load_iris
# from sklearn.decomposition import PCA

# your code here 

Repeat the previous task using scikit-learn's TSNE class to reduce the dimension of data to 2D using t-distributed Stochastic Neighbour Embedding (t-SNE), and then visualize this dimension-reduced data using Matplotlib's scatter function. 

Compare the resulting plots from PCA and t-SNE to determine which method gives a clearer insight about the underlying data clusters.

In [16]:
# from sklearn.manifold import TSNE
# reduce the dimension of data from 4 to 2 using tSNE
# your code here 


Download the olivettifaces.mat dataset from https://cs.nyu.edu/~roweis/data.html. The dataset contains grayscale images of faces where pixel values range in [0-255], a few images of several different people, 400 total images (number of samples n=400), 64x64 size (feature size d=4096). Load this data using loadmat function from scipy.io module i.e., loadmat('olivettifaces.mat')

Once loaded, access the face data using the 'faces' key in the loaded dictionary. Check the shape of the face data to confirm that it is (4096, 400).

Visualize one of the faces from the dataset using Matplotlib's imshow function. Choose a face index from 0 to 399 and reshape the corresponding 4096-dimensional feature vector into a 64x64 image. Plot the image using the cmap='gray' argument to display it in grayscale

In [17]:
# import numpy as np
# import matplotlib.pyplot as plt
# from scipy.io import loadmat

# Load the olivettifaces.mat dataset
# data = loadmat('olivettifaces.mat')

# Access the face data
# faces = data['faces']

# Check the shape of the face data
# your code here 

# Example of visualizing one of the faces
# your code here 


Visualise all faces in an image grid (montage). Use numpy functions to reshape and rearrange the face images into a grid. Finally, we visualize the montage using matplotlib


In [18]:
# your code here 

Use PCA to compress faces data using the 20 first PCs. Reshape these 20 PC vectors into 20 images and display them. These are known as the Eigen faces, that is all images in the face dataset can be (approximately) described by the linear combination of these 20 PCs. Can you see what facial features each of the Eigen faces do capture?

In [19]:
# your code here  


Finally, use appropriate output from the pca command to print how much (%) of the variance (energy) of the original dataset is explained by the first 20 principal components. How many principal components would be needed to have a compression loss less than 5%? NB: The PCA dimensionality reduction is a lossy compression; a measure of PCA's compression loss could be 1 - explained_variance.

Hints:
- To calculate the explained variance for the first 20 principal components, you can use the explained_variance_ratio_ attribute of the PCA 
- To determine how many principal components are needed to have a compression loss less than 5%, you can use np.argmax.

In [20]:
# your code here 