# Feature Reduction with PCA (Principal Component Analysis)

## Objective
Feature Reduction, also known as dimension reduction, involves transforming data from a high-dimensional space to a low-dimensional space, ensuring the new representation retains meaningful properties of the original data. **Principal Component Analysis (PCA)**, introduced by Karl Pearson in 1901, is a widely used technique for this purpose. PCA performs an orthogonal transformation to convert a dataset of possibly correlated features into a set of linearly uncorrelated variables called principal components, with fewer components than the original features. This process is sensitive to the relative scaling of the original features. The aim of this lab is to apply PCA for feature reduction.

## Prerequisites
Complete all materials in submodule 3.6, specifically the lecture slides on PCA, to understand the underlying algorithm and its application.

## PCA Algorithm
To implement PCA, follow these steps:

1. **Centralize the features** by subtracting the mean of each feature from the dataset.
2. **Calculate the covariance matrix** of the centered features to understand how features vary together.
3. **Perform eigendecomposition** on the covariance matrix to extract eigenvalues and eigenvectors.
4. **Sort eigenvalues and eigenvectors** in descending order based on the eigenvalues, which represent the variance captured by each eigenvector.
5. **Select the top K eigenvectors** based on the sorted eigenvalues, where K is the number of principal components you wish to retain.
6. **Construct the principal components (PCs)** using the selected eigenvectors, which will be the new, lower-dimensional representation of your data.

## Instructions
- Implement the PCA algorithm using Python.
- Apply PCA to the **IRIS dataset** by selecting the first two principal components for feature reduction. The dataset is available `iris.data` or at [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data).
- **Plot the data** using the two principal components to visualize the reduced dimensionality.

This exercise will help you gain hands-on experience with PCA and understand how it can be applied to reduce feature dimensions while preserving the essence of the original dataset.

In [24]:
# !pip install scikit-learn 
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
import pandas as pd
import numpy as np

### Load the IRIS dataset

In [48]:
#the length and the width of the sepals and petals,
iris = datasets.load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

Note iris.target has the labels

In [49]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

#### iris.data has the data

In [58]:
data = iris.data
data

array([[5.1, 3.5, 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [4.6, 3.4, 1.4, 0.3],
       [4.4, 2.9, 1.4, 0.2]])

### Standardize the data, to remove bias due to differences in feature scales
Call StandardScaler().fit_transform\
How would this be computed by hand?\
Why is standarization of the data important?

In [61]:
# code here
std_data = StandardScaler().fit_transform(data)
std_data[:10]

array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ],
       [-0.53717756,  1.93979142, -1.16971425, -1.05217993],
       [-1.50652052,  0.78880759, -1.34022653, -1.18381211],
       [-1.02184904,  0.78880759, -1.2833891 , -1.3154443 ],
       [-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
       [-1.14301691,  0.09821729, -1.2833891 , -1.44707648]])

Trasnpose to capture all the values for each feature into one row

In [69]:
# code here
data_T = std_data.T
data_T[0]

array([-0.90068117, -1.14301691, -1.38535265, -1.50652052, -1.02184904,
       -0.53717756, -1.50652052, -1.02184904, -1.74885626, -1.14301691,
       -0.53717756, -1.26418478, -1.26418478, -1.87002413, -0.05250608,
       -0.17367395, -0.53717756, -0.90068117, -0.17367395, -0.90068117,
       -0.53717756, -0.90068117, -1.50652052, -0.90068117, -1.26418478,
       -1.02184904, -1.02184904, -0.7795133 , -0.7795133 , -1.38535265,
       -1.26418478, -0.53717756, -0.7795133 , -0.41600969, -1.14301691,
       -1.02184904, -0.41600969, -1.14301691, -1.74885626, -0.90068117,
       -1.02184904, -1.62768839, -1.74885626, -1.02184904, -0.90068117,
       -1.26418478, -0.90068117, -1.50652052, -0.65834543, -1.02184904,
        1.40150837,  0.67450115,  1.2803405 , -0.41600969,  0.79566902,
       -0.17367395,  0.55333328, -1.14301691,  0.91683689, -0.7795133 ,
       -1.02184904,  0.06866179,  0.18982966,  0.31099753, -0.29484182,
        1.03800476, -0.29484182, -0.05250608,  0.4321654 , -0.29

### Calcuate the covariance matrix
Use the numpy .cov funciton\
What does this matrix tell us?

In [30]:
# code here


### Generate eigen values and vectors
Use the np.linalg.eig function to extract the eigen values vecotrs of the covariance matrix.\
This step could have been done with SVD as well\
Repeat lab using SVD instead for this step

In [31]:
# code here

Display the eigen vectors

In [32]:
# code here

Display the eigen values\
What do the eigen values tell us?

In [33]:
# code here

### Estimate the importance of each feature
Which eigen vectors do we keep and why?

In [34]:
# code here

### Project the data onto the first basis (i.e., the 'axis' of the most improtant eigen vector)
That is take the dot product of the orignal data, with the most important eigen vector

In [35]:
# code here

Parse into pandas dataframe\
The y-axis is set to 0 so can use 2D graph to display one dimentional data

In [36]:
# code here

Visualzie with scatter plot\
What do you observe?

In [37]:
# code here

### Repeat with 2nd most important principal component

Project onto the second basis

In [38]:
# code here

Parse into a pandas dataframe

In [39]:
# code here

Plot the 2D graph

In [40]:
# code here

How does these plots compare to projecting onto PC3 or some 'axis' that captures less variance?\
Project the data onto the third basis\


In [41]:
# code here

Generate plot with the third basis only

In [42]:
# code here

## Can also use the PCA from sklearn

In [43]:
from sklearn.decomposition import PCA

Call PCA with the number of components

In [44]:
 # code here

Parse into pandas dataframe

In [45]:
# code here

Display plot

In [46]:
# code here