<h1 align="center">Zewail University of Science and Technology</h1>
<h2 align="center">CIE 417 (Fall 2018)</h2>
<h2 align="center">Lab 9: PCA</h3>
<h3 align="center">15/11/2018</h3>

## <font color="#00cccc">PCA<font/>
### Run PCA Demo

## <font color = "#af00af"> What does PCA do? <font/>

### 1) PCA learns a k-dimensional subspace given a d-dimensional data
### 2) The new subspace is represented by k basis vectors
### 3) The k basis vectors are orthonormal, and they try to capture as much variance in the original data as possible
### 4) The 1st basis vector (1st principal component) points in the direction of the data with maximum variance, and so forth
### $$ X \approx ZW $$
### $$ (nxd) \approx (nxk)(kxd) $$

### The basis vectors are the rows of W
### The representations in the new basis are the rows of Z

<img src="PCA_Linear_Combination.PNG">
source: https://ubc-cs.github.io/cpsc340/lectures/L24.pdf
### Instead of using 784-dimensional images (28x28), we can use the 7-dimensional z to represent them (for "3", the z = [1,1,1,1,1,0,0])

## <font color = "#af00af"> How to calculate principal components? <font/>

### 1) Subtract the mean of the data (and preferably standardize the data)
### 2) Calculate the covariance matrix
### $$ \Sigma = \frac{1}{n-1}(X-\mu)^T(X-\mu) = \frac{1}{n-1}X^T X\ (as\ we\ centered\ the\ data)$$ 
### 3) Calculate the eigenvalues and the eigenvectors of the covariance matrix
### 4) Construct the transformation matrix W, with the rows being the eigenvectors that correspond to the k biggest eigenvalues, these eigenvectors represent your new basis vectors

### Alternatively, PCA can be calculated using SVD (which is more common)
### See here for more details on SVD: https://www.youtube.com/watch?v=daHVmoOrLrI

## Load IRIS Dataset

##### Attributes:
1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 

We will just use two features for easier visualization; sepal length and width.

##### class: 
* Iris Setosa 
* Iris Versicolour 
* Iris Virginica

<img src="Lab9_petal_sepal.png">
source: https://www.wpclipart.com/plants/diagrams/plant_parts/petal_sepal_label.png.html

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

# import iris dataset
iris = datasets.load_iris()

# We would use only the first two features
X = iris.data
y = iris.target

#split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, shuffle = True, random_state = 0)

del X, y
print (f"training set size: {X_train.shape[0]} samples \ntest set size: {X_test.shape[0]} samples")

In [None]:
from numpy.linalg import svd
import numpy as np

## Standardize Data

In [None]:
mu = np.mean(X_train,axis=0)
std = np.std(X_train, axis=0)
X_train_std = (X_train-mu)/std
del X_train

## <font color = "#af00af"> Why do we need to standardize data? <font/>

## Apply PCA to Reduce Features Dimensions From 4 to 2

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
new_projected_data = pca.fit_transform(X_train_std)

In [None]:
from lab9_functions import plot_iris_data
plot_iris_data(new_projected_data,y_train)

## <font color = "#af00af"> Which features do these two components represent? <font/>

## <font color = "#af00af"> How much information did we lose in the previous operation? <font/>

In [None]:
pca.explained_variance_ratio_

## <font color = "#af00af"> What are the applications of PCA? <font/>

### Supervised Learning (reduce features size for computational purposes, avoid overfitting, etc)
### Visualization
### Dimensionality Reduction

## <font color = "#af00af"> When does PCA fails? <font/>

<img src="PCA_fails.png">
source: A Tutorial on Principal Component Analysis, Jonathon Shlens, https://arxiv.org/pdf/1404.1100.pdf

## <font color = "#ff0000"> Exercise: <font/>

In [None]:
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

X,y = load_digits(return_X_y=True)

In [None]:
plt.imshow(X[0,:].reshape(8,8), interpolation='nearest')

### Split Data to Training and Testing

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, shuffle = True, random_state = 0)

### Center Data (no need to divide by standard deviation as they are all on the same scale)

### Use Sklearn LinearSVC classifier to classify data, and print the training and testing accuracy

In [None]:
from sklearn.svm import LinearSVC


### Apply PCA using two components

### Plot Data (with different colors for different numbers)

### Use Sklearn LinearSVC classifier to classify the new projected data, and print the training and testing accuracy

### Apply PCA such that 90% of the variance in the data is reserved

### How many components did you use?

### Use Sklearn LinearSVC classifier to classify the new projected data, and print the training and testing accuracy