## Principal Component Analysis

The principal application of PCA is **dimension reduction**. If we have high dimensional data, PCA allows us to reduce the dimensionality of the data so the bulk of the variation that exists in your data across many high dimensions is captured in fewer dimensions. 

The key idea for PCA is to help reducing dimension of the dataset to preserve the most information (variation) of the dataset.

### Curse of Dimensionality

**Sparseness in N-Dimensional Space:** 

Curse of dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. The popular aspects of the curse of dimensionality includes "data sparsity" and "distance concentration".

**Data Sparsity**:
When the training samples do not capture all combinations of the attributes, is referred to as "data sparsity" or simply "sparsity" in high dimensional data. Training a model with sparse data could lead to high-variance or overfitting condition because the model has learned from the frequently occurring combinations of the attributes and can predict the outcome accurately, but failed to predict less frequently occurring combinations in the reality (unseen data).

**Distance Concentration**:
Distance concentration refers to the problem of all the pairwise distances between different samples/points in the space converging to the same value as the dimensionality of the data increases. Several machine learning models such as clustering or nearest neighbours’ methods use distance-based metrics to identify similar or proximity of the samples. Due to distance concentration, the concept of proximity or similarity of the samples may not be qualitatively relevant in higher dimensions.


### Steps to Conducting PCA

#### Mathematical Approach:

1. Center each feature by subtracting the feature mean
2. Calculate the covariance matrix for your normalized dataset (Eigen Decomposition)
3. Calculate the eigenvectors/eigenvalues for the covariance matrix
4. Take the dot product of the transpose of the eigenvectors with the transpose of the normalized data

#### Graphical Approach:

Assume two predictors / features case: $X_1$ and $X_2$

1. Plot the two predictor on the graph with scatter plot

![PCA1](img/pca1.png)

2. Standard scale both $X_1$ and $X_2$ and fit the best PC line to the data

![PCA2](img/pca2.png)

3. Maximize the Sum of Squared Distance (SSD) for PC1

![PCA3](img/pca3.png)

Note: SSD for PC1 is the Eigenvalue for PC1

4. Calculate the Eigenvector for PC1

![PCA4](img/pca4.png)

Note: Eigenvector for PC1 is the called "singular vector", which consists of $\frac{\Delta X_1}{m}$ and $\frac{\Delta X_2}{m}$, where $m$ is the slope of PC1.

5. PC2 is just a perpendicular line of PC1 in this case

![PCA5](img/pca5.png)

Note: the Eigenvalue and Eigenvector can be calculated using the same logic to the PC2 line.

6. Rotate the plot to make PC1 the horizontal axis and PC2 the vertical axis

![PCA6](img/pca6.png)

At this stage, we can calculate the variation for PC1 and PC2:

$Variation_{PC1} = \frac{SSD_{PC1}}{n-1}$

$Variation_{PC2} = \frac{SSD_{PC2}}{n-1}$

Total Variation = $Variation_{PC1} + Variation_{PC2}$

Therefore, the total variation of the data being explained by PC1 and PC 2:

$$Explained_{PC1} = \frac{Variation_{PC1}}{TotalVariation}$$

$$Explained_{PC2} = \frac{Variation_{PC2}}{TotalVariation}$$

For higher dimensional data, the Eigenvector for each principal component consists the total number of elements that matches with the total numbers of the predictors / features in the dataset.

e.g. Principal Component Functions

$$PC_1 = ev_{11}(x_1) + ev_{12}(x_2) + ... + ev_{1p}(x_p)$$

$$PC_2 = ev_{21}(x_1) + ev_{22}(x_2) + ... + ev_{2p}(x_p)$$

$$PC_3 = ev_{31}(x_1) + ev_{32}(x_2) + ... + ev_{3p}(x_p)$$

...

where 

- $ev_{11}, ev_{12}, ... ev_{1p}$ are the values in the eigenvector (which is calculated by $\frac{\Delta X_1}{m}, \frac{\Delta X_2}{m}$, ... in our example)   
- $x_1, x_2, ... x_p$ are the predictors columns  
- $p$ refers to the number of predictors in the dataset  

### Python Code

SKLearn Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

**Calculate the Explained Variance by the PCA**

``` Python
# Import dependencies
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Standard Scale the predictors
X = StandardScaler().fit_transform(X)

# Create a PCA object that reduce dimensions into 2 principal components
pca = PCA(n_components = 2)

# Fit the predictors to the PCA object
principalComponents = pca.fit_transform(X)

# Putting the components into a dataframe
principal_df = pd.DataFrame(data = principalComponents,
                            columns = ['principal component 1', 'principal component 2'])

# Explained variance by the PCA
pca.explained_variance_ratio_
```

#### Diagnostic of the Principal Components

We usually use the **scree plot** to diagnose the principal components to determine the number of components to include in the model.

``` Python
# Import dependencies
import numpy as np
import matplotlib.pyplot as plt

# Create the PC values for the horizontal axis
PCs = ['PC1', 'PC2', 'PC3', ....]

# Create the explained variance values for each component
exp_var = pca.explained_variance_ratio_.tolist()

# Create bar plot with the explained variance ratio
plt.bar(PCs, exp_var, color='blue')
plt.title('Scree Plot')
plt.xlabel('Principal Components')
plt.ylabel('Proportion of Variance Explained')
plt.show()
```

**Dimension Reduction by Minimum Explained Variance Threshold**

``` Python
# Import dependencies
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

# Standard Scale the predictors
X = StandardScaler().fit_transform(X)

# Create a PCA object that choose the minimum number of principal component such that 90% of the variance is retained
pca_90 = PCA(n_components = 0.90)

# Fit PCA on training set
pca_90.fit(X_train)

# Transform both training set and test set with the pca_90 object
pca_X_train = pca.transform(X_train)
pca_X_Test = pca.transform(X_test)

# Apply transformed training set for ML model, like logistic regression
lr = LogisticRegression(solver='lbfgs')
lr.fit(pca_X_train, y_train)

# Make prediction based on the testing set
lr.predict(pca_X_test)

# Calculate the evaluation metric (consider using other measure based on situation)
lr.score(pca_X_test, y_test)
```

You can try setting different ```n_components``` values, such as 0.95, 0.9, 0.85, 0.8, 0.75, 0.7 ... to compare the number of components (features or predictors) applied to the model, time for training, and testing accuracy rate to see if it makes a significant improvement on training time without too much of accuracy rate.



Now, lets play around with this image dataset from SKLearn and see if you can reduce the number of features for accuracy prediction.

In [None]:
# Import dependencies
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_openml

In [None]:
# Import the image dataset from SKLearn
mnist = fetch_openml('mnist_784')

In [None]:
# Test Train Split
train_img, test_img, train_lbl, test_lbl = train_test_split(
    mnist.data, mnist.target, test_size=0.15, random_state=999)

In [None]:
# Create the Scaling object
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(train_img)

# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

In [None]:
# Create a PCA object with 500 principle components


# Fit the training data to the PCA



In [None]:
# Create the PC values for the horizontal axis


# Create the explained variance for each component


# Create bar plot with the explained variance ratio



In [None]:
# Create the PCA object to preserve minimum 90% of variance 


# Fit on training set only


# Transform the training set and testing set with the PCA object




In [None]:
# Build your own model for this classification problem



In [None]:
# Validate the prediction with CV or testing set



In [None]:
# Try to create a for loop to run the process with different n_component values and track the accuracy and run time


