# 

# <center>PCA

## References

* Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow - Aurélien Géron
* Machine learning - Fast reference guide - Matt Harrison
* https://www.youtube.com/@patloeber
* https://www.youtube.com/@Dataquestio
* https://medium.com/turing-talks/aprendizado-n%C3%A3o-supervisionado-redu%C3%A7%C3%A3o-de-dimensionalidade-479ecfc464ea

## Overview

PCA is an unsupervised learning algorithm that reduces the dimensionality of data without discarding attributes. PCA is based on the variance of the data, meaning it tries to create a new representation of the data with a lower dimension while maintaining the variance between them. The model returns a matrix of the data whose columns are uncorrelated and are linear combinations of the original columns.

When we have a dataset with a large number of attributes, we face a phenomenon known as the Curse of Dimensionality. This causes the model to have many parameters, which can lead to overfitting and high computational cost. Other problems such as the presence of highly correlated attributes or attributes that do not provide useful information for the problem also become common.

An alternative to reduce the complexity of the model is to use some feature selection method. In general, these methods discard the least relevant attributes, resulting in the total loss of information that those attributes could bring to the model. On the other hand, principal component analysis is associated with the idea of reducing the data mass with the least possible loss of information.

Therefore, the use of PCA allows for reducing the complexity of the model while preserving the variance of the data as much as possible. Additionally, it enables the visualization of a multidimensional dataset in 3D or 2D. One disadvantage is the loss of interpretability of the model, as the resulting dataset is not easily associated with the original data.

## Math

### Calcuation Steps

* Subtract the mean from X
* Calculate Cov(X,X)
* Calculate eigenvectors and eigenvalues of the covariance matrix
* Sort the eigen vectors acording the their eigen values 
* Transform the original n-dimensional data into k dimension

---

## Imports

In [76]:
import numpy as np
import pandas as pd

## Data

In [103]:
# Create some data
data = np.matrix([
    [1,2,4,6,1],
    [4,1,2,4,3],
    [5,4,8,3,1],
    [7,2,3,7,6]
])

# To dataframe
X = pd.DataFrame(data)

# View
X

Unnamed: 0,0,1,2,3,4
0,1,2,4,6,1
1,4,1,2,4,3
2,5,4,8,3,1
3,7,2,3,7,6


# Models

## From sklearn


In [104]:
# Imports
from sklearn.decomposition import PCA

In [105]:
# standardize data
std_data = (X - X.mean())/X.std()

In [106]:
# Define PCA and fit transform
pca_sklearn = (PCA(3).fit_transform(std_data))

# View
pca_sklearn.round(2)

array([[ 0.17,  1.41, -0.75],
       [-0.74,  0.66,  1.03],
       [ 2.26, -0.82,  0.11],
       [-1.69, -1.25, -0.38]])

## From scratch

In [107]:
# Imports
from my_PCA import PCA

In [108]:
# Define PCA
pca = PCA(3)

# Fit
pca.fit(X)

# Tranform
pca_scratch = pca.transform(X)

# View
pca_scratch.round(2)

array([[-0.17, -1.41, -0.75],
       [ 0.74, -0.66,  1.03],
       [-2.26,  0.82,  0.11],
       [ 1.69,  1.25, -0.38]])

___