# PCA (Principal Component Analysis)

PCA is a dimensionality reduction technique that can be used to significantly speed up your unsupervised feature learning algorithm. It can also be used to visualize high-dimensional data in 2D or 3D. In this exercise, we will implement PCA and apply it to the MNIST dataset.

### Why PCA?

The motivation behind the algorithm is that there are certain features that capture a large percentage of variance in the original dataset. So it's important to find the directions of maximum variance in the dataset. These directions are called principal components. And PCA is essentially a projection of the dataset onto the principal components.

### How PCA works

The steps of PCA are as follows:

1. Compute the covariance matrix of the data.
2. Compute the eigenvectors of the covariance matrix.
3. Take the eigenvectors corresponding to the largest eigenvalues and use them to form a projection matrix.
4. Project the data into the subspace spanned by the projection matrix.

<div align="center">
<img src="pca.png" width=500 height=350/>
</div>

In [21]:
from sklearn import datasets
import pandas as pd

df = datasets.load_wine(as_frame=True)
df = pd.DataFrame(df.data, columns=df.feature_names)
df

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0


In [22]:
df.shape

(178, 13)

Let's standarize the data first. This is because PCA is sensitive to the scale of the data.
What standardization does is, it makes the mean of all the features equal to zero and the variance equal to one.
This makes sure that all the features are on a similar scale.

In [23]:
# Preprocessing
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df)
scaled_df = scaler.transform(df) # could also use fit_transform
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,1.518613,-0.562250,0.232053,-1.169593,1.913905,0.808997,1.034819,-0.659563,1.224884,0.251717,0.362177,1.847920,1.013009
1,0.246290,-0.499413,-0.827996,-2.490847,0.018145,0.568648,0.733629,-0.820719,-0.544721,-0.293321,0.406051,1.113449,0.965242
2,0.196879,0.021231,1.109334,-0.268738,0.088358,0.808997,1.215533,-0.498407,2.135968,0.269020,0.318304,0.788587,1.395148
3,1.691550,-0.346811,0.487926,-0.809251,0.930918,2.491446,1.466525,-0.981875,1.032155,1.186068,-0.427544,1.184071,2.334574
4,0.295700,0.227694,1.840403,0.451946,1.281985,0.808997,0.663351,0.226796,0.401404,-0.319276,0.362177,0.449601,-0.037874
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,0.876275,2.974543,0.305159,0.301803,-0.332922,-0.985614,-1.424900,1.274310,-0.930179,1.142811,-1.392758,-1.231206,-0.021952
174,0.493343,1.412609,0.414820,1.052516,0.158572,-0.793334,-1.284344,0.549108,-0.316950,0.969783,-1.129518,-1.485445,0.009893
175,0.332758,1.744744,-0.389355,0.151661,1.422412,-1.129824,-1.344582,0.549108,-0.422075,2.224236,-1.612125,-1.485445,0.280575
176,0.209232,0.227694,0.012732,0.151661,1.422412,-1.033684,-1.354622,1.354888,-0.229346,1.834923,-1.568252,-1.400699,0.296498


Let's perform PCA now baby!

In sklearn, PCA is implemented as a transformer object that learns n components in its fit method, and can be used on new data to project it on these components.

Parmaeters of PCA:

- n_components: Number of components to keep. If n_components is not set all components are kept.
- whiten: When True (False by default) the components_ vectors are divided by n_samples times singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

For more : https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [24]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
new_data = pca.fit_transform(scaled_df)   # new data with 3 principal components
new_data = pd.DataFrame(new_data, columns=['PC1', 'PC2', 'PC3'])
new_data

Unnamed: 0,PC1,PC2,PC3
0,3.316751,-1.443463,-0.165739
1,2.209465,0.333393,-2.026457
2,2.516740,-1.031151,0.982819
3,3.757066,-2.756372,-0.176192
4,1.008908,-0.869831,2.026688
...,...,...,...
173,-3.370524,-2.216289,-0.342570
174,-2.601956,-1.757229,0.207581
175,-2.677839,-2.760899,-0.940942
176,-2.387017,-2.297347,-0.550696


In [30]:
print(pca.explained_variance_ratio_)    # it is a list of the variance explained by each of the principal components
print(pca.singular_values_)             # it is a list of the singular values
print(pca.explained_variance_)          # it is a list of the variance explained by each of the principal components

[0.36198848 0.1920749  0.11123631]
[28.94203422 21.08225141 16.04371561]
[4.73243698 2.51108093 1.45424187]


In [26]:
print(pca.components_)  # it is a list of the principal components

[[ 0.1443294  -0.24518758 -0.00205106 -0.23932041  0.14199204  0.39466085
   0.4229343  -0.2985331   0.31342949 -0.0886167   0.29671456  0.37616741
   0.28675223]
 [-0.48365155 -0.22493093 -0.31606881  0.0105905  -0.299634   -0.06503951
   0.00335981 -0.02877949 -0.03930172 -0.52999567  0.27923515  0.16449619
  -0.36490283]
 [-0.20738262  0.08901289  0.6262239   0.61208035  0.13075693  0.14617896
   0.1506819   0.17036816  0.14945431 -0.13730621  0.08522192  0.16600459
  -0.12674592]]


In [27]:
print(pca.get_covariance()) # it is the covariance matrix of the original data projected onto the principal components

[[ 1.05579312  0.05482031  0.18366728 -0.28802092  0.36093877  0.27904562
   0.22702748 -0.19211219  0.20219036  0.50552844 -0.1140751   0.03321036
   0.57041942]
 [ 0.05482031  0.80872556  0.20624483  0.30246803  0.00205704 -0.37203396
  -0.43330102  0.34321028 -0.29820104  0.32808029 -0.43497691 -0.4578201
  -0.14324449]
 [ 0.18366728  0.20624483  1.04342443  0.38485763  0.27836946  0.13221557
   0.09000633  0.12995853  0.1181486   0.26070718 -0.131359   -0.00543064
   0.1559266 ]
 [-0.28802092  0.30246803  0.38485763  1.06467545 -0.07115812 -0.31611516
  -0.34087189  0.41223283 -0.23001815 -0.0059975  -0.2458137  -0.27972899
  -0.38162344]
 [ 0.36093877  0.00205704  0.27836946 -0.07115812  0.72770416  0.30052015
   0.27586481 -0.14152778  0.23542691  0.2569887   0.01878937  0.14926839
   0.38473515]
 [ 0.27904562 -0.37203396  0.13221557 -0.31611516  0.30052015  1.13702106
   0.73882122 -0.47681781  0.55877969 -0.09913736  0.47794369  0.6400969
   0.51642409]
 [ 0.22702748 -0.4333010

In [28]:
pca.get_precision() # it is the precision matrix (inverse covariance matrix) of the original data projected onto the principal components

array([[ 1.73200446, -0.10240719, -0.08037323,  0.28411099, -0.27265867,
        -0.12906772, -0.0736106 ,  0.11954683, -0.08017507, -0.50269942,
         0.19427729,  0.09253647, -0.46088392],
       [-0.10240719,  2.05253422, -0.224265  , -0.20425464, -0.07357472,
         0.15230063,  0.19507175, -0.18825861,  0.12145114, -0.25050529,
         0.25729595,  0.23750805,  0.00895679],
       [-0.08037323, -0.224265  ,  1.47027152, -0.60710327, -0.30894132,
        -0.18337037, -0.14695793, -0.18889347, -0.17164143, -0.17911984,
         0.08254812, -0.06637621, -0.08961758],
       [ 0.28411099, -0.20425464, -0.60710327,  1.5677838 , -0.05140309,
         0.05424197,  0.06250529, -0.31421315,  0.01020438,  0.10088191,
         0.05835625,  0.0210871 ,  0.27357411],
       [-0.27265867, -0.07357472, -0.30894132, -0.05140309,  2.04679584,
        -0.18354179, -0.1541329 ,  0.03605232, -0.14575023, -0.24490132,
         0.05270646, -0.05244737, -0.264302  ],
       [-0.12906772,  0.152300

Explained variance is a concept associated with dimensionality reduction techniques, particularly Principal Component Analysis (PCA). In the context of PCA, each principal component corresponds to an eigenvalue, which represents the variance of the data along that component. The total explained variance is the sum of all these eigenvalues.

Here's a breakdown of the terms:

### Eigenvalue: 
In PCA, eigenvalues represent the amount of variance captured by each principal component. Larger eigenvalues indicate that the corresponding principal components explain more variance in the data.

### Explained Variance: <br>
The explained variance is the proportion of the total variance in the dataset that is "explained" or captured by a subset of the principal components. It is the ratio of the sum of the eigenvalues of the selected principal components to the total sum of eigenvalues.

$$ 
ExplainedVariance = \frac{\sum_{i=1}^{k}{Eigenvalue_i}}{\sum_{i=1}^{n}{Eigenvalue_i}}
$$​

Here, 
k is the number of selected principal components, and 
n is the total number of principal components.
Interpretation:

A higher explained variance indicates that the selected principal components retain more information about the original data. It is often used to assess how well the reduced-dimensional representation (lower-dimensional space) preserves the variability present in the original data.

For example, if 90% of the total variance is explained by a certain number of principal components, it implies that using only those components retains most of the information in the data.

### Application:

Explained variance is frequently used in the context of selecting the number of principal components to retain. One common approach is to choose a number of components that collectively explain a sufficiently high percentage of the total variance, e.g., 90% or 95%.
In scikit-learn's PCA implementation, you can access the explained variance of each component using the attribute explained_variance_. The cumulative explained variance is often visualized using a scree plot to help determine the appropriate number of components to retain for dimensionality reduction.