# Principal Component Analysis (PCA)

## Outine
- Introduction to Principal Component Analysis (PCA)
- Manually calculating PCA
- PCA using sklearn
- Wrap up

  
## Introduction to Principal Component Analysis
   
Principal component analysis (PCA) is a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components.

PCA is a method used to reduce number of variables in your data by extracting important ones from a large pool. It reduces the dimensionality of your data with the aim of retaining as much information as possible.

To interpret each principal component, examine the magnitude and direction of the coefficients for the original variables. The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating the component. How large the absolute value of a coefficient has to be in order to deem it important is subjective. Use your specialized knowledge to determine at what level the correlation value is important.


## Manually calculating PCA

There is no **pca()** function in NumPy, but we can easily calculate the ``Principal Component Analysis`` step-by-step using NumPy functions.

The example below defines a small 3×2 matrix, centers the data in the matrix, calculates the covariance matrix of the centered data, and then the eigendecomposition of the covariance matrix. The eigenvectors and eigenvalues are taken as the principal components and singular values and used to project the original data.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call B. Let’s walk through the steps of this operation.


In [2]:
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print("Original matrix")
print(A)
# step 1: calculate the mean of each column
M = mean(A.T, axis=1)
print("Calculate the means of each columns")
print(M)
# step 2: center columns by subtracting column means
C = A - M
print("Center columns")
print(C)
# step 3: calculate covariance matrix of centered matrix
V = cov(C.T)
print("Calculate covariance matrix of centered matrix")
print(V)
# step 4: perform the eigendecomposition of covariance matrix
values, vectors = eig(V)
print("eigenvectors (PCA components)")
print(vectors)
print("eigenvalues (PCA variance)")
print(values)
# project data into the subspace (reduction)
P = vectors.T.dot(C.T)
print("PCA reduction")
print(P.T)

Original matrix
[[1 2]
 [3 4]
 [5 6]]
Calculate the means of each columns
[3. 4.]
Center columns
[[-2. -2.]
 [ 0.  0.]
 [ 2.  2.]]
Calculate covariance matrix of centered matrix
[[4. 4.]
 [4. 4.]]
eigenvectors (PCA components)
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
eigenvalues (PCA variance)
[8. 0.]
PCA reduction
[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


In [22]:
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print("Original matrix")
print(A)
print(A.T)
# # step 1: calculate the mean of each column
M = mean(A.T,axis=1)
print("Calculate the means of each columns")
print(M)
# # step 2: center columns by subtracting column means
C = A - M
# print("Center columns")
print(C)
# step 3: calculate covariance matrix of centered matrix
V = cov(C.T)
# print("Calculate covariance matrix of centered matrix")
print(V)
# # step 4: perform the eigendecomposition of covariance matrix
values, vectors = eig(V)
print("eigenvectors (PCA components)")
print(vectors)
print("eigenvalues (PCA variance)")
print(values)
# # project data into the subspace (reduction)
P = vectors.T.dot(C.T)
print("PCA reduction")
print(P.T)

Original matrix
[[1 2]
 [3 4]
 [5 6]]
[[1 3 5]
 [2 4 6]]
Calculate the means of each columns
[3. 4.]
[[-2. -2.]
 [ 0.  0.]
 [ 2.  2.]]
[[4. 4.]
 [4. 4.]]
eigenvectors (PCA components)
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
eigenvalues (PCA variance)
[8. 0.]
PCA reduction
[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


# PCA using sklearn

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance_ and components_ attributes.

The example below demonstrates using this class by first creating an instance, fitting it on a 3×10 matrix, accessing the values and vectors of the projection, and transforming the original data.

In [5]:
# Principal Component Analysis
from numpy import array
from sklearn.decomposition import PCA
# define a matrix
#A = array([[1, 2], [3, 4], [5, 6]])
A = array([
[1,2,3,4,5,6,7,8,9,10],
[11,12,13,14,15,16,17,18,19,20],
[21,22,23,24,25,26,27,28,29,30]])
print('original matrix')
print(A)
# create the PCA instance
pca = PCA(2)
# fit on data
pca.fit(A)
# access values and vectors
print('PCA components')
print(pca.components_)
print('PCA explained variance')
print(pca.explained_variance_)
# transform data
B = pca.transform(A)
print('PCA transform')
print(B)

original matrix
[[ 1  2  3  4  5  6  7  8  9 10]
 [11 12 13 14 15 16 17 18 19 20]
 [21 22 23 24 25 26 27 28 29 30]]
PCA components
[[-0.31622777 -0.31622777 -0.31622777 -0.31622777 -0.31622777 -0.31622777
  -0.31622777 -0.31622777 -0.31622777 -0.31622777]
 [ 0.9486833  -0.10540926 -0.10540926 -0.10540926 -0.10540926 -0.10540926
  -0.10540926 -0.10540926 -0.10540926 -0.10540926]]
PCA explained variance
[1.00000000e+03 7.09974815e-30]
PCA transform
[[ 3.16227766e+01 -3.10862447e-15]
 [ 0.00000000e+00  0.00000000e+00]
 [-3.16227766e+01  3.10862447e-15]]


We can see, that with some very minor floating point rounding that we achieve the same principal components, singular values, and projection as in the previous example.

## Wrap up
We discussed:
- Introduction to Principal Component Analysis (PCA)
- Manually calculating PCA
- PCA using sklearn

Examples, thanks to Jason Brownlee PhD retrieved from: https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/