# Reference
https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/

# Manually Calculation

In [1]:
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig

3×2 matrix

In [2]:
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)

[[1 2]
 [3 4]
 [5 6]]


Centers data in the matrix

In [3]:
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)

[3. 4.]


In [4]:
# center columns by subtracting column means
C = A - M
print(C)

[[-2. -2.]
 [ 0.  0.]
 [ 2.  2.]]


Calculate covariance matrix of centered data

Correlation: normalized measure of amount and direction (positive/negative) that 2 columns change together <br>
Covariance: generalized and unnormalized version of correlation across multiple columns.

A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.

In [5]:
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)

[[4. 4.]
 [4. 4.]]


Calculate eigendecomposition of covariance matrix

Eigenvectors and eigenvalues are taken as the principal components and singular values are used to project original data.

Eigenvectors: directions or components for the reduced subspace <br>
Eigenvalues: magnitudes for the directions

In [6]:
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print("eigenvectors:")
print(vectors)
print("eigenvalues:")
print(values)

eigenvectors:
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
eigenvalues:
[8. 0.]


In [7]:
# project data
P = vectors.T.dot(C.T)
print(P.T)

[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace.

If all eigenvalues have a similar value, then the existing representation may already be reasonably compressed or dense and that the projection may offer little. <br>
If there are eigenvalues close to 0, they represent components that may be discarded. <br>
Select k eigenvectors (principal components) that have the k largest eigenvalues.

Only the first eigenvector is required. <br>
Suggestes that the 3×2 matrix could be projected onto a 3×1 matrix with little loss.

# Reusable
Use the `PCA()` class in scikit-learn library.
When creating the class, number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function
Data can be projected into a subspace with the chosen number of dimensions by calling `transform()`.

Once fit, eigenvalues and principal components can be accessed on the PCA class via `explained_variance_` and `components_ attributes`.

In [8]:
# Principal Component Analysis
from numpy import array
from sklearn.decomposition import PCA

In [9]:
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)

[[1 2]
 [3 4]
 [5 6]]


In [10]:
# create the PCA instance
pca = PCA(2)

In [11]:
# fit on data
pca.fit(A)

PCA(n_components=2)

In [12]:
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)

[[ 0.70710678  0.70710678]
 [ 0.70710678 -0.70710678]]
[8.00000000e+00 2.25080839e-33]


In [13]:
# transform data
B = pca.transform(A)
print(B)

[[-2.82842712e+00  2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.82842712e+00 -2.22044605e-16]]
