# Principal Component Analysis (PCA) in Python

A step by step tutorial to Principal Component Analysis, a simple yet powerful transformation technique.

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data.

It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data.

The PCA method can be described and implemented using the tools of linear algebra.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call B. Let’s walk through the steps of this operation.

In [9]:
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig

In [8]:
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)

[[1 2]
 [3 4]
 [5 6]]


In [12]:
A.shape

(3, 2)

In [3]:
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)

[3. 4.]


In [4]:
# center columns by subtracting column means
C = A - M
print(C)

[[-2. -2.]
 [ 0.  0.]
 [ 2.  2.]]


In [5]:
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)

[[4. 4.]
 [4. 4.]]


In [6]:
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
[8. 0.]


In [7]:
# project data
P = vectors.T.dot(C.T)
print(P.T)

[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]


### PCA Using SciKit Learn

In [14]:
# Principal Component Analysis
from numpy import array
from sklearn.decomposition import PCA

In [16]:
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)

[[1 2]
 [3 4]
 [5 6]]


In [19]:
A.shape

(3, 2)

In [21]:
# create the PCA instance
pca = PCA(2)

In [23]:
# fit on data
pca.fit(A)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [25]:
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)

[[ 0.70710678  0.70710678]
 [-0.70710678  0.70710678]]
[8. 0.]


In [26]:
# transform data
B = pca.transform(A)
print(B)

[[-2.82842712e+00 -2.22044605e-16]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.82842712e+00  2.22044605e-16]]


In [27]:
B.shape

(3, 2)