### Another Example of Principal Component Analysis

In [1]:
import numpy as np

In [2]:
X = np.array([[10,20,10],
              [2,5,2],
              [8,17,7],
              [9,20,10],
              [12,22,11]])

In [3]:
print(X)

[[10 20 10]
 [ 2  5  2]
 [ 8 17  7]
 [ 9 20 10]
 [12 22 11]]


#### Let's first try to drive the principle components directly without using scikit-learn Decomposition module

In [4]:
# To dod PCA, you need to center your matrix
X = np.mat(X) 
meanVals = X.mean(axis=0)
A = X - meanVals           # A is the zero-mean (centered) version of X
C = np.cov(A, rowvar=0)    # C is the covarianvce matrix of X
print(C) # ypu had 3 features so you end up with a 3 * 3 convariance matrix

[[14.2  25.3  13.5 ]
 [25.3  46.7  24.75]
 [13.5  24.75 13.5 ]]


<p>Every column is a a feature so you need the mean of the featureand you need to subtract that from the original matrix and after that you compute the convariance </p>

In [5]:
meanVals

matrix([[ 8.2, 16.8,  8. ]])

<p>The first mean is 8.2. Second mean is 16.8 and third mean is 8.</p>

In [6]:
# Note that C = (1/(N-1)) A.T*A
# you can compute C and verify that it is the same exact thing that numpy is doing is in cell (4)
N = np.shape(X)[0]
print(np.dot(A.T,A)/(N-1))

[[14.2  25.3  13.5 ]
 [25.3  46.7  24.75]
 [13.5  24.75 13.5 ]]


#### Now we can obtain eigenvalues and eigenvectors of the covariance matrix:

In [7]:
# with numpy we can obtain the eigen vectors of C, you need to have computed C which we supply as a parameter 
# and get eigen values and vectors at once
eigen_values, eigen_vectors = np.linalg.eig(C)
print("Eigen Values:", eigen_values, "\n")
print("Eigen Vectors:\n", eigen_vectors)

Eigen Values: [73.71803604  0.38355337  0.29841058] 

Eigen Vectors:
 [[ 0.43405692  0.89979879 -0.04423488]
 [ 0.79486757 -0.40562416 -0.45128105]
 [ 0.42400487 -0.16072079  0.89128486]]


<p>We have eigen values: we haveM a very large first component with 73.718, and almost nothing on the second and third components. </p>
    <p>we know that [ 0.43405692  0.89979879 -0.04423488] is the most important eigen vector of my data set. <p>

#### No we can transform the full data into the new feature space based on the eigenvectors:

In [8]:
# we transform the old features by using the transpose of the eigen vectors 
# this is similar to the slide of U  *A 

newFeatures = eigen_vectors.T
XTrans = np.dot(newFeatures, A.T)
print(XTrans.T) 

[[ 4.17288843e+00  1.98923940e-04  2.58847587e-01]
 [-1.46146195e+01  1.71937350e-01  2.51663442e-01]
 [-3.51842744e-01 -1.00363798e-01 -9.72694089e-01]
 [ 3.73883152e+00 -8.99599868e-01  3.03082462e-01]
 [ 7.05474229e+00  8.27827393e-01  1.59100599e-01]]


 ##### this is the transformed matrix in the new space with all the dimensions for our data points but we are not going to need all of them. 

<p>My new features, we just need only one feature in the new space./p>

#### However, typically, we want a lower-dimensional space. We can sort the eigenvectors in the decreasing order of their eigenvalues and take the top k. In the example below, we'll take only the top first principal component (since it has the largest eigenvalue, no sorting necessary):

In [9]:
# the point is that we can drop a feature and take only one of the eigen vectors, 
# take only this [ 0.43405692  0.89979879 -0.04423488] produced in cell 7 which is everything for the first dimension and we use 
# that in the dot product  with A.T (transpose of A) 
# and gives you only one dimension in the new space.
reducedFeatures = eigen_vectors[:,0].T
reducedXTrans = np.dot(reducedFeatures, A.T)
print(reducedXTrans.T)

[[  4.17288843]
 [-14.6146195 ]
 [ -0.35184274]
 [  3.73883152]
 [  7.05474229]]


<p>In the new space, this is the only feature that we have.</p>

#### We can also use Scikit-learn decomposition module to do the same thing:

<p>In Scitit-learn, we don't ever look at this calculations.</p>

In [10]:
from sklearn import decomposition

In [11]:
pca = decomposition.PCA(svd_solver='randomized')
XTrans = pca.fit_transform(X) # fit your data into PCA and you immediately get your transform



In [12]:
np.set_printoptions(precision=3, suppress=True)

print(XTrans) # this is your transform, the one that we got through multiple steps in cell 9

[[-4.173 -0.    -0.259]
 [14.615 -0.172 -0.252]
 [ 0.352  0.1    0.973]
 [-3.739  0.9   -0.303]
 [-7.055 -0.828 -0.159]]


<p>This is just computing your new features</p>

#### The remaining part of this notebook, is another example of using PCA for dimensionality reduction. 

In [13]:
# we have a new matrix of data
M = np.array([[2.5, 2.4],
           [0.5, 0.7],
           [2.2, 2.9],
           [1.9, 2.2],
           [3.1, 3.0],
           [2.3, 2.7],
           [2, 1.6],
           [1, 1.1],
           [1.5, 1.6],
           [1.1, 0.9]])

In [14]:
meanM = M.mean(axis=0) # compute the mean 
# center the matrix (matrix - mean)
MC = M - meanM                 # MC is the zero-mean (centered) version of M

# you get the convariance
CovM = np.cov(MC, rowvar=0)    # CovM is the covarianvce matrix of M
print("Zero Mean Matrix:\n", MC,"\n")
print("Covariance Matrix:\n", CovM,"\n")

Zero Mean Matrix:
 [[ 0.69  0.49]
 [-1.31 -1.21]
 [ 0.39  0.99]
 [ 0.09  0.29]
 [ 1.29  1.09]
 [ 0.49  0.79]
 [ 0.19 -0.31]
 [-0.81 -0.81]
 [-0.31 -0.31]
 [-0.71 -1.01]] 

Covariance Matrix:
 [[0.617 0.615]
 [0.615 0.717]] 



In [15]:
eigVals, eigVecs = np.linalg.eig(CovM)
print("Eigenvalues:\n", eigVals,"\n")
print("Eigenvectors:\n", eigVecs,"\n")

Eigenvalues:
 [0.049 1.284] 

Eigenvectors:
 [[-0.735 -0.678]
 [ 0.678 -0.735]] 



<p>You have your eigen values (where 1.284 is the highest) and eigen vectors of our only two features and this one,  [ 0.678 -0.735], is more important.</p>

In [16]:
# We are going to just take one of these vectors
# So this value (1.284 in cell 15) was higher and this is why we are taking this vector  [ 0.678 -0.735] for your transformation
newFeatures = eigVecs[:,1].T 
print(newFeatures)

[-0.678 -0.735]


In [17]:
MTrans = np.dot(newFeatures, MC.T)
print(np.mat(MTrans).T)

[[-0.828]
 [ 1.778]
 [-0.992]
 [-0.274]
 [-1.676]
 [-0.913]
 [ 0.099]
 [ 1.145]
 [ 0.438]
 [ 1.224]]


<p>So instead of two features, you are working with just one feature.</p>

In [18]:
# Instead we can use scikit-learn's decomposition.PCA and specifiy the number of components

pca2 = decomposition.PCA(n_components=1)
MTrans2 = pca2.fit_transform(M)
print(MTrans2)

[[-0.828]
 [ 1.778]
 [-0.992]
 [-0.274]
 [-1.676]
 [-0.913]
 [ 0.099]
 [ 1.145]
 [ 0.438]
 [ 1.224]]


<p>In Scikit-learn  you would tell it how many components you want. You can print them all but yo can tell it how many components you have and get just the same exact answer. </p>