# Principal Component Analysis: Data Visualisation

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris 
%matplotlib inline

# Loading Iris Database

In [3]:
# load dataset into dataframe

df = pd.read_csv(filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
                 , header=None, sep=',')
df.columns=['sepal length', 'sepal width', 'petal length', 'petal width', 'class']

In [4]:
# show the last 5 lines of the table

df.tail()

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [5]:
# split data table into data X and class labels y

X = df.iloc[:,0:4].values
y = df.iloc[:,4].values

# Standardise

In [6]:
# fits data onto an unit scale (mean=0, variance=1)

X_std = StandardScaler().fit_transform(X)

# Calculating the Principal Components

There are two main ways that the eigenvectors that correspond to each principical component are calculated: eigendecomposition of the correlation matrix and singular value decomposition. Since SVD improves the computational efficiency of the process, most PCA implementations utilise the SVD method. I will show both methods to compare the results and to show they will be the same.

## Correlation Matrix

In [19]:
cor_mat = np.corrcoef(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cor_mat)

print("Eigenvectors \n %s" %eig_vecs)
print("\n")
print("Eigenvalues \n %s" %eig_vals)

Eigenvectors 
[[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
 [-0.26335492 -0.92555649  0.24203288 -0.12413481]
 [ 0.58125401 -0.02109478  0.14089226 -0.80115427]
 [ 0.56561105 -0.06541577  0.6338014   0.52354627]]


Eigenvalues 
[2.91081808 0.92122093 0.14735328 0.02060771]


## Singular Value Decomposition

As you can see, the results for both the SVD method and the correlation matrix method are the same.

In [10]:
u,s,v = np.linalg.svd(X_std.T)
u

array([[-0.52237162, -0.37231836,  0.72101681,  0.26199559],
       [ 0.26335492, -0.92555649, -0.24203288, -0.12413481],
       [-0.58125401, -0.02109478, -0.14089226, -0.80115427],
       [-0.56561105, -0.06541577, -0.6338014 ,  0.52354627]])

# Sort the Eigenpairs

In order to decide which eigenvectors can be removed without losing too much of the original information, you must choose eigenvectors which correspond to the largest eigenvalues of the dataset matrix. The eigenvectors with the lowest eigenvalues can be removed without losing much information.
Therefore, we will rank the eigenvalues from highest to lowest and choose the top $k$ eigenvectors.

In [26]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

Eigenvalues in descending order:
2.910818083752051
0.9212209307072246
0.14735327830509562
0.02060770723562511


# Visualisation of Variance
    