In [32]:
# initial library import
import numpy as np 
import pandas as pd
import seaborn as sns

**Principal component analysis of a dataset :**
- Unlike what the name suggests, it is a dimension reduction technique for easier data processing.
- In this notebook, we'll demonstrate the same by converting an of 784 dimensions from the MNIST dataset into a 2D visualization.

In [33]:
# data import
data = pd.read_csv('../input/mnist-data/train.csv') 
data.head() 

In [34]:
# dropping unnecessary labels
label = data['label'] # save label data for later use
data.drop('label', axis = 1, inplace = True)
data.head()

**Data standardization :**

PCA gives more emphasis to variables with high variance. Therefore, if the dimensions are not scaled, we will get inconsistent results. For example, the value for one variable might lie in the range 50-100 and the other one 5-10. In this case, PCA will give more weight to the first variable. Such issues can be resolved by standardizing the dataset before applying PCA.

In [35]:
# scaling data to have a mean of 0 and standard deviation of 1
from sklearn.preprocessing import StandardScaler
data_standardized = StandardScaler().fit_transform(data)
data_standardized

In [36]:
# covariance matrix to determine dimensional relationships
covMatrix = np.matmul(data_standardized.T ,data_standardized)
covMatrix

In [37]:
# eigenvalue & eigenvector calculation to determine principal components
from scipy.linalg import eigh
values, vector = eigh(covMatrix,eigvals=(782,783))
vector = vector.T
values

In [38]:
# projecting vector on standardized data
projectedData = np.matmul(vector, data_standardized.T)
projectedData

In [39]:
# preparing stacked data for visualization
reducedData = np.vstack((projectedData, label)).T 
reducedData = pd.DataFrame(reducedData, columns = ['pca_1', 'pca_2', 'label'])

**Visualization using FacetGrid :**

FacetGrid is used for plotting conditional relationships. The basic workflow is to initialize the FacetGrid object with the dataset, and the variables used to structure the grid. Then one or more plotting functions can be applied to each subset by calling FacetGrid.map() or FacetGrid.map_dataframe(), and then the other customizations can also be done.

In [40]:
# data visualization 
sns.FacetGrid(reducedData, hue = 'label', size = 8).map(sns.scatterplot, 'pca_1', 'pca_2').add_legend()

In [41]:
# visualization of what the dataset actually represents
import matplotlib.pyplot as plt

index = 1234 # random index chosen for representation purposes
fig_data = np.array(data.iloc[index]).reshape(28,28) 
plt.imshow(fig_data, interpolation = None, cmap = 'gray') 
plt.show()
print('Digit represented : ', label[index])