# Introduction

**Dimension Reduction:**
There are many ways to achieve dimensionality reduction, but most of these techniques fall into one of two classes 
* Feature Elimination
* Feature Extraction

**Principal Component Analysis (PCA):** 
PCA is a dimension reduction technique extensionally used for visualization of high dimensional data. It is a feature extraction technique — it combines all input variable in a specific way, drop the least important variables while retaining the most valuable ones. In this method we calculate eigenvectors and eigenvalues of the covariance matrix. Once eigenvectors are found from the covariance matrix, the nextstep is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance.The eigenvector with the highest eigenvalue is the principal component of the data set.

**t-Distributed Stochastic Neighbor Embedding (t-SNE):**
t-SNE is a non-linear, unsupervised technique primarily used for data exploration and visualization of high-dimensional data. It provides you an intuition of how the data is arranged in a high-dimensional space. The t-SNE algorithm calculates a similarity measure between points in high dimensional space using Gaussian distribution then in the low dimensional space using Cauchy distribution. Finally it measures the probability distribution of the two dimensional spaces by Kullback-Liebler divergence and optimize the KL cost function using gradient descent.

**PCA vs t-SNE:**
t-SNE differs from PCA by preserving only small pairwise distances or local similarities whereas PCA is concerned with preserving large pairwise distances to maximize variance. PCA is a linear dimension reduction technique that seeks to maximize variance and preserves large pairwise distances. In other words, things that are different end up far apart. This can lead to poor visualization especially when dealing with non-linear manifold structures. 

> *In this notebook we will see the comparision between PCA and t-SNE on MNIST digits dataset*




In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import manifold
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition

> Import data as Input

* MNIST digits dataset is used for the purpose which consist of 784 pixel values and 42000 entries in train data

In [None]:
data = pd.read_csv('../input/digit-recognizer/train.csv')
data.head()

In [None]:
df_labels = data.label
df_data = data.drop('label', axis = 1)


In [None]:
#Count plot for the labels 
sns.countplot(df_labels)

Standardizing the data into a 2D array of shape 10000x784 as 10000 images have been extracted of shape 28x28 pixels. Flattening 28x28 pixels give 784 data points.

In [None]:
#extracting top 10000 data points 
df_data = df_data.head(10000)
df_labels = df_labels.head(10000)
pixel_df = StandardScaler().fit_transform(df_data)
pixel_df.shape

In [None]:
sample_data = pixel_df

**Implementation of PCA**

In [None]:
pca = decomposition.PCA(n_components = 2, random_state = 42)

In [None]:
pca_data = pca.fit_transform(sample_data)
print("shape of pca_reduced.shape = ", pca_data.shape)

In [None]:
# attaching the label for each 2-d data point 
pca_data = np.column_stack((pca_data, df_labels))

# creating a new data frame for plotting of data points
pca_df = pd.DataFrame(data=pca_data, columns=("X", "Y", "labels"))
print(pca_df.head(10))
sns.FacetGrid(pca_df, hue="labels", size=6).map(plt.scatter, 'X', 'Y').add_legend()
plt.show()

**Implementation of t-SNE**

In [None]:
tsne = manifold.TSNE(n_components = 2, random_state = 42, verbose = 2, n_iter = 2000)
transformed_data = tsne.fit_transform(sample_data)

In [None]:
#Creation of new dataframe for plotting of data points
tsne_df = pd.DataFrame(
    np.column_stack((transformed_data, df_labels)),
    columns = ['x', 'y', 'labels'])
tsne_df.loc[:, 'labels']= tsne_df.labels.astype(int)
print(tsne_df.head(10))


In [None]:
grid = sns.FacetGrid(tsne_df, hue='labels', size = 8)
grid.map(plt.scatter, 'x', 'y').add_legend()