## Dimension Reduction [1] : PCA

![](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png)

I am going to make a series with dimension reduction.

- [🎛 Dimension Reduction [2] : LDA](https://www.kaggle.com/subinium/dimension-reduction-2-lda)
- [🎛 Dimension Reduction [3] : T-SNE](https://www.kaggle.com/subinium/dimension-reduction-3-t-sne)
- [🎛 Dimension Reduction [4] : UMAP](https://www.kaggle.com/subinium/dimension-reduction-4-umap)

## Import Library & Default Setting

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib as mpl
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


After plotting with plotly, it's actually not really necessary, so let's always make the settings for custom

In [None]:
# matplotlib configure
plt.rcParams['image.cmap'] = 'gray'
# Color from R ggplot colormap
color = ['#6388b4', '#ffae34', '#ef6f6a', '#8cc2ca', '#55ad89', '#c3bc3f', '#bb7693', '#baa094', '#a9b5ae', '#767676']

In [None]:
mnist = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
mnist.head()

In [None]:
label = mnist['label']
mnist.drop(['label'], inplace=True, axis=1)

## Check Dataset

No one knows the mnist, but let's see what kind of data it is

In [None]:
def arr2img(arr, img_size=(28, 28)):
    return arr.reshape(img_size)

fig, axes = plt.subplots(2, 5, figsize=(10, 2))

for idx, ax in enumerate(axes.flat):
    ax.imshow(arr2img(mnist[idx:idx+1].values))
    ax.set_title(label[idx], fontweight='bold', fontsize=8)
    ax.axis('off')

plt.subplots_adjust(bottom=0.1, right=0.5, top=0.9)
plt.show()


We can recognize the 784's binary information as a 28x28 sized image, but the computer is not. 

So the computer simply understands it as a 784-dimensional vector.

## PCA & Result

**PCA** stands for Principal Component Analysis. 

The goal is to reduce the dimensions in multi-dimensional data, and to do this we use the technique of linear algebra.

In short, it is a method to find the axis with the highest variance in the distribution and eliminate the vertical axis.

In this way, it is assumed that only the axes with the highest information remain, so that the meaning is maintained to some extent when the dimension is reduced.

Much of this dimensionality reduction is provided by scikit-learn. There are various types of PCA, but this time, only the most basic is applied.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(mnist)
mnist_pca = pca.transform(mnist)

As it is an interactive visualization, you can check if they are clustered by clicking the above category(legend).

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

for idx in range(10):
    fig.add_trace(go.Scatter(
        x = mnist_pca[:,0][label==idx],
        y = mnist_pca[:,1][label==idx],
        name=str(idx),
        opacity=0.6,
        mode='markers',
        marker=dict(color=color[idx])
        
    ))

fig.update_layout(
    width = 800,
    height = 800,
    title = "PCA result",
    yaxis = dict(
      scaleanchor = "x",
      scaleratio = 1
    ),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)


fig.show()