<a href="https://colab.research.google.com/github/Machine-Learning-Tokyo/ELSI-DL-Bootcamp/blob/master/data_visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization (by Alisher)

##Data pre-processing: read and preprocess the data

Import necessary libraries 

In [0]:
import os
import csv
import numpy as np

Get the Github Repository: clone the repository into your Google Drive 
If the folder (ELSI-DL-Bootcamp) already exists in your Google Drive, you can either skip this step or uncomment the first line (which deletes the current folder and clone the repo from scratch) 

In [0]:
!rm -rf ELSI-DL-Bootcamp
!git clone https://github.com/Machine-Learning-Tokyo/ELSI-DL-Bootcamp.git


List the files in the repository. 

We need:


  **1. sat6annotations.csv**: class_names

  **2. X_train_subset.csv**: subset of X_train (i.e., images from DeepSat6 Kaggle Dataset)

  **3. y_train_subset.csv**: subset of y_train (i.e., image labels of X_train images)


In [0]:
from subprocess import check_output
print(check_output(["ls", "./ELSI-DL-Bootcamp"]).decode("utf8"))

Import panda library and read csv files using panda:
- [pandas](https://pandas.pydata.org):  is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python.

In [0]:
import pandas as pd
X_train = pd.read_csv('./ELSI-DL-Bootcamp/X_train_subset.csv', header=None)
y_train = pd.read_csv('./ELSI-DL-Bootcamp/y_train_subset.csv', header=None)
class_names = list(pd.read_csv('./ELSI-DL-Bootcamp/sat6annotations.csv', header=None)[0])

print("Shape of X_train: {}".format(X_train.shape))
print("Shape of y_train: {}".format(y_train.shape))
print("Class names: {}".format(class_names))

**y_train (labels) are one-hot-vectors, indicating the class with "1" entries**

In [0]:
print(y_train.loc[:5,:].values)

**We need categorical labels for visual consistency**

In [0]:
feat_cols = ['pixel'+str(i) for i in range(X_train.shape[1])]
df = pd.DataFrame(X_train.values, columns=feat_cols)

labels = []
numeric_labels = []
for _, cols in y_train.iterrows():
    labels.append(class_names[np.argmax(cols.to_list())])
    numeric_labels.append(np.argmax(cols.to_list()))


df['y'] = labels
df['numeric_label'] = numeric_labels


**Let's check how many data we have for each class:**

In [0]:
for label_name in class_names:
  print("{}: {}".format(label_name, labels.count(label_name)))

## Data Visualization (Reference: [data visualization](https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b))

###Import necessary libraries: 

- [seaborn](https://seaborn.pydata.org): is a Python data visualization library based on matplotlib, provides a high-level interface for drawing attractive and informative statistical graphics.
- [sklearn](http://scikit-learn.github.io/stable): is a library in Python that provides many unsupervised and supervised learning algorithms. It’s built upon NumPy, pandas, and Matplotlib.
- [matplotlib](https://matplotlib.org): Matplotlib is a Python 2D plotting library which provides high quality figures.
- [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html): **Principal Component Analysis (PCA)** is a linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
- [sklearn.manifold.TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html): **t-distributed Stochastic Neighbor Embedding (t-SNE)** is a tool to visualize high-dimensional data. Mostly is is good to use t-SNE with some dimensionality reduction algorithms (for example, PCA)
- matplotlib.pyplot


In [0]:
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
%matplotlib inline
import matplotlib.pyplot as plt
import time


This is for reproducability of the results (seed):

In [0]:
np.random.seed(42)
rndperm = np.random.permutation(df.shape[0])

**Fit PCA (with 3 principal components) to our data. PCA is using the correlation between some dimensions and tries to provide a minimum number of variables that keeps the maximum amount of variation or information about how the original data is distributed.**


In [0]:
pca = PCA(n_components=3)
pca_result = pca.fit_transform(df[feat_cols].values)

# add new columns to our dataFrame: pca-one, pca-two, pca-three
df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1] 
df['pca-three'] = pca_result[:,2]
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

**These variations mean:**
*   the 1st principal component of PCA (first dimension of compressed data) has the highest variance: **0.65892326**
*   the 2nd principal component of PCA (second dimension of compressed data) has the variance of: **0.17176854**
*   the 3rd principal component of PCA (third dimension of compressed data) has the variance of: **0.01565496**

**Which means:**
*   we saved **84%** of information from original data (images)
*   when we describe the data with 3 dimensions, we would lose **16%** of information
*   when we describe the data with 2 dimensions, we would lose **17.5%** of information






In [0]:
plt.figure(figsize=(16,10))
g_pca_2d = sns.scatterplot(
    x="pca-one", y="pca-two",
    hue="y",
    palette=sns.color_palette("bright", 6),
    data=df.loc[:,:],
    legend="full",
    alpha=0.3
)

g_pca_2d.legend(loc='upper right')
g_pca_2d.set_title("First two dimensions of PCA-compressed-data")

In [0]:
from mpl_toolkits.mplot3d import Axes3D
ax = plt.figure(figsize=(16,10)).gca(projection='3d')
ax.scatter(
    xs=df.loc[rndperm,:]["pca-one"], 
    ys=df.loc[rndperm,:]["pca-two"], 
    zs=df.loc[rndperm,:]["pca-three"], 
    c=df.loc[rndperm,:]["numeric_label"], 
    cmap='tab10'
)
ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')

plt.show()




---



---

**fit TSNE to PCA-compressed-data**:
*   First, compress the data two 3-dimensions
*   Then, use TSNE to visualize the data in 2D



In [0]:
df_subset = df.loc[rndperm,:].copy()
data_subset = df_subset[feat_cols].values
pca = PCA(n_components=3)
pca_result = pca.fit_transform(data_subset)
df_subset['pca-one'] = pca_result[:,0]
df_subset['pca-two'] = pca_result[:,1] 
df_subset['pca-three'] = pca_result[:,2]
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(pca_result)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

**Plot TSNE using seaborn's scatterplot:**

In [0]:
df_subset['tsne-2d-one'] = tsne_results[:,0]
df_subset['tsne-2d-two'] = tsne_results[:,1]
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=sns.color_palette("bright", 6),
    data=df_subset,
    legend="full",
    alpha=0.3
)



---



---

**You don't necessarily do PCA-compression into 3-dimensions:**
*   If you want to save larger portion of data before TSNE-visualization, reduce the dimensionaly into k-dimensions
*   For example, reduce the dimensionality of data from **3136** to **5**.
*   Fit another PCA with 5 principal components
*   Fit TSNE to PCA-compressed-data (with 5 dimensions) to visualize the data in 2D



In [0]:
pca_5 = PCA(n_components=5)
pca_result_5 = pca_5.fit_transform(data_subset)
print('Explained variation for 5 principal components: {}'.format(pca_5.explained_variance_ratio_))
print('Cumulative explained variation for 5 principal components: {}'.format(np.sum(pca_5.explained_variance_ratio_)))


**These variations mean:**
*   the 1st principal component of PCA (first dimension of compressed data) has the highest variance: **0.65892326**
*   the 2nd principal component of PCA (second dimension of compressed data) has the variance of: **0.17176854**
*   the 3rd principal component of PCA (third dimension of compressed data) has the variance of: **0.01565496**
*   the 4th principal component of PCA (fourth dimension of compressed data) has the variance of: **0.00871125**
*   the 5th principal component of PCA (fifth dimension of compressed data) has the variance of: **0.00635331**

**Which means:**
*   we saved **86.14%** of information from original data (images)
*   when we describe the data with 5 dimensions, we would lose **13.86%** of information


In [0]:
time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300)
tsne_pca_results = tsne.fit_transform(pca_result_5)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

Let's plot three different plots together:


1.   PCA plot (fit PCA with 3 principal components and directly visualize)
2.   TSNE plot (fit TSNE to PCA-compressed-data with 3 principal components) on 2D
3.   TSNE plot (fit TSNE to PCA-compressed-data with 5 principal components) on 2D



In [0]:
df_subset['tsne-pca5-one'] = tsne_pca_results[:,0]
df_subset['tsne-pca5-two'] = tsne_pca_results[:,1]


plt.figure(figsize=(30,8))
ax1 = plt.subplot(1, 3, 1)
g1 = sns.scatterplot(
    x="pca-one", y="pca-two",
    hue="y",
    palette=sns.color_palette("bright", 6),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax1
)
ax2 = plt.subplot(1, 3, 2)
g2 = sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=sns.color_palette("bright", 6),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax2
)
ax3 = plt.subplot(1, 3, 3)
g3 = sns.scatterplot(
    x="tsne-pca5-one", y="tsne-pca5-two",
    hue="y",
    palette=sns.color_palette("bright", 6),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax3
)

g1.legend(loc='upper right')
g1.set_title("PCA plot")
g2.legend(loc='upper right')
g2.set_title("t-SNE plot of PCA (compressed into 3 dimensions)")
g3.legend(loc='upper right')
g3.set_title("t-SNE plot of PCA (compressed into 5 dimensions)")




---



---

# **Thank you for listening!!!**

In [0]:
!pip install art==3.6
from art import *
tprint("Thank you from", font="sub-zero")
tprint("(MLT Team)", "sub-zero")