[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## UnSupervised Learning - Dimensionality Reduction - t-SNE

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 26/02/2023 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0042DimensionalityReductionTSNE.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.manifold import TSNE
from sklearn.decomposition import KernelPCA

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Fixel Algorithms Packages


## Dimensionality Reduction by t-SNE

The t-SNE method was invented in Google with the motivation of analyzing the weights of Deep Neural Networks (High dimensional data).  
So its main use, originally, was visualization. Yet in practice it is one of the most powerful dimensionality reduction methods.

In this notebook:

 - We'll apply the t-SNE algorithm on the MNIST data set.
 - We'll compare the results to the results by Kernel PCA or IsoMap. 

In [None]:
# Parameters

# Data
numRows  = 3
numCols  = 3
tImgSize = (28, 28)

numSamples = 5_000

# Model
lowDim      = 2
paramP      = 10
metricType  = 'l2'

# Visualization
imgShift        = 5
numImgScatter   = 70


In [None]:
# Auxiliary Functions

def PlotMnistImages(mX: np.ndarray, vY: np.ndarray = None, numRows: int = 1, numCols: int = 1, imgSize = (28, 28), randomChoice: bool = True, hF = None):

    numSamples  = mX.shape[0]

    numImg = numRows * numCols

    # tFigSize = (numRows * 3, numCols * 3)
    tFigSize = (numCols * 3, numRows * 3)

    if hF is None:
        hF, hA = plt.subplots(numRows, numCols, figsize = tFigSize)
    else:
        hA = hF.axes
    
    hA = np.atleast_1d(hA) #<! To support numImg = 1
    hA = hA.flat

    
    for kk in range(numImg):
        if randomChoice:
            idx = np.random.choice(numSamples)
        else:
            idx = kk
        mI  = np.reshape(mX[idx, :], imgSize)
    
        hA[kk].imshow(mI, cmap = 'gray')
        hA[kk].tick_params(axis = 'both', left = False, top = False, right = False, bottom = False, labelleft = False, labeltop = False, labelright = False, labelbottom = False)
        labelStr = f', Label = {vY[idx]}' if vY is not None else ''
        hA[kk].set_title(f'Index = {idx}' + labelStr)
    
    plt.show()

def PlotScatterData(mX: np.ndarray, vL: np.ndarray, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, markerSize: int = MARKER_SIZE_DEF, lineWidth: int = LINE_WIDTH_DEF, axisTitle: str = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    vU = np.unique(vL)
    numClusters = len(vU)

    for ii in range(numClusters):
        vIdx = vL == vU[ii]
        hA.scatter(mX[vIdx, 0], mX[vIdx, 1], s = ELM_SIZE_DEF, edgecolor = EDGE_COLOR, label = ii)
    
    hA.set_xlabel('${{x}}_{{1}}$')
    hA.set_ylabel('${{x}}_{{2}}$')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.grid()
    hA.legend()

    return hA

def PlotEmbeddedImg(mZ: np.ndarray, mX: np.ndarray, vL: np.ndarray = None, numImgScatter: int = 50, imgShift: float = 5.0, tImgSize: Tuple[int, int] = (28, 28), hA: plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, markerSize: int = MARKER_SIZE_DEF, edgeColor = EDGE_COLOR, lineWidth: int = LINE_WIDTH_DEF, axisTitle: str = None):

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    numSamples = mX.shape[0]
    
    lSet = list(range(1, numSamples))
    lIdx = [0] #<! First image
    for ii in range(numImgScatter):
        mDi  = sp.spatial.distance.cdist(mZ[lIdx, :], mZ[lSet, :])
        vMin = np.min(mDi, axis = 0)
        idx  = np.argmax(vMin) #<! Farthest image
        lIdx.append(lSet[idx])
        lSet.remove(lSet[idx])
    
    for ii in range(numImgScatter):
        idx = lIdx[ii]
        x0  = mZ[idx, 0] - imgShift
        x1  = mZ[idx, 0] + imgShift
        y0  = mZ[idx, 1] - imgShift
        y1  = mZ[idx, 1] + imgShift
        mI  = np.reshape(mX[idx, :], tImgSize)
        hA.imshow(mI, aspect = 'auto', cmap = 'gray', zorder = 2, extent = (x0, x1, y0, y1))
    
    if vL is not None:
        vU = np.unique(vL)
        numClusters = len(vU)
    else:
        vL = np.zeros(numSamples)
        vU = np.zeros(1)
        numClusters = 1

    for ii in range(numClusters):
        vIdx = vL == vU[ii]
        hA.scatter(mZ[vIdx, 0], mZ[vIdx, 1], s = markerSize, edgecolor = edgeColor, label = ii)
    
    hA.set_xlabel('${{x}}_{{1}}$')
    hA.set_ylabel('${{x}}_{{2}}$')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.grid()
    hA.legend()

    return hA


## Generate / Load Data

In this notebook we'll use [SciKit Learn's `make_s_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_s_curve.html) to generated data.  

In [None]:
# Loading / Generating Data

mX, vY = fetch_openml('mnist_784', version = 1, return_X_y = True, as_frame = False, parser = 'auto') #<! Results are random beyond the noise

print(f'The features data shape: {mX.shape}')
print(f'The features data type: {mX.dtype}')

In [None]:
# Sub Sample the Data

vIdx = np.random.choice(mX.shape[0], numSamples, replace = False)
mX = mX[vIdx]
vY = vY[vIdx]

print(f'The features data shape: {mX.shape}')

### Plot the Data

In [None]:
# Plot the Data
hF, hA = plt.subplots(nrows = numRows, ncols = numCols, figsize = (8, 8))
hA = PlotMnistImages(mX, vY, numRows = numRows, numCols = numCols, imgSize = tImgSize, hF = hF)

## Applying Dimensionality Reduction - t-SNE

The t-SNE algorithm is an improvement of the SNE algorithm:

![](https://i.imgur.com/CNK26ly.png)

The addition of the heavy tail t Student distribution allowed improving the algorithm by keeping the local structure.  
It allowed the model to have small penalty even for cases points are close in high dimension but far in low dimension.

We'll use SciKit Learn's implementation of the algorithm [`TSNE`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).

* <font color='brown'>(**#**)</font> The t-SNE algorithm preserves local structure.
* <font color='brown'>(**#**)</font> The t-SNE algorithm is random by its nature.
* <font color='brown'>(**#**)</font> The t-SNE algorithm is heavy to compute.
* <font color='brown'>(**#**)</font> The t-SNE algorithm does not support out of sample data inherently.



In [None]:
# Apply the t-SNE

# Construct the object
oTsneDr = TSNE(n_components = lowDim, perplexity = paramP, metric = metricType)
# Build the model and transform data
mZ = oTsneDr.fit_transform(mX)

* <font color='red'>(**?**)</font> In production, what do we deliver?
* <font color='red'>(**?**)</font> Look at the documentation of [`TSNE`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), why don't we have the `transform()` method?

In [None]:
# Plot the Low Dimensional Data (With the Digits)

hA = PlotEmbeddedImg(mZ, mX, vL = vY, numImgScatter = numImgScatter, imgShift = imgShift, tImgSize = tImgSize)
hA.set_title(f'Low Dimension Representation by t-SNE with Perplexity = {paramP}')

* <font color='blue'>(**!**)</font> Change the `perplexity` parameter and see the results.
* <font color='blue'>(**!**)</font> Apply the KernelPCA or IsoMap to the data and compare results (Run time as well).