# ML Explainer - How do t-SNE and UMAP work

Pieter Overdevest  
2024-03-12

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

### How to work with this Jupyter Notebook yourself?

- Get a copy of the repository ('repo') [machine-learning-with-python-explainers](https://github.com/EAISI/machine-learning-with-python-explainers) from EAISI's GitHub site. This can be done by either cloning the repo or simply downloading the zip-file. Both options are explained in this Youtube video by [Coderama](https://www.youtube.com/watch?v=EhxPBMQFCaI).

- Copy the folder 'ml-explainers\\' located in the folder 'example-solutions\\' to your own project folder.

#### Aim

To explain how t-SNE and UMAP embeddings works with a dummy non-linear data set.

#### Initialization

We start by importing a few packages,

In [23]:
import pandas as pd
import numpy as np
import altair as alt
import random
import warnings

from plotly.express import scatter, scatter_3d
from sklearn.preprocessing import StandardScaler
from sklearn import manifold
from umap import UMAP
from numpy.random import rand

In [24]:
# This helps to silence a 'FutureWarning' from the Altair library, specifically regarding
# the use of the convert_dtype parameter in the apply() method.
warnings.simplefilter(action='ignore', category=FutureWarning)

#### Get the data

We create a dummy data set with non-linear relationships between the three features, `x`, `y`, and `z`. In particular, the data set has a ‘bucket’-like data structure in which another cluster is situated. Below, we show a 3D scatter plot made with Plotly. This allows us to rotate the box holding the data around the x, y, and z-axes, to observe the data from different directions. In addition, the data have been projected on each of the three planes constructed by two of the three axes. This helps to understand how data are overlapping in the concerned directions.

In this notebook, we investigate the data structure using t-SNE. This can also be done in ML explainer `how-does-pca-work`. There you will see there is a problem for PCA to distinguish the non-linear data structures.

In [4]:
random.seed(10)

# Cluster A (red)
l_a_x = list( 5 + 3 * rand(25)); l_a_y = list( 5 + 3 * rand(25)); l_a_z = list(10 + 1 * rand(25))

# Cluster B (green)
l_b_x = list( 5 + 3 * rand(25)); l_b_y = list(15 + 3 * rand(25)); l_b_z = list( 5 + 1 * rand(25))

# Cluster C (blue)
l_c_x = list(14 + 3 * rand(25)); l_c_y = list(12 + 3 * rand(25)); l_c_z = list(15 + 1 * rand(25))

# Cluster D (magenta)
l_d_x = list( 5 + 3 * rand(50)); l_d_y = list(15 + 3 * rand(50)); l_d_z = list( 7 + 1 * rand(50))

# Cluster E (orange)
n_x_center = 15; n_y_center = 10; n_z_center = 15
n_radius   = 5
n_data     = 250

v_phi   = np.pi * rand(n_data)
v_theta = np.pi * rand(n_data)

v_e_x = n_x_center + n_radius * np.sin(v_phi) * np.cos(v_theta)
v_e_y = n_y_center + n_radius * np.sin(v_phi) * np.sin(v_theta) * 2
v_e_z = n_z_center + n_radius * np.cos(v_phi)

l_e_x = list(v_e_x); l_e_y = list(v_e_y); l_e_z = list(v_e_z)

# Cluster label.
ps_y = pd.Series(["A"]*25 + ["B"]*25 + ["C"]*25 + ["D"]*50 + ["E"]*n_data)

# Concatenation of cluster data.
l_df_X_names = ['x', 'y', 'z']
m_X       = np.array([
    l_a_x + l_b_x + l_c_x + l_d_x + l_e_x,
    l_a_y + l_b_y + l_c_y + l_d_y + l_e_y,
    l_a_z + l_b_z + l_c_z + l_d_z + l_e_z
]).transpose()

# Convert to dataframe, df_X.
df_X  = pd.DataFrame(    
    m_X,
    columns = l_df_X_names
)

# Create copy of df_X.
df_data = df_X.copy()

# Add cluster label to df_X.
df_data['label'] = ps_y

# Create shadows on x, y, and z planes.
df_data_x = df_data.copy(); df_data_y = df_data.copy(); df_data_z = df_data.copy()

df_data_x['x'] = 0; df_data_y['y'] = 0; df_data_z['z'] = 0

# Concatenate data.
df_data_total = pd.concat([
    
    df_data,
    df_data_x,
    df_data_y,
    df_data_z
], axis = 0)


##### Plot data in high-dimensional space

Now, we plot the data in the high-dimensional space. Below, a 3D scatter
plot is shown made with Plotly. Use your mouse to rotate the data and
the x, y, and z-axes, to observe the data from different directions. In
addition, the data have been projected on each of the x, y, and z planes
(shadows). This helps to understand how data are overlapping in one of
the three directions.

We see that cluster C (blue) is located within cluster E (orange),
i.e. a non-linear structure. Clusters B and D are close to each other,
but linearly separable, as you can see with PCA (see Demonstration
‘pca’).

In [5]:
# Plot the data.
fig = scatter_3d(

    df_data_total, 
    x                  = 'x',
    y                  = 'y', 
    z                  = 'z',
    color              = 'label',
    color_discrete_map = {"A": 'red', "B": 'green', "C": 'blue', "D": 'magenta', "E": 'orange'},
    range_x            = (0,25),
    range_y            = (0,25),
    range_z            = (0,25)
)

fig.update_layout(
    autosize = True,
    width    = 800,
    height   = 800)
    
fig.show()

##### Pre-processing

We scale the data,

In [6]:
m_X_scaled = StandardScaler().fit_transform(m_X)

and apply dimension reduction to the scaled data. Choose embedding type:

In [44]:
#c_embedding = 'UMAP'
c_embedding = 't-SNE'

In [47]:
if c_embedding == 'UMAP':

    embedding = UMAP(
        n_neighbors  = 9,
        n_components = 2,
        metric       = 'euclidean'
    )

elif c_embedding == 't-SNE':

    embedding = manifold.TSNE(
        
        perplexity    = 15, # This is the 'hyperparameter' to change.
        init          = "random",
        n_iter        = 1000,
        learning_rate = 100,
        random_state  = 0,
    )

else:

    raise ValueError("Embedding method must be 'UMAP' or 't-SNE'.")


df_X_embedding = (

    pd.DataFrame(
        embedding.fit_transform(m_X_scaled),
        columns = ['dim1', 'dim2']
    )
    .assign(target = ps_y)
)

##### Plot data in low-dimensional space

Now, we plot the data in the low-dimensional space, see below. We observe that the non-linear data structure ‘C-in-E’ is separated in two clusters, contrary to what is possible with PCA, see ML Explainer ‘how-does-pca-work’. Compare the t-SNE embeddings for smaller and larger values for the perplexity.

In [48]:
dc_color = {
    "A": 'red',
    "B": 'green',
    "C": 'blue', 
    "D": 'magenta', 
    "E": 'orange'
}

alt.Chart(df_X_embedding).mark_circle(size=60, opacity = 0.5).encode(
    x     = 'dim1',
    y     = 'dim2',
    color = alt.Color(
        'target',
        scale=alt.Scale(
            domain = list(dc_color.keys()),
            range  = list(dc_color.values())
        )
    )
).properties(
    width  = 500,
    height = 500,
    title = f"Embedding: {c_embedding}"
).interactive()

When we set perplexity to 250, we see that the data structure ‘C-in-E’ is not separated. This is because we move the ‘horizon’ further away (more neighbours considered; standard deviation of normal distribution increased). This causes data structures that are separated in the high-dimensional space to lump together in the low-dimensional space. The opposite occurs when we decrease the perplexity. There is no objective optimal value for perplexity. It is determined based on the insights that can be gained from the t-SNE embeddings based on discussions with domain experts.