:::{.callout-info appearance="simple"}
For suggestions/questions regarding this notebook, please contact [Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/) (pieter@innovatewithdata.nl).
:::

## Aim

In this exercise we explore high-dimensional non-linear data with Principle Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). We use 3D data as the high-dimensional data (three features), as it is the highest we humans can relate to. Of course, in the real world of data, we could easily go up to high dimensional spaces of hundreds to billions of features.

## Initialization

We start by importing a few packages,

In [1]:
import pandas as pd
import numpy as np
import random

from plotly.express import scatter_3d, scatter
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from numpy.random import rand

## A. Data simulation, visualisation, and scaling

We create a dummy 3D data set with non-linear relationships between the three features, `x1`, `x2`, and `x3`.

In [2]:
# This block is given in the exercise:

# Discrete color mapping.
dc_color = {"A": 'red', "B": 'green', "C": 'magenta', "D": 'blue', "E": 'orange'}

# Fix the seed to reproduce our results.
random.seed(10)

# Cluster A (red)
l_a_x1 =  5 + 3 * rand(25); l_a_x2 =  5 + 3 * rand(25); l_a_x3 = 10 + 1 * rand(25)

# Cluster B (green)
l_b_x1 =  5 + 3 * rand(25); l_b_x2 = 15 + 3 * rand(25); l_b_x3 =  5 + 1 * rand(25)

# Cluster C (magenta). Cluster C floats slightly above cluster B. The distance between
# the two clusters is given by 'n_x3_dist' (default: 2).
n_x3_dist = 2

l_c_x1 = 5 + 3 * rand(25); l_c_x2 = 15 + 3 * rand(25); l_c_x3 = 5 + 1 * rand(25) + n_x3_dist

# Cluster D (blue)
l_d_x1 = 14 + 3 * rand(50); l_d_x2 = 12 + 3 * rand(50); l_d_x3 = 15 + 1 * rand(50)

# Cluster E (orange)
n_x1_center = 15; n_x2_center = 10; n_x3_center = 15
n_radius    = 5
n_e_data    = 500

v_phi   = np.pi * rand(n_e_data)
v_theta = np.pi * rand(n_e_data)

l_e_x1 = n_x1_center + n_radius * np.sin(v_phi) * np.cos(v_theta)
l_e_x2 = n_x2_center + n_radius * np.sin(v_phi) * np.sin(v_theta) * 2
l_e_x3 = n_x3_center + n_radius * np.cos(v_phi)

# Cluster label.
ps_label = pd.Series(["A"]*25 + ["B"]*25 + ["C"]*25 + ["D"]*50 + ["E"]*n_e_data)

# Feature names to be used for df_X.
l_df_X_names = ['x1', 'x2', 'x3']

# Concatenation of cluster data.
m_X = np.array([
    np.concatenate((l_a_x1, l_b_x1, l_c_x1, l_d_x1, l_e_x1)),
    np.concatenate((l_a_x2, l_b_x2, l_c_x2, l_d_x2, l_e_x2)),
    np.concatenate((l_a_x3, l_b_x3, l_c_x3, l_d_x3, l_e_x3))
]).transpose()

# Convert to dataframe, df_X.
df_X  = pd.DataFrame(    
    data    = m_X,
    columns = l_df_X_names
)

# Create copy of df_X.
df_data = df_X.copy()

# Add cluster label to df_X.
df_data['label'] = ps_label

# Create 'shadows' on x, y, and z planes.
df_data_x1 = df_data.copy(); df_data_x2 = df_data.copy(); df_data_x3 = df_data.copy()

df_data_x1['x1'] = 0; df_data_x2['x2'] = 0; df_data_x3['x3'] = 0

# Concatenate data.
df_data_total = pd.concat([
    
    df_data,
    df_data_x1,
    df_data_x2,
    df_data_x3
], axis = 0)

### 1. Plot the simulated data in a 3D scatter plot, i.e., the high-dimensional space (tip: use Plotly Express' `scatter_3d()` function).

In [8]:
fig = scatter_3d(

    df_data_total, 
    x                  = 'x1',
    y                  = 'x2', 
    z                  = 'x3',
    color              = 'label',
    color_discrete_map = dc_color,
    range_x            = (0,25),
    range_y            = (0,25),
    range_z            = (0,25)
)

fig.update_layout(
    width    = 700,
    height   = 700
)
    
fig.update_traces(
    marker_size = 5
)

fig.show()

### 2. Use your mouse to rotate the data and observe the data from different directions. What do you observe in terms of the cluster locations relative to each other?

As you can see in the plot, the data have been projected on each of the three planes constructed from two dimensions, like shadows. This helps to understand how data are overlapping in one of the three directions.

In [5]:
# We observe that cluster D (blue) is located within cluster E (orange), like a ball in a bucket.
# Clusters B and C are close to each other, but not overlapping.
# Cluster A lies separate from the other clusters.

### 3. Create an object holding the standardized data in `df_X`

In [6]:
m_X_scaled = StandardScaler().fit_transform(df_X)

## B. Dimensionality reduction with PCA and clustering

### 1. Create a PCA object with as many principal components as there are columns in data frame df_X.

In [7]:
# Number of features in original data.
n_components = df_X.shape[1]

# Print results.
print(f"Number of features in the original data: {n_components}")

# Create pca object.
pca_ = PCA(n_components = n_components)

Number of features in the original data: 3


### 2. Calculate the principle components from the scaled data using the PCA object, put them in a data frame with headers `PC1`, `PC2`, ..., and add the cluster labels in `ps_label` as a column to the data frame. Observe the first 5 rows to check your result.

In [8]:
# Principle components (PC's).
m_pc = pca_.fit_transform(m_X_scaled)

# Header names.
l_pc_names = ["PC" + str(i+1) for i in range(n_components)]

print(l_pc_names)

# Put PC's in data frame.
df_pc = pd.DataFrame(

    data    = m_pc,
    columns = l_pc_names
)

# Add the cluster labels in `ps_label` as fourth column
df_pc['target'] = ps_label

# Show first five rows.
df_pc.head(5)

['PC1', 'PC2', 'PC3']


Unnamed: 0,PC1,PC2,PC3,target
0,2.313925,2.010835,-0.918284,A
1,2.459066,2.625184,-0.930229,A
2,2.175736,1.958362,-0.657875,A
3,2.107596,2.118872,-0.645811,A
4,2.229299,2.228798,-0.77558,A


### 3. Obtain the loadings from the fitted PCA object. Put them in a data frame with the same header names as in `df_X`. Set the indices of the rows to `PC1`, `PC2`, … What do you conclude from the loadings of the first and second principal component? How does this relate to your observation in the 3D scatterplot?

In [9]:
df_loadings = pd.DataFrame(
    
    data    = pca_.components_,
    columns = l_df_X_names,
    index   = l_pc_names
)

df_loadings.transpose()

Unnamed: 0,PC1,PC2,PC3
x1,-0.70332,0.078802,0.706492
x2,-0.096026,-0.995259,0.015416
x3,-0.704358,0.056999,-0.707553


In [10]:
# x1 and x3 are main contributors of first principal component, and x2 is almost the
# sole contributor of second principal component.
# So, x1 and x3 are highly correlated, something we also observe in the 3D scatterplot.

### 4. Create a 2D scatter plot of `PC1` against `PC2`

In [11]:
scatter(
    
    data_frame         = df_pc, 
    x                  = 'PC1', 
    y                  = 'PC2', 
    color              = 'target',
    color_discrete_map = dc_color,
    title              = "First two PC's of a simulated non-linear data set",
    width              = 600,
    height             = 600
)

### 5. What do you observe in the 2D scatter plot in terms of the cluster locations relative to each other? What explains your observation?

In [12]:
# Cluster A is separated from the rest.

# Clusters B and C are somewhat separated, although they are linearly separable, the direction in which they are separable
# explains a small amount of the total variance. The clusters B and C are too close to each other that when their data are
# projected on the first PC, the data from B and C partly overlap.

# Clusters D and E fully overlap. Whatever the direction of the PC's the data in clusters D and E are never separated.

#----------------

# References

# Does PCA preserve linear separability for every linearly separable set?
# No, it depends on the direction of the discriminative information (cluster separation) and of the principal components.
# In case they are the same the clusters will be separated in a biplot, and in case they are orthogonal the clusters
# are not separated.
# https://stats.stackexchange.com/questions/566696/does-pca-preserve-linear-separability-for-every-linearly-separable-set

### 6. Increase `n_x3_dist` to six and re-run the script to this point. What happens to the separation of clusters B and C in the 2D scatterplot? What explains your observation? When you are ready, return the value back to two before going to the next question and re-run the script to this point.

In [13]:
# If we increase the distance between B and C ('n_x3_dist') to six, we observe that clusters B and C become separated
# in the scatter plot. This is because if we project clusters B and C on the first principle component,
# they become separated.

# Cluster A is not overlapping with clusters B and C, because PC2 depends primarily on x2 and clusters B and C are seprated from
# cluster A in the direction of x2.

## C. Dimensionality reduction with t-SNE and clustering

Let's follow another approach (t-SNE) to reduce the dimensionality of the data to study clusters in the data.

### 1. Create a t-SNE object.

In [65]:
# Default perplexity is 30 and init is pca.
t_sne = TSNE(perplexity=30, init='pca')

# t_sne = TSNE(
#     perplexity    = 10,
#     #perplexity    = 250,
#     #init          = "random",
#     #n_iter        = 1000,
#     #learning_rate = 100,
#     #random_state  = 0,
# )

### 2. Apply the scaled data to the t-SNE object, put the resulting matrix in a data frame with headers `dim1` and `dim2`, and add the cluster labels in `ps_label` as third column to the data frame.

In [66]:
m_X_t_sne = t_sne.fit_transform(m_X_scaled)

df_X_t_sne = pd.DataFrame(m_X_t_sne, columns = ['dim1', 'dim2'])

df_X_t_sne['target'] = ps_label

### 3. Create a 2D scatter plot of `dim1` against `dim2`.


In [67]:
scatter(
    
    data_frame         = df_X_t_sne,
    x                  = 'dim1', 
    y                  = 'dim2', 
    color              = 'target',
    color_discrete_map = dc_color,
    title              = "t-Distributed Stochastic Neighbor Embedding",
    width              = 600,
    height             = 600
)

### 4. What do you observe in the 2D scatter plot in terms of the cluster locations relative to each other?

In [None]:
# Cluster A is separated from the rest.

# Clusters B and C are somewhat separated. Though the color dots are separated.
# We would not know the two are separate clusters when they would have had the same color.

# Clusters D and E are separated now.

### 5. Investige the influence of the `perplexity` parameter in `TSNE()`. Check and explain the results for perplexity values 10, 30 (default), and 200.

In [68]:
# perplexity = 10: Clusters B&C and D&E are separated, and more.
# We throw a smaller net and fewer neighbours are considered to be each other's neighbor.
# The standard deviation of the normal distribution scales with perplexity. As we saw in the
# morning lecture the normal distribution is used to determine whether data points are
# neighbors or not. With a smaller std, it are really the close neighbors that are
# marked as neighbors.
# This even causes sub-clustering of data within clusters B and E.

# perplexity = 200: Clusters B&C are somewhat separated and clusters D&E are not separated.
# Of course we see the blue dots as separate cluster within the orange dots but we would 
# not know they are different clusters if they had no distinguishing color.
# Here, we through a bigger net and more neighbours are considered to be each other's neighbor.
# The wider normal distribution (larger std) causes also more distant data to become
# neighbors, so they also overlap in the 2D canvas. This lumps data in the low-dimensional 
# space (2D) that are separated in the high-dimensional space (3D).


### 6. Extra - Investigate the influence of the `init` parameter. Why is the result always the same for the value 'pca' (default) and is always different for the value 'random'?

In [None]:
# When we use 'pca' the outcome is always the same, because we start the lower dimensional plot (2D/canvas) using a PCA analysis. This
# is fixed (no randomness). In case of 'random' the output is always slightly different, because we start with a random distribution
# of the data on the 2D canvas.

In [None]:
# Main reasons to evaluate different init settings:

# When init='pca' is used, it initializes t-SNE by first applying Principal Component Analysis (PCA) to the data and
# then using the resulting lower-dimensional representation as the initial configuration for t-SNE. This approach can
# provide some benefits:

# Faster convergence: By using the PCA initialization, t-SNE starts from a configuration that already captures most
# of the variance in the data. This can help t-SNE converge more quickly since it has a good starting point.

# Stability: PCA provides a stable and deterministic initialization. This means that if you run t-SNE multiple times
# with the same dataset and parameters, using PCA initialization can result in consistent results, which may facilitate
# reproducibility.

# Improved preservation of global structure: PCA focuses on capturing the most significant directions of variance in
# the data. By using the PCA initialization, t-SNE can benefit from the global structure information contained in the
# principal components, potentially leading to better preservation of global relationships in the lower-dimensional
# representation.

# However, it's important to note that using PCA initialization doesn't guarantee superior results in all cases.
# The performance of t-SNE heavily depends on the specific characteristics of the data, and different initialization
# strategies may yield different outcomes. It is recommended to experiment with different initialization methods and
# other t-SNE parameters to find the best configuration for your particular dataset and visualization goals.