:::{.callout-info appearance="simple"}
For suggestions/questions regarding this notebook, please contact [Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/) (pieter@innovatewithdata.nl).
:::

## Aim

In this exercise we explore high-dimensional non-linear data with Principle Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). We use 3D data as the high-dimensional data (three features), as it is the highest we humans can relate to. Of course, in the real world of data, we could easily go up to high dimensional spaces of hundreds to billions of features.

## Initialization

We start by importing a few packages,

In [1]:
import pandas as pd
import numpy as np
import random

from plotly.express import scatter_3d, scatter
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from numpy.random import rand

## A. Data simulation, visualisation, and scaling

We create a dummy 3D data set with non-linear relationships between the three features, `x1`, `x2`, and `x3`.

In [2]:
# This block is given in the exercise:

# Discrete color mapping.
dc_color = {"A": 'red', "B": 'green', "C": 'magenta', "D": 'blue', "E": 'orange'}

# Fix the seed to reproduce our results.
random.seed(10)

# Cluster A (red)
l_a_x1 =  5 + 3 * rand(25); l_a_x2 =  5 + 3 * rand(25); l_a_x3 = 10 + 1 * rand(25)

# Cluster B (green)
l_b_x1 =  5 + 3 * rand(25); l_b_x2 = 15 + 3 * rand(25); l_b_x3 =  5 + 1 * rand(25)

# Cluster C (magenta). Cluster C floats slightly above cluster B. The distance between
# the two clusters is given by 'n_x3_dist' (default: 2).
n_x3_dist = 2

l_c_x1 = 5 + 3 * rand(25); l_c_x2 = 15 + 3 * rand(25); l_c_x3 = 5 + 1 * rand(25) + n_x3_dist

# Cluster D (blue)
l_d_x1 = 14 + 3 * rand(50); l_d_x2 = 12 + 3 * rand(50); l_d_x3 = 15 + 1 * rand(50)

# Cluster E (orange)
n_x1_center = 15; n_x2_center = 10; n_x3_center = 15
n_radius    = 5
n_e_data    = 500

v_phi   = np.pi * rand(n_e_data)
v_theta = np.pi * rand(n_e_data)

l_e_x1 = n_x1_center + n_radius * np.sin(v_phi) * np.cos(v_theta)
l_e_x2 = n_x2_center + n_radius * np.sin(v_phi) * np.sin(v_theta) * 2
l_e_x3 = n_x3_center + n_radius * np.cos(v_phi)

# Cluster label.
ps_label = pd.Series(["A"]*25 + ["B"]*25 + ["C"]*25 + ["D"]*50 + ["E"]*n_e_data)

# Feature names to be used for df_X.
l_df_X_names = ['x1', 'x2', 'x3']

# Concatenation of cluster data.
m_X = np.array([
    np.concatenate((l_a_x1, l_b_x1, l_c_x1, l_d_x1, l_e_x1)),
    np.concatenate((l_a_x2, l_b_x2, l_c_x2, l_d_x2, l_e_x2)),
    np.concatenate((l_a_x3, l_b_x3, l_c_x3, l_d_x3, l_e_x3))
]).transpose()

# Convert to dataframe, df_X.
df_X  = pd.DataFrame(    
    data    = m_X,
    columns = l_df_X_names
)

# Create copy of df_X.
df_data = df_X.copy()

# Add cluster label to df_X.
df_data['label'] = ps_label

# Create 'shadows' on x, y, and z planes.
df_data_x1 = df_data.copy(); df_data_x2 = df_data.copy(); df_data_x3 = df_data.copy()

df_data_x1['x1'] = 0; df_data_x2['x2'] = 0; df_data_x3['x3'] = 0

# Concatenate data.
df_data_total = pd.concat([
    
    df_data,
    df_data_x1,
    df_data_x2,
    df_data_x3
], axis = 0)

### 1. Plot the simulated data in a 3D scatter plot, i.e., the high-dimensional space (tip: use Plotly Express' `scatter_3d()` function).

In [None]:
#

### 2. Use your mouse to rotate the data and observe the data from different directions. What do you observe in terms of the cluster locations relative to each other?

As you can see in the plot, the data have been projected on each of the three planes constructed from two dimensions, like shadows. This helps to understand how data are overlapping in one of the three directions.

In [None]:
#

### 3. Create an object holding the standardized data in `df_X`

In [None]:
#

## B. Dimensionality reduction with PCA and clustering

### 1. Create a PCA object with as many principal components as there are columns in data frame df_X.

In [None]:
#

### 2. Calculate the principle components from the scaled data using the PCA object, put them in a data frame with headers `PC1`, `PC2`, ..., and add the cluster labels in `ps_label` as a column to the data frame. Observe the first 5 rows to check your result.

In [None]:
#

### 3. Obtain the loadings from the fitted PCA object. Put them in a data frame with the same header names as in `df_X`. Set the indices of the rows to `PC1`, `PC2`, … What do you conclude from the loadings of the first and second principal component? How does this relate to your observation in the 3D scatterplot?

In [None]:
#

In [None]:
#

### 4. Create a 2D scatter plot of `PC1` against `PC2`

In [None]:
#

### 5. What do you observe in the 2D scatter plot in terms of the cluster locations relative to each other? What explains your observation?

In [None]:
#

### 6. Increase `n_x3_dist` to six and re-run the script to this point. What happens to the separation of clusters B and C in the 2D scatterplot? What explains your observation? When you are ready, return the value back to two before going to the next question and re-run the script to this point.

In [None]:
#

## C. Dimensionality reduction with t-SNE and clustering

Let's follow another approach (t-SNE) to reduce the dimensionality of the data to study clusters in the data.

### 1. Create a t-SNE object.

In [65]:
#

### 2. Apply the scaled data to the t-SNE object, put the resulting matrix in a data frame with headers `dim1` and `dim2`, and add the cluster labels in `ps_label` as third column to the data frame.

In [66]:
#

### 3. Create a 2D scatter plot of `dim1` against `dim2`.


In [None]:
#

### 4. What do you observe in the 2D scatter plot in terms of the cluster locations relative to each other?

In [None]:
#

### 5. Investige the influence of the `perplexity` parameter in `TSNE()`. Check and explain the results for perplexity values 10, 30 (default), and 200.

In [None]:
#

### 6. Extra - Investigate the influence of the `init` parameter. Why is the result always the same for the value 'pca' (default) and is always different for the value 'random'?

In [None]:
#