In [None]:
from google.colab import drive
drive.mount('/content/drive/')

%cd drive/MyDrive/clariah2025-dse-ml/materials/2025-09-09/

# Experiments in Dimensionality Reduction
### CLARIAH-AT Summer School on Machine Learning for Digital Scholarly Editions
Bernhard C. Geiger (TU Graz, Know Center Research GmbH)

## Data Generation
First, we generate three-dimensional data that is available in Python. While our embeddings will have higher dimensions, we focus on three-dimensional data to allow for visualization.

In [None]:
import pandas as pd
import plotly.express as px

import numpy as np
from sklearn import datasets, manifold

# Generate Swiss roll with hole
sh_points, sh_color = datasets.make_swiss_roll(n_samples=1500, hole=True, random_state=0)

df_sh = pd.DataFrame(sh_points)
df_sh['color']=sh_color

# Generate s-curve
s_points, s_color = datasets.make_s_curve(n_samples=1500, noise=0.05, random_state=0)

df_s = pd.DataFrame(s_points)
df_s['color']=s_color

# Generate Gaussian blobs
g_points, g_color = datasets.make_blobs(n_samples=1500, centers=5, n_features=3,
                  random_state=1)

df_g = pd.DataFrame(g_points)
df_g['color']=g_color

We next visualizse the data using plotly. Feel free to exchange the datasets for visualization, and try to experiment with the parameters of the data generation methods (e.g., more or less centers for the Gaussian blobs, more or less noise for the s-curve).

In [None]:
df=df_g

# Create interactive 3D scatter plot
fig = px.scatter_3d(df, x=0, y=1, z=2,color='color',
                    size_max=1,
                    opacity=0.7)

fig.show()

## Linear Dimensionality Reduction
We will now project the data on a two-dimensional Euclidean space, with (PCA, truncated SVD) and without prior rotation. Given the three datasets, what do you observe? Is dimensionality reduction "successful"? How do you determine that?

### Projection
(just remove one of the coordinates)

### Truncated singular value decomposition (SVD)
(see slides for code examples)

Try to vary the data. For example, what happens if you add an offset to one of the dimensions?

### Principal components analysis (PCA)
(see slides for code examples)

Try to vary the data also in this case. If you add an offset to one of the dimensions, do you still see some change?

## Nonlinear Dimensionality Reduction
We will now investigate nonlinear dimensionality reduction methods such as t-SNE and UMAP. Here it is important to study the influence of the hyperparameters. How do you set these parameters such that dimensionality reduction is "successful"?

### t-distributed stochastic neighbor embedding (t-SNE)
(see slides for code example)

### Uniform manifold approximation and projection (UMAP)
(see slides for code example)

## Experiments with Text Embeddings
The following cell loads utterance embeddings of the Parlamint corpus, obtained with a sentence transformer. This is a dense embedding. Reduce the dimensionality of these embeddings to two or three dimensions and visualize the results. Do you see topic clusters? (Note that in subsequent experiments you will use higher dimensions; we use these low dimensions just for the sake of illustration.)

In [None]:
# df=pd.read_pickle("SENTENCE_TRANSFORMER_LARGE_PER_UTTERANCE.pkl")