# From Audio Tracks to Acoustic Embeddings and Clusters

In this notebook, we will:

1. Take a **folder of audio files** (e.g., songs or stems).  
2. For each track, compute a compact **acoustic embedding** using:
   - **Mel-spectrogram statistics** (how energy is distributed across frequency)
   - **MFCC statistics** (a classic timbre representation)
   - **Spectral features** (centroid, bandwidth, rolloff, flatness, RMS, zero-crossing rate)
3. Stack these embeddings into a feature matrix (one vector per track).  
4. Run two clustering methods:
   - **k-means** (you choose the number of clusters)
   - **HDBSCAN** (finds clusters and outliers automatically)

> The acoustic embedding is a **fixed-length numeric vector** that summarizes the “sound fingerprint” of a track.  
> We will use these vectors as inputs for clustering (k-means & HDBSCAN).


### Install dependencies

In [28]:
# --- Install dependencies ---
!pip install librosa hdbscan umap-learn --quiet


### Imports and configuration

## Step 1 — Imports and configuration

We will:

- Import **librosa** for audio feature extraction.  
- Import **NumPy / pandas** for numerical work and tables.  
- Import **scikit-learn** for k-means and scaling.  
- Import **HDBSCAN** for density-based clustering.  
- Set the folder that contains our audio files.

> In Colab, you can either:
> - Upload audio files directly, or  
> - Mount Google Drive and point `AUDIO_FOLDER` to a folder in your Drive.


### Imports and configuration

In [3]:
# --- Cell 2: Imports and configuration ---
from pathlib import Path
import warnings
import math

import numpy as np
import pandas as pd
import librosa

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import hdbscan

print("librosa version:", librosa.__version__)

# === USER: Set this to your audio folder ===
# Example if using Google Drive (after mounting):
# AUDIO_FOLDER = Path("/content/drive/MyDrive/LS100/audio_tracks")
AUDIO_FOLDER = Path("/Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips")  # <-- CHANGE THIS

SAMPLE_RATE_TARGET = 22050   # resample target for feature extraction (Hz)
N_MELS = 64                  # number of mel bands
N_MFCC = 20                  # number of MFCC coefficients

print("Audio folder:", AUDIO_FOLDER)


librosa version: 0.11.0
Audio folder: /Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips


### Helper: list files and quick check

We’ll search the folder (and subfolders) for common audio extensions and make sure we have files to work with.


In [4]:
# --- Cell 3: Helper function: list audio files ---

def list_audio_files(folder: Path):
    exts = [".wav", ".mp3", ".flac", ".ogg", ".m4a"]
    files = []
    for ext in exts:
        files.extend(folder.glob(f"*{ext}"))
        files.extend(folder.glob(f"**/*{ext}"))  # include subfolders
    files = sorted(set(files))
    return files

if not AUDIO_FOLDER.exists():
    raise FileNotFoundError(f"AUDIO_FOLDER does not exist: {AUDIO_FOLDER}")

audio_files = list_audio_files(AUDIO_FOLDER)
print(f"Found {len(audio_files)} audio files.")
for p in audio_files[:5]:
    print("  ", p.name)


Found 13 audio files.
   Mediu Zhiga.wav
   Ra Bacheeza.wav
   [SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3
   [SP] Alvaro Carrillo - Pinotepa Nacional.mp3
   [SP] Lagrimas Negras.mp3


### Acoustic Embedding: Idea

For each audio file, we will compute a **fixed-length feature vector** that summarizes its sound:

1. **Mel-spectrogram statistics**
   - Compute a Mel-spectrogram (frequency vs time on a “human” scale).
   - Convert to dB.
   - Take **mean** and **standard deviation** across time for each Mel band.  
   → This captures overall spectral shape / timbre.

2. **MFCC statistics**
   - Compute MFCCs (a compact representation of timbre).
   - Take mean and standard deviation over time for each MFCC coefficient.

3. **Spectral features (each with mean and std over time)**
   - Spectral centroid (brightness)
   - Spectral bandwidth
   - Spectral rolloff (e.g., 85% energy)
   - Spectral flatness (tonal vs noise-like)
   - RMS energy (loudness)
   - Zero-crossing rate (noisiness)

All these are concatenated into a single **embedding vector** per track:

> One row per track, one column per feature → ready for clustering.


### Acoustic embedding function

In [5]:
# --- Cell 4: Acoustic embedding for one track ---

def extract_acoustic_embedding(
    path: Path,
    sr_target: int = SAMPLE_RATE_TARGET,
    n_mels: int = N_MELS,
    n_mfcc: int = N_MFCC,
) -> dict:
    """
    Load an audio file, compute mel + MFCC + spectral statistics,
    and return a dict with:
      - file_name, duration, sample_rate
      - embedding (1D numpy array)
      - feature_names (list of strings, same length as embedding)
    """
    # Load mono audio at target sample rate
    y, sr = librosa.load(path, sr=sr_target, mono=True)
    duration = len(y) / sr

    # Trim leading/trailing silence a bit (optional, helps avoid long tails of silence)
    y_trim, _ = librosa.effects.trim(y, top_db=30)
    if len(y_trim) < int(0.5 * sr):
        # too little audio after trimming → fall back to original
        y_trim = y

    # Mel-spectrogram
    S = librosa.feature.melspectrogram(
        y=y_trim,
        sr=sr,
        n_fft=2048,
        hop_length=512,
        n_mels=n_mels,
        power=2.0,
    )
    S_db = librosa.power_to_db(S, ref=np.max)

    mel_mean = np.mean(S_db, axis=1)
    mel_std  = np.std(S_db, axis=1)

    # MFCC from the mel-spectrogram
    mfcc = librosa.feature.mfcc(S=S_db, sr=sr, n_mfcc=n_mfcc)
    mfcc_mean = np.mean(mfcc, axis=1)
    mfcc_std  = np.std(mfcc, axis=1)

    # Spectral features (computed on trimmed waveform)
    spec_cent = librosa.feature.spectral_centroid(y=y_trim, sr=sr)[0]
    spec_bw   = librosa.feature.spectral_bandwidth(y=y_trim, sr=sr)[0]
    spec_roll = librosa.feature.spectral_rolloff(y=y_trim, sr=sr, roll_percent=0.85)[0]
    spec_flat = librosa.feature.spectral_flatness(y=y_trim)[0]
    rms       = librosa.feature.rms(y=y_trim)[0]
    zcr       = librosa.feature.zero_crossing_rate(y_trim)[0]

    def stats(x):
        return np.array([np.mean(x), np.std(x)], dtype=float)

    spec_stats = np.concatenate([
        stats(spec_cent),
        stats(spec_bw),
        stats(spec_roll),
        stats(spec_flat),
        stats(rms),
        stats(zcr),
    ])

    # Build feature names
    feat_names = []
    # mel
    for i in range(n_mels):
        feat_names.append(f"mel_mean_{i}")
    for i in range(n_mels):
        feat_names.append(f"mel_std_{i}")
    # mfcc
    for i in range(n_mfcc):
        feat_names.append(f"mfcc_mean_{i}")
    for i in range(n_mfcc):
        feat_names.append(f"mfcc_std_{i}")
    # spectral
    spec_labels = [
        "spec_cent", "spec_bw", "spec_roll", "spec_flat",
        "rms", "zcr"
    ]
    for name in spec_labels:
        feat_names.append(f"{name}_mean")
        feat_names.append(f"{name}_std")

    embedding = np.concatenate([mel_mean, mel_std, mfcc_mean, mfcc_std, spec_stats])

    assert len(embedding) == len(feat_names), "Feature length mismatch"

    return {
        "file_name": path.name,
        "file_path": str(path),
        "duration_sec": float(duration),
        "sample_rate": int(sr),
        "embedding": embedding,
        "feature_names": feat_names,
    }

# Quick sanity check on one file (if available)
if audio_files:
    test_meta = extract_acoustic_embedding(audio_files[0])
    print("One embedding size:", len(test_meta["embedding"]))
    print("First 5 feature names:", test_meta["feature_names"][:5])


One embedding size: 180
First 5 feature names: ['mel_mean_0', 'mel_mean_1', 'mel_mean_2', 'mel_mean_3', 'mel_mean_4']


### Compute embeddings for all tracks

Now we will loop over all audio files in the folder and:

- Compute the acoustic embedding for each track.  
- Store embeddings and basic metadata in a pandas DataFrame.  
- This DataFrame will be our **feature matrix** for clustering.


### Compute embeddings

In [6]:
# --- Cell 5: Compute embeddings for all tracks ---

all_rows = []
feature_names = None

for i, path in enumerate(audio_files):
    print(f"[{i+1}/{len(audio_files)}] {path.name}")
    try:
        meta = extract_acoustic_embedding(path)
        if feature_names is None:
            feature_names = meta["feature_names"]
        row = {
            "file_name": meta["file_name"],
            "file_path": meta["file_path"],
            "duration_sec": meta["duration_sec"],
            "sample_rate": meta["sample_rate"],
        }
        # add embedding dimensions
        for fname, val in zip(feature_names, meta["embedding"]):
            row[fname] = float(val)
        all_rows.append(row)
    except Exception as e:
        print(f"  ⚠️ Error on {path.name}: {e}")

emb_df = pd.DataFrame(all_rows)
print("\nEmbedding DataFrame shape:", emb_df.shape)
emb_df.head()


[1/13] Mediu Zhiga.wav
[2/13] Ra Bacheeza.wav
[3/13] [SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3
[4/13] [SP] Alvaro Carrillo - Pinotepa Nacional.mp3
[5/13] [SP] Lagrimas Negras.mp3
[6/13] [SP] Los Panchos - Contigo.mp3
[7/13] [SP] Los Panchos - Jamas Jamas Jamas.mp3
[8/13] [SP] Los Panchos - Te Quiero Dijiste.mp3
[9/13] [SP] Soledad y el Mar - Natalia Lafourcade.mp3
[10/13] [ZAP] Binni Gula_za - Ni_bixi Dxi Zina.mp3
[11/13] [ZAP] Mediu Zhiga.mp3
[12/13] [ZAP] Ra Bacheeza.mp3
[13/13] [ZAP] Sabor a Mi - Trio Galenos Y Mario Carrillo.mp3

Embedding DataFrame shape: (13, 184)


Unnamed: 0,file_name,file_path,duration_sec,sample_rate,mel_mean_0,mel_mean_1,mel_mean_2,mel_mean_3,mel_mean_4,mel_mean_5,...,spec_bw_mean,spec_bw_std,spec_roll_mean,spec_roll_std,spec_flat_mean,spec_flat_std,rms_mean,rms_std,zcr_mean,zcr_std
0,Mediu Zhiga.wav,/Users/souvikmandal/Documents/06_Teaching_Ment...,188.05551,22050,-24.645851,-15.943171,-17.33828,-20.33445,-24.937132,-28.436523,...,1926.905314,467.466582,2638.648068,1392.164225,0.002418,0.004076,0.157222,0.061175,0.041478,0.020515
1,Ra Bacheeza.wav,/Users/souvikmandal/Documents/06_Teaching_Ment...,187.585306,22050,-34.979458,-21.680338,-20.455389,-19.014982,-22.339806,-28.481764,...,1797.522246,594.470306,2732.360967,1819.846687,0.002189,0.004974,0.127954,0.067294,0.052537,0.034499
2,[SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3,/Users/souvikmandal/Documents/06_Teaching_Ment...,198.629433,22050,-26.876875,-30.842556,-35.46133,-32.830112,-29.271029,-32.626991,...,3088.165319,325.530537,7335.079551,1333.095666,0.04075,0.054341,0.048351,0.029437,0.114178,0.054867
3,[SP] Alvaro Carrillo - Pinotepa Nacional.mp3,/Users/souvikmandal/Documents/06_Teaching_Ment...,186.619864,22050,-47.052795,-34.028484,-20.187807,-17.814312,-18.787128,-22.830231,...,905.491822,255.589968,1454.355524,557.564729,0.000114,0.000618,0.141318,0.066385,0.049439,0.021564
4,[SP] Lagrimas Negras.mp3,/Users/souvikmandal/Documents/06_Teaching_Ment...,80.34068,22050,-25.006502,-17.543041,-21.075603,-24.617285,-26.99081,-28.331671,...,1897.391561,380.930731,2885.990661,1163.436448,0.003032,0.006082,0.113102,0.046774,0.064276,0.032124


### Prepare feature matrix for clustering

To cluster tracks, we will:

1. Extract only the **numeric feature columns** (embedding dimensions).  
2. Standardize them using **z-score scaling** (mean 0, std 1) so that:
   - Mel bands, MFCCs, and spectral features are comparable in scale.  
3. Keep `file_name` and `duration_sec` for interpreting results later.


### Build X and scale

In [7]:
# --- Cell 6: Build feature matrix X and standardize ---

if feature_names is None:
    raise RuntimeError("No embeddings were computed. Check earlier cells.")

# X will contain only the embedding dimensions
X = emb_df[feature_names].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Feature matrix shape (num_tracks, num_features):", X_scaled.shape)


Feature matrix shape (num_tracks, num_features): (13, 180)


### k-means clustering

## Step 6 — k-means clustering on acoustic embeddings

We first use **k-means**:

- You choose the number of clusters `k`.  
- The algorithm pulls tracks into `k` groups based on their acoustic embeddings.  
- Each track gets a **cluster ID**: 0, 1, 2, …, k−1.

> k-means assumes clusters are roughly spherical and of similar size.  
> It is simple and fast, but sometimes misses irregular or uneven clusters.


In [14]:
# --- Cell 7: k-means clustering ---

# === USER: choose the number of clusters ===
K = 2  # try 3, 4, 5, ... and compare

kmeans = KMeans(n_clusters=K, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)

emb_df["cluster_kmeans"] = kmeans_labels

print("k-means cluster counts:")
print(emb_df["cluster_kmeans"].value_counts().sort_index())

emb_df[["file_name", "cluster_kmeans"]].head(20)


k-means cluster counts:
cluster_kmeans
0    9
1    4
Name: count, dtype: int64


Unnamed: 0,file_name,cluster_kmeans
0,Mediu Zhiga.wav,0
1,Ra Bacheeza.wav,1
2,[SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3,0
3,[SP] Alvaro Carrillo - Pinotepa Nacional.mp3,0
4,[SP] Lagrimas Negras.mp3,0
5,[SP] Los Panchos - Contigo.mp3,0
6,[SP] Los Panchos - Jamas Jamas Jamas.mp3,0
7,[SP] Los Panchos - Te Quiero Dijiste.mp3,1
8,[SP] Soledad y el Mar - Natalia Lafourcade.mp3,1
9,[ZAP] Binni Gula_za - Ni_bixi Dxi Zina.mp3,0


### HDBSCAN clustering

Next, we use **HDBSCAN** (Hierarchical Density-Based Spatial Clustering of Applications with Noise):

- It can find clusters of **different densities and shapes**.  
- It **does not require choosing k** ahead of time.  
- It can label some tracks as **noise/outliers** with label `-1`.

We will:

- Run HDBSCAN on the same standardized feature matrix.  
- Get a cluster label per track (`cluster_hdbscan`).


In [13]:
# --- Cell 8: HDBSCAN clustering ---

# === USER: tweak these if needed ===
MIN_CLUSTER_SIZE = 4   # minimum number of tracks to form a cluster
MIN_SAMPLES      = 2  # None → defaults to MIN_CLUSTER_SIZE; or set an int

hdbscan_clusterer = hdbscan.HDBSCAN(
    min_cluster_size=MIN_CLUSTER_SIZE,
    min_samples=MIN_SAMPLES,
    metric="euclidean",
    cluster_selection_method="eom"
)

hdbscan_labels = hdbscan_clusterer.fit_predict(X_scaled)
emb_df["cluster_hdbscan"] = hdbscan_labels

print("HDBSCAN cluster counts (including -1 = noise):")
print(emb_df["cluster_hdbscan"].value_counts().sort_index())

emb_df[["file_name", "cluster_hdbscan"]].head(10)


HDBSCAN cluster counts (including -1 = noise):
cluster_hdbscan
-1    13
Name: count, dtype: int64




Unnamed: 0,file_name,cluster_hdbscan
0,Mediu Zhiga.wav,-1
1,Ra Bacheeza.wav,-1
2,[SP] Alfonso Ortiz Tirado - TE QUIERO DIJISTE.mp3,-1
3,[SP] Alvaro Carrillo - Pinotepa Nacional.mp3,-1
4,[SP] Lagrimas Negras.mp3,-1
5,[SP] Los Panchos - Contigo.mp3,-1
6,[SP] Los Panchos - Jamas Jamas Jamas.mp3,-1
7,[SP] Los Panchos - Te Quiero Dijiste.mp3,-1
8,[SP] Soledad y el Mar - Natalia Lafourcade.mp3,-1
9,[ZAP] Binni Gula_za - Ni_bixi Dxi Zina.mp3,-1


### Save embeddings and cluster labels

Finally, we will save:

- A CSV file with:
  - file_name, duration, embedding features, k-means cluster, HDBSCAN cluster.  
- (Optional) You can also save as JSON or pickle if you want to reload easily.

This CSV can then be used in a separate notebook for:

- visualizations (e.g., 2D scatter plots using PCA/UMAP),  
- checking which songs fall into which cluster,  
- building recommendation or similarity tools.


In [15]:
# --- Cell 9: Save embeddings + cluster labels to CSV ---

OUTPUT_CSV = AUDIO_FOLDER / "acoustic_embeddings_with_clusters.csv"
emb_df.to_csv(OUTPUT_CSV, index=False)

print("Saved embeddings + clusters to:")
print(OUTPUT_CSV)


Saved embeddings + clusters to:
/Users/souvikmandal/Documents/06_Teaching_Mentoring/LS100_comp_etho/2025/media/audio/music_tracks/Audio_Clips/acoustic_embeddings_with_clusters.csv


### Visualizing Clusters in 2D with PCA + Plotly - Setup

Our acoustic embeddings live in a **high-dimensional space** (hundreds of features per track).  
To visualize them, we’ll compress them down to **2 dimensions** using **PCA** (Principal Component Analysis):

- PCA finds directions (components) that capture the most variance in the data.  
- We’ll project each track’s embedding to `(PC1, PC2)` and make **scatter plots**.

We’ll then color points by:

- **k-means cluster ID**  
- **HDBSCAN cluster ID** (with `-1` = noise/outliers)

This will let us see how clusters are arranged in the acoustic space.


In [None]:
# --- PCA projection to 2D and Plotly setup ---

from sklearn.decomposition import PCA
import plotly.express as px

# Compute 2D PCA embedding from X_scaled
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

emb_df["pca_x"] = X_pca[:, 0]
emb_df["pca_y"] = X_pca[:, 1]

print("Explained variance by PC1 and PC2:",
      pca.explained_variance_ratio_[0],
      pca.explained_variance_ratio_[1])


Explained variance by PC1 and PC2: 0.2872734305504054 0.23867184477917308


### PCA scatter plot colored by k-means clusters

Each point is one track:

- Position = projection of its acoustic embedding to the first two principal components (PC1, PC2).  
- Color = **k-means cluster ID**.  
- Hover text = file name.

> This gives a geometric picture of how k-means partitioned the acoustic space.


#### Below are the color options for the dots in the plot.

```python
px.colors.qualitative.Vivid
px.colors.qualitative.Dark24
px.colors.qualitative.Set1   # very bold, good for small number of clusters
px.colors.qualitative.Set3   # pastel but distinct
px.colors.qualitative.Alphabet  # huge palette
```

Just replace

```python
color_discrete_sequence=px.colors.qualitative.Bold
```


In [22]:
# --- k-means PCA scatter plot ---

# Ensure cluster labels are treated as categorical, not numeric
emb_df["cluster_kmeans_str"] = emb_df["cluster_kmeans"].astype(str)

fig_k = px.scatter(
    emb_df,
    x="pca_x",
    y="pca_y",
    color="cluster_kmeans_str",             # use string labels → categorical colors
    color_discrete_sequence=px.colors.qualitative.Bold,  # <-- HIGH CONTRAST
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_kmeans_str": True,
    },
    title="PCA Projection of Acoustic Embeddings — Colored by k-means Cluster",
    labels={"pca_x": "PC1", "pca_y": "PC2", "cluster_kmeans_str": "k-means cluster"},
)

fig_k.update_layout(
    legend_title_text="k-means cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",     # subtle grey for better contrast
)

fig_k.show()



### Visualize HDBSCAN clusters
PCA scatter plot colored by HDBSCAN clusters

Now we color the same PCA projection by **HDBSCAN** cluster labels:

- Each color = HDBSCAN cluster ID.  
- Label `-1` means “**noise**” or “unclustered / outlier” points.  

This can look quite different from k-means:

- k-means forces every track into some cluster.  
- HDBSCAN is allowed to say, “these tracks don’t belong to any dense group.”


In [23]:
# --- Improved HDBSCAN PCA scatter plot with discrete colors ---

# Convert cluster labels to string categories
emb_df["cluster_hdbscan_str"] = emb_df["cluster_hdbscan"].astype(str)
emb_df.loc[emb_df["cluster_hdbscan"] == -1, "cluster_hdbscan_str"] = "noise (-1)"

# Prepare a high-contrast color palette
palette = px.colors.qualitative.Dark24.copy()

# Ensure noise has a consistent neutral color
noise_color = "#7f7f7f"  # medium grey

unique_clusters = emb_df["cluster_hdbscan_str"].unique().tolist()

# Assign colors: noise gets grey, others use Dark24 cycling
color_map = {}
palette_i = 0
for c in unique_clusters:
    if c == "noise (-1)":
        color_map[c] = noise_color
    else:
        color_map[c] = palette[palette_i % len(palette)]
        palette_i += 1

# Build the figure
fig_h = px.scatter(
    emb_df,
    x="pca_x",
    y="pca_y",
    color="cluster_hdbscan_str",
    color_discrete_map=color_map,             # <-- Force our custom mapping
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_hdbscan": True,
    },
    title="PCA Projection of Acoustic Embeddings — Colored by HDBSCAN Cluster",
    labels={"pca_x": "PC1", "pca_y": "PC2", "cluster_hdbscan_str": "HDBSCAN cluster"},
)

fig_h.update_layout(
    legend_title_text="HDBSCAN cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",   # subtle grey to help contrast
)

fig_h.show()


## UMAP: Nonlinear 2D Embedding of Acoustic Space

PCA gave us a **linear** 2D summary of the embeddings.  
Now we will use **UMAP (Uniform Manifold Approximation and Projection)**:

- UMAP is a nonlinear dimensionality reduction method.
- It tries to preserve **local neighborhoods**: points that are close in high dimension
  tend to stay close in the 2D map.
- This often reveals **curved or irregular cluster structure** that PCA misses.

We will:

1. Compute a 2D UMAP projection of our standardized feature matrix `X_scaled`.  
2. Make two scatter plots (Plotly):
   - one colored by **k-means cluster**  
   - one colored by **HDBSCAN cluster**


In [29]:
# --- Cell: Compute UMAP 2D projection ---

import umap

umap_model = umap.UMAP(
    n_neighbors=10,      # how many neighbors define "local"
    min_dist=0.1,        # how compact clusters are
    metric="euclidean",
    random_state=42,
)

X_umap = umap_model.fit_transform(X_scaled)

emb_df["umap_x"] = X_umap[:, 0]
emb_df["umap_y"] = X_umap[:, 1]

print("UMAP embedding shape:", X_umap.shape)



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



UMAP embedding shape: (13, 2)


### UMAP + k-means visualization

In [30]:
# --- UMAP scatter plot for k-means clusters ---

# Ensure categorical labels
emb_df["cluster_kmeans_str"] = emb_df["cluster_kmeans"].astype(str)

fig_umap_k = px.scatter(
    emb_df,
    x="umap_x",
    y="umap_y",
    color="cluster_kmeans_str",
    color_discrete_sequence=px.colors.qualitative.Dark24,
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_kmeans_str": True,
    },
    title="UMAP of Acoustic Embeddings — Colored by k-means Cluster",
    labels={"umap_x": "UMAP-1", "umap_y": "UMAP-2", "cluster_kmeans_str": "k-means cluster"},
)

fig_umap_k.update_layout(
    legend_title_text="k-means cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",
)

fig_umap_k.show()


### UMAP + HDBSCAN visualization

In [31]:
# --- UMAP scatter plot for HDBSCAN clusters ---

# Make sure we have the string labels with "noise (-1)"
emb_df["cluster_hdbscan_str"] = emb_df["cluster_hdbscan"].astype(str)
emb_df.loc[emb_df["cluster_hdbscan"] == -1, "cluster_hdbscan_str"] = "noise (-1)"

palette = px.colors.qualitative.Dark24.copy()
noise_color = "#7f7f7f"

unique_clusters = emb_df["cluster_hdbscan_str"].unique().tolist()
color_map = {}
palette_i = 0
for c in unique_clusters:
    if c == "noise (-1)":
        color_map[c] = noise_color
    else:
        color_map[c] = palette[palette_i % len(palette)]
        palette_i += 1

fig_umap_h = px.scatter(
    emb_df,
    x="umap_x",
    y="umap_y",
    color="cluster_hdbscan_str",
    color_discrete_map=color_map,
    hover_name="file_name",
    hover_data={
        "file_path": False,
        "duration_sec": True,
        "cluster_hdbscan": True,
    },
    title="UMAP of Acoustic Embeddings — Colored by HDBSCAN Cluster",
    labels={"umap_x": "UMAP-1", "umap_y": "UMAP-2", "cluster_hdbscan_str": "HDBSCAN cluster"},
)

fig_umap_h.update_layout(
    legend_title_text="HDBSCAN cluster",
    width=800,
    height=500,
    plot_bgcolor="#F0F2F5",
)

fig_umap_h.show()


### Cluster Summaries

To interpret the clusters, we will compute simple **summary statistics**:

- Number of tracks in each cluster  
- Average, minimum, and maximum track duration  

We will do this for:

- **k-means clusters**  
- **HDBSCAN clusters** (ignoring the noise cluster `-1` for summaries)


### Summaries for k-means

In [34]:
# --- Cluster summaries: k-means (all numeric metadata) ---

# Columns to exclude from "metadata" summarization
exclude_cols = set(feature_names) | {
    "cluster_kmeans",
    "cluster_kmeans_str",
    "cluster_hdbscan",
    "cluster_hdbscan_str",
    "pca_x", "pca_y",
    "umap_x", "umap_y",
}

# Metadata candidates = all other columns
metadata_cols = [c for c in emb_df.columns if c not in exclude_cols]

# Among those, pick only numeric columns for aggregation
numeric_meta_cols = emb_df[metadata_cols].select_dtypes(include="number").columns.tolist()

print("Numeric metadata columns being summarized (k-means):")
print(numeric_meta_cols)

# Build aggregation dict: for each numeric metadata column, compute mean/min/max
agg_dict = {"file_name": ("file_name", "count")}  # count = number of tracks
for col in numeric_meta_cols:
    agg_dict[f"{col}_mean"] = (col, "mean")
    agg_dict[f"{col}_min"]  = (col, "min")
    agg_dict[f"{col}_max"]  = (col, "max")

summary_k = (
    emb_df
    .groupby("cluster_kmeans")
    .agg(**agg_dict)
    .rename(columns={"file_name": "num_tracks"})
    .reset_index()
    .sort_values("cluster_kmeans")
)

summary_k



Numeric metadata columns being summarized (k-means):
['duration_sec', 'sample_rate']


Unnamed: 0,cluster_kmeans,num_tracks,duration_sec_mean,duration_sec_min,duration_sec_max,sample_rate_mean,sample_rate_min,sample_rate_max
0,0,9,157.716115,55.681088,198.629433,22050.0,22050,22050
1,1,4,184.327256,145.960635,216.177778,22050.0,22050,22050


### summaries for HDBSCAN (ignoring noise)

In [35]:
# --- Cluster summaries: HDBSCAN (all numeric metadata, non-noise only) ---

mask_non_noise = emb_df["cluster_hdbscan"] != -1
df_h = emb_df[mask_non_noise].copy()

if df_h.empty:
    print("No non-noise HDBSCAN clusters to summarize.")
else:
    # Reuse the same logic as before, but on df_h
    exclude_cols_h = set(feature_names) | {
        "cluster_kmeans",
        "cluster_kmeans_str",
        "cluster_hdbscan",
        "cluster_hdbscan_str",
        "pca_x", "pca_y",
        "umap_x", "umap_y",
    }

    metadata_cols_h = [c for c in df_h.columns if c not in exclude_cols_h]
    numeric_meta_cols_h = df_h[metadata_cols_h].select_dtypes(include="number").columns.tolist()

    print("Numeric metadata columns being summarized (HDBSCAN, non-noise):")
    print(numeric_meta_cols_h)

    agg_dict_h = {"file_name": ("file_name", "count")}
    for col in numeric_meta_cols_h:
        agg_dict_h[f"{col}_mean"] = (col, "mean")
        agg_dict_h[f"{col}_min"]  = (col, "min")
        agg_dict_h[f"{col}_max"]  = (col, "max")

    summary_h = (
        df_h
        .groupby("cluster_hdbscan")
        .agg(**agg_dict_h)
        .rename(columns={"file_name": "num_tracks"})
        .reset_index()
        .sort_values("cluster_hdbscan")
    )

    summary_h



No non-noise HDBSCAN clusters to summarize.
