## Setup

# DecVAE Tutorial: SimVowels Dataset

Complete workflow example for the SimVowels dataset.

In [1]:
# Import necessary libraries
import os
import json
from pathlib import Path

# Set the working directory to the DecVAE root
# Adjust this path to your local DecVAE directory
DECVAE_ROOT = Path(os.getcwd()).parent if 'examples' in os.getcwd() else Path(os.getcwd())
os.chdir(DECVAE_ROOT)
print(f"Working directory: {os.getcwd()}")

Working directory: /home/student3/Documents/IoannisZiogas/DecVAE


## 1. Generate SimVowels Dataset

Generate from scratch:

In [3]:
!python scripts/simulations/simulated_vowels.py

Succesfully saved training set part  1
Succesfully saved dev set
Succesfully saved test set


Or download from [Google Drive](https://drive.google.com/drive/folders/1VE4mkC3P1GEDrorThmRgL07NdEoLtyf9?usp=sharing) and place in directory "../sim_vowels" (same level as the DecVAE project directory).

## 2. Decompose the SimVowels dataset

We will first calculate the decomposition of the dataset and obtain the components of every audio sample to be used as masks during the self-supervised pre-training. The input to the decomposition is raw audio utterances X, each of shape [1,F,S] and the output is the raw and the decomposed utterance of shape [C+1,F,S], F the number of frames in the utterance, S the receptive field/segment length of the frame. 

We will execute the pre-training script by first setting the preprocessing_only flag to true:

In [2]:
!python utils/update_config.py config_files/DecVAEs/sim_vowels/pre-training/config_pretraining_sim_vowels_NoC3.json preprocessing_only true

Updated 'preprocessing_only' to True in config_files/DecVAEs/sim_vowels/pre-training/config_pretraining_sim_vowels_NoC3.json


The decomposition processing can be assigned to multiple processors for faster execution. The number of processors can be controlled with the "preprocessing_num_workers" parameter, currently set to 8.

Executing the below cell will require 23.4 GB of space for the decomposed data. The same data will be used later for visualizing, pre-training and evaluation.

In [2]:
!accelerate launch scripts/pre-training/base_models_ssl_pretraining.py --config_file config_files/DecVAEs/sim_vowels/pre-training/config_pretraining_sim_vowels_NoC3.json

Added /home/student3/Documents/IoannisZiogas/DecVAE to Python path
loading configuration file preprocessor_config.json from cache at /home/student3/.cache/huggingface/hub/models--patrickvonplaten--wav2vec2-base-v2/snapshots/9371f1849947b4613f451680a8e96d907617ce86/preprocessor_config.json
Feature extractor Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

  correlogram[o,r] = np.cov(OCs[o],OCs[r])[0,1] / (np.std(OCs[o])*np.std(OCs[r]))
  correlogram[o,r] = np.cov(OCs[o],OCs[r])[0,1] / (np.std(OCs[o])*np.std(OCs[r]))
  correlogram[o,r] = np.cov(OCs[o],OCs[r])[0,1] / (np.std(OCs[o])*np.std(OCs[r]))
  correlogram[o,r] = np.cov(OCs[o],OCs[r])[0,1] / (np.std(OCs[o])*np.std(OCs[r]))
  correlogram[o,r] = np.cov(OCs[o],OCs[r])[0,1] / (np.std(OCs[o])*np.std(OCs[r]))
  correlogram[o,r] = np.cov(OCs[o],OCs[r])

## 3. Input Visualization

We generate input TSNE visualizations for the raw audio signal (X), and the components after applying a decomposition. We visualize individual components (OC1, OC2, ..., OCn) and aggregated representations, e.g. concatenation of all components and initial X [X,OC1,OC2,...,OCn]. We color the representations using frequency correspondence of the inputs or generative factors (vowel, speaker). 

For the development set (500 utterances) of SimVowels, we will visualize the inputs to all models. The inputs are the Mel filterbank coefficients of the initial audio signal X and its components OC1,OC2,...,OCn.  

As the whole dev set has to be projected via TSNE this can take a while (3-4 hours). If you want to sample a subset of the dev set to visualize, skip to the next cell.

In [3]:
# Visualize frame-level inputs
!accelerate launch scripts/visualize/low_dim_vis_input.py \
    --config_file config_files/input_visualizations/config_visualizing_input_frames_vowels.json

Added /home/student3/Documents/IoannisZiogas/DecVAE to Python path
loading configuration file preprocessor_config.json from cache at /home/student3/.cache/huggingface/hub/models--patrickvonplaten--wav2vec2-base-v2/snapshots/9371f1849947b4613f451680a8e96d907617ce86/preprocessor_config.json
Feature extractor Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

  return f(*args, **kwargs)
Explained variance for mel domain original frame PCA: 92.01%
Explained variance for mel OC 1 frame PCA: 98.16%
Explained variance for mel OC 2 frame PCA: 98.67%
Explained variance for mel OC 3 frame PCA: 99.10%
Explained variance for mel domain OCs_concat frame PCA: 95.63%
[0m

Then we will compute the sequence-level TSNE visualization of the train set (4000 utterances). Here each utterance is represented as a single sample. The total shape of the input will be [C+1,F*S].

In [4]:
# Visualize sequence-level inputs
!accelerate launch scripts/visualize/low_dim_vis_input.py \
    --config_file config_files/input_visualizations/config_visualizing_input_sequences_vowels.json

Added /home/student3/Documents/IoannisZiogas/DecVAE to Python path
loading configuration file preprocessor_config.json from cache at /home/student3/.cache/huggingface/hub/models--patrickvonplaten--wav2vec2-base-v2/snapshots/9371f1849947b4613f451680a8e96d907617ce86/preprocessor_config.json
Feature extractor Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "sampling_rate": 16000
}

  return f(*args, **kwargs)
Explained variance for mel original sequence PCA: 29.63%
Explained variance for mel domain OC 1 sequence PCA: 25.93%
Explained variance for mel domain OC 2 sequence PCA: 32.74%
Explained variance for mel domain OC 3 sequence PCA: 29.63%
Explained variance for mel domain OCs_concat sequence PCA: 35.32%
[0m

For a faster frame-level input TSNE representation we have to reduce the number of samples. First, we will have to set the number of framed utterances to a small number e.g. 50:

In [None]:
!python utils/update_config.py config_files/DecVAEs/sim_vowels/pre-training/config_pretraining_sim_vowels_NoC3.json frames_to_vis 50

Then re-run the visualization:

In [None]:
# Visualize frame-level inputs
!accelerate launch scripts/visualize/low_dim_vis_input.py \
    --config_file config_files/input_visualizations/config_visualizing_input_frames_vowels.json

The above can be repeated for other sets as well (train,dev,test), for mel filterbank and waveform input, and by using UMAP instead of TSNE. Check the corresponding config files for more information. Resulting TSNE visualizations may slightly vary due to the stochasticity of TSNE (even with the fixed random seed).

## 3. Pre-training DecVAE

Single-GPU: use the --gpu_ids argument to specify the id of the GPU (0,1,2,...) - accelerate launch --gpu ids scripts... . Alternatively omit this argument and the default GPU id in your system will be used (as below).

In [None]:
# Pre-train DecVAE on single GPU
!accelerate launch scripts/pre-training/base_models_ssl_pretraining.py \
    --config_file config_files/DecVAEs/sim_vowels/pre-training/config_pretraining_sim_vowels_NoC3.json

Multi-GPU (specify GPU IDs):

In [None]:
# Pre-train DecVAE on multiple GPUs (e.g., GPU 0 and 1)
# Uncomment and modify as needed:
# !accelerate launch --gpu_ids 0,1 scripts/pre-training/base_models_ssl_pretraining.py \
#     --config_file config_files/DecVAEs/sim_vowels/pre-training/config_pretraining_sim_vowels_NoC3.json

View configuration:

In [None]:
import json

with open("config_files/DecVAEs/sim_vowels/pre-training/config_pretraining_sim_vowels_NoC3.json", 'r') as f:
    config = json.load(f)

print(json.dumps(config, indent=2))

## 4. Latent Evaluation

In [None]:
# Evaluate latent representations
!accelerate launch scripts/post-training/latents_post_analysis.py \
    --config_file config_files/DecVAEs/sim_vowels/latent_evaluations/config_latent_anal_sim_vowels.json

## 5. Latent Visualization

Frame-level:

In [None]:
# Visualize frame-level latent representations
!accelerate launch scripts/visualize/low_dim_vis_latents.py \
    --config_file config_files/DecVAEs/sim_vowels/latent_visualizations/config_latent_frames_visualization_vowels.json

Sequence-level:

In [None]:
# Visualize sequence-level latent representations
!accelerate launch scripts/visualize/low_dim_vis_latents.py \
    --config_file config_files/DecVAEs/sim_vowels/latent_visualizations/config_latent_sequences_visualization_vowels.json

## 6. Latent Traversals

Generate controlled dataset:

In [None]:
# Generate small-scale dataset with controlled generative factors
!python scripts/simulations/simulated_vowels_for_latent_traversal.py

Perform traversal analysis:

In [None]:
# Perform latent traversal analysis
!accelerate launch scripts/latent_response_analysis/latent_traversal_analysis.py \
    --config_file config_files/DecVAEs/sim_vowels/latent_traversals/config_latent_traversals_sim_vowels.json