# Latent Audio
## Disentangling Yamnet's latent representations into materials and actions

The artificial neural network Yamnet takes sound spectrograms as input and classifies them according to approximately 550 classes. Yamnet's latent space appears to be encode the materials and actions involved in making a sound. In order to understand these latent representaitons, the current notebook explores the latent space and then disentangles it.

Latent space exploration involves 
- projection with principal component analysis (PCA),
- t-distributed stochastic neighbor embeddings (t-SNE),
- classification of latent representations as materials and actions using k-nearest neighbors (KNN) and 
- disententanglement with a flow model.

Latent space manipulation involves
- invertible projection with PCA. This requires a complete PCA model whose output and input dimensionality is the same which is processing intensive to set up.
- disentanglement with a flow model. This flow model needs to be invertible. The one listed for exploration will work as it is.

The preparation for these two analyses is similar in many ways. They both require the conversion of data from its waveform domain to Yament's latent representation for each layer and the projection to manageable dimensionality, e.g. 64 dimensions using PCA. Note that computing a full PCA model is resource intensive and a small model with e.g. 64 dimensions will suffice for most layers. For the majority of layers, the original latent space representations that are of higher dimensionality than the projection are also no longer needed after projection. Only the layers whose latent space shall be manipulated need the full PCA model for invertability and need the original latent space representation. In the curernt study, this only applies to layer 9. The below diagram illustrates this pipeline. Containers represent entire data sets, square represent single data points (i.e. a single sound), scripts refer to code snippets and rectangles represent models. While $W_1$ is the original waveform data, $W_2$ is the augmented one. $Z'$ is Yamnet's latent representation and is obtained for every layer. $Z$ is the latent space representation afgter projection to a lower dimension. $Z_a,Z_b$ form a pair of two sounds whose similarity is indicated for actions and materials in $Y_{ab}$. $Z,Y$ correspond to a single sound (in projected latent representation) and its material/ action label.

Important: The conversion from waveform to projected latent Yament representation requires several hours to be executed and memory as well as disk storage demands can temporarily peak.

![alt text](Pipeline.png)

# Pre-processing

In [1]:
from latent_audio.scripts import audio_to_latent_yamnet as aud2lat, create_scalers_and_PCA_model_for_latent_yamnet as lat2pca, latent_yamnet_to_calibration_data_set as lat2cal
import shutil, os

full_dim_layer_indices = [9]
reduced_target_dimensionality = 64

for layer_index in range(14):
    print(f'Layer {layer_index}')
    # Extract data
    aud2lat.run(layer_index=layer_index) # Converts audio to latent yamnet representation of original dimensionality
    lat2pca.run(layer_index=layer_index, target_dimensionality=None if layer_index in full_dim_layer_indices else reduced_target_dimensionality) # Creates standard scalers and PCA for projection to lower dimensional space
    lat2cal.run(layer_index=layer_index, dimensionality=reduced_target_dimensionality) # Performs the projection (this will be needed for all layers)

    # Delete latent representations of original dimensionality to save disk storage
    if layer_index not in full_dim_layer_indices:
        shutil.rmtree(os.path.join("data","latent yamnet","original",f"Layer {layer_index}"))

Layer 9
Running script to create scalers and PCA model for latent yamnet
	Loading sample of latent data Completed. Shape == [instance count, dimensionality] == (13516, 12288)
	Fitting Pre-PCA Standard Scaler to sample Completed
	Fitting 12288-dimensional PCA to sample Completed
	Fitting Post-PCA Standard Scaler to sample Completed
	Run Completed
