# Lecture 6: Self-Supervised Learning for Speech Representation

## Motivation

- Speech Datasets with label transcriptions can be scarce and can be costly depennding on the task.
- At the same time, we have an abundant of unlabeled available speech data.

## Learning

+ ***Supervised Learning***
+ ***Unsupervised Learning***
+ ***Self-Supervised Learning:***

    Between both, we have data without labels, but what we do is we use the data istelf to create 'pseudo-labels' (aka. pretext task) to train in a 'supervised' way.

    Eg. Given a speech that we mask, we predict what is the next word/frame using the model, so the model learns some understanding of the features present in the data.

    ### Typical Model Archietcture

    Raw waveform, frames and windowed, given to CNN to get the latent speech representations (local representations) Z, which are then given to a transformer that autoencodes the data and gives us context representations.

    ### Training Methods

    - Generative Architecture:

        We have two compatible signals X & Y, we encode X, then we would want to reconstruct parts of the signal Y.

        + Wav2vec: It solves a next time-step prediction task. The model learns to distinguish a sample $z_{i+k}$ that is k steps into the future from distractor samples drawn from a proposal distribution. 

        + Wav2vec2.0: It solved a contrastive task defined over a quantization of the latent representations Z. Instead of predicting next time steps, we mask part of the input and ask the model to predict what is in this masked region. Quantization is 'Vocabulary', notebook for the speech.

        + Bert (Text): A sentence in which we mask a word (sequence of words) and the model predicts what is the most probable word in that masked region.

        + HuBERT: We encode Speech using CNNs, mask a region, then ask it to predict the masked region. We only predict the loss for the masked regions.

        + WavLM: Same as HuBERT, but we add noise to the speech before encoding it.

    - Joint-Embedding Architecture:

        Instead of predicting/reconstructing the output signal using the input (masked) signal, we have two compatible signals X & Y, we have two encoders, we want them to encode similar embeddings for compatible signal inputs, and dissimilar embeddings for incompatible input signals. 

        + BYOL: Image $\rightarrow$ Speech. Augment input signal in two different ways, connected to an Online Network and a Target Network, encode the input signals, and the goal is to encode the same thing for these two augmented signals, basically ignore the noise and focus on the signal.

    - Joint-Embedding Predictive Architecture:

        Combines both architectures. We now have compatible signals, two encoders, but once we have encoded X, we want to predict the embeddings of the Y signal.

        + Data2vec: Image, Speech & Language. Student Network and a Teacher Network. Teacher Network is getting the full input signal. The Student gets a masked version of the input signal for which we predict the masked region's embeddings.


## Why does SSL work?

- It can make use of very large amounts of data, it is not bounded with unlabeled data. Transformer architecture is well suited to learn context & embeddings. The fact that we ask it to predict audio makes it learn useful representation learning that encodes useful information for the given task.

## Model Evaluation

Since the models' outputs are just representations, how do we benchmark them?

We encode audio using the models, then take these representations, and train smaller models for specific tasks with these representations (being the labels) and see how well these models perform on the given task.

+ SUPERB Benchmark - A wide range of tasks regarding speech
+ HEAR Benchmark - A wide range of tasks regarding general audio 

## SSL for Synthesis

Synthesis is where we generate new audio/speech waveforms.

- Voice Conversion: Audio to Audio

    We encode the speech, extract the acoustic features, then pass through the conversion model, get the converted acoustic features, Vocoder, and we get the converted speech audio.

    The Challenge is the encoder/decoder. Artifacts introduced by the encoder and vocoder...

    Voice Conversion Model? Frames are linearly close if the phonetic/linguistic information is similar, even though the voice is different.

    - kNN-VC: We have a source audio signal, and a target speaker audio signal. Then we use kNN to get the k closest samples to the source audio signal in the target audio signal and then take their average as the converted signal. This works well for zero-shot conversion.   

- Text-to-Speech: Text to Audio

    We encode the text input, extract the linguistic features and pass them through an acoustic model, then pass the acoustic features through a similar type of vocoder in order to generate the speech output audio.

    - kNN-TTS: We have source input text embeddings, and a target speaker audio signal. Then we use kNN to get the k closes samples to the source speaker features in the target audio signal and then take their average as the converted signal, but sometimes we would want to interpolate the selected features with the source speaker features in order for the voice to be more controlled & mixed (eg. 25% source voice, 75% target voice).

## Summary

- Self-supervised learning enables learning useful general speech representations from unlabeled data that can be used in a wide range of tasks.

- Benchmarks that evaluate these models showed that they are improving performance in the given tasks, especially when labelled data is scarce.

- Very good representations can simplify task-specific models.