# Assignment 7: Transformers

To get you warmed up and familiar with some of the libararies, we start out easy with a BERT tutorial from J. Alammar. 
The tutorial builds a simple sentiment analysis model based on pretrained BERT models with the [HuggingFace](https://huggingface.co/) library.
It will get you familiarized with the libary and make the next exercise a bit easier. 
The [Visual Guide](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) has nice graphics and visualizations and will increase your general understanding of transformers and especially the BERT model even more. 

---

## Task 1) Wav2vec 2.0 for keyword recognition

After the warm-up with BERT, this exercise is a bit more advanced and you will be mostly on your own.
The task in this exercise is to build a keyword recognition system based on wav2vec 2.0. 
There are a couple of options you will have to think about and decide which implementation path you want to follow.

You can use the Huggingface [Audio Classification Tutorial](https://github.com/huggingface/notebooks/blob/main/examples/audio_classification.ipynb) as starting point.
There are a couple of options, that will lead to differnt performance on this problem. They vary in complexity as well as performance.
You should be able to reason the design and implementation choices you made.
Choose one of the options that suits you best or the one that you think might yield the best performance.
1. What model will you use? ```BASE vs. LARGE``` and what pretrained weights ```ASR vs BASE```, ```XLSR53 vs ENGLISH```?
1. HuggingFace or ```torchaudio.pipelines```?
1. Use a simple neural classification head?
3. Extract features and use them with some downstream classifier (e.g. SVM, Naive Bayes etc.)
    1. What pooling strategy will you use (mean, statistical, etc)?
    2. Compare downstream classifiers (e.g., SVM vs MLP cs CNN).
    3. Should you use a dimeninsionality reduction method?
1. Or use CTC loss and a greedy decoder? (closed vocab!)

## Dataset

For this exercise please use the [speech-commands-dataset](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html) from google to train and evaluate your keyword recognition systems.
The data can also be obtained using the 
[HuggingFace api](https://huggingface.co/datasets/speech_commands) or you can use [torchaudio](https://pytorch.org/audio/stable/_modules/torchaudio/datasets/speechcommands.html).

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

### Prepare the Data

In [13]:
### YOUR CODE HERE

import tarfile
from typing import Iterable, Optional
import requests
import os
from sklearn.model_selection import train_test_split
from torch import Tensor
import torchaudio
from torch.utils.data import Dataset, DataLoader
import numpy as np


DATASET_URL = "http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz"
DATA_DIR_PATH = "./data"
DATASET_TARBALL_PATH = f"{DATA_DIR_PATH}/speech_commands_v0.01.tar.gz"
DATASET_PATH = f"{DATA_DIR_PATH}/speech_commands_v0.01"

RANDOM_SEED = 42

TEST_RATIO = 0.2
VAL_RATIO = 0.15

if not os.path.exists(DATA_DIR_PATH):
    os.mkdir(DATA_DIR_PATH)
if not os.path.exists(DATASET_TARBALL_PATH):
    with open(DATASET_TARBALL_PATH, "wb") as fp:
        fp.write(requests.get(DATASET_URL).content)
if not os.path.exists(DATASET_PATH):
    with tarfile.open(DATASET_TARBALL_PATH) as tar:
        tar.extractall(DATASET_PATH)

audios = []
labels = []
label_set = set()
for p in os.listdir(DATASET_PATH):
    dir_p = f"{DATASET_PATH}/{p}"
    if p.startswith("_") or not os.path.isdir(dir_p):
        continue
    label_set.add(p)
    for wp in os.listdir(dir_p):
        audios.append(torchaudio.load(f"{dir_p}/{wp}"))
        labels.append(p)
distinct_labels = sorted(label_set)
print(f"distinct labels: {distinct_labels}")

train_idcs, temp = train_test_split(np.arange(len(audios)), test_size=TEST_RATIO + VAL_RATIO, random_state=RANDOM_SEED)
test_idcs, val_idcs = train_test_split(temp, test_size=VAL_RATIO / (VAL_RATIO + TEST_RATIO), random_state=RANDOM_SEED)

class CommandsDataset(Dataset):
    def __init__(self, audios: Iterable[Tensor], labels: Iterable[str], distinct_labels: Optional[list[str]] = None):
        self.__audios = list(audios)
        self.__labels = list(labels)
        assert len(self.__audios) == len(self.__labels)
        self.__distinct_labels = sorted(set(labels)) if distinct_labels is None else list(distinct_labels)
        self.__label_index_lookup = {l: i for i, l in enumerate(self.__distinct_labels)}

    def __len__(self) -> int:
        return len(self.__audios)

    def __getitem__(self, i: int) -> tuple[Tensor, int]:
        return self.__audios[i], self.__label_index_lookup[self.__labels[i]]
    
    def get_label(self, i: int) -> str:
        return self.__distinct_labels[i]

train_dataset = CommandsDataset(map(audios.__getitem__, train_idcs), map(labels.__getitem__, train_idcs), distinct_labels)
val_dataset = CommandsDataset(map(audios.__getitem__, val_idcs), map(labels.__getitem__, val_idcs), distinct_labels)
test_dataset = CommandsDataset(map(audios.__getitem__, test_idcs), map(labels.__getitem__, test_idcs), distinct_labels)

### END YOUR CODE

distinct labels: ['bed', 'bird', 'cat', 'dog', 'down', 'eight', 'five', 'four', 'go', 'happy', 'house', 'left', 'marvin', 'nine', 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six', 'stop', 'three', 'tree', 'two', 'up', 'wow', 'yes', 'zero']


### Train the wav2vec model

In [None]:
### YOUR CODE HERE

import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
from torch.optim import Adam
from torch.nn import Linear

MODEL_NAME = "superb/wav2vec2-large-superb-sid"
LR = 0.001
DEVICE = "cuda"

def build_model(model_name: str, class_count: int) -> tuple[Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification]:
    ftx: Wav2Vec2FeatureExtractor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_NAME)
    model: Wav2Vec2ForSequenceClassification = Wav2Vec2ForSequenceClassification.from_pretrained(MODEL_NAME) # type: ignore
    model.classifier = Linear(model.projector.out_features, class_count)
    return ftx, model

ftx, model = build_model(MODEL_NAME, len(distinct_labels))
opt = Adam(model.parameters(), LR)



### END YOUR CODE

### Evaluate your model

In [None]:
### YOUR CODE HERE

    

### END YOUR CODE