# HW5: Unsupervised Speech Recognition (USR)

Welcome to HW5 in Introduction to Deep Learning 11685. You will be working on Unsupervised Speech Recognition with GANs in this HW. You will be reimplementing and further improving on the model given in the USR paper by Facebook AI.<br>
Link: https://arxiv.org/abs/2105.11084


# Installations

In [None]:
! pip install git+https://github.com/pytorch/fairseq
# You can install other libraries such as torchsummaryX, wandb and so on

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/pytorch/fairseq
  Cloning https://github.com/pytorch/fairseq to c:\users\kyle\appdata\local\temp\pip-req-build-ol7n3gov
  Resolved https://github.com/pytorch/fairseq to commit 3f6ba43f07a6e9e2acf957fc24e57251a7a3f55c
  Installing build dependencies: started
  Installing build dependencies: still running...
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/pytorch/fairseq 'C:\Users\Kyle\AppData\Local\Temp\pip-req-build-ol7n3gov'
  Running command git submodule update --init --recursive -q


# Kaggle

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle/

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"u","key":"k') # TODO: Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

In [1]:
!kaggle competitions download -c 11-685-s23-hw5
!mkdir '/content/data'

!unzip -qo '/content/11-685-s23-hw5.zip' -d '/content/data'

Downloading 11-685-s23-hw5.zip to c:\Users\Kyle\Documents\Coding\DeepLearning\HW5\hw5_handout_S23




  0%|          | 0.00/4.85G [00:00<?, ?B/s]
  0%|          | 1.00M/4.85G [00:00<09:43, 8.93MB/s]
  0%|          | 8.00M/4.85G [00:00<01:57, 44.1MB/s]
  0%|          | 13.0M/4.85G [00:00<01:49, 47.5MB/s]
  0%|          | 18.0M/4.85G [00:00<02:08, 40.6MB/s]
  1%|          | 25.0M/4.85G [00:00<02:29, 34.8MB/s]
  1%|          | 31.0M/4.85G [00:00<02:07, 40.7MB/s]
  1%|          | 36.0M/4.85G [00:01<02:46, 31.1MB/s]
  1%|          | 41.0M/4.85G [00:01<02:40, 32.2MB/s]
  1%|          | 49.0M/4.85G [00:01<02:39, 32.3MB/s]
  1%|          | 58.0M/4.85G [00:01<01:58, 43.5MB/s]
  1%|▏         | 65.0M/4.85G [00:01<02:23, 35.8MB/s]
  1%|▏         | 72.0M/4.85G [00:02<02:03, 41.6MB/s]
  2%|▏         | 77.0M/4.85G [00:02<01:59, 43.1MB/s]
  2%|▏         | 82.0M/4.85G [00:02<03:02, 28.2MB/s]
  2%|▏         | 89.0M/4.85G [00:02<02:40, 31.9MB/s]
  2%|▏         | 96.0M/4.85G [00:02<02:12, 38.7MB/s]
  2%|▏         | 104M/4.85G [00:02<01:47, 47.3MB/s] 
  2%|▏         | 110M/4.85G [00:02<01:44, 48.6MB/s]
  

# Imports

In [2]:
import torch
from torch import nn, optim
from torch.utils import data
from torch.nn.utils.rnn import *

import numpy as np
from tqdm import tqdm
import sys
import json

# add any other imports that you want 

has_cuda = torch.cuda.is_available()
if has_cuda:
  print("GPU: ", torch.cuda.get_device_name(0))
device = torch.device("cuda:0" if has_cuda else "cpu")
print("Device: ", device)

  from .autonotebook import tqdm as notebook_tqdm


GPU:  NVIDIA GeForce RTX 3070 Ti
Device:  cuda:0


In [None]:
# TODO
%cd /Path/To/Your/hw5_handout/Directory/

# Dataset and DataLoaders

You have TODOs which need to be completed in `task/unpaired_audio_text.py` before you run these cells. You just need to replace the paths. You can use the original code base as a reference.



In [None]:
from task import UnpairedAudioText

task = UnpairedAudioText()

In [None]:
train_dataloader_args = dict(batch_size=160, #feel free to change these values
                             num_workers=4,
                            ) if has_cuda else dict(batch_size=64)
train_dataloader_args["shuffle"] = True
train_dataloader_args["collate_fn"] = task.datasets["train"].collater

validation_dataloader_args = train_dataloader_args.copy()
validation_dataloader_args["shuffle"] = False
validation_dataloader_args["collate_fn"] = task.datasets["valid"].collater

train_dataloader = data.DataLoader(task.datasets["train"], **train_dataloader_args)
validation_dataloader = data.DataLoader(task.datasets["valid"], **validation_dataloader_args)

# Model and Training Configurations

You need to complete the TODOs in `model/wav2vec_u.py` before you run this cell. You can use the original codebase as a refernce to complete this.
Original Codebase: https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/unsupervised/


In [None]:
from model import Wav2vec_U

model = Wav2vec_U(task.target_dictionary).to(device)
print(model)

For a GAN, you need optimizers for both the discriminator and the generator. Configure the optimizers according to fairseq's configuration given in the link:
https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/unsupervised/config/gan/w2vu.yaml


In [None]:
from itertools import chain

num_epochs = 2000
epoch_start = 1

if epoch_start == 1:
    # define 2 optimizers for different parts of the model at the start of the training
    optimizer = {
      "discriminator": optim.Adam(model.discriminator.parameters(),
                                  # TODO: define lr, weight decay, betas and other relavant parameters
                                 ),
      "generator": optim.Adam(chain(model.generator.parameters(), model.segmenter.parameters()),
                              # TODO: define lr, weight decay, betas and other relavant parameters
                              )
    }
    
# Optional TODO: Consider using mixed-precision to speed up training

A bunch of TODOs in the next cell. <br><br>
Tip: Instead of completing whole `run_model` function and the debugging while running the experiment section, you can create a new cell and code your own sanity check. It may help you to understand what is returned from the dataloader, what needs to be pushed to the device, how model is called and what `loss_stats` are.

In [None]:
# Hint: You may find pdb to be a great tool in helping you understand returned values
# from the dataloader and the model. Usage:
# import pdb
# pdb.set_trace()

def run_model(model, dataloader):
    cumulative_stats = dict()

    for data in tqdm(dataloader, desc="Train" if model.training else "Eval "):
        net_input = data['net_input']
        # What are the keys and values obtained from the data loader?
        # TODO: move all tensors to GPU
        # Tip: Checking what is inside net_input might help

        if model.training:
            # TODO: We are training the model. Might need to do something with the optimizer?
            # Remember that you are training a GAN. Both optimizers won't be used at the same time.
            # You may have to write an if statement or something similar to use the specific optimizer.
            # You may have to use the discrim_step() attribute in the Wav2vec_U class


            loss_stats = model(**net_input) # forward pass
            

            total_loss = 0.0 

            # TODO: accumulate losses into total_loss for backprop during training
            # loss_stats["losses"] is a dictionary containing various loss components
            # some losses can be None if it's not used

            total_loss /= net_input["features"].size(0) # average by batch


            group = model.get_groups_for_update() 
            # Look at what the get_groups_for_update() function does in the Wav2vec_U class
            # Can you try to think how you can use discrim_step() previously?

            # TODO: backprop loss and run the corresponding optimizer (Tip: See what 'group' is)

        else:
            # validation
            loss_stats = task.valid_step(data, model)


        # accumulate batch stats
        for k, v in loss_stats.items():
            if type(v) is dict:
                # flatten inner dictionary
                key_value_pairs = [(k + "_" + kn, vn) for kn, vn in v.items()]
            else:
                key_value_pairs = [(k, v)]

          # TODO: accmulate all statistics into cumulative_stats, a dictionary
          # NOTE: you should convert any returned tensors to either values or numpy arrays
          # cumulative_stats shouldn't have nested dictionaries

          
    # average stats over the dataset
    # Note that some metrics are already averaged over batch, so the result won't make sense
    # You can fix them if needed
    for k, v in cumulative_stats.items():
        v = v / len(dataloader.dataset)
        if type(v) is np.ndarray:
            v = v.tolist()
        cumulative_stats[k] = v

    return cumulative_stats

# Experiments

In [None]:
print(f"Training for {num_epochs} epochs", file=sys.stderr)

eval_interval = 10 # evaluation after how many epochs?

for epoch in range(epoch_start, num_epochs + 1):
    print(f"Epoch {epoch}", file=sys.stderr)

    model.train()
    # The model uses the epoch number to decide which part of the network to train
    model.set_num_updates(epoch) # Look at what this function does in the Wav2vec_U class

    train_stats = run_model(model, train_dataloader)

    if epoch % eval_interval == 0:

        model.eval()
        with torch.no_grad():
            eval_stats = run_model(model, validation_dataloader)
            
        # TODO: perhaps save your model and optimizer here
        # Tip: You can even save the model after every epoch along with the best model. 
        # This may help to continue training even if the best model is from a very early epoch.

    else:
        eval_stats = {}

    # TODO: Log training/eval statistics

# Testing and submission to Kaggle

In [None]:
from dataset import extracted_features_dataset
test_path = # TODO: PathToTest
testset = extracted_features_dataset.ExtractedFeaturesDataset(path=test_path,
                                                              split='test')

test_dataloader_args = train_dataloader_args.copy()
test_dataloader_args["shuffle"] = False
test_dataloader_args["collate_fn"] = testset.collater

test_dataloader = data.DataLoader(testset, **test_dataloader_args)

Write the `test_step` function which can be coded very similar to `valid_step` given in `task/unpaired_audio_text.py`

In [None]:
def test_step(inputs, model):
    # TODO
    pass

Write some code to evaluate and get the results. You are free to write the below cells however you want

In [None]:
model.eval()
results = []

for data in tqdm(test_dataloader, desc="Test"):
    # TODO
    pass

In [None]:
# TODO: Replace the path and get the phoneme_map.json for mapping
with open("/Path/To/phoneme_map.json", "r") as file:
    phon_map = json.load(file)

In [None]:
predictions = []

for line in results:
    # TODO: Map results with phon_map
    pass

In [None]:
# TODO: Make the CSV and submit to kaggle