# HW5: Unsupervised Speech Recognition (USR)

Welcome to HW5 in Introduction to Deep Learning 11685. You will be working on Unsupervised Speech Recognition with GANs in this HW. You will be reimplementing and further improving on the model given in the USR paper by Facebook AI.<br>
Link: https://arxiv.org/abs/2105.11084


Most comments in the code below are from the given code, as I did not delete comments or TODOs as I filled them out.

# Installations

In [None]:
! pip install git+https://github.com/pytorch/fairseq
! pip install torchsummaryX
! pip install wandb -q
# You can install other libraries such as torchsummaryX, wandb and so on

# Kaggle

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    # TODONE: Put your kaggle username & key here
    # key deleted for obvious reasons
    f.write('{"username":"u","key":"k"}') 

!chmod 600 /root/.kaggle/kaggle.json

In [None]:
!kaggle competitions download -c 11-685-s23-hw5 --force
!mkdir '/content/data'

!unzip -qo '/content/11-685-s23-hw5.zip' -d '/content/data'

# Imports

In [None]:
import torch
from torch import nn, optim
from torch.utils import data
from torch.nn.utils.rnn import *

import numpy as np
from tqdm import tqdm
import sys
import json
from google.colab import drive

# add any other imports that you want 

has_cuda = torch.cuda.is_available()
if has_cuda:
  print("GPU: ", torch.cuda.get_device_name(0))
device = torch.device("cuda:0" if has_cuda else "cpu")
print("Device: ", device)

drive.mount('/content/drive', force_remount=True)

Note that this cell pulls the contents of the .py files from Google Drive, as I ran this project on Colab and this way was easiest for testing.

In [None]:
# TODONE
%cp -r "/content/drive/MyDrive/School Stuff/Deep Learning/hw5_handout_S23" "/content/handout"
%cd /content/handout

# Dataset and DataLoaders

You have TODOs which need to be completed in `task/unpaired_audio_text.py` before you run these cells. You just need to replace the paths. You can use the original code base as a reference.



In [None]:
from task import UnpairedAudioText

task = UnpairedAudioText()

In [None]:
train_dataloader_args = dict(batch_size=160, #feel free to change these values
                             num_workers=4,
                            ) if has_cuda else dict(batch_size=64)
train_dataloader_args["shuffle"] = True
train_dataloader_args["collate_fn"] = task.datasets["train"].collater

validation_dataloader_args = train_dataloader_args.copy()
validation_dataloader_args["shuffle"] = False
validation_dataloader_args["collate_fn"] = task.datasets["valid"].collater

train_dataloader = data.DataLoader(task.datasets["train"], **train_dataloader_args)
validation_dataloader = data.DataLoader(task.datasets["valid"], **validation_dataloader_args)

Code for Quiz

In [None]:
'''from fairseq.data import (
    Dictionary,
    data_utils,
    StripTokenDataset,
)
import os'''
'''text_dataset = data_utils.load_indexed_dataset(
                os.path.join(task.cfg.text_data, 'train'), task.target_dictionary
            )
'''
#print(f'The dataset has {len(task.datasets["train"])} speech segments')
#print(f'The dataset has {len(text_dataset)} text segments')

# Model and Training Configurations

You need to complete the TODOs in `model/wav2vec_u.py` before you run this cell. You can use the original codebase as a refernce to complete this.
Original Codebase: https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/unsupervised/.


In [None]:
from model import Wav2vec_U

model = Wav2vec_U(task.target_dictionary).to(device)
print(model)

For a GAN, you need optimizers for both the discriminator and the generator. Configure the optimizers according to fairseq's configuration given in the link:
https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/unsupervised/config/gan/w2vu.yaml


In [None]:
GENERATOR_CONFIG = {
  "lr": 0.0004,
  "adam_betas": (0.5,0.98),
  "adam_eps": 1e-06,
  "weight_decay": 0,
}
DISCRIMINATOR_CONFIG = {
  "lr": 0.0003,
  "adam_betas": (0.5,0.98),
  "adam_eps": 1e-06,
  "weight_decay": 0.0001,
}

In [None]:
from itertools import chain

num_epochs = 2000
epoch_start = 1

if epoch_start == 1:
    # define 2 optimizers for different parts of the model at the start of the training
    optimizer = {
      "discriminator": optim.Adam(model.discriminator.parameters(),
                                  # TODONE: define lr, weight decay, betas and other relavant parameters
                                  lr=DISCRIMINATOR_CONFIG["lr"],
                                  betas=DISCRIMINATOR_CONFIG["adam_betas"],
                                  eps=DISCRIMINATOR_CONFIG["adam_eps"],
                                  weight_decay=DISCRIMINATOR_CONFIG["weight_decay"]

                                 ),
      "generator": optim.Adam(chain(model.generator.parameters(), model.segmenter.parameters()),
                              # TODONE: define lr, weight decay, betas and other relavant parameters
                              lr=GENERATOR_CONFIG["lr"],
                              betas=GENERATOR_CONFIG["adam_betas"],
                              eps=GENERATOR_CONFIG["adam_eps"],
                              weight_decay=GENERATOR_CONFIG["weight_decay"]
                              )
    }
    
# Optional TODO: Consider using mixed-precision to speed up training
scaler = torch.cuda.amp.GradScaler()

A bunch of TODOs in the next cell. <br><br>
Tip: Instead of completing whole `run_model` function and the debugging while running the experiment section, you can create a new cell and code your own sanity check. It may help you to understand what is returned from the dataloader, what needs to be pushed to the device, how model is called and what `loss_stats` are.

In [None]:
# Scheduler for use with the discriminator optimizer.
# Freezes the LR for every other epoch, to give the generator time to catch up.
def freeze_lr(epoch):
  unfreeze_every = 2
  if epoch % (unfreeze_every*2) == 0:
    return 1
  else:
    return 0
# Uncommenting this line and one of the lines farther below will enable the freeze_lr scheduler
#scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer['discriminator'], lr_lambda=freeze_lr)

In [None]:
from numpy.core.multiarray import ndarray
# Hint: You may find pdb to be a great tool in helping you understand returned values
# from the dataloader and the model. Usage:
# import pdb
# pdb.set_trace()

def run_model(model, dataloader):
    cumulative_stats = dict()

    for data in tqdm(dataloader, desc="Train" if model.training else "Eval "):
        net_input = data['net_input']
        # What are the keys and values obtained from the data loader?
        # TODONE: move all tensors to GPU
        #data['id'] = data['id'].to(device)

        data['net_input']['features'] = data['net_input']['features'].to(device)
        data['net_input']['padding_mask'] = data['net_input']['padding_mask'].to(device)
        # Tip: Checking what is inside net_input might help

        if model.training:
            # TODONE: We are training the model. Might need to do something with the optimizer?
            data['net_input']['random_label'] = data['net_input']['random_label'].to(device)
            if model.discrim_step(model.update_num):
              optimizer['discriminator'].zero_grad()
            else:
              optimizer['generator'].zero_grad()
            #model.zero_grad()
            # Remember that you are training a GAN. Both optimizers won't be used at the same time.
            # You may have to write an if statement or something similar to use the specific optimizer.
            # You may have to use the discrim_step() attribute in the Wav2vec_U class

            with torch.cuda.amp.autocast():
              loss_stats = model(**net_input) # forward pass
            

            total_loss = 0.0 

            # TODONE: accumulate losses into total_loss for backprop during training
            losses = loss_stats['losses']
            for k, v in losses.items():
              if v != None:
                total_loss += v
            # loss_stats["losses"] is a dictionary containing various loss components
            # some losses can be None if it's not used

            total_loss /= net_input["features"].size(0) # average by batch


            group = model.get_groups_for_update() 
            # Look at what the get_groups_for_update() function does in the Wav2vec_U class
              # Outputs either 'discriminator' or 'generator', depending on which is in use
            # Can you try to think how you can use discrim_step() previously?
              # discrim_step outputs a bool, true if it's stepping or false otherwise

            # backprop loss and run the corresponding optimizer
            scaler.scale(total_loss).backward() # This is a replacement for loss.backward()
            scaler.step(optimizer[group]) # This is a replacement for optimizer.step()
            scaler.update()


        else:
            # validation
            loss_stats = task.valid_step(data, model)


        # accumulate batch stats
        for k, v in loss_stats.items():
            if type(v) is dict:
                # flatten inner dictionary
                key_value_pairs = [(k + "_" + kn, vn) for kn, vn in v.items()]
            else:
                key_value_pairs = [(k, v)]

            # accmulate all statistics into cumulative_stats, a dictionary
            for pair in key_value_pairs:
              key, value = pair
              if value == None:
                continue
              if torch.is_tensor(value):
                value = value.cpu().detach().numpy()
              if key in cumulative_stats:
                cumulative_stats[key] += value
              else:
                cumulative_stats[key] = value
            # NOTE: you should convert any returned tensors to either values or numpy arrays
            # cumulative_stats shouldn't have nested dictionaries

        del net_input, loss_stats
        torch.cuda.empty_cache()

    # average stats over the dataset
    # Note that some metrics are already averaged over batch, so the result won't make sense
    # You can fix them if needed
    for k, v in cumulative_stats.items():
        v = v / len(dataloader.dataset)
        if type(v) is np.ndarray:
            v = v.tolist()
        cumulative_stats[k] = v

    return cumulative_stats

# Experiments

In [None]:
# just now noticed the unused parameter here, but I'm not changing it in case it breaks something else.
def save_model(model, optimizers, metric, epoch, path):
    torch.save(
        {'model_state_dict'         : model.state_dict(),
         'discriminator_optimizer_state_dict'     : optimizer['discriminator'].state_dict(),
         'generator_optimizer_state_dict'     : optimizer['generator'].state_dict(),
         metric[0]                  : metric[1], 
         'epoch'                    : epoch}, 
         path
    )
  
def load_model(path, model, metric= 'edit_distance', optimizers= None):

    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizers != None:
        optimizers['discriminator'].load_state_dict(checkpoint['discriminator_optimizer_state_dict'])
        optimizers['generator'].load_state_dict(checkpoint['generator_optimizer_state_dict'])
        
    epoch   = checkpoint['epoch']
    #metric  = checkpoint[metric]

    return [model, optimizer, epoch]

In [None]:
best_edit_dist = 37 # if you're restarting from some checkpoint, use what you saw there.
epoch_model_path = '/content/drive/MyDrive/pth/HW5_temp.pth'# set the model path( Optional, you can just store best one. Make sure to make the changes below )
best_model_path = '/content/drive/MyDrive/pth/HW5_final.pth'# set best model path 

In [None]:
import wandb
wandb.login(key="50b736c0fcb136ffc188e014a038e939c5a4f3f4")

In [None]:
model, optimizer, epoch = load_model(best_model_path, model, optimizer)

In [None]:
run = wandb.init(
    name = "Reduced LR + Augmentations Finetune", ## Wandb creates random run names if you skip this field
    reinit = True, ### Allows reinitalizing runs when you re-run this cell
    #id = '9qelwp3k',### Insert specific run id here if you want to resume a previous run
    #resume = "must", ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw5-ablations", ### Project should be created in your wandb account 
    # This uses only the generator config because it was a quick hacky fix that I didn't feel the need to rewrite later.
    config = GENERATOR_CONFIG ### Wandb Config for your run
)

In [None]:
print(f"Training for {num_epochs} epochs", file=sys.stderr)

eval_interval = 10 # evaluation after how many epochs?

for epoch in range(epoch_start, num_epochs + 1):
    print(f"Epoch {epoch}", file=sys.stderr)

    model.train()
    # The model uses the epoch number to decide which part of the network to train
    model.set_num_updates(epoch) # Look at what this function does in the Wav2vec_U class

    train_stats = run_model(model, train_dataloader)

    print(train_stats)
    if epoch % eval_interval == 0:

      model.eval()
      with torch.no_grad():
          eval_stats = run_model(model, validation_dataloader)
          
      # TODONE: perhaps save your model and optimizer here
      dist = eval_stats['edit_distance']
      save_model(model, optimizer, ['edit_distance', dist], epoch, epoch_model_path)
      print("Saved epoch model")

      if dist <= best_edit_dist:
          best_edit_dist = dist
          save_model(model, optimizer, ['edit_distance', dist], epoch, best_model_path)
          print("Saved best model")
        # You may find it interesting to explore Wandb Artifcats to version your models
          # Tip: You can even save the model after every epoch along with the best model. 
          # This may help to continue training even if the best model is from a very early epoch.

      # TODONE: Log training/eval statistics
      try:
        wandb.log(train_stats)
        wandb.log(eval_stats)
      except:
        print('Wand logging failed')
      #scheduler.step()
      print(eval_stats)

# Testing and submission to Kaggle

In [None]:
from dataset import extracted_features_dataset
# TODONE: PathToTest
test_path = "/content/data/11-685-s23-hw5/test"
testset = extracted_features_dataset.ExtractedFeaturesDataset(path=test_path,
                                                              split='test')

test_dataloader_args = train_dataloader_args.copy()
test_dataloader_args["shuffle"] = False
test_dataloader_args["collate_fn"] = testset.collater

test_dataloader = data.DataLoader(testset, **test_dataloader_args)

Write the `test_step` function which can be coded very similar to `valid_step` given in `task/unpaired_audio_text.py`

In [None]:
import logging
import math
import editdistance

# Based on the given code
def test_step(inputs, model):
    res = model(
        **inputs["net_input"],
        dense_x_only=True,
    )

    dense_x = res["logits"]
    padding_mask = res["padding_mask"]

    z = dense_x.argmax(-1)
    z[padding_mask] = task.target_dictionary.pad()

    output = []
    for i, (x, t, id) in enumerate(
        zip(
            z,
            inputs["target"] if "target" in inputs else [None] * len(z),
            inputs["id"],
        )
    ):

        pred_units_arr = x

        pred_units_arr = pred_units_arr.tolist()

        chars = []
        for char in pred_units_arr:
            chars.append(task.target_dictionary.string([char]))
        output.append(chars)
        #output.append(pred_units_arr)

    return output

Write some code to evaluate and get the results. You are free to write the below cells however you want

In [None]:
model.eval()
results = []
model, optimizer, epoch = load_model(best_model_path, model, optimizer)

for data in tqdm(test_dataloader, desc="Test"):
    data['net_input']['features'] = data['net_input']['features'].to(device)
    data['net_input']['padding_mask'] = data['net_input']['padding_mask'].to(device)
    with torch.no_grad():
        out = test_step(data, model)
        for batch in out:
          results.append(batch)

In [None]:
# TODONE: Replace the path and get the phoneme_map.json for mapping
with open("/content/data/11-685-s23-hw5/test/phoneme_map.json", "r") as file:
    phon_map = json.load(file)

In [None]:
predictions = []

for line in results:
    prediction = "".join([phon_map[index] for index in line])
    # TODONE: Map results with phon_map
    predictions.append(prediction)


In [None]:
import pandas as pd
# TODONE: Make the CSV and submit to kaggle
data_dir = '/content/data/11-685-s23-hw5/test/sample_submission.csv'
df = pd.read_csv(data_dir)
df.label = predictions
df.to_csv('submission.csv', index = False)