# CS753 2023 -- Assignment 2

This assignment is due on or before **11.59 pm on April 9, 2023**. The submission portal
on Moodle will be open until midnight on April 11th with a 5% penalty for each additional day after the 9th. This  assignment adds up to 25 points overall. This is a group assignment. You can form groups of 2-3 students.

## **Acknowledgements**

* All of Task 0's ASR-related code snippets have been borrowed from the ASR Notebook at [CS224S's Assignment 4 at Stanford](https://web.stanford.edu/class/cs224s/assignments/a4/) which are, in turn, borrowed from the [SpeechBrain toolkit](https://github.com/speechbrain/speechbrain/).

* SpeechBrain models are downloaded from their host site on [Huggingface](https://huggingface.co/speechbrain).  

* The following [SpeechBrain tutorial](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=J6N0Fb51pFnZ) will give you a code walkthrough of how an ASR system is coded from scratch using this toolkit.

## **Dataset**

We will use *HarperValleyBank (HVB)* -- a publicly-available spoken dialog corpus. Click [here](https://arxiv.org/pdf/2010.13929.pdf) for more details about the HVB corpus, how it was collected and what it is annotated for.  

## **What to submit**

On Moodle, you will have to submit a text file `README.txt` with answers requested for in the tasks below and a `.ipynb` file containing all your code with the following naming convention: LDAP IDs of team members delimited by `_`. E.g., `220022022_220022021.ipynb`.

## **Getting Started**
* Make a copy of this notebook in your personal Google Drive to make edits.
* Change your runtime type (under "Runtime") to GPU.
* Go through all the steps in Task 0 to get set up with the first ASR task.
* **Important:** ASR training, as in Task 0, will take close to **1.5 hours** for two epochs. Keep this in mind when scheduling your runs. You should consider saving the checkpoints from a training run if you want to use it for other experiments or for additional finetuning.


# Dependencies

In [None]:
from google.colab import drive #Mountinng the drive to colab
drive.mount('/content/drive')

Mounted at /content/drive


If you have any issues using `gdown`, the same data and config files are available directly via Google Drive [here](https://drive.google.com/drive/folders/1xQxvR9NRlwK-75KMd0i1y4Dy0fimTsM9).  

In [None]:
# setup
#!gdown 1oJh0U3g_bUx6UPX4xix2UHMVHeCE_H1y
!unzip /content/drive/MyDrive/cs753-2023-assgmt2/hvb.zip
#!mv content/data /content/
#!rm -r /content/content

#!gdown 1a0EGlsLbXnGn1xwZoSqT0tcdAQ1L2nfd # train.py
#!gdown 1yCmjRbxXRxfEN5LXdnE1Zpl8ZOIzdrAO # train.yaml
#!gdown 1KHmdcLVFI9ontvGmi5J6vfaropGYuKcr # inference.yaml

!pip install speechbrain -q

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: content/data/segments/5224.wav  
  inflating: content/data/segments/7770.wav  
  inflating: content/data/segments/15016.wav  
  inflating: content/data/segments/21930.wav  
  inflating: content/data/segments/7698.wav  
  inflating: content/data/segments/12353.wav  
  inflating: content/data/segments/7926.wav  
  inflating: content/data/segments/16266.wav  
  inflating: content/data/segments/24681.wav  
  inflating: content/data/segments/20442.wav  
  inflating: content/data/segments/9689.wav  
  inflating: content/data/segments/7934.wav  
  inflating: content/data/segments/6896.wav  
  inflating: content/data/segments/12117.wav  
  inflating: content/data/segments/19455.wav  
  inflating: content/data/segments/937.wav  
  inflating: content/data/segments/16852.wav  
  inflating: content/data/segments/19350.wav  
  inflating: content/data/segments/13098.wav  
  inflating: content/data/segments/24202.wav  
  in

In [None]:
import speechbrain as sb
from speechbrain.pretrained import EncoderDecoderASR
import json
import torchaudio
import torch
from torch import nn
from tqdm import tqdm
from collections import Counter

from IPython.display import Audio
from scipy.io import wavfile

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Task 0: Evaluate a pretrained CRDNN model, further finetuned with HVB (5 points)

We will first load a [SpeechBrain CRDNN model pretrained on LibriSpeech](https://huggingface.co/speechbrain/asr-crdnn-rnnlm-librispeech). SpeechBrain has some utils with built-in options to source pre-trained models from a repository on HuggingFace.

In Task 0, you will first load this pretrained CRDNN model from HuggingFace and use it for inference on the first 500 examples from `test_manifest.json` (included within `hvb.zip` you have already downloaded earlier). Subsequently, you will fine tune this model on the HVB training dataset before reevaluating on the test examples and note the difference in performance.

In [None]:
crdnn = EncoderDecoderASR.from_hparams(
    source='speechbrain/asr-crdnn-rnnlm-librispeech',
    savedir='asr-crdnn-rnnlm-librispeech',
    run_opts={'device': 'cuda'}
)

The manifests we prepared to use with SpeechBrain are jsons with the structure
```
{
    "15748": {
        "wav": "/content/data/segments/15748.wav",
        "length": 1.86,
        "words": "WHAT DAY WOULD YOU LIKE FOR YOUR APPOINTMENT"
    },
    ...
}
```

We first load them and define a function to batch them into a format that our `EncoderDecoderASR` object can ingest:

In [None]:
TEST_SIZE = 500 # for faster processing

with open('/content/content/data/test_manifest.json', 'r') as f:
    test_manifest = json.load(f)
test_manifest = {k: v for k, v in list(test_manifest.items())[:TEST_SIZE]}


def batchify(manifest, batch_size):
    keys = list(manifest.keys())
    wav_paths = list(map(lambda x: '/content'+x['wav'], manifest.values()))
    iterable = zip(keys, wav_paths)
    num_examples = len(manifest)
    for i in range(0, num_examples, batch_size):
        batch_wavs = nn.utils.rnn.pad_sequence([
            torchaudio.load(path)[0].squeeze(0)
            for path in wav_paths[i:min(i + batch_size, num_examples)]
        ], batch_first=True)
        batch_keys = keys[i:min(i + batch_size, num_examples)]
        batch_wav_lens = torch.tensor([
            manifest[key]['length'] for key in batch_keys
        ])
        batch_wav_lens = batch_wav_lens / batch_wav_lens.max()
        yield batch_keys, batch_wavs, batch_wav_lens

Next, we feed our test examples through the pretrained ASR model:

In [None]:
true_dict = {key: test_manifest[key]['words'] for key in test_manifest}

def inference(model, test_manifest, batch_size=8):
    torch.cuda.empty_cache()
    pred_dict = {}
    for keys, wavs, wav_lens in tqdm(batchify(test_manifest, batch_size), total=round(len(test_manifest) / batch_size + 0.5)):
        transcriptions, _ = model.transcribe_batch(wavs.to(device), wav_lens.to(device))
        for key, transcription in zip(keys, transcriptions):
            pred_dict[key] = transcription
    return pred_dict

pred_dict = inference(crdnn, test_manifest)

Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:862.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
100%|██████████| 63/63 [03:56<00:00,  3.76s/it]


## Q1: Evaluate WER of pretrained model predictions (2 points)

Check the word error rate on the first 200 test instances in `test_manifest.json`. Note that we want WERs, so we need to split our transcripts into lists of words.

You don't need to implement anything new here. Just follow along and ensure you can run the code to obtain WER on the results you just generated.

In [None]:
# this data structure stores WER information we use later.
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)

In [None]:
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)

{'WER': 75.84080717488789,
 'SER': 87.8,
 'num_edits': 2706,
 'num_scored_tokens': 3568,
 'num_erraneous_sents': 439,
 'num_scored_sents': 500,
 'num_absent_sents': 0,
 'num_ref_sents': 500,
 'insertions': 1127,
 'deletions': 307,
 'substitutions': 1272}

What is the WER value you obtain? Write it down in `README.txt` that you will upload on Moodle, along with your Colab notebook.

We expect WER of this pretrained system to be somewhat high on HVB data (around 72%+ WER). That is really quite high! Note that we already re-sampled the audio to 16kHz to make the HVB audio features more similar to the training inputs of the pre-trained model.

Often times ASR errors have specific error modes or correlations -- let's see if we can understand where our pretrained system is failing on HVB data. We can start to investigate where our system is making mistakes by checking some of the top missed utterances.

In [None]:
def summarize(detail_dict, true_dict, pred_dict):
    print(f"{detail_dict['key']}: {detail_dict['WER']}")
    print(f"\tTrue: {true_dict[detail_dict['key']]}")
    print(f"\tPred: {pred_dict[detail_dict['key']]}")

for wer_dict in sb.utils.edit_distance.top_wer_utts(details_by_utterance, 10)[0]:
    summarize(wer_dict, true_dict, pred_dict)

3015: 6900.0
	True: ZERO
	Pred: AIN'T IT OAT OAT OAT OAT OAK OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT OAT
5788: 2633.3333333333335
	True: YOU AS WELL
	Pred: YOU'RE THE WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WEST WE
1130: 2400.0
	True: ALRIGHT
	Pred: RIPE RIGHT RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE RIPE
8731: 2100.0
	True: ALRIGHT
	Pred: HE ARRAYED ARRAYED ARRAYED ARRAYED ARRAYED ARRAYED ARRA

Seems that our predictions keep outputting the same word over and over. Let's see why.

Select some examples the model wrongly predicts, and try to build a hypothesis around what in the data is associated with the model making mistakes. Examples of mistake types include:
- Repeated wrong word
- A few correct words but clearly wrong transcript

Give examples of at least 3 audio files and different kinds of errors you have identified in these audio files. Add this to `README.txt`.


Here's a code snippet to help you get started.

In [None]:
test_manifest[example['key']]['wav']]


'/content/data/segments/15748.wav'

In [None]:
example = details_by_utterance[0]
summarize(example, true_dict, pred_dict)
Audio('/content'+test_manifest[example['key']]['wav'])


15748: 37.5
	True: WHAT DAY WOULD YOU LIKE FOR YOUR APPOINTMENT
	Pred: WHAT DAY WOULD YOU LIKELY APPOINT APPOINTMENT


(Hint: In our initial checks, poor audio quality seems to be associated with repeated word errors.)

## Q2: Finetune the pretrained CRDNN model (2 points)

The performance of this pretrained model is disappointing. However, the model was trained on a different data domain than call-center transcripts as in HVB. To see if we can derive better performance on our dataset, we fine-tune the pretrained model with HVB training data and test it. For this experiment, you won't need to modify it much. Just get training working, and you can try adjusting some training or decoding parameters as you like. The key thing to learn here is simply developing an understanding of how things work when finetuning on a new corpus using SpeechBrain. (This is a state-of-the-art approach to building and adjusting ASR models that might be used in industry projects.)

We've set up the training script, experiment yaml, and inference yaml for you, but we encourage you to take a look at how it works (and most importantly what a neat ML experiment yaml file looks like). Training this model for 2 epochs on Colab GPUs should take around **1.5 hours**.

The model will save checkpoints during training which you can specify for use during inference / testing below. That means you should be able to use a fine-tuned version of the model, even if you don't wait the full time for the model to completely train. It is okay to submit the homework with your fine tuning model partially trained, but not 2 full epochs.

In [None]:
# this downloads the training and config files for our fine tuning setup
!gdown 1v_3Kl8OrUd6_1_D0ZGoYVFEuOKhZ7YMo # train.py
!gdown 17cQIpx5kLLMCD23EDaE0EYg2E9LPqMCF # train.yaml
!gdown 1CWYOD2PC97gXguW4krc9122HKAraHkYS # inference.yaml

Downloading...
From: https://drive.google.com/uc?id=1v_3Kl8OrUd6_1_D0ZGoYVFEuOKhZ7YMo
To: /content/train.py
100% 18.3k/18.3k [00:00<00:00, 28.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=17cQIpx5kLLMCD23EDaE0EYg2E9LPqMCF
To: /content/train.yaml
100% 12.1k/12.1k [00:00<00:00, 17.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1CWYOD2PC97gXguW4krc9122HKAraHkYS
To: /content/inference.yaml
100% 4.96k/4.96k [00:00<00:00, 9.66MB/s]


**Finetuning with HVB data**:

There are two files you've just downloaded which specify the architecture to train, and the main training loop for improving the neural net ASR system.

- `train.yaml` is the yaml config file SpeechBrain uses to specify both the network / ASR system architecture, as well as the parameters of training procedures, loss functions, datasets, etc. This is a good starting point for understanding the architecture of the ASR system you're working with. Note that you are not able to modify much about the ASR network architecture as it needs to match what we load from file. You can adjust things like loss function weights, learning rate, and training time to adjust the fine tuning setup.
- `train.py` specifies the main training loop for fitting the acoustic model. You do not need to modify this file.

Edit the training yaml file and run the training loop as shown below. Finetune with modified hyperparameters that worked best for you. Copy/paste train loss, valid loss, valid CER and valid WER from your training output in `README.txt`, along with the epoch number. (With using `train.yaml` as-is, we obtained training and validation loss < 1.75 after epoch 1.)



In [None]:
!pip install hyperpyyaml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import yaml

with open("train.yaml", "r") as stream:
    try:
        print(yaml.safe_load(stream))
    except yaml.YAMLError as exc:
        print(exc)

could not determine a constructor for the tag '!apply:torch.manual_seed'
  in "train.yaml", line 14, column 13


In [None]:
torch.cuda.empty_cache()

!python train.py train.yaml --batch_size=4
# OOM on batch_size=5

Downloading http://www.openslr.org/resources/28/rirs_noises.zip to ./data/rirs_noises.zip
rirs_noises.zip: 1.31GB [01:35, 13.8MB/s]                
Extracting ./data/rirs_noises.zip to ./data
Traceback (most recent call last):
  File "/content/train.py", line 417, in <module>
    hparams = load_hyperpyyaml(fin, overrides)
  File "/usr/local/lib/python3.9/dist-packages/hyperpyyaml/core.py", line 188, in load_hyperpyyaml
    hparams = yaml.load(yaml_stream, Loader=loader)
  File "/usr/local/lib/python3.9/dist-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
  File "/usr/local/lib/python3.9/dist-packages/ruamel/yaml/constructor.py", line 121, in get_single_data
    return self.construct_document(node)
  File "/usr/local/lib/python3.9/dist-packages/ruamel/yaml/constructor.py", line 126, in construct_document
    data = self.construct_object(node)
  File "/usr/local/lib/python3.9/dist-packages/ruamel/yaml/constructor.py", line 154, in construct_object
    dat

## Q3: Evaluate your finetuned model (1 point)

To run inference, you need to use a different yaml to be compatible with the `EncoderDecoderASR` class.

NOTE: You need to set your checkpoint path in a few locations to make this work. Be careful your paths are set before other debugging if things aren't working (e.g. trying to download from HuggingFace)

To get inference working there are three steps:
1. Note the directory that your checkpoints are saved in (under `./results/CRDNN_BPE_960h_LM/2602/save/{your ckpt here}`)
2. Paste this directory into the ckptdir entry in `inference.yaml`
3. Paste this directory after the `ckpt_path = ` in the below cell.

You can then use the same inference procedure as in Q1.
NOTE: set the `ckpt_path` below AND change the path in `inference.yaml` (by modifying `ckptdir` to point to the desired checkpoint) before or after it is copied into your checkpoint path.


In [None]:
ckpt_path = "/content/results/CRDNN_BPE_960h_LM/2602/save/CKPT+2022-05-09+17-22-55+00"
!cp inference.yaml {ckpt_path}

cp: cannot create regular file '/content/results/CRDNN_BPE_960h_LM/2602/save/CKPT+2022-05-09+17-22-55+00': No such file or directory


Evaluate your finetuned model on the first 50 test sentences in `test_manifest.json`. Generate predictions by setting up a model object, and calling `inference()`. Remember your checkpoint paths must be set correctly in the copy of inference.yaml read to run inference. For this step simply populate pred_dict with inferences from your test subset. Write down the WER in `README.txt`. Also compute the WER from the pretrained model for this test subset of size 50 and write it down in `README.txt`.

NOTE: Running inference on ~50 utterances might require ~15 minutes of computation on a Colab CPU. The code below uses CPU inference as we could not get checkpoint-loaded DNNs to work with SpeechBrain's inference on the GPU (you are free to try this).

In [None]:
device = 'cpu'
our_model = EncoderDecoderASR.from_hparams(
    source=ckpt_path,
    hparams_file='inference.yaml',
    savedir="our_ckpt",
    run_opts={'device': device}
)

HFValidationError: ignored

In [None]:
TEST_SIZE = 50 # for faster processing
with open('/content/content/data/test_manifest.json', 'r') as f:
    test_manifest = json.load(f)
test_manifest = {k: v for k, v in list(test_manifest.items())[:TEST_SIZE]}

pred_dict = inference(our_model.to(device), test_manifest)

# Task 1: Train a sentiment detection model (20 points)


Apart from the audio files and transcriptions, the HVB corpus also comes with annotations for intent, sentiment/emotion and dialog actions. Use the command below to download `transcript.zip`.

In [None]:
!gdown 1-s2e8dZYSjhVgfo_TL0V_89RZVhGnZ1Y
!unzip -q transcript.zip

!gdown 1ChdI1XyhmGq9z8Y8M38yXPMob6oPRqPO  #train.txt
!gdown 10w15DnUbJcQRBSZWP03qjM6Oq8l7WSVQ  #dev.txt

Downloading...
From: https://drive.google.com/uc?id=1-s2e8dZYSjhVgfo_TL0V_89RZVhGnZ1Y
To: /content/transcript.zip
100% 4.08M/4.08M [00:00<00:00, 102MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ChdI1XyhmGq9z8Y8M38yXPMob6oPRqPO
To: /content/train.txt
100% 30.5k/30.5k [00:00<00:00, 41.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=10w15DnUbJcQRBSZWP03qjM6Oq8l7WSVQ
To: /content/dev.txt
100% 12.3k/12.3k [00:00<00:00, 23.9MB/s]


In [None]:
with open('transcript/370981f1f0254ebc.json', 'r') as f:
  fl=f.read()

In [None]:
import json
f = open('transcript/370981f1f0254ebc.json')
data = json.load(f)
data[0]

{'channel_index': 2,
 'dialog_acts': ['gridspace_greeting'],
 'duration_ms': 2970,
 'emotion': {'neutral': 0.3537469506263733,
  'negative': 0.03589286655187607,
  'positive': 0.6103601455688477},
 'human_transcript': 'hello this is harper valley national bank my name is michael',
 'index': 1,
 'offset_ms': 6020,
 'speaker_role': 'agent',
 'start_ms': 3970,
 'start_timestamp_ms': 1591055856446,
 'transcript': 'hello this is harper valley national bank my name is michael',
 'word_durations_ms': [330, 180, 120, 270, 300, 390, 270, 150, 150, 90, 450],
 'word_offsets_ms': [0,
  600,
  780,
  900,
  1170,
  1470,
  1860,
  2130,
  2280,
  2430,
  2520]}

Each transcript json file within `/content/transcript/` refers to a conversation with a list of utterances. Each segment (utterance) is associated with a json object and one of its keys is labeled as "emotion" that maps to three probability values associated with positive, negative and neutral sentiments. You can consider the sentiment with maximum probability to be the ground-truth label for each utterance.

For this task, you will write new code to create a sentiment classification model for HVB utterances. Here are the various steps:
- Create train and dev splits from `train.txt` and `dev.txt` downloaded in the code cell above. To understand the format in `train.txt` (and `dev.txt`), consider a line in  `train.txt`: *010d38f5ada54e0d:1,2,3,4,5,6,7,8*. This refers to conversation-ID 010d38f5ada54e0d with eight utterances appearing sequentially within the file `/content/transcript/010d38f5ada54e0d.json`. These utterances are numbered by the field "index" in the json file. Similarly interpret the lines in `dev.txt`. Use these text files (`train.txt` and `dev.txt`) to create train and dev sets and extract the relevant metadata ("emotion") you need from the respective transcript json files.
- Load the pretrained CRDNN Librispeech model as in Task 0 and extract the encoder.
- Add a linear layer mapping the encoding of the audio signal to a prediction of the underlying sentiment. This will be randomly initialized.
- Train both the new linear layer and all the pretrained encoder layers using a cross-entropy loss with the reference emotion labels derived as described at the top of this cell.
- Evaluate your trained sentiment detection model on utterances listed in `dev.txt` and compute overall accuracy for sentiment prediction.
- Write down the accuracy you obtained in `README.txt`.



## Rough Score Breakdown

---

1. Data preprocessing (4 points)
2. Loading the CRDNN model and extracting the encoder layers (6 points)
3. Adding a linear layer + training the model (7 points)
4. Obtaining accuracies similar to or better than our solution code (3 points)  

In [None]:
import re
def create_dataset(file_name_str):
  file1 = open(file_name_str,"r")
  x=file1.read().split('\n')
  y1=[i.split(':')[0] for i in x ]
  y2=[list(map(int,re.findall(r'\d+', i.split(':')[-1]))) for i in x ]
  train_dict=dict(zip(y1,y2))
  transcr_list=[]
  emotion_list=[]
  for key in train_dict.keys():
    fname='transcript/'+key+'.json'
    f = open('transcript/370981f1f0254ebc.json')
    data_dic = json.load(f)
    for data in data_dic:
      if data['index'] in train_dict[key]:
        transcr_list.append(data['transcript'])
        emotion_list.append(list(data['emotion'].values()))
  return dict(zip(transcr_list,emotion_list))

In [None]:
create_dataset("train.txt")

{'hello this is harper valley national bank my name is michael': [0.3537469506263733,
  0.03589286655187607,
  0.6103601455688477],
 'how can i help you today': [0.352674663066864,
  0.06029738858342171,
  0.587027907371521],
 '[noise]': [1.0, 0.0, 0.0],
 '[noise] hi my name is elizabeth brown i need a new checkbook': [0.32787075638771057,
  0.04341552034020424,
  0.6287136673927307],
 'what is your address': [0.5000655651092529,
  0.3527262210845947,
  0.14720825850963593],
 'okay': [0.5097575783729553, 0.2189798504114151, 0.27126264572143555],
 'my address is eight five six first street': [0.47416213154792786,
  0.43734416365623474,
  0.088493712246418],
 'uh': [0.5228667259216309, 0.29485175013542175, 0.1822814792394638],
 'mm hmm': [0.5318920016288757, 0.31008899211883545, 0.1580190360546112],
 'forrest ranch': [0.43634673953056335,
  0.11246293783187866,
  0.451190322637558],
 'alright': [0.5168510675430298, 0.24006950855255127, 0.24307942390441895],
 'oregon': [0.5262944102287292

In [None]:

train_dataset

{'hello this is harper valley national bank my name is michael': [0.3537469506263733,
  0.03589286655187607,
  0.6103601455688477],
 'how can i help you today': [0.352674663066864,
  0.06029738858342171,
  0.587027907371521],
 '[noise]': [1.0, 0.0, 0.0],
 '[noise] hi my name is elizabeth brown i need a new checkbook': [0.32787075638771057,
  0.04341552034020424,
  0.6287136673927307],
 'what is your address': [0.5000655651092529,
  0.3527262210845947,
  0.14720825850963593],
 'okay': [0.5097575783729553, 0.2189798504114151, 0.27126264572143555],
 'my address is eight five six first street': [0.47416213154792786,
  0.43734416365623474,
  0.088493712246418],
 'uh': [0.5228667259216309, 0.29485175013542175, 0.1822814792394638],
 'mm hmm': [0.5318920016288757, 0.31008899211883545, 0.1580190360546112],
 'forrest ranch': [0.43634673953056335,
  0.11246293783187866,
  0.451190322637558],
 'alright': [0.5168510675430298, 0.24006950855255127, 0.24307942390441895],
 'oregon': [0.5262944102287292

# Extra Credit: Use Whisper's pretrained model to evaluate HVB (5 points)

Write code to evaluate [Whisper's **small** model](https://github.com/openai/whisper/blob/main/model-card.md) on the first 200 test utterances in `test_manifest.json` that you used in Q1 of Task 0. Compute the WER with predictions from Whisper and add it to `README.txt`.

In [None]:
##############################################################
#### YOUR EVALUATION CODE USING Whisper-small GOES BELOW #####
##############################################################
