In [1]:
# Downloads required packages and files
required_files = "https://github.com/jhu-intro-hlt/jhu-intro-hlt.github.io/raw/master/assignments/hw4-files/student/required_files.zip"
! wget $required_files && unzip -o required_files.zip
! pip install -r requirements.txt

--2022-12-14 19:10:13--  https://github.com/jhu-intro-hlt/jhu-intro-hlt.github.io/raw/master/assignments/hw4-files/student/required_files.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/assignments/hw4-files/student/required_files.zip [following]
--2022-12-14 19:10:13--  https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/assignments/hw4-files/student/required_files.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2395 (2.3K) [application/zip]
Saving to: ‘required_files.zip.1’


2022-12-14 19:10:13 (47.3 MB/s) - ‘requ

In [2]:
import tensorflow as tf

In [3]:
# Initialize Otter
import otter

grader = otter.Notebook(colab=True)

# Assignment 4

You have now learnt about end-to-end speech recognition models. In this assignment, you will build a CTC-based end-to-end model for ASR. This assignment is based on the tutorial available [here](https://www.assemblyai.com/blog/end-to-end-speech-recognition-pytorch), and you are encouraged to read through the post and other end-to-end ASR resources available on the internet.

# Setup

For this assignment, as in the previous one, we will be using Google Colab, for both code as well as descriptive questions. Your task is to finish all the questions in the Colab notebook and then upload a PDF version of the notebook, and a viewable link on Gradescope. 

### Google colaboratory

Before getting started, get familiar with google colaboratory:
https://colab.research.google.com/notebooks/welcome.ipynb

This is a neat python environment that works in the cloud and does not require you to
set up anything on your personal machine
(it also has some built-in IDE features that make writing code easier).
Moreover, it allows you to copy any existing collaboratory file, alter it and share
with other people.

__Note:__
1. You may need to change your Runtime setting to GPU in order to run the following code blocks.
2. On changing the Runtime setting, you would be required to run the previous code-blocks again.

### Submission

Before you start working on this homework do the following steps:

1. Press __File > Save a copy in Drive...__ tab. This will allow you to have your own copy and change it.
2. Follow all the steps in this collaboratory file and write / change / uncomment code as necessary.
3. Do not forget to occasionally press __File > Save__ tab to save your progress.
4. After all the changes are done and progress is saved press __Share__ button (top right corner of the page), press __get shareable link__ and make sure you have the option __Anyone with the link can view__ selected. Copy the link and paste it in the box below.
5. After completing the notebook, press __File > Download .ipynb__ to download a local copy on your computer, and then upload the file to Gradescope.
6. Please export the notebook to PDF and upload the PDF to the writing part.

__Special handling for model checkpoints.__
7. As the homework requires training neural models, such trained model checkpoints should also be submitted together with the notebook, hence avoiding re-training during the grading phase. For such model checkpoints, they would be stored at `./lightning_logs` directory. You have to first locate the directory from the left side panel (`Files`) on Colab.
8. Enter `./lightning_logs` and find the training label that you would like to submit. The versions are labelled with respect to the training calls.
9. Download the `.
` file from `./lightning_logs/<your_version>/checkpoints/<name>.ckpt`.
10. **Upload the checkpoint file to your JHU OneDrive and get a share link. The permission should be at least `can be viewed by anyone with the link`. Put the link into the corresponding cell.**
11. Submit your notebook to the autograder.

__Paste your notebook link in the box below.__ _(0 points)_



```
https://colab.research.google.com/drive/1v16BhaE60rjELg3A8-blkjl4GMr7K8gY?usp=sharing

```



## Installing the requirements

In this assignment, we will use torchaudio for speech feature extraction. You have learned about Mel Frequency Cepstral Coefficients (MFCCs) in the class. [Here](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html) is another useful blog post describing MFCCs. In the Kaldi tutorial, you extracted MFCCs for ASR, but in this assignment, we will do something slightly different.

First let's set up some helper code and import the libraries which we will use in this experiment.

In [4]:
import os
from typing import Dict, List, Tuple, Any, Optional, Union
from dataclasses import dataclass

import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl
from torch.utils.data import DataLoader, Dataset
from transformers.modeling_outputs import CausalLMOutput

try:
    from vocabulary import Vocabulary, encode_as_tensor, decode_as_str
except:
    from files.vocabulary import Vocabulary, encode_as_tensor, decode_as_str

In [5]:
# Checks whether it is in the autograder grading mode
# Checks whether GPU accelerators are available
is_autograder = os.path.exists('is_autograder.py')
if torch.cuda.is_available() and not is_autograder:
    accelerator = 'gpu'
elif os.environ.get('COLAB_TPU_ADDR') is not None and not is_autograder:
    accelerator = 'tpu'
else:
    accelerator = 'cpu'
print(f'The notebook is running for "{"autograder" if is_autograder else "student"}".')
print('Students should make sure you are running under the "student" mode.')
print(f'You are using "{accelerator}".')

The notebook is running for "student".
Students should make sure you are running under the "student" mode.
You are using "gpu".


In [6]:
# Seed everything to make sure all experiments are reproducible
pl.seed_everything(seed=777)

INFO:pytorch_lightning.utilities.seed:Global seed set to 777


777

In [7]:
# Defines constants
HOMEWORK_DATA_URL = "https://github.com/jhu-intro-hlt/jhu-intro-hlt.github.io/raw/master/assignments/hw4-files/student/"

In [8]:
# Attention!!!
# Set this to True, if you are using local machine instead of Colab
RUN_LOCALLY = False
# But you have to make sure that your machine has CUDA support
# Otherwise the training would be super slow.
LOCAL_DATA_STORE_PATH = 'librispeech_data'  # If run locally, please configure a path to store the dataset.
# Only try to use the Colab and Google Drive when not under the autograder environment
if not is_autograder:
    if not RUN_LOCALLY:
        try:
            # LIBRISPEECH_DATA_PATH = './gdrive/My Drive/librispeech_data'
            LIBRISPEECH_DATA_PATH = LOCAL_DATA_STORE_PATH
            from google.colab import drive

            drive.mount('/content/gdrive')
        except:  # Fall back to local storage
            LIBRISPEECH_DATA_PATH = LOCAL_DATA_STORE_PATH
    else:
        LIBRISPEECH_DATA_PATH = LOCAL_DATA_STORE_PATH
else:
    LIBRISPEECH_DATA_PATH = LOCAL_DATA_STORE_PATH
os.makedirs(LIBRISPEECH_DATA_PATH, exist_ok=True)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [9]:
if not is_autograder:
    train_dataset = torchaudio.datasets.LIBRISPEECH(LIBRISPEECH_DATA_PATH, url='train-clean-100', download=True)
    dev_dataset = torchaudio.datasets.LIBRISPEECH(LIBRISPEECH_DATA_PATH, url='dev-clean', download=True)
else:
    train_dataset = None
    dev_dataset = None

  0%|          | 0.00/5.95G [00:00<?, ?B/s]

  0%|          | 0.00/322M [00:00<?, ?B/s]

# Vocabulary

In this part, we would reuse the `Vocabulary` class from the previous homework (hw3). In order to make the indexing consistent across submissions and models, we create a vocabulary instance here. The only difference is that we added a `<SPACE>` special token.

In [10]:
char_to_map = char_map_str = """'
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z"""

vocab = Vocabulary()
for c in char_to_map.split('\n'):
    vocab.add_token(token=c.strip())

## Downloading the data

In the Kaldi hands-on session, you worked with the Mini-LibriSpeech data, which is a tiny version of the 960h LibriSpeech (most popular English ASR benchmark dataset). In this assignment, you will use a 100h subset of this data to train your models. Run the code cell below to download the training and testing data.

__Note:__ The training data is approx. 6 GB, and may take several minutes depending on your Internet bandwidth. To avoid downloading the data several times in every session, we download it to Google drive. On running the following code block, you would need to authenticate your Google drive to allow Colab to access its contents. The download may not work if your drive does not have sufficient storage.

## Feature extraction



In the lectures, you have learnt about different methods for feature extraction from audio data. Two of these are:

  1. Mel Frequency Cepstral Coefficients
  2. Mel Spectogram

You used the MFCC features for training in the Kaldi hands-on session. Answer the following questions about these feature extraction methods.

In this assignment, you will use 64-dim Mel Spectogram features for training the models. `torchaudio` makes it easy to extract these features: look at `torchaudio.transforms` for a list of audio transformations available in the library.

Additionally, we use [SpecAugment](https://arxiv.org/pdf/1904.08779.pdf) to add randomness (as a form of data augmentation). This is a simple data augmentation scheme proposed recently that has become very popular since it provides significant WER gains and can be used on-the-fly during training.

Think about what are the 2 main transformations from [this list](https://pytorch.org/tutorials/beginner/audio_preprocessing_tutorial.html#transformations) that can be used to implement SpecAugment.

Complete the following code-block to extract MelSpectogram features with SpecAugment from the training data. For the test data, we just extract MelSpectogram features without any SpecAugment (since data augmentation is only applied on training data).

Time Masking and Frequency Masking.

In [11]:
# Feature extraction for training data with SpecAugment. Note that Librispeech
# has a sample rate of 16 kHz. Extract 64-dim MelSpectogram features, followed by
# the 2 operations needed for SpecAugment.
# Hint: use `freq_mask_param=30` for frequency masking and `time_mask_param=100`
# for time masking. These specify how much randomness is added to the input
# features.

train_audio_transforms = nn.Sequential(
    torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=64),
    torchaudio.transforms.FrequencyMasking(freq_mask_param=30),
    torchaudio.transforms.TimeMasking(time_mask_param=100)
)

# Feature extraction for test data. Use the same sample rate and number of mels
# as the training data. No SpecAugment is needed here.

valid_audio_transforms =  torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=64)

In [12]:
@dataclass
class ASRBatch:
    spectrograms: torch.Tensor  # (batch_size, channel, feature, time)
    labels: torch.Tensor  # (batch_size, label_len, num_labels)
    attention_mask: torch.Tensor  # (batch_size, time)
    label_mask: torch.Tensor  # (batch_size, label_len)
    label_strs: List[str]


class SpeechRecognitionDataModule(pl.LightningDataModule):
    """Wraps PyTorch dataset as a lightning data module."""

    def __init__(
            self,
            datasets: Dict[str, Dataset],
            vocab: Vocabulary,
            batch_size: int = 32,
            shuffle: bool = True
    ):
        super(SpeechRecognitionDataModule, self).__init__()

        self.datasets: Dict[str, Dataset] = {k: v for k, v in datasets.items() if v is not None}
        self.vocab = vocab
        self.batch_size = batch_size
        self.shuffle = shuffle

    @staticmethod
    def _pad_sequence(seq: torch.Tensor, max_length: int, padding_value: Union[int, float, bool]) -> torch.Tensor:
        seq_len = seq.shape[-1]
        if seq_len < max_length:
            return torch.cat(
                [seq, torch.tensor([padding_value] * (max_length - seq_len), dtype=seq.dtype, device=seq.device)],
                dim=-1
            )
        else:
            return seq

    def collate_fn(self, data, phase='train'):
        spectrograms = []
        labels = []
        label_strs = []
        input_lengths = []
        label_lengths = []
        for (waveform, _, utterance, _, _, _) in data:
            if phase == 'train':
                spec = train_audio_transforms(waveform).squeeze(0).transpose(0, 1)
            elif phase == 'val' or phase == 'test':
                spec = valid_audio_transforms(waveform).squeeze(0).transpose(0, 1)
            else:
                raise ValueError
            spectrograms.append(spec)
            label_str = utterance.lower()
            label = encode_as_tensor(self.vocab, label_str).squeeze(0)
            labels.append(label)
            label_strs.append(label_str)
            # This is actually very weird, the stride is hardcoded to be 2...
            # In fact, this is even bugged, as stride=2 convolution doesn't necessarily yield
            # T // 2.
            # Overall, any slight change to the architecture breaks this code and thus should
            # be modified.
            # input_lengths.append(spec.shape[0] // 2)
            input_lengths.append(spec.shape[0])
            label_lengths.append(label.shape[-1])

        spectrograms = nn.utils.rnn.pad_sequence(spectrograms, batch_first=True).unsqueeze(1).transpose(2, 3)
        labels = nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=self.vocab.pad_id())

        max_input_len = max(input_lengths)
        max_label_len = max(label_lengths)
        attention_mask = torch.stack(
            [
                self._pad_sequence(torch.ones(l, dtype=torch.bool), max_length=max_input_len, padding_value=False)
                for l in input_lengths
            ],
            dim=0
        )
        label_mask = torch.stack(
            [
                self._pad_sequence(torch.ones(l, dtype=torch.bool), max_length=max_label_len, padding_value=False)
                for l in label_lengths
            ],
            dim=0
        )

        return ASRBatch(
            spectrograms=spectrograms,
            labels=labels,
            attention_mask=attention_mask,
            label_mask=label_mask,
            label_strs=label_strs
        )

    def train_dataloader(self):
        return DataLoader(self.datasets['train'],
                          batch_size=self.batch_size,
                          shuffle=self.shuffle,
                          collate_fn=lambda x: self.collate_fn(x, phase='train'))

    def val_dataloader(self):
        return DataLoader(self.datasets['val'],
                          batch_size=self.batch_size,
                          shuffle=False,
                          collate_fn=lambda x: self.collate_fn(x, phase='val'))

    def test_dataloader(self):
        return DataLoader(self.datasets['test'],
                          batch_size=self.batch_size,
                          shuffle=False,
                          collate_fn=lambda x: self.collate_fn(x, phase='test'))

In [13]:
test_datamodule = SpeechRecognitionDataModule(
    datasets={
        'train': train_dataset,
        'val': dev_dataset
    },
    vocab=vocab,
    batch_size=4
)

In [14]:
train_dataloader = test_datamodule.train_dataloader()
for i, bc in enumerate(train_dataloader):
    if i > 0:
        break
    print(bc)
    # print(bc.spectrograms.size())
    # print(bc.attention_mask.size())

torch.Size([4, 1, 64, 981])
torch.Size([4, 981])


In [15]:
class CNNLayerNorm(nn.Module):
    """Layer normalization built for cnns input"""
    def __init__(self, feat_size):
        super(CNNLayerNorm, self).__init__()
        self.layer_norm = nn.LayerNorm(feat_size)

    def forward(self, x):
        # x (batch, channel, feature, time)
        x = x.transpose(2, 3).contiguous() # (batch, channel, time, feature)
        x = self.layer_norm(x)
        return x.transpose(2, 3).contiguous() # (batch, channel, feature, time)

class ResidualCNN(nn.Module):
    """Residual CNN inspired by https://arxiv.org/pdf/1603.05027.pdf
        except with layer norm instead of batch norm
    """
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size,
                 stride,
                 dropout,
                 feat_size):
        super(ResidualCNN, self).__init__()
        padding = kernel_size // 2 if isinstance(kernel_size, int) else (kernel_size[0] // 2, kernel_size[1] // 2)
        self.cnn1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding=padding)
        self.cnn2 = nn.Conv2d(out_channels, out_channels, kernel_size, stride, padding=padding)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.layer_norm1 = CNNLayerNorm(feat_size)
        self.layer_norm2 = CNNLayerNorm(feat_size)


    def forward(self, x):
        residual = x  # (batch, channel, feature, time)
        x = self.layer_norm1(x)
        x = F.gelu(x)
        x = self.dropout1(x)
        x = self.cnn1(x)
        x = self.layer_norm2(x)
        x = F.gelu(x)
        x = self.dropout2(x)
        x = self.cnn2(x)
        x += residual
        return x # (batch, channel, feature, time)

In [16]:
class FeatureExtractor(nn.Module):
    """Extracts features for ASR.

    If needed, you have to use your auxiliary modules to extract features.
    Please modify the `__init__` to make it compatible with your model designs.
    """

    def __init__(
            self,
            in_channels: int,
            out_channels: int,
            kernel_size: Union[int, tuple],
            dropout: float,
            feat_size: int,
            n_cnn_layers: int,
            stride
    ):
        super(FeatureExtractor, self).__init__()
        s0 = stride if isinstance(stride,int) else stride[0]
        k0 = kernel_size if isinstance(kernel_size,int) else kernel_size[0]
        convd_feat_size = int((feat_size + 2 * (k0 // 2) - (k0 - 1) - 1) / s0 + 1)
        padding = kernel_size // 2 if isinstance(kernel_size, int) else (kernel_size[0] //2, kernel_size[1] // 2)

        # Shrink the size of the input features by stride
        # Possible optimization: modify the first cnn to broaden the vertical reception field
        self.cnn = nn.Conv2d(in_channels=1,
                             out_channels=out_channels,
                             kernel_size=kernel_size,
                             stride=stride,
                             padding=padding)
        
        self.rescnn = nn.Sequential(*[ResidualCNN(in_channels=out_channels,
                              out_channels=out_channels,
                              kernel_size=kernel_size,
                              stride=1,
                              dropout=dropout,
                              feat_size=convd_feat_size)
                            for _ in range(n_cnn_layers)])

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """This forward method processes the input spectrogram features and returns
        a tensor that represents transformed features.

        Please pay attention to the output tensor shape, you have to use `transpose`
        to make sure dimensions aligned.

        Parameters
        ----------
        x : torch.Tensor
            The input spectrograms, in the shape of (batch_size, channel, feature, time)

        Returns
        -------
        torch.Tensor
            (batch_size, time, mapped_feature)

        """
        x = self.cnn(x)
        x = self.rescnn(x)
        return x

class GRUBlock(nn.Module):
    def __init__(self,
                 input_size: int,
                 hidden_size: int,
                 num_layers: int,
                 dropout: float = 0.1,
                 batch_first: bool = True,
                 bidirectional: bool = True) -> None:
        super().__init__()

        self.rnn =nn.GRU(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=batch_first,
            bidirectional=bidirectional,
            dropout=dropout if num_layers > 2 else 0
        )

        self.layer_norm =nn.LayerNorm(input_size)
        self.gelu =nn.GELU()
    
    def forward(self, x):
        x =self.gelu(self.layer_norm(x))
        x,_ =self.rnn(x)
        return x


class AttentionBlock(nn.Module):
    def __init__(self,
                 input_size: int,
                 num_heads: int = 8,
                 dropout: float = 0.1,
                 batch_first: bool = True) -> None:
        super().__init__()

        self.attn = nn.MultiheadAttention(embed_dim=input_size,
                                         num_heads=num_heads,
                                         dropout=dropout,
                                         batch_first=batch_first
                                         )
        self.layer_norm = nn.LayerNorm(input_size)
        self.gelu = nn.GELU()
    
    def forward(self, x, attention_mask):
        residual =x
        x = self.gelu(self.layer_norm(x))
        # Self-attention only
        x, _ = self.attn(query=x, key=x, value=x, key_padding_mask=attention_mask)
        # return x
        return x+residual


class SpeechRecgonitionModel(nn.Module):
    def __init__(  # Please modify the `__init__` to make it compatible with your model designs.
            self,
            vocab: Vocabulary,
            in_channels: int,
            out_channels: int,
            kernel_size: Union[int, tuple],
            dropout: float,
            feat_size: int,
            n_cnn_layers: int,
            n_rnn_layers: int,
            rnn_input_size: int,
            rnn_hidden_size: int,
            num_heads: int,
            stride: Union[int, tuple],
    ):
        super(SpeechRecgonitionModel, self).__init__()
        self.vocab =vocab
        self.stride = stride
        self.kernel_size =kernel_size
        self.s1 = stride if isinstance(stride,int) else stride[1]
        # In this model, you basically need three modules:
        # 1. A `FeatureExtractor` that extracts features
        # 2. An RNN that processes the sequence
        # 3. A LM head that generates tokens (like in previous homework)

        self.fe = FeatureExtractor(in_channels=in_channels,out_channels=out_channels,kernel_size=kernel_size,stride=stride,
                                   dropout=dropout,
                                   feat_size=feat_size,
                                   n_cnn_layers=n_cnn_layers)
        
        self.s0 = stride if isinstance(stride, int) else stride[0]
        self.k0 = kernel_size if isinstance(kernel_size, int) else kernel_size[0]
        self.k1 = kernel_size if isinstance(kernel_size, int) else kernel_size[1]
        convd_feat_size = int((feat_size + 2 * (self.k0 // 2) - (self.k0 - 1) - 1) / self.s0 + 1)

        self.fc1 = nn.Linear(in_features=out_channels * convd_feat_size,
                             out_features=rnn_input_size)
        
        self.attn1 = AttentionBlock(input_size=rnn_input_size,
                                   num_heads=num_heads,
                                   batch_first=True,
                                   dropout=dropout
                                   )
        
        self.attn2 = AttentionBlock(input_size=rnn_input_size,
                                   num_heads=num_heads,
                                   batch_first=True,
                                   dropout=dropout
                                   )

        self.head = nn.Sequential(
            nn.Linear(in_features=rnn_input_size,
                      out_features=rnn_hidden_size),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(rnn_hidden_size, len(vocab))
        )

    def forward(
            self,
            input_values: Optional[torch.Tensor] = None,
            attention_mask: Optional[torch.Tensor] = None,
            labels: Optional[torch.Tensor] = None,
            label_mask: Optional[torch.Tensor] = None
    ) -> CausalLMOutput:
        """

        Parameters
        ----------
        input_values : Optional[torch.Tensor]
        attention_mask : Optional[torch.Tensor]
        labels : Optional[torch.Tensor]
        label_mask: Optional[torch.Tensor]

        Returns
        -------

        """        
        # Manually compute the length of the input after convolution
        device = input_values.device
        input_lengths = torch.sum(attention_mask, dim=-1)
        input_lengths = torch.floor((input_lengths + 2 * (self.k1 // 2) - (self.k1 - 1) - 1) / self.s1 + 1)
        input_lengths = input_lengths.type(torch.LongTensor)
        max_input_len = max(input_lengths)
        attention_mask = ~torch.stack(
            [
                SpeechRecognitionDataModule._pad_sequence(torch.ones(l, dtype=torch.bool), max_length=max_input_len, padding_value=False)
                for l in input_lengths
            ],
            dim=0
        )
        attention_mask = attention_mask.to(device)

        x = self.fe(input_values)
        dims = x.size()
        x = x.view(dims[0], -1, dims[-1]).transpose(1, 2)
        x = self.fc1(x)
        x = self.attn1(x, attention_mask=attention_mask)
        x = self.attn2(x, attention_mask=attention_mask)
        logits = self.head(x)

        loss = None
        if labels is not None:
            target_lengths = label_mask.sum(-1)
            flattened_targets = labels.masked_select(label_mask)
            log_probs = nn.functional.log_softmax(logits, dim=-1).transpose(0, 1)

            loss = nn.functional.ctc_loss(
                log_probs=log_probs,
                targets=flattened_targets,
                input_lengths=input_lengths,
                target_lengths=target_lengths,
                blank=self.vocab.pad_id(),
                zero_infinity=True
            )

        return CausalLMOutput(loss=loss,
                              logits=logits)

In [17]:
def predictions(predicted_ids, pad_id):
    res = torch.ones_like(predicted_ids) * pad_id
    for i,l in enumerate(predicted_ids):
        prev_token =pad_id
        for j, id in enumerate(l):
            if id != prev_token:
                res[i][j] = id
                prev_token = id
            elif id == pad_id:
                prev_token = pad_id                
    return res
    
def wrapped_generate(model_to_wrap: nn.Module, **kwargs) -> torch.Tensor:
    """Wraps the generate method. This function is what will be actually
    called by the evaluation routine for the leaderboard.

    In this function, you can wrap your `generate()` method to allow
    different generation configurations to be used at the test time.

    Parameters
    ----------
    model_to_wrap : nn.Module
        Your ASR model.
    kwargs
        Argument dict that passes all arguments to the generate function.

    Returns
    -------
        Generated phoneme sequences in the form of torch.Tensor.
        In the shape of (batch_size, time).
    """
    # Please implement your decoding strategy here.
    # You have to pass inputs to the model and run the model once (in no_grad mode)
    # and decode with your decoding strategy to get the tensor of token ids
    # The simplest one is the greedy decoding - recall that we did it in hw2
    # It is also the one used in the `validation_step`.
    # You might see multiple repeated characters. In this case, you could also implement
    # some heuristics in this function to do post-processing to remove
    # repeated nonsensical characters.

    pad_id = model_to_wrap.vocab.pad_id()
    input_values = kwargs["input_values"]
    attention_mask = kwargs["attention_mask"]
    out = model_to_wrap(
            input_values=input_values,
            attention_mask=attention_mask,
        )
    logits = out.logits
    # take argmax and decode - Using greedy decoding
    pre_ids =torch.argmax(logits,dim=-1)

    return predictions(pre_ids, pad_id)

In [18]:
from torchmetrics import CharErrorRate, WordErrorRate
from pytorch_lightning.utilities.types import STEP_OUTPUT

class AutomaticSpeechRecognitionTask(pl.LightningModule):
    def __init__(self,
                 model: nn.Module,
                 vocab: Vocabulary,
                 learning_rate: float = 0.001):
        super(AutomaticSpeechRecognitionTask, self).__init__()
        self.model = model
        self.learning_rate = learning_rate
        self.vocab = vocab

        self.cer = CharErrorRate()
        self.wer = WordErrorRate()

    def training_step(self, batch: ASRBatch) -> STEP_OUTPUT:
        """Defines the training step.

        Parameters
        ----------
        batch : ASRBatch
            The batched training instances.

        Returns
        -------
        loss : torch.Tensor
            The loss computed from the ASR model.
        """
        # Please make necessary modifications to accommodate your design
        outputs = self.model(
            input_values=batch.spectrograms,
            attention_mask=batch.attention_mask,
            labels=batch.labels,
            label_mask=batch.label_mask
        )
        return outputs.loss

    def validation_step(self, batch: ASRBatch, batch_idx: int):
        """Defines the validation step - for this module, we have the same
        training and validation behaviors. Usually, we would compute a metric that is
        used to select the best performing model checkpoint.

        Parameters
        ----------
        batch : ASRBatch
            The batched training instances.
        batch_idx: int
            The index of the batch.
        """
        # You are free to modify this function to ensure they are being called correctly.
        outputs = self.model(
            input_values=batch.spectrograms,
            attention_mask=batch.attention_mask,
        )
        logits = outputs.logits
        # take argmax and decode - Using greedy decoding
        predicted_ids = torch.argmax(logits, dim=-1)
        predicted_ids = predictions(predicted_ids, self.vocab.pad_id())
        decoded_outputs = decode_as_str(self.vocab, predicted_ids)
        curr_cer = self.cer(decoded_outputs, batch.label_strs)
        curr_wer = self.wer(decoded_outputs, batch.label_strs)
        self.log('val_cer', self.cer)
        self.log('val_wer', self.wer)

    def configure_optimizers(self):
        """Configures optimizers for the training."""
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

In [19]:
# Configure data module
asr_datamodule = SpeechRecognitionDataModule(
    datasets={
        'train': train_dataset,
        'val': dev_dataset
    },
    vocab=vocab,
    batch_size=64
)
# Please modify the model and trainer configurations to accommodate your design
asr_model = SpeechRecgonitionModel(
    vocab=vocab,
    # The following args might not align with your model!
    in_channels=1,
    out_channels=32,
    feat_size=64,
    n_cnn_layers=3,
    kernel_size=(3, 5),
    stride=(2, 3),
    dropout=0.1,
    rnn_input_size=512,
    rnn_hidden_size=512,
    n_rnn_layers=3,
    num_heads=8
)
if is_autograder:  # You have to make sure that you checkpoint can be correctly loaded by the autograder
    # Please upload your checkpoint to your Google Drive
    # and put the share link below `CHECKPOINT_TO_DOWNLOAD`
    CHECKPOINT_TO_DOWNLOAD = 'https://livejohnshopkins-my.sharepoint.com/:u:/g/personal/yluo53_jh_edu/Ece_uV8HuGBDp27yUyRoZYIBE2-1jmzkaW8AmBXC0KVAyQ?e=cGN0nL'
    # Downloads the checkpoint
    from onedrivedownloader import download as onedrive_download
    onedrive_download(CHECKPOINT_TO_DOWNLOAD, filename='checkpoint.ckpt', unzip=False)
    asr_pl_module = AutomaticSpeechRecognitionTask.load_from_checkpoint(
        checkpoint_path='checkpoint.ckpt',
        vocab=vocab,
        model=asr_model
    )
else:
    # In the student mode, a new model would be trained
    # You are allowed to change training hyperparameters
    # But you are not allowed to create a new task
    asr_pl_module = AutomaticSpeechRecognitionTask(
        vocab=vocab,
        model=asr_model,
        learning_rate=7e-4
    )
    asr_trainer = pl.Trainer(
        accelerator=accelerator,
        min_epochs=10,
        max_epochs=20,
        default_root_dir="gdrive/MyDrive/HLT/hw4",
        callbacks=[pl.callbacks.EarlyStopping(monitor='val_cer', mode='min')]
    )
    asr_trainer.fit(model=asr_pl_module, datamodule=asr_datamodule)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                   | Params
-------------------------------------------------
0 | model | SpeechRecgonitionModel | 3.0 M 
1 | cer   | CharErrorRate          | 0     
2 | wer   | WordErrorRate          | 0     
-------------------------------------------------
3.0 M     Trainable params
0         Non-trainable params
3.0 M     Total params
12.002    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [20]:
asr_trainer.logged_metrics

{'val_cer': tensor(0.3788), 'val_wer': tensor(0.8840)}

In [21]:
if not is_autograder:
    # You can run this cell to test whether you have get your model trained
    # You are expecting to see a phoneme sequence
    train_dataloader = asr_datamodule.train_dataloader()
    test_example = None
    for i, bc in enumerate(train_dataloader):
        if i > 0:
            break
        test_example = bc

    print(decode_as_str(
        asr_pl_module.model.vocab,
        wrapped_generate(
            model_to_wrap=asr_pl_module.model,
            input_values=test_example.spectrograms.to(asr_pl_module.device),
            attention_mask=test_example.attention_mask.to(asr_pl_module.device)
        )
    ))

['jhot e ap tfin ner ser le ho it with he lar bein got at thar cae pr fest she jrahed it u be missistringons sha e e   brd the gais misestrinan odlah felt  e e    e ', 'lhat ast be apy with at qlilyian qurking hat ist ae bev the sind bline ly hat is te sinet it is lor hev hesseus loed sais women a    od un hotpiints rnmine     ', 'ae he e e hadyet was af fin i deo so his nopc ore king for woncs and be gand ae plamhe soihe ration bey foit that her ato be s wem ng mraeses an tre felng hin testa a  e e  e', 'of fbach ad so ortean sestin conse eation to priisi oha no se hat the com pbars ind would laod ing wor fhit giv lye qpon my shattin hnt ha ok e', 'rint ll is he hahet eople drouvn lo the hears oppe hel lifh to ly re h oserv it oihagbe gentof fhank thim mes pe rlng hod he ets tin tos sa thikets tern ton dro ', ' pon thesan pencsipposta prtended by o erble criticsisom ondtit tene of ta latnvire to ra bis hron the tain his nagit ve voce and the leoslature', 'wo cti ma a bot e any com la 

In [22]:
print(test_example.label_strs)

["chop me up fine or serve me whole it was a way of being got at that kate professed she dreaded it would be missus stringham's however she understood because missus stringham oddly felt", 'let us be happy without quibbling and quirking let us obey the sun blindly what is the sun it is love he who says love says woman ah ah behold omnipotence women', 'all the villagers said it was a fine idea so they stopped working for once and began to plan the celebration they thought that there ought to be swimming races and tree felling contests', 'of farce and so forth is a suggestive contribution to criticism i am not sure that the comparison would not have been more effectively put in a chapter than a book', 'growing old is a habit people travel along the years up the hill of life till they reach a certain point where they begin to think they must be growing old think its time to sag think its time to droop', 'upon the same principles they pretended by a verbal criticism on the tense of a latin

In [23]:
grader.check("model-generate-test")