# Assignment 3 [15% of your grade, 70 points in total]

Hi! Welcome to assignment 3. Here, we are going to build a simple automatic speech recognition (ASR) system using the SpeechBrain framework, and check your understanding of some important concepts related to ASR. This assignment constitutes 15% of your final grade.

You are required to:
- Finish this notebook. Successfully run all the code cells and answer all the questions.
- When you need to embed screenshot in the notebook, put the picture in './resources'.

**Submission**
After finishing, **zip the whole assignment directory (but please exclude "datasets" directory)**, then submit to Canvas. **Naming: "eXXXXXXX_Name_Assignment3.zip"**.

**Late Policy**
Please submit before **Wednesday, Recess Week, 27 September 2023, 23:59**. For each late day, your will get -25% marks.

**Honor Code**
Note that plagiarism will not be condoned. You may discuss the questions with your classmates or search on the internet for references, but you MUST NOT submit your code/answers that is copied directly from other sources. If you referred to the code or tutorial somewhere, please explicitly attribute the source somewhere in your code, e.g., in the comment.

**Note** You might need to restart the jupyter kernel to clear the imported py files before running some code cells.

**Useful Resources**
- (Paper) [Recent Advances in End-to-End Automatic Speech Recognition](https://arxiv.org/abs/2111.01690)
- (Code) [SpeechBrain ASR from Scratch](https://colab.research.google.com/drive/1aFgzrUv3udM_gNJNUoLaHIm78QHtxdIz?usp=sharing#scrollTo=IVCCe6cXPzJ0)
- (Video) [End-to-End Models for Speech Processing](https://www.youtube.com/watch?v=3MjIkWxXigM)

## Getting Started

We will continue using the same conda environment as the assignment 2, but some additional packages are needed.
1. Enter the conda environment by:

        conda activate 4347
2. Install packages

        # Install SpeechBrain and other libraries
        pip install -r requirement.txt

        # Install CMU Dictionary
        python
        nltk.download('cmudict')
        exit()

3. When you run this notebook in your IDE, switch the interpreter to the 4347 conda environment.
4. You may be prompted to install the jupyter package. Click "confirm" in this case.

## Section 1 - Automatic Speech Recognition (ASR) [28 mark(s)]
An automatic speech ASR system recognize spoken words from audio. If we build it using singing data, it becomes a lyric transcription system. As you have learned in the lecture, in recent decades, the performance of ASR systems has advanced significantly thanks to end-to-end (E2E) ASR models and large-scale open-source datasets.

We are not going to build a well-performed E2E ASR system in this assignment because it's too demanding for both computation resources and scale of data. Instead, we will
- Use phoneme as the recognition unit. In English, they have tighter relationship with the pronunciation, hence is less data-demanding.
- Use a simple model with a toy dataset.
- Train the model from scratch.
- Decode the output without language model.

This is just for simplicity and let you know the general idea of ASR system and SpeechBrain framework, but not what we do to solve real-world problems. For current state-of-the-art ASR systems, they tend to
- Use grapheme (e.g., character, word, sub-word) as the recognition unit. This make the recognition workflow simpler.
- Use huge models with huge datasets.
- Transfer learning is commonly adopted -- systems are first trained with large-scale corpus from various domains, or even unlabeled data (audio-only, no text annotation), and then fine-tuned with some domain-specific labeled data.
- Language models participate in the decoding process, making the output with higher fluency.

Since we will be using phoneme as the target for the dataset, our goal is to recognize a sequence of spoken phonemes from audio. But many speech dataset do not provide phoneme annotation (as in this assignment). So we need to obtain the phoneme sequence from sentences ourselves.

### Task 1: Prepare phoneme annotation  [4 mark(s)]
1. Please finish the code of PhonemeUtil Class in utils.py, so that you can pass the below tests. Please using the CMU Dictionary in nltk to obtain the pronunciation. Use the first pronunciation if multiple ones exists for a word. If a word is not in the dictionary, mark its phoneme as "\<UNK\>".  **[2 mark(s)]**


In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
from utils import *
phoneme_util = PhonemeUtil()
sentences = [
    "This is a test asdfdsaf",
    "For you phoneme tool",
    "thhat ensure you can get",
    "Correct labels",
]
out = [phoneme_util.word_to_phoneme_sequence(s) for s in sentences]
ans = [['DH', 'IH', 'S', 'IH', 'Z', 'AH', 'T', 'EH', 'S', 'T', '<UNK>'], ['F', 'AO', 'R', 'Y', 'UW', 'F', 'OW', 'N', 'IY', 'M', 'T', 'UW', 'L'], ['<UNK>', 'EH', 'N', 'SH', 'UH', 'R', 'Y', 'UW', 'K', 'AE', 'N', 'G', 'EH', 'T'], ['K', 'ER', 'EH', 'K', 'T', 'L', 'EY', 'B', 'AH', 'L', 'Z']]
for i,j in zip(out, ans):
    assert i == j
print('Congratulations!')

Congratulations!


2. Run the code below to obtain phoneme annotation for tiny LibriSpeech dataset. After this, the phoneme annotations will be stored to 'phn' property in the annotation files for each audio.  **[2 mark(s)]**

In [4]:
phoneme_util = PhonemeUtil()
dataset_dir = './datasets/tiny_librispeech'
annot_dir_complete = jpath(dataset_dir, 'annotation_word')
annot_dir_word = jpath(dataset_dir, 'annotation')
if not os.path.exists(annot_dir_word):
    os.mkdir(annot_dir_word)
splits = ['train', 'valid', 'test']
for split in splits:
    annot_fp_old = jpath(annot_dir_complete, split+'.json')
    annot_fp_new = jpath(annot_dir_word, split+'.json')
    data = read_json(annot_fp_old)
    for id in data:
        entry = data[id]
        sentence = entry['words']
        phonemes = phoneme_util.word_to_phoneme_sequence(sentence)
        data[id]['phn'] = ' '.join(phonemes)
    save_json(data, annot_fp_new)
data = read_json(jpath(dataset_dir, 'annotation', 'test.json'))

t = 'R AA B AH N <UNK> S AO DH AE T HH IH Z D AW T S AH V W AA R AH N T AH N HH AE D B IH N AH N F EH R AH N D HH IY B IH K EY M AH SH EY M D AH V HH IH M S EH L F F AO R HH AA R B ER IH NG DH EH M'
assert data['61-70970-0036']['phn'] == t
print('Congrats!')

Congrats!


### Task 2: Prepare tokenizer [3 mark(s)]
In both training and inference, a tokenizer help to convert labels (in our case, phoneme annotations) from text to integer numbers so that the model can handle them easily.

1. Please finish the code of PhonemeTonekizer Class in utils.py so that it can pass the cell below. **[3 mark(s)]**

In [7]:
from utils import PhonemeTokenizer
tokenizer = PhonemeTokenizer()
assert len(tokenizer.vocab) == 41
assert tokenizer.token_to_id['<UNK>'] == 40
assert tokenizer.id_to_token[0] == '<blank>'

phn_seqs = [
    ['CH', 'AO', 'B', 'T', 'S', 'OY'],
    ['B', 'AE', 'AA', 'AH', 'ER', 'TH'],
    ['<UNK>', 'D', 'B', '<UNK>', 'HH', 'TH']
]
ans = [
    [8, 4, 7, 31, 29, 26],
    [7, 2, 1, 3, 12, 32],
    [40, 9, 7, 40, 16, 32],
]

assert tokenizer.encode_seq(phn_seqs[0]) == ans[0]
assert tokenizer.encode_seq(phn_seqs[1]) == ans[1]
assert tokenizer.encode_seq(phn_seqs[2]) == ans[2]
assert tokenizer.decode_seq(ans[0]) == phn_seqs[0]
assert tokenizer.decode_seq(ans[1]) == phn_seqs[1]
assert tokenizer.decode_seq(ans[2]) == phn_seqs[2]
assert tokenizer.decode_seq_batch(ans) == phn_seqs

print('Congrats!')

Congrats!


### Task 3: ASR Baseline [8 mark(s)]

We are now ready for building the first ASR system. Please finish the tasks below:

1. The current code uses the validation set as the testing set, while the code for preparing the test data is missing. Please complete it. **[1 mark(s)]**
2. Please use Checkpointer class of speechbrain to help you save the model with the lowest Phoneme Error Rate (PER) during training. Save the checkpoint under the directory "results/baseline/best_ckpt". **[1 mark(s)]**
3. Load the best model (lowest PER) for evaluation, instead of using the model from the last epoch. **[1 mark(s)]**
4. Please use speechbrain.utils.metric_stats.ErrorRateStats.write_stats to help you save the output of your model on the whole test set to help you know your model's performance better. In the output file, please use phoneme tokens instead of token ids (numbers). Save the file to "results/baseline/results.txt" **[1 mark(s)]**
5. Please log your training, validation, and evaluation statistics to the result folder, in whatever way you like. **[1 mark(s)]**

Run the training and testing by

    python train.py hparam_baseline.yaml
Expected PER: 90%.

**NOTE**: Please keep the (1) training log, (2) model checkpoint and the (3) corresponding result files, when submitting you assignment. **[3 mark(s)]**

In [1]:
!python train_baseline.py hparam_baseline.yaml

python: can't open file '/Users/niharika/Documents/Study_Material/Sem_3/Sound and Music/Assignments/Assignment 3/train_.py': [Errno 2] No such file or directory


### Task 4: Modifying the Model [13 mark(s)]

You may have spot some of the issues during the training, like the slow converging speed, overfitting, etc. Please make the following changes to your model by modifying the yaml file.
1. (Please create a new .yaml file from the hparam_baseline.yaml, naming it hparam_modified.yaml) **[1 mark(s)]**
2. Increase the N_epoch to 20. **[1 mark(s)]**
3. Increase the learning rate to 5e-3 **[1 mark(s)]**
4. Add weight decay = 0.1 to the optimizer **[1 mark(s)]**
5. Add a variable named "drop_p", with value 0.2. **[1 mark(s)]**
6. Add 3 dropout layers to the model, after act1, act2, and RNN. All with the same dropout rate of "drop_p" (you need to use a variable reference here). **[1 mark(s)]**
7. Change the output_dir from "results/baseline" to "results/drop0.2x2_lr0.005_wd0.1". **[1 mark(s)]**

There are some other changes you need to make in the train.py file:
1. Use the speechbrain.nnet.schedulers.NewBobScheduler to schedule the learning rate or training according to loss on validation set. If the validation loss did not decrease after an epoch of training, use that scheduler to adjust the learning rate. **[2 mark(s)]**
2. Before the training of each epoch, print out and log the current learning rate. **[1 mark(s)]**

Run the training and testing by

    python train.py hparam_modified.yaml
Expected PER: 65%.

**NOTE**: Please keep the (1) training log, (2) model checkpoint and the (3) corresponding result files, when submitting you assignment. **[3 mark(s)]**

In [16]:
!python train.py hparam_modified.yaml

[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.
Epoch: 0, Train LR: 0.005
100%|██████████████████████████| 80/80 [00:31<00:00,  2.53it/s, train_loss=3.69]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 10.69it/s]
Epoch 0 complete
Train loss: 3.69
Stage.VALID loss: 3.43
Stage.VALID PER: 100.00
Epoch: 1, Train LR: 0.005
100%|██████████████████████████| 80/80 [00:28<00:00,  2.84it/s, train_loss=3.39]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 13.56it/s]
Epoch 1 complete
Train loss: 3.39
Stage.VALID loss: 3.27
Stage.VALID PER: 99.91
Epoch: 2, Train LR: 0.005
100%|██████████████████████████| 80/80 [00:28<00:00,  2.77it/s, train_loss=3.07]
100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 13.29it/s]
Epoch 2 complete
Train loss: 3.07
Stage.VALID loss: 2.86
Stage.VALID PER: 90.28
Epoch: 3, Train LR: 0.005
100%|██████████████████████████| 80/80 [00:30<00:00,  2.60it/s, train_loss=2.55]
100%|██████

## Section 2 - Questions [42 marks]

### - Result Analysis [2 mark(s)]
1. How does your system perform? Briefly introduce your system's performance with objective metric scores and the result file for the test set. **[2 mark(s)]**

(Your Answer)

System's performance:
1. Validation PER: 56.713, Validation Loss: 2.152
2. Test PER: 59.756, Test Loss: 2.224

Results on test set: 
Average Word Error Rate (WER) = 57.53%

Analysis:
Let's take an example of a sample from the test data.

```
Reference sentence = "ROBIN CAREFULLY DESCENDED THE LADDER AND FOUND HIMSELF SOON UPON FIRM ROCKY GROUND",

Alignment (1st sentence is ground truth, 3rd sentence is prediction, 2nd sentence shows min-edit operations on phonemes) = 
61-70970-0027, %WER 64.41 [ 38 / 59, 0 ins, 21 del, 17 sub ]
R ;   AA  ;   B   ; AH ; N ; K  ;   EH  ;   R   ; F ;   AH  ; L ; IY ; D ; IH ; S ; EH ; N ;   D   ;   AH  ;   D   ;   DH  ;   AH  ; L ;   AE  ; D ; ER ; AH ; N ;   D   ; F ; AW ; N ; D  ; HH ;   IH  ;   M   ; S ; EH ; L ;   F   ; S ; UW ; N ; AH ; P ; AA ; N ; F ; ER ; M ;   R   ;   AA  ;   K   ; IY ; G ;   R   ;   AW  ; N ;   D  
S ;   D   ;   D   ; =  ; S ; S  ;   D   ;   D   ; = ;   D   ; = ; =  ; = ; S  ; = ; =  ; S ;   D   ;   D   ;   D   ;   D   ;   D   ; = ;   D   ; = ; S  ; S  ; = ;   D   ; = ; S  ; S ; S  ; S  ;   D   ;   D   ; = ; S  ; = ;   D   ; = ; S  ; S ; =  ; = ; S  ; = ; = ; S  ; S ;   D   ;   D   ;   D   ; =  ; = ;   D   ;   D   ; = ;   D  
W ; <eps> ; <eps> ; AH ; T ; ER ; <eps> ; <eps> ; F ; <eps> ; L ; IY ; D ; AH ; S ; EH ; T ; <eps> ; <eps> ; <eps> ; <eps> ; <eps> ; L ; <eps> ; D ; AY ; IH ; N ; <eps> ; F ; AA ; T ; AH ; N  ; <eps> ; <eps> ; S ; OW ; L ; <eps> ; S ; IY ; D ; AH ; P ; AO ; N ; F ; AH ; P ; <eps> ; <eps> ; <eps> ; IY ; G ; <eps> ; <eps> ; N ; <eps>
```

- This alignment shows that the sentence has a WER = 64.41%, wherein, we are required to do 21 deletion and 12 substitution operations on the phonemes in order to completely align the predictions to the ground truth.

- Looking at the predictions, a high number of them were `<eps>` (no sound), which means that the system is not able to learn proper words at all.

Overall, the performance of the system is not very good since a lot of operations (more than >50%) are required to align the predictions to the ground truth. This means that our system is predicting highly erroneous annotations.

### - Tokenization [8 mark(s)]
1. Do you think detecting phoneme sequence from speech recording is more difficult than detecting character or word sequence? Why? **[2 mark(s)]**
(Your Answer)
```
I feel that detecting phoneme sequences from a speech recording can be more challenging than word sequences but less challenging than character sequences.

1. Characters are the most granular part of a word. Therefore, detecting them from a speech recording can be exceptionally challenging since it highly depends on every character being enunciated properly. For e.g., 
    - if someone with an accent pronounces the word "water" as "watahh" vs "waterr", then character-level detection is difficult.
    - Multiple words have silent characters in them, such as, "jalapeno" which is pronounced as "halepeno". In this case, the system will fail to detect the character properly.

2. On the other hand, 
    - it is possible that the ground truth phonemes for the same word are pronounced differently by speakers with varied accents, or the same speaker in a different context. For e.g.,
        - "mate" being pronounced as "ma-ai-tt" in an irish accent vs "m-ae-t" in indian accent.
    - It is also possible that the speech recording is either too noisy or too fast, and a lot of the phonemes go completely undetected.

3. However, detecting a sequence of words from a speech recording can be easier compared to the above 2 processes, since no matter how the pronunciations of the granular phonemes are, it will finally coalesce into a single meaningful word at the end. (This still might fail if the system is not robust to highly obscure accents). It is also relatively easy to detect the start and stop of a word, due to long pauses or silences (in a normal recording).
```
</br>

2. For the task of speech recognition, what are the drawbacks of using phoneme as the detecting unit? **[2 mark(s)]**
(Your Answer)
```
1. Modelling phonemes distribution:
    - Modelling phonemes to recognize all of their contextual variations is a complex problem compared to directly recognizing words. 
    - Some languages have a large number of phonemes, wherein managing a large inventory of phonemes requires substantial computational resources and data. 
    - Annotating a substantial amount of phonetic data for training can be resource-intensive.

2. Different speakers may pronounce phonemes differently due to accents. This variability of diverse speakers can affect system performance if the model is not robust.
```
</br>

3. What is the advantage of sub-word tokenizer compared to word-level tokenizer? **[2 mark(s)]**
(Your Answer)
```
1. Multiple sub-words come together to form a new word. This means that there are lesser sub-words than words. Therefore, using a sub-word tokenizer is advantageous since its vocabulary size would be much lesser compared to a word-level tokenizer, without any loss of information.

2. Sub-word tokenization can break down words into smaller units, which helps in dealing with OOV words, i.e., previously unseen words can be represented by a combination of known sub-words, allowing the model to generalize better.
```
</br>

4. If we are changing our tokenizer to the type of grapheme, which level do you think is the best, among {character, word, sub-word}? Please state your reason. **[2 mark(s)]**
(Your Answer)
```
I think that a sub-word level tokenizer is the best amongst the three. 

1. It has lesser vocabulary size as compared to a word-level tokenizer as described above.
2. It generalizes better to out-of-distribution words by breaking it down into smaller sub-units.
3. For the context of speech recognition, it is easier to detect sub-words rather than characters due to the intuition of how natural language is spoken (more emphasis given to sub-words/phonemes than individual characters).
```

### - Modeling [7 mark(s)]
Connectionist Temporal Classification (CTC) is a type of loss function that is commonly used in ASR, especially when we do not know the precise alignment between the annotation and the audio.

1. Explain how does CTC deal with the misalignment issue between audio and annotation, i.e., the number of frames in the audio is much higher than the number of phoneme/character/sub-word/word in the annotation, and we do not know their correspondence. **[1 mark(s)]**
2. Why does CTC need an additional blank token in the prediction? **[1 mark(s)]**
3. Here are several decoded output from a CTC model. Write out their final recognition result. ("-" is CTC blank token, and "_" represent space) **[2 mark(s)]**

    (1) heeel-ll-l_lllooo--wooooorld

    (2) hhhhee-llow_wo--rr-rllll--dd
    
4. Recall the formula of CTC loss:
   $$L_{CTC} = -log(\sum_{\pi \in B^{-1}(W)} \prod_{t=1}^Tp(\pi_t|\mathbf{x}_t))$$
   Does this summation mark means that we have to list out all possible alignments between frames and texts, compute the probability for each pair, and add them together? Is there more efficient way to compute the CTC loss? If you think so, please briefly explain a more efficient algorithm. **[3 mark(s)]**

(Your Answer)

1. 

        CTC Loss: The purpose of CTC loss is to align transcripts with the audio features.
        
        Transcript: "of"

        CTC will try to align the transcript by repeating the characters. CTC also introduces a blank token $\epsilon$

| $x_1$ | $x_2$ | $x_3$ | $x_4$ | $x_5$ | 
|----|----|----|----|----|
| $\epsilon$ | $\epsilon$ | o | o | f |

.
.
.

| $\epsilon$ | o | o | f | f |
|----|----|----|----|----|
| $\epsilon$ | o | o | o | f |


        After all the possible alignments are iterated, CTC will compute the probablity of the word occurring given the audio features $P(W|X)$ and sum it.
        The negative log of the sum is the CTC Loss.

        So, if our model predict accurate words, then the probablity will be higher. Which means our loss will be lower.


2. 

        Once CTC has found the word alignment, it will collapse characters that are the same.

        o o o f f → of. This was a simple case and naively combining characters worked.

        If we have multiple letters, this naive method will give erroneous outputs.

        h h e e e l l l o o o o. --> helo. This is incorrect, we wanted it to be "hello". 

        To address this issue CTC introduces a blank token.

3. 
Using beam search for CTC loss,

        a. heeel-ll-l_lllooo--wooooorld -> helll_loworld
        b. hhhhee-llow_wo--rr-rllll--dd -> helow_worrld

4. 

        Yes, the formula is exactly what is described in the question. It is very inefficient to compute the probablities for all the alignments by listing them out. Instead, we can use beam search to iterate through all possible alignments efficiently using a heuristic (beam_size) and compute loss on the candidate with highest probability.

```
Beam Search Algorithm:
        
Initialization:

1. Start with an empty list of candidate sequences.
2. Initialize a single candidate sequence with a special "start" token (e.g., <start>).

Generating Candidate Sequences:

1. At each step, expand the top-k candidate sequences from the previous step. k is a hyperparameter = beam_size.

2. For each candidate sequence, generate a set of next-token candidates using the probabilistic model.

3. Compute the probability score for each next-token candidate based on the model's predictions.

4. Combine the next-token candidates with their parent sequences to create new candidate sequences.

5. Keep only the top-k candidate sequences based on their cumulative probability scores.

Checking for Termination:

1. Check if any of the candidate sequences end with a special "end" token (e.g., <end>). If so, mark them as completed sequences.
2. Continue expanding candidates until a maximum length is reached or until a certain number of completed sequences are obtained.

Termination and Output:

1. Sort the completed sequences based on their cumulative probability scores.
2. Return the top-ranked completed sequence as the final output.
```

References: 
1. https://medium.com/@kushagrabh13/modeling-sequences-with-ctc-part-2-14ab45ef896e



### - Language Model [7 mark(s)]
1. Consider the two sequences below:
    - A: I like Singapore's weather.
    - B: I Singapore like ? weathers.

    For a well-trained language model, which sentence will have lower perplexity from this model? Why? **[1 mark(s)]**

```
- Perplexity measures how well a language model can predict the next word in a sequence based on the preceding context. Since a well-trained language model must have been trained on vast amounts of text data that contain well-formed sentences, they are more likely to assign lower perplexity to grammatical and fluent sentences.

- Therefore, sentence A will have lower perplexity from this model.

- Whereas, sentence B contains non-standard language elements like the question mark in the middle of the sentence and the unusual word "weathers". These elements increase the perplexity of the sentence.
```
</br>

2. Given the corpus below:

            I love to play football
            He loves to watch football
            I love to watch movies
            She loves to play tennis
    (1) Assuming we are using a word-level tokenizer. Calculate the below bigram probability by #this bigram/#all bigram: **[3 mark(s)]**
    [Not done since its not graded].
    
    a. P(love | I)

    b. P(to | love)

    c. P(football | play)
    
    d. P(movies | watch)
   </br>
   
    (2) Use the probability you obtained above, calculate the probability of below sentences **[2 mark(s)]**
    a. I love to watch football
    b. She loves to play football
   </br>
   
    (3) Why it's not a good idea to use a large n value for n-gram language models? **[1 mark(s)]**
    ```
    Longer values of n means longer sequences of words. These longer sequences are less likely to appear frequently in the training data (mostly only once or not at all), since the data will become extremely sparse. Therefore, without proper smoothning, it is difficult to estimate accurate probailities for them. This results in overfitting to the training data and poor model generalization.
    ```


### - Beam Search [4 mark(s)]
Assume we have a simplified language model that can predict the probability of next word. We have generated a start part of the sentence "I want to". Now we are using beam search to predict the rest of the sentence. Use letter "G" denote the generated part. Let's use beam size of 2 for this question.

        Probability calculated by language model:
        p(eat | G): 0.4
        p(play | G): 0.3
        p(go | G): 0.2
        p(watch | G): 0.1
        p(a sandwich | G eat): 0.5
        p(dinner | G eat): 0.4
        p(an apple | G eat): 0.1
        p(football | G play): 0.6
        p(games | G play): 0.4
1. Let's continue the generation from G="I want to". After the first step of beam search, what tokens will be selected, and what are the resulting candidate sequence? **[1 mark(s)]**

```
Tokens selected: ["eat", "play"]
Candidate sequences: [
        "I want to eat a sandwich",
        "I want to eat dinner",
        "I want to eat an apple",
        "I want to play football",
        "I want to play games"
]
```

2. In the 2nd step of beam search, what are the two beams starting with "G eat"? What are their probability respectively? **[1 mark(s)]**

```
"I want to eat a sandwich" : 0.5*0.4 = 0.20
"I want to eat dinner" : 0.4*0.4 = 0.16
```

3. In the 2nd step of beam search, what are the two beams starting with "G play"? What are their probability respectively? **[1 mark(s)]**

```
"I want to play football" : 0.6*0.3 = 0.18
"I want to play games" : 0.4*0.3 = 0.12
```

4. What are the resulting candidate sequence from the 2nd step of beam search? **[1 mark(s)]**

```
Candidate sequences: [
        "I want to eat a sandwich",
        "I want to play football"
]
```

### - Word Error Rate [3 mark(s)]

Consider an automatic speech recognition system that transcribes a spoken segment into text. We compare the transcription of the system with a human-annotated reference transcript to calculate the system's Word Error Rate.

Reference Transcript:
"I am excited to learn about speech recognition."

System's Transcription (Hypothesis):
"I am excited learn about speech recognise."

1. Calculate the number of insertions, deletions, and substitutions. **[1 mark(s)]**
2. Compute the Word Error Rate (WER) using the formula: **[1 mark(s)]**
$$WER=\frac{\text{Insertions}+\text{Deletions}+\text{Substitutions}}{\text{Number of words in Reference}}$$
3. Why might WER be a more resonable metric for ASR compared to a simple accuracy rate (correct words divided by total words)? **[1 mark(s)]**

(Your Answer)

1. 

<center>

|I| am| excited |to |learn |about |speech |recognition|
|---|---|---|---|---|---|---|---|
|=|=|=|I|=|=|=|S|
|I| am |excited | |learn| about| speech| recognise|

</center>

        Insertions: 1
        Deletions: 0
        Substitutions: 1

2. 

        WER = 2 / 8 = 0.25
3. 

        Simply calculating the accuracy will not capture the ordering. In ASR system we care about the ordering and alignment of the phonemes/words/sub-words.

### - Possible Improvement [3 mark(s)]
1. The performance of the recognition system in Section 1 might still have room to improve. What are possible reasons for the not-so-good performance, and directions of improvement? Please list 3 pairs of them. **[3 mark(s)]**

```
1. Hyperparameter Tuning: By analysing the validation and test logs, we can see that the final validation and test PER scores are similar and the model hasn't overfit yet. This means that we can perform some hyperparamater tuning to get the model to converge, e.g., increase number of epochs.

2. Add regularization: If the convergence rate is too slow, it means that the model is not learning new and complex patterns easily. We can further add regularization to our model - in the form of dropout layers, increased dropout rate, or regularization in the model objective.

3. More training data: Increasing training data increases variance of the model. Therefore, more complex patterns and phoneme relationships in the annotations can be learnt.

All these methods in combination can help increase system performance.
```

### - Speech vs Singing [6 mark(s)]
1. What are the properties that are different between audio of from speech recording and that of singing recording? What are the similar/same properties that are shared between them? **[2 mark(s)]**
2. What are the properties that are different between spoken texts and lyrics? What are the similar/same properties that are shared between them? **[1 mark(s)]**
3. Given the limited paired singing dataset of audio and lyric, how can we build a lyric transcription system with better performance? Please answer from 3 perspectives. **[3 mark(s)]**

(Your Answer)

1. 

        Differences:

            - Range of Pitch: Speaking voice normally ranges from 75Hz to 600Hz, but the range of pitch in singing voice is very big (eg: opera)
            - Loudness: Speaking voice is usually not loud, whereas singing voice can vary a lot.
            - Rate of Speech: Speaking voice has a standard speaking rate, but singing voice depends on the genre of music. (Rap: Fast, Opera: Slow)
            - Periodicity: While both types are apreriodic, Singing voice follows a rhythm and generally has patterns.
            - Sampling Rate: Sampling Rate of singing voice needs to much higher, in order to reliably reconstruct the audio.

        Similarities:

            - Both involve content that can be broken down into phonemes. While the articulation/intonation will be different, the phonemes dont change.
            - Both produce sounds that fall within our hearing range.
            - Both are aperiodic

2. 

        Differences:

            - Intonation: Lyrics contain information on how to say the words.
            - Rhythm: Lyrics often follow a rhythm based on the instruments. Spoken texts do not have this constraint.
        
        Similarities:

            - Language: Both are represented using some form of language.
            - Grammar: Both follow the syntax and rules of grammar (atleast for english)

3. 
        Model Architecture: 
        
            - We can add attention mechanism to our architecture. Lyrics have a strong dependence on the rhythm of the music. The stressors in phoneme are more important in the case of lyrics, compared to spoken voice.

        Training:

            - Incorporate feedback from human annotators. Humans will evaluate the misclassified results and the feedback will be used to fine tune the model. (eg: RLHF).
            - Incorporate a language model over the outputs to significantly improve accuracy.
            - Add regularization during the trianing process (Dropout layers/L2/L1).
            - Tune the hyperparameters such as learning rate, dropout rate etc so that the model converges.

        Data Augmentation: 

            - Generate synthetic data from the training data distribution to increase the size.

### - Timing Survey [2 mark(s)]

- What do you think is the most difficult part? Which part did you spent most time on it? **[1 mark(s)]**
```
Answering the theory questions and understanding the SpeechBrain library.
```
</br>

- How much time did you spent on the assignment? Please fill an estimated time here if you did not time yourself. **[1 mark(s)]**
```
3 days.
```
</br>