# DL4ASP - Lab Assignment 3 - Automatic Speech Recognition (2023-24)

By Doroteo Torre Toledano & Beltrán Labrador Serrano.

**Based on IS09 TUTORIAL "Advanced methods for neural end-to-end speech processing – unification, integration, and <span style="color:red">implementation</span> -"**
By: [Shigeki Karita](https://github.com/ShigekiKarita);
NTT Communication Science Laboratories;
15, September, 2019

**Based on CMU 11751/18781 Fall 2022: ESPnet Tutorial"**
By: [Yifan Peng](yifanpen@andrew.cmu.edu); https://espnet.github.io/espnet/notebook/espnet2_recipe_tutorial_CMU_11751_18781_Fall2022.html


# Abstract

This lab assignment introduces the task of Automatic Speech Recognition, ASR (also known as Speech-To-Text, STT), which aims to assign an output sequence of words, letters of phonemes to an input sequence of audio (or audio features).

Speech Recognition has evolved from template-matching methods (such as Dynamic Time Warping, DTW) in the early beginnings to statistical-based methods (such as Hidden Markov Models (HMMs) combined with Gaussian Mixture Models (GMMs)). In the last years, it is dominated by deep learning methods, first combined with the previous HMM machinery (Hybrid HMM-DNN methods), and more recently solving the whole sequence-to-sequence problem with end-to-end deep learning approaches.

In this lab assignment you will learn to build one of the most modern end-to-end deep learning approaches with a recent tool developed exclusively for sequence-to-sequence problems. [ESPnet](https://github.com/espnet/espnet) is a widely-used end-to-end speech processing toolkit that currently supports various speech processing tasks. ESPnet uses PyTorch as its main deep learning engine, and also follows Kaldi style recipes to provide a complete setup for speech recognition and other speech processing experiments.

## Sessions & Report
The lab assignment will consist of two sessions, during which a report will be elaborated by each student answering to the questions proposed in this notebook under the titles **'Questions for the report'**. The report will be delivered after the last session and before the last day of the course.
To answer the questions proposed for the report you may need to:
-	Search the ESPnet online documentation (https://espnet.github.io/espnet).
-	See the output of the scripts (both in the Google Colab notebook and in the files/folders generated by each stage).

**NOTE:** Running some parts of the code takes a lot of time. In particular, training the neural network and decoding can take over 20 minutes and scoring can take over 30. Therefore, it is advisable to run all the code, and continue answering the questions while the code is running.


## Materials:
- [ESPnet repository](https://github.com/espnet/espnet)
- [ESPnet documentation](https://espnet.github.io/espnet/)
- These slides https://github.com/espnet/interspeech2019-tutorial
- API documetation https://espnet.github.io/espnet/
- TIMIT corpus (**can only be used for this lab assignment**) (https://drive.google.com/open?id=14Nz-80FDX4G6oY6UeluKc0zDxdP5Y73-)


## Objectives
After this tutorial, you are expected to know how to:
- Run existing recipes (data prep, training, inference and scoring) in ESPnet2
- Change the training and decoding configurations

Optional extensions:
- Modifying the training and decoding configurations (e.g. other features or models)
- Using pre-trained models
- Recognizing your own voice
- Using real-time decoding
- Other ideas...


## Useful links

- Installation https://espnet.github.io/espnet/installation.html
- Usage https://espnet.github.io/espnet/espnet2_tutorial.html


## Contents

1. ESPnet Overview
1. ESPnet Installation
1. Download and exploration of TIMIT dataset
1. Runing an existing recipe step by step
1. Optional Extensions


# 1. ESPnet Overview

ESPnet provides **bash recipes and python library** for speech processing. This lab assignment explores the bash recipes in ASR


## 1.1 Bash recipe overview

ESPnet supports **many** ASR tasks including

- Multilingual ASR: en, zh, ja, etc
- Noise robust and far-field ASR
- Multi-channel ASR: joint training with speech enhancement
- Speech Translation: ASR + MT

It also supports a number of tasks different to ASR:
- Text-to-speech (TTS), also known as Speech Synthesis
- Speech Enhancement
- Speaker recognition

For more up-to-date details:
https://github.com/espnet/espnet/tree/master/egs2


## 1.2 ASR Performance

Performance on different corpora is reported in the README.md files in each recipe.

Some examples:

https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1/README.md

https://github.com/espnet/espnet/tree/master/egs2/timit/asr1/README.md




## 1.3 Pretrained models

A lot of pretrained models are available in Huggingface (filter with espnet)

https://huggingface.co/models?sort=downloads&search=espnet

https://huggingface.co/docs/hub/main/en/espnet

# 2. ESPnet Installation
Based on https://espnet.github.io/espnet/notebook/espnet2_recipe_tutorial_CMU_11751_18781_Fall2022.html

Modified by DTT (Doroteo Torre Toledano)



In [1]:
def print_date_and_time():
  from datetime import datetime
  import pytz

  now = datetime.now(pytz.timezone("America/New_York"))
  print("=" * 60)
  print(f' Current date and time: {now.strftime("%m/%d/%Y %H:%M:%S")}')
  print("=" * 60)

# example output
print_date_and_time()

 Current date and time: 12/13/2023 05:21:45


**NOTE: You will need a GPU to run this code. Please check that you have selected the appropriate running environment in Colab**

In [2]:
#!nvidia-smi

Wed Dec 13 11:21:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   43C    P8    12W / 170W |   2789MiB / 12050MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# It takes a few seconds
#!git clone --depth 5 https://github.com/espnet/espnet

# We use a specific commit just for reproducibility.
# (DTT) - This git checkout does not work, so we comment it and use the last commit
#%cd ./espnet
#!git checkout 3a22d1584317ae59974aad62feab8719c003ae05

fatal: destination path 'espnet' already exists and is not an empty directory.
/home/javiermunoz/Universidad/MasterDeepLearning/DL4ASP/Practica/Lab3/espnet


In [5]:
# It takes 35 seconds
#%cd ./espnet/tools
#!./setup_anaconda.sh anaconda espnet 3.9

[Errno 2] No such file or directory: './espnet/tools'
/home/javiermunoz/Universidad/MasterDeepLearning/DL4ASP/Practica/Lab3/espnet
/bin/bash: ./setup_anaconda.sh: No such file or directory


In [None]:
# It may take 12 minutes
#%cd ./espnet/tools
#!make TH_VERSION=1.12.1 CUDA_VERSION=11.6 # 11.6

After the main installation of ESPnet, we can install optional packages. However, no additional packages are required for this lab assignment.






If other listed packages are necessary, you can install any of them using

`. ./activation_python.sh && ./installers/install_xxx.sh`

In [None]:
# s3prl and fairseq are necessary if you want to use self-supervised pre-trained models
# It takes 50s
# We do not install them by default
#%cd /content/espnet/tools

#!. ./activate_python.sh && ./installers/install_s3prl.sh      # install s3prl to use SSLRs
#!. ./activate_python.sh && ./installers/install_fairseq.sh    # install s3prl to use Wav2Vec2 / HuBERT model series

Now let’s make sure torch, torch cuda, and espnet are successfully installed. We should get someting like

`...`

`[x] torch=1.12.1`

`[x] torch cuda=11.6`

`[x] torch cudnn=8302`

`...`

`[x] espnet=202310`

`...`

In [None]:
%cd ./espnet/tools
!. ./activate_python.sh && python3 check_install.py | head -n 40

# NOTE: Checkpoint 1
print_date_and_time()

#3. Download and explore the TIMIT dataset



We will use the TIMIT speech corpus for training and testing the ASR system.

**WARNING: You can use TIMIT only for the purpose of this lab assignment. You are not allowed not copy, distribute or use this corpus for any other task.**

In this section we will download the corpus and explore the different types of file it contains.

Download the TIMIT corpus with the following code:

In [None]:
# It takes 25 seconds
!pip install --upgrade --no-cache-dir gdown
# Alternative download location, in case first is not working
#!gdown -O TIMIT.tgz https://drive.google.com/uc?id=1JAbQTuumm_0CIXciBkeyWtQeJ-7yFh5Y
# %cd /content
!gdown -O TIMIT.tgz https://drive.google.com/uc?id=1cpnMCdFdQuWINW-eEjiGIYBLOSr_5Nz9



In [None]:
# It takes 12 seconds
#%cd /content

!tar -xzf TIMIT.tgz


Now have a look at the contents of the corpus:

In [None]:
!apt-get install tree
!tree -L 1 /content/timit/


This directory contains 3 subdirectories, of which only the TIMIT subdirectory is of interest for this lab.

In [None]:
!tree -L 2 /content/timit/TIMIT

In this directory, you will find one directory with documentation (DOC) and two containing the real data, divided into training (TRAIN) and test (TEST). These two last directories share the same structure, but we need to introduce a few concepts about the dataset before explaining this structure.

TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States.
```
      Dialect
      Region(dr)    #Male    #Female    Total
      ----------  --------- ---------  ----------
         1         31 (63%)  18 (27%)   49 (8%)  
         2         71 (70%)  31 (30%)  102 (16%)
         3         79 (67%)  23 (23%)  102 (16%)
         4         69 (69%)  31 (31%)  100 (16%)
         5         62 (63%)  36 (37%)   98 (16%)
         6         30 (65%)  16 (35%)   46 (7%)
         7         74 (74%)  26 (26%)  100 (16%)
         8         22 (67%)  11 (33%)   33 (5%)
       ------     --------- ---------  ----------
        ALL       438 (70%) 192 (30%)  630 (100%)

```

There are three different types of sentences according to the type of text and the number of speakers that read them:
```
  Sentence Type   #Sentences   #Speakers   Total   #Sentences/Speaker
  -------------   ----------   ---------   -----   ------------------
  Dialect (SA)          2         630       1260           2
  Compact (SX)        450           7       3150           5
  Diverse (SI)       1890           1       1890           3
  -------------   ----------   ---------   -----    ----------------
  Total              2342                   6300          10
```
The corpus is divided into a training part and a test part in a way that:
-	A speaker appears in either train or test, but not in both.
-	All sentences appearing in test are different from those in train.
  -	Dialect sentences (SA) are excluded from test (because they are read by both all train and all test speakers).


For each dialectal region (DRN) we have a number of speakers according to the following structure:
```
<DIALECT>/<SEX><SPEAKER_ID>
```
Where:
```
DIALECT :== DR1 | DR2 | DR3 | DR4 | DR5 | DR6 | DR7 | DR8
SEX :== M | F
SPEAKER_ID :== <INITIALS><DIGIT>
    INITIALS :== speaker initials, 3 letters
    DIGIT :== number 0-9 to differentiate speakers with
              identical initials                              
```

In [None]:
!tree -L 1 /content/timit/TIMIT/TRAIN/DR1

For each speaker we have several utterances, and several files for each utterance according to the following file naming convention:  
```
<SENTENCE_ID>.<FILE_TYPE>
```
Where:
```
SENTENCE_ID :== <TEXT_TYPE><SENTENCE_NUMBER>             
    TEXT_TYPE :== SA | SI | SX
    SENTENCE_NUMBER :== 1 ... 2342                
FILE_TYPE :== WAV | TXT | WRD | PHN
```

In [None]:
!tree -L 1 /content/timit/TIMIT/TRAIN/DR1/FCJF0

For each utterance we have:
- The transcripton (.TXT): orthographic transcription of the utterance (with start and end audio sample of the utterance).



In [None]:
!cat /content/timit/TIMIT/TRAIN/DR1/FCJF0/SX397.TXT

- The time-aligned word transcripton (.WRD) (with start and end audio sample of each word)

In [None]:
!cat /content/timit/TIMIT/TRAIN/DR1/FCJF0/SX397.WRD

- The time-aligned phone transcription (.PHN) (with start and end audio sample of each phone)

In [None]:
!cat /content/timit/TIMIT/TRAIN/DR1/FCJF0/SX397.PHN

- The speech audio recording in SPHERE wav format (.WAV).
   The SPHERE format is not a standard audio format.
   You need to convert it to a standard wav format to listen to it, for instance with SOX.

The TIMIT corpus is a very special corpus. Normally you don’t have time-alignments at all and you don’t have phonetic transcriptions: in standard corpus you only have the .wav and the .txt files, or even the .wav files only!

Take a look at the audio file characteristics and listen to a few of them.

In [None]:
# It takes 12 seconds
!apt-get install sox
!mkdir -p /content/tmpdata
!sox /content/timit/TIMIT/TRAIN/DR1/FCJF0/SX397.WAV /content/tmpdata/SX397.wav
import wave
obj = wave.open('/content/tmpdata/SX397.wav','r')
print( "Number of channels",obj.getnchannels())
print ( "Sample width",obj.getsampwidth())
print ( "Frame rate.",obj.getframerate())
print ("Number of frames",obj.getnframes())
obj.close()
from scipy.io import wavfile
import matplotlib.pyplot as pyplot
import IPython
samplerate, data = wavfile.read('/content/tmpdata/SX397.wav')

#Plot file
pyplot.plot(data)
pyplot.show()

# Play wave (May not workd depending on browser, but you can download and play the file)
IPython.display.Audio('/content/tmpdata/SX397.wav')

Now let's have a look at the Mel-frequency spectrogram of this file. This is very similar to the features that the speech recognizer takes as input.


In [None]:
import librosa
import librosa.display
import numpy as np
y, sr = librosa.load('/content/tmpdata/SX397.wav')
melspec=librosa.feature.melspectrogram(y=y, sr=sr)
pyplot.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(melspec), y_axis='mel', fmax=8000, x_axis='time')
pyplot.colorbar(format='%+2.0f dB')
pyplot.title('Mel spectrogram')
pyplot.tight_layout()



---


#*Questions for the report (Q3.1):*

-	Explore the TRAIN and TEST directory and try to listen to some files checking that the audio is the expected one (gender, text, dialectal region if you can recognize the dialect, etc.). You can do this by modifying the corresponding parts of the Google Colab notebook. Provide one example in the report including gender, text and dialectal region.
-	Check the contents of the .txt, .wrd and .phn files against the audio (.wav) file for a few files. You can do this by modifying the corresponding parts of the Google Colab notebook. Include in the report (a few lines of ) these files for the example in the previous question.
-	Explore the documentation of the TIMIT corpus and try to find other information associated with the speakers. Can you imagine other applications of this corpus beyond training and testing Speech-To-Text systems? Include your answers in the report.

---

#4. Run an existing recipe step by step


## 4.1 KALDI-style recipes and directory structure
ESPnet follows the KALDI-style recipe structure in Bash for ASR systems.
This structure has changed somehow with the new version (ESPnet2) but is still inspired in the KALDI-style recipe structure.

There is a directory for each corpus under the directory `espnet/egs2/`.

Have a look at the different recipes contained in ESPnet:

In [None]:
!ls /content/espnet/egs2

In this lab assingment we will be working with the recipe for the TIMIT database.

It is located in `espnet/egs2/timit/asr1`:

In [None]:
!tree -L 1 /content/espnet/egs2/timit/asr1/

All the recipes have the same structure:
```
 - conf/      #Configuration files for training, inference, etc.
 - scripts/   # Bash utilities of espnet2
 - pyscripts/ # Python utilities of espnet2
 - steps/     # From Kaldi utilities
 - utils/     # From Kaldi utilities
 - db.sh      # The directory path of each corpora
 - path.sh    # Setup script for environment variables
 - cmd.sh     # Configuration for your backend of job scheduler
 - run.sh     # Entry point
 - asr.sh     # Invoked by run.sh
```

You need to modify `db.sh` for specifying your corpus before executing `run.sh`. For example, when you touch the recipe of `egs2/timit`, you need to change the paths of TIMIT in `db.sh` (but don't worry about this because this notebook does this for you later!)

Some corpora can be freely obtained from the WEB and they are written as `downloads/` at the initial state. You can also change them to your corpus path if it’s already downloaded.

`path.sh` is used to set up the environment for `run.sh`. Note that the Python interpreter used for ESPnet is not the current Python of your terminal, but it’s the Python which was installed at `tools/`. Thus you need to source `path.sh` to use this Python.

```
 . path.sh
    python
```
`cmd.sh` is used for specifying the backend of the job scheduler. If you don’t have such a system in your local machine environment, you don’t need to change anything about this file. You definitely do not have to change this file in Google colab.


`conf` is a directory containing configuration files for the recipe. We will start with the standard configuration, but we have to modify it to make different experiments. Have a look at its content and try to figure out the use of each file (their names help).

In [None]:
!tree -L 1 /content/espnet/egs2/timit/asr1/conf


> One of the most important configuation files is the one defining the neural network architecture and the training parameters, which are defined in `train_asr.yaml`. Examine the file and try to figure out what are the different parameters.



In [None]:
!cat /content/espnet/egs2/timit/asr1/conf/train_asr.yaml

> Similarly, `decode.yaml` contains parameters for decoding.

In [None]:
!cat /content/espnet/egs/timit/asr1/conf/decode.yaml

- `README.md` contains the results obtained with this recipe with different configurations. Explore the different results obtained with this database. You have to look at the `Sum/Avg` row. The columns represent (from left-to-right):
>- Number of files.
>- Number of words.
>- % Correct tokens (phones/characters/words depending on the token units used in the configuration).
>- % Substituted tokens.
>- % Deleted tokens.
>- % Inserted tokens.
>- Phone / Character / Word Error Rate in %.
>- Sentence Error Rate in %.



In [None]:
!cat /content/espnet/egs2/timit/asr1/README.md

`run.sh` is the main script, which we often call as “recipe”, to run all stages related to DNN experiments; data-preparation, training, and evaluation.

You can execute all the recipe by just running
```
%cd /content/espnet/egs2/timit/asr1
!./run.sh
```

However, we will be executing the recipe stage by stage to show and explain the different parts of the process.

First of all, we need to tell ESPnet where is the TIMIT corpus.
We do this by modifying the `db.sh` file in the recipe.


In [None]:
%cd /content/espnet/egs2/timit/asr1
!cat db.sh | grep TIMIT

In [None]:
!mv db.sh db.sh.bk

In [None]:
!sed -e "s/TIMIT=/TIMIT=\/content\/timit\/TIMIT/" db.sh.bk > db.sh
!cat db.sh | grep TIMIT

First, let's have a look at the main script.

In [None]:
!cat run.sh



The main script just calls the `asr.sh` script with some configurations.
So, let's have a look at the `asr.sh` script as well (it is very long and it is not necessary to understand it all).

In [None]:
!cat asr.sh

`asr.sh` is quite long, but it is structured in stages and you don't need to know exactly how all of them work. It is important, however, to understand the different stages.

You can run the stages you want by providing the arguments `--stage` and `--stop_stage` to both, the `run.sh` and/or the `asr.sh` scripts.

For instance, the following command will execute stages 2 to 5:

`!./run.sh --stage=2 --stop_stage=5`



---


#*Questions for the report (Q4.1):*

-	What types of recipes (recipes for different purposes such as ASR, speech synthesis, etc.) can you identify in the egs2 directory?
-	What files will you need to change to change the structure of the DNNs used by the STT? (e.g. adding more layers in encoder)
- What files will you need to change to change the Sequence-to-Sequence mapping technique used? (e.g. only CTC or only Attention-based encoder-decoder)



---

###4.1.1 Stage 1: Data Preparation

The first stage is completely corpus dependent. Its main purpose is to process the data in a specific corpus (mainly speech in some audio format and labels in text format) and organize it in a uniform way, so that all the recipes can proceed from a unified data format.

Note that `--stage <N>` is to specify the starting stage and `--stop_stage <N>` is to specifiy the stopping stage.

In [None]:
# It takes 30 seconds.
!./run.sh --stage 1 --stop_stage 1

After this stage is finished, please check the `data` directory

In [None]:
%cd /content/espnet/egs2/timit/asr1

In [None]:
!ls data

In this recipe, we use `train` as a training set, `dev` as a validation set (monitor the training progress by checking the validation score). We also use (reuse) `test` and `dev` sets for the final speech recognition evaluation.

Let's check one of the training data directory:


In [None]:
!ls -1 data/train/

These are the speech and corresponding text and speaker information based on the Kaldi format. Please also check https://kaldi-asr.org/doc/data_prep.html
```
spk2utt # Speaker information
text    # Transcription file
utt2spk # Speaker information
wav.scp # Audio file

```



---


#*Questions for the report (Q4.1.1):*

- What is the purpose of stage 1?
- Which are the 3 subsets of TIMIT used? How many utterances has each one of them?
- What type of transcriptions is used in this recipe?. Can you use another type? Which one?




---

###4.1.2 Stage 2: Speed perturbation (one of the data augmentation methods)

We do not use speed perturbation for this demo. But you can turn it on by adding an argument `--speed_perturb_factors "0.9 1.0 1.1"` to the shell script

In [None]:
!./run.sh --stage 2 --stop_stage 2



---


#*Questions for the report (Q4.1.2):*

- What is “speed perturbation”? Do you think that it may help in this case?




---

###4.1.3 Stage 3: Format wav.scp: data/ -> dump/raw

We dump the data with specified format (flac in this case) for the efficient use of the data.

Note that adding `--nj <N>` you can specify the number of CPU jobs. Please set it appropriately by considering your CPU resources and disk access.

In [None]:
# It takes 2 minutes
!./run.sh --stage 3 --stop_stage 3

###4.1.4 Stage 4: Remove long/short data: dump/raw/org -> dump/raw

There are too long and too short audio data, which are harmful for our efficient training. Those data are removed from the list.

In [None]:
!./run.sh --stage 4 --stop_stage 4



---


#*Questions for the report (Q4.1.4):*

- Stage 4 removes very long o very short segments for efficiency reasons. Does it do this to all the subsets? Why?




---

###4.1.5 Stage 5: Generate token_list from dump/raw/train/text using BPE.

Byte Pair Encoding (BPE) is a way of compressing the information in a vocabulary by finding frequent sub-words.

This is important for text processing. We make a dictionary based on the English character in this example.

We use a `sentencepiece` toolkit developed by Google.

**NOTE:** This lab assignment does not use BPE since the task is phonetic recognition. "Words" are really phones in this lab assignment, and there is a finite and very limited number of different phones.

In [None]:
!./run.sh --stage 5 --stop_stage 5

Let's check the content of the dictionary. There are several special symbols, e.g.,

```
<blank>   used for CTC
<unk>     unknown symbols do not appear in the training data
<sos/eos> start and end sentence symbols
```


In [None]:
!cat data/token_list/word/tokens.txt



---


#*Questions for the report (Q4.1.5):*

- Although not used in this example, find the meaning of g2p (a parameter of the main script of this stage) and explain it in the report.
-	Find an example output of this stage for one particular file of the training corpus and include it in the report.





---

###4.1.6-9 language modeling (skip in this tutorial)

**Stages 6--9: Stages related to language modeling.**

We skip the language modeling part in the recipe (stages 6 -- 9) in this tutorial.

In [None]:
!./run.sh --stage 6 --stop_stage 6

In [None]:
!./run.sh --stage 7 --stop_stage 7

In [None]:
!./run.sh --stage 8 --stop_stage 8

In [None]:
!./run.sh --stage 9 --stop_stage 9



---


#*Questions for the report (Q4.1.9):*

- What is language modelling? Why is it normally used in ASR?
- Why language modelling is not included in this recipe?






---

###4.1.10 End-to-end ASR ASR collect stats

We estimate the mean and variance of the data to normalize the data. We also collect the information of input and output lengths for the efficient mini batch creation.



In [None]:
# It takes 1 minute
!./run.sh --stage 10 --stop_stage 10

###4.1.11 Stage 11: ASR Training.

This is the main training loop. It takes a lot of time, but you can monitor the progress in different ways while training the network.

One way is to explore the following files that are actualized per epoch:
- log file /content/espnet/egs2/timit/asr1/exp/asr_train_asr_raw_word/train.log
- loss /content/espnet/egs2/timit/asr1/exp/asr_train_asr_raw_word/images/loss.png
- accuracy /content/espnet/egs2/timit/asr1/exp/asr_train_asr_raw_word/images/acc.png

There are many more files to monitor progress, but these are the most important.

You can also use Tensorboard to analyze the training and validation logs while training the network.

**NOTE:** You need to launch Tensorboard before starting training. It will not found any log, of course. You need to refresh it from time to time (by clicking the reload icon) to analyze how training is progressing.

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard
# Launch Tensorboard before training starts
%tensorboard --logdir /content/espnet/egs2/timit/asr1/exp

In [None]:
# It takes 15 minutes
!./run.sh --stage 11 --stop_stage 11



---


#*Questions for the report (Q4.1.11):*

- Have a look at the different options for the `train.yaml` file in ESPnet documentation. Read completely the `train.yaml` file and try to identify what parameter would you change to modify:
  -	The number of layers in the encoder.
  - The type of the encoder.
  - The number of units in each encoder layer.
  -	The type of attention in the decoder.
  -	The number of units in the decoder.
- Have a look at the different graphs **in Tensorflow** and answer the following questions:
  - Can you identify the parameters (e.g. the loss) corresponding to CTC and Attention? Include them in the report.
  - Check the evolution of the different parameters (particularly the loss and the accuracy) during training and compare the evolution of the results in train and validation. Do you think that training worked properly?
- An alternative way to check the evolution of training is using the graphs generated during the training process in /content/espnet/egs2/timit/asr1/exp/asr_train_asr_raw_word/. You can have a look at them during training using the file explorer, but you need to wait until training ends to answer these questions:
  - Can you see a difference in the evolution of the CTC and the Attention loss? Can you explain this different behavior?
  - What is the final Phone Error Rate (PER) achieved in train and validation? What is the difference between them?








---

###4.1.12 Stage 12: Decoding.

Note that you can use `inference_nj <N>` to specify the number of inference jobs

Let's monitor the log /content/espnet/egs2/timit/asr1/exp/asr_train_asr_raw_word/decode_asr_asr_model_valid.acc.ave/dev/logdir/asr_inference.1.log

In [None]:
# It takes 30 minutes
!./run.sh --stage 12 --stop_stage 12 --inference_nj 4

###4.1.13: Stage 13: Scoring

This stage computes the word error rate (WER), character error rate (CER), etc. for each test set.

In [None]:
# It takes 13 seconds
!./run.sh --stage 13 --stop_stage 13



---


#*Questions for the report (Q4.1.13):*

- Explore the file `decode.yaml` and answer the following questions:
  -	What is the “beam-size” parameter?
  - What is the “ctc-weigth” parameter?
- Explore the output of the scoring phase and answer the following questions:
  - What is the meaning of the different values produced by the scoring software and included in the heading `|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|`?.
  - Find the final Phone Error Rate (equivalent to the Word Error Rate in this experiment) for the two sets evaluated.
- You can also check the breakdown of the phone error rate in /content/espnet/egs2/timit/asr1/exp/asr_train_asr_raw_word/decode_asr_asr_model_valid.acc.ave/dev/score_wer/result.txt. Explore this file and answer the following questions:
  - Find the best and the worst phone error rate for the different speakers. Do you think that these results are very different? Do you think it is normal? Why?
  -	Explore the file until it starts dumping the alignments between hypothesis and reference. Compute manually the PER for the second alignment appearing (sentence fadg0-fadg0_si1909). What is the PER for that file?









---

###4.1.14 Stage 14: Packing the model for uploading

ESPnet scripts are prepared to pack the generated model and upload it to Zenodo/Huggingface. This stage packs the model.

We skip this stage in this lab assignment.


In [None]:
!./run.sh --stage 14 --stop_stage 14

###4.1.15 Stage 15: Uploading the model to Zenodo

ESPnet scripts are prepared to pack the generated model and upload it to Zenodo. This stage uploads the packed model.

We skip this stage in this lab assignment.


In [None]:
#!./run.sh --stage 15 --stop_stage 15

###4.1.15 Stage 15: Uploading the model to Huggingface

ESPnet scripts are prepared to pack the generated model and upload it to Huggingface. This stage uploads the packed model.

We skip this stage in this lab assignment.


In [None]:
#!./run.sh --stage 16 --stop_stage 16

# 4.2 Upload results to google drive
Given that Section 4.1 takes a lot of time, it is a good idea to upload the results of this section to Google Drive, so that you can resume the lab assignment without running all the previous steps following the procedure explained in Section 4.3.

**NOTE: Google Colab will ask you for permissions to upload this file to your Google Drive and it will not save it unless you give that permission**



In [None]:
from google.colab import drive
drive.mount('/content/drive')
import shutil

%cd /content/espnet/egs2/timit
!mkdir -p /content/drive/MyDrive/Backup_P3_DL4ASP/
!tar -cvzf timit_asr1.tgz asr1
shutil.copy("/content/espnet/egs2/timit/timit_asr1.tgz","/content/drive/MyDrive/Backup_P3_DL4ASP")

#4.3. Download results from Google Drive
If yoo lose the connection and the exectution environment you can restore it (relatively) quickly by:
- Running sections 1-3 (Install ESPnet and download TIMIT)
- Execute the cells in this section

After doing this, you should have exactly the same execution environment that you obtained running sections 1 thru 4.2.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import shutil

%cd /content/espnet/egs2/timit
shutil.copy("/content/drive/MyDrive/Backup_P3_DL4ASP/timit_asr1.tgz","/content/espnet/egs2/timit")
!rm -rf /content/espnet/egs2/timit/asr1
!tar -xzf timit_asr1.tgz


#5. Optional Extensions
There are a number of possible optional extensions for this lab assignment (but rest assured that you can get the highest grade without making any of these extensions).

You can either complete one or more of these optional extensions as part of the lab assignment or for the final project.

The first optional extension (changing the training and/or decoding config) do not change the task, but only the model or the features. Here are some examples of the modifications you can try to explore:
-	Changing the Encoding or Decoding network topology (adding or removing layers, making layers wider or narrower in terms of units, etc.).
- Changing the sequence-to-sequence mapping function (using only CTC, using only Attention, using Transformer, etc.).

Any of these two optional extensions, or any combination of them will yield a different PER on the same task, and will allow you to participate in an optional challenge in which the student with the best PER will be awarded with an extra point, the 2nd best with 0.8 points, the 3rd best with 0.7 points and any student completing a meaningful modification of the original system will be awarded with an extra 0.5 points in the lab assignment.

The rest of the proposed optional extensions imply changing the task, therefore, results could not be compared with the original system proposed in the lab assignment. In any case, these optional extensions give you a closer idea of real applications of automatic speech recognition, and you can use these optional extensions in your final project or as a part of the final project.

##5.1 Changing the training and/or decoding config

###5.1.1 By changing config files
All training options are changed by using a config file.

Please check https://espnet.github.io/espnet/espnet2_training_option.html

Let's first check config files prepared in the `timit` recipe

```
- LSTM-based E2E ASR /content/espnet/egs2/timit/asr1/conf/train_asr_rnn.yaml
- Transformer based E2E ASR /content/espnet/egs2/timit/asr1/conf/train_asr_transformer.yaml
```



You can run

**RNN**
```
./asr.sh --stage 10 \
   --train_set train_nodev \    
   --valid_set train_dev \
   --test_sets "train_dev test" \
   --nj 4 \
   --inference_nj 4 \
   --use_lm false \
   ----asr_config conf/train_asr_rnn.yaml
```

**Transformer**
```
./asr.sh --stage 10 \
   --train_set train_nodev \    
   --valid_set train_dev \
   --test_sets "train_dev test" \
   --nj 4 \
   --inference_nj 4 \
   --use_lm false \
   ----asr_config conf/train_asr_transformer.yaml
```


You can also find various configs in `espnet/egs2/*/asr1/conf/`, including
- Conformer `espnet/egs2/librispeech/asr1/conf/train_asr_confformer.yaml`
- Wav2vec2.0 pre-trained model and fine-tuning `https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_wav2vec2_960hr_large.yaml`
- HuBERT pre-trained model and fine-tuning `https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_hubert_960hr_large.yaml`

###5.1.2 By using command line arguments

You can also customize it by editing the file or passing the command line arguments, e.g.,

```
./run.sh --stage 10 --asr_args "--model_conf ctc_weight=0.3"
```
```
./run.sh --stage 10 --asr_args "--optim_conf lr=0.1"
```

See https://espnet.github.io/espnet/espnet2_tutorial.html#change-the-configuration-for-training

##5.2. Using pre-trained models, recognizing your own voice or real-time decoding

See the following link for how to use pre-trained models (from espnet_model_zoo), recognizing other audio files or your own voice and using real-time decoding (which requires specifica models).

https://espnet.github.io/espnet/notebook/espnet2_asr_realtime_demo.html

You can also use models downloaded from Hugginface. Check this webpage for more info:

https://huggingface.co/docs/hub/main/en/espnet
