# COMP47700 Speech and Audio PL6: Automatic Speech Recognition (ASR) with HuggingFace
---

## Learning outcomes
This practical tutorial covers the following learning outcomes within the COMP47700 Speech and Audio module:
* Recognise and reflect on the relationship between the underlying theories and start-of-the-art research (**LO4**)
  * Familiarise yourself with current available frameworks and tools for ASR (HuggingFace, Whisper)
  * Load and test SotA models for ASR using HuggingFace
* Create programmes to conduct experiments on speech and audio samples building on third software libraries (**LO6**)
  * Evaluate model's performance for ASR
  * Familiarise yourself with current available datasets for ASR tasks


## Module topics
This practical tutorial builds on the following core topics:
* Automatic speech recognition and speaker recognition (Unit 9)

## Why is it important?
* Understanding ASR and speaker recognition is crucial for developing a wide range of applications, from voice-controlled interfaces to security systems and personalized user experiences. These technologies play a key role in enhancing human-computer interaction and making technology more versatile and accessible.
* Getting familiar with SoTA frameworks like HuggingFace is essential for leveraging the platform effectively, developing customized models, integrating with other NLP tools, and contributing to the collaborative and dynamic community focused on advancing speech-related tasks, natural language processing and machine learning.

## Structure of this tutorial
This practical tutorial contains different sections:
* **Live coding:** Basic theory, demos and coding examples presented by the lecturer on site (unmarked)
* **Student activity:** Familiarisation and coding exercises to be completed by the students and followed by a short discussion on site (unmarked). These activites introduce key concepts and skills necessary to complete the assignments.
* **Assignment:** Three (3) take home problem/coding questions to be completed by the students and due in two (2) weeks from the day the practical tutorial is given. Assignment questions represent fifteen (15) mark points.

## Setup notes
We will be using Google Colabs for our labs but if you wish to run speech and audio projects locally (not recommended) you will need a manage your own Python environment setup with a number of important packages.

Some important libraries and packages for signal analysis in Python are:

[datasets](https://huggingface.co/docs/datasets/index): library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

[transformers](https://huggingface.co/docs/transformers/index): provides APIs and tools to download and train state-of-the-art pretrained models.

[evaluate](https://huggingface.co/docs/evaluate/index): library for evaluating ML models and datasets.

[ffmpeg](https://python-ffmpeg.readthedocs.io/en/latest/): is a powerful and versatile multimedia processing tool that can handle a variety of audio and video formats. Provides a Python wrapper for the FFmpeg library, allowing developers to use FFmpeg functionality in Python scripts.

**To install the required libraries, execute the command-line below:**

In [None]:
!pip install datasets transformers huggingface_hub torchaudio librosa jiwer evaluate

---
### **Live coding:** Collecting data from zipped file
1. From your local system, select the .zip file provided for PL6 (`PL6_files.zip`).
2. Use `zipfile` to extract the files to your Google Colab environment.

**Notes:** You can inspect the extracted folder (phonemes) in the files section at the table of contents.

In [None]:
import zipfile
from google.colab import files

zipname = 'PL6_files.zip'
uploaded = files.upload()

In [None]:
# Extract the zip file
with zipfile.ZipFile(zipname, 'r') as zip_ref:
  zip_ref.extractall()  # Extract all files to the current directory

---
### **Live coding:** Inference Examples with HuggingFace
HuggingFace contains over 7000 ASR models available at **The Hub**. They can be used through the inference API or using libraries such as `transformers, speechbrain` and `espnet`. In this tutorial, we will show examples using the inference API and the `transformers` library.

1. Define an inference query function
2. Load a sample wave file
3. Test the **wav2vec 2.0** model from api-inference.huggingface.co
4. Test the **wav2vec 2.0** model loading the model from the `transformers` library

In [None]:
# query function requires and API_TOKEN from HuggingFace
def query(filename, headers):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

In [None]:
import IPython.display as ipd
import urllib.request as urllib2
import IPython
import ipywidgets as widgets
import json
import requests

# Load sample file
h_comp='hinesCOMP47700.wav'

# Set API_TOKEN from HuggingFace [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]
headers = {"Authorization": f"Bearer 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"}

# Call the 'facebook/wav2vec2-base-960h' pre-trained model from the api-inference.huggingface portal
API_URL = "https://api-inference.huggingface.co/models/facebook/wav2vec2-base-960h"

data = query("hinesCOMP47700.wav", headers)
print(data['text'])
ipd.Audio(filename=h_comp)

In [None]:
import transformers
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", "facebook/wav2vec2-base-960h")
data = pipe("hinesCOMP47700.wav")
print(data['text'])
ipd.Audio(filename=h_comp)

### **Student activity #1:** Inference for multiple files
1. Find the folder `tcdvoip_sample` containing a small set of wavefiles and create a dataframe named `dftranscriptions` listing the files names under the column `file`
2. Use the **wav2vec 2.0** model to generate the transcriptions for the list of files and store them in the dataframe under the column `stt_wav2vec2`

In [None]:
###############################
## Student activity solution #1
###############################

import requests, zipfile, io, os
from os import listdir
import numpy as np
import pandas as pd

# Set a dataframe for our dataset
basedir='./tcdvoip_sample/'
wavefilenames = listdir(basedir)
wavefilenames = sorted(wavefilenames)
dftranscriptions = pd.DataFrame(columns = ['file'])
dftranscriptions['file'] = wavefilenames
dftranscriptions

In [None]:
###############################
## Student activity solution #1
###############################

# set task and model parameters for pipeline function
task = "automatic-speech-recognition"
modelname = 'facebook/wav2vec2-base-960h'
pipe = pipeline(task, modelname)

stranscription = []

# process wavefiles
for fname in wavefilenames:
  transcription = pipe(basedir+fname)
  stranscription.append(transcription['text'])

dftranscriptions['stt_wav2vec2'] = stranscription
dftranscriptions

### The Hub
[The Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) is a cloud-based platform provided by HuggingFace that serves as a central repository for sharing and versioning models and datasets. It allows to upload, download, and share pre-trained models for NLP, Computer vision, and Audio/Speech tasks.


For ASR, over 7000 models are available at [The Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads). They vary in the type of libraries (e.g., transformers, PyTorch, ESPnet, TensorFlow, SpeechBrain, etc), languages (e.g., English, German, Spanish, Hindi, Chinese, etc.). In this tutorial, we will test four different pre-trained models from the HuggingFace repository.

* **'facebook/wav2vec2-base-960h'**: model description [here](https://huggingface.co/facebook/wav2vec2-base-960h)
* **'facebook/hubert-large-ls960-ft'**: model description [here](https://huggingface.co/facebook/hubert-large-ls960-ft)
* **'openai/whisper-base'**: model description [here](https://huggingface.co/openai/whisper-base)
* **'openai/whisper-tiny'**: model description [here](https://huggingface.co/openai/whisper-tiny)

### **Live coding:** Exploring ASR Models in HuggingFace

1. Define a function to compute the transcriptions for a list of wavefiles

In [None]:
# test a huggingface model from a list of wavefiles, a filepath, and the model's path in The Hub
# returns a list with the transcriptions

def testHugModelwavlist(wavelist, filepath, hfmodelname):
  # set pipe with target task
  task = "automatic-speech-recognition"
  hfmodel = hfmodelname
  pipe = pipeline(task, hfmodel)

  stranscription = []

  # Process wavfiles
  for fname in wavelist:
    transcription = pipe(filepath+fname)
    stranscription.append(transcription['text'])

  return stranscription

### **Student activity #2:** Inference for multiple models
1. Use the function `testHugModelwavlist` to generate transcriptions for the files listed in the dataframe `dftranscriptions`. Load these models hosted in Huggingface's hub to generate those transcriptions:
- facebook/wav2vec2-base-100h
- facebook/hubert-large-ls960-ft
- openai/whisper-base
- openai/whisper-tiny
2. Store the generated transcriptions in the dataframe `dftranscriptions` under the columns:
- facebook/wav2vec2-base-100h `sst_wav2vec2-base`
- facebook/hubert-large-ls960-ft `sst_hubert`
- openai/whisper-base `sst_whisper-base`
- openai/whisper-tiny `sst_whisper-tiny`

In [None]:
###############################
## Student activity solution #2
###############################

from os import listdir
import numpy as np
import pandas as pd

# Set a dataframe for our dataset
basedir='./tcdvoip_sample/'
wavefilenames = listdir(basedir)
wavefilenames = sorted(wavefilenames)
#dftranscriptions = pd.DataFrame(columns = ['file'])
#dftranscriptions['file'] = wavefilenames[:6]

# call the test function and add the results to our dataframe
stt = testHugModelwavlist(wavefilenames[:6], basedir, 'facebook/wav2vec2-base-100h')
dftranscriptions['sst_wav2vec2-base'] = stt

stt = testHugModelwavlist(wavefilenames[:6], basedir, 'facebook/hubert-large-ls960-ft')
dftranscriptions['stt_hubert'] = stt

stt = testHugModelwavlist(wavefilenames[:6], basedir, 'openai/whisper-base')
dftranscriptions['stt_whisper-base'] = stt

stt = testHugModelwavlist(wavefilenames[:6], basedir, 'openai/whisper-tiny')
dftranscriptions['stt_whisper-tiny'] = stt

dftranscriptions



### Word Error Rate
For this tutorial, we will evaluate the performance of the loaded models with the **Word Error Rate (WER)**. WER is a common metric of the performance of an ASR system.

WER can be computed as:
* WER = (S + D +I) / N
* WER = (S + D +I) / (S + D + C)

where

* **S**: number of substitutions
* **D**: number of deletions
* **I**: number of insertions
* **C**: number of correct words
* **N**: number of words in the reference

### **Live coding:** Evaluate Model's performance

1. Add the transcription references to our wavefiles dataframe
2. Define a function to compute the WER for our model's transcriptions
3. Observe the resulting performance and handle issues (uppercase)

In [None]:
# Add column with reference text
reference = ['HIS HIP STRUCK THE KNEE OF THE NEXT PLAYER THERE IS A LAG BETWEEN THOUGHT AND ACT',
             'GREEN ICE FROSTED THE PUNCH BOWL THE STEADY DRIP IS WORSE THAN A DRENCHING RAIN',
             'WHEN THE FROST HAS COME IT IS TIME FOR TURKEY LOOP THE BRAID TO THE LEFT AND THEN OVER',
             'HOIST THE LOAD TO YOUR LEFT SHOULDER BOTH LOST THEIR LIVES IN THE RAGING STORM',
             'POST NO BILLS ON THIS OFFICE WALL THEY SANG THE SAME TUNES AT EACH PARTY',
             'A CRUISE IN WARM WATERS IN A SLEEP YACHT IS FUN TEAR A THIN SHEET FROM THE YELLOW PAD'
             ]

dftranscriptions['reference'] = reference

In [None]:
# computes the WER for a list of models (columns) over a dataframe
# returns a dataframe with WER scores for each model in the list
def computeWERdf(df, models):
  dfscores = pd.DataFrame(columns = ['model', 'WER'])
  dfscores['model'] = models
  # load metric
  wer = load("wer")
  wer_scores = []
  for model in models:
    predictions = df[model]
    references = df["reference"]
    wer_scores.append(wer.compute(predictions=predictions, references=references))

  dfscores['WER'] = wer_scores
  return dfscores


In [None]:
from evaluate import load

computeWERdf(dftranscriptions,['stt_wav2vec2', 'sst_wav2vec2-base', 'stt_hubert', 'stt_whisper-base', 'stt_whisper-tiny'])

* These values indicate the average number of errors per reference word
* Why whisper models are getting high WER scores?

### **Student activity #3:** Transcriptions issues
1. Use the function `str.upper()` to fix the transcriptions generated by the whisper models and replace them in the dataframe.
2. Compute the WER again for all the models and observe the changes in performance.

In [None]:
###############################
## Student activity solution #3
###############################

# Change to uppercase columns for whisper models
dftranscriptions['stt_whisper-base'] = dftranscriptions['stt_whisper-base'].str.upper()
dftranscriptions['stt_whisper-tiny'] = dftranscriptions['stt_whisper-tiny'].str.upper()

# compute WER again
computeWERdf(dftranscriptions,['stt_wav2vec2', 'sst_wav2vec2-base', 'stt_hubert', 'stt_whisper-base', 'stt_whisper-tiny'])

**Whisper** models are capable of recognizing patterns in speech signals, including changes in intonation, pauses, and other cues that indicate the presence of punctuation. These models can generate text output that includes commas, periods, question marks, exclamation points, and other common punctuation marks.

### Datasets in HuggingFace
HuggingFace allows users to download and prepare datasets with a suite of functions that enable efficient pre-processing. As with ASR models, The Hub hosts a number of datasets for audio and speech tasks.

Datasets can vary in terms of domain, language, tasks, and sizes ([ASR datasets](https://huggingface.co/datasets?task_categories=task_categories:automatic-speech-recognition&sort=downloads)).

Datasets can be downloaded using the **load_dataset** function from **datasets** library. This library contains usefull methods to process dataset objects (e.g., **remove_column** method).

### **Live coding:** Exploring datasets from HuggingFace

1. Load the **ciempiess** dataset from The Hub repository and explore the elements included in it
2. Pre-process the dataset using some of the methods available in the **dataset** library (e.g., remove columns, resample audio elements, filter elements in the dataset.)
3. Present the **Streaming Mode** and understand its functionality


For practicality purposes we selected the **ciempiess** dataset which is light enought to be downloaded during the live tutorial.


The **CIEMPIESS TEST Corpus** is a gender balanced corpus designed to test acoustic models for the speech recognition task. It was created by recordings and human transcripts of 10 male and 10 female speakers. The language of the corpus is Spanish with the accent of Central Mexico except for the speaker M_09 that comes from El Salvador. More information available at [Ciempiess-HuggingFace](https://huggingface.co/datasets/ciempiess/ciempiess_test).

In [None]:
from datasets import load_dataset, load_metric, Audio

# Load and inspect datasets format in HuggingFace
spanish_dataset = load_dataset("ciempiess/ciempiess_test")

print('Dataset features available:')
print('===========================')
print(spanish_dataset["test"])
print('===========================')


print('Sample element from dataset:')
print('============================')
spanish_dataset["test"][0]

* At a dataset object (*spanish_dataset[test]*) we have the **feature** headers included in our dataset (they can vary depending on the dataset) and the **num_rows** indicating the number of elements in the dataset.

* For any ASR dataset, features can vary, but two elements are always included: **audio** (which has inner elements like path, array, and sampling_rate) and **transcription** (in this dataset we can see it labelled as **normalized_text**).


The **remove_columns()** function allow removing columns from our dataset.

In [None]:
# Pre-processing: remove_columns function

columns_to_remove = ['audio_id', 'speaker_id', 'gender', 'duration']
spanish_dataset = spanish_dataset.remove_columns(columns_to_remove)

print(spanish_dataset["test"])

The **cast_column()** function is used to cast a column to another feature to be decoded.

In [None]:
# Pre-processing: Resampling

from datasets import Audio

spanish_dataset = spanish_dataset.cast_column("audio", Audio(sampling_rate=8000))

spanish_dataset["test"][0]

* The sampling_rate value has changed to 8000.
* We can resample the audio column again and go back to a sampling_rate of 16000

In [None]:
spanish_dataset = spanish_dataset.cast_column("audio", Audio(sampling_rate=16000))

spanish_dataset["test"][0]

We can filter elements from our dataset depending on the task we want to complete (e.g., duration of speech samples, gender of speaker, etc.). The **filter()** function returns rows that match a specified condition.

We will filter out from our dataset audio samples that exceed a 10 seconds duration.

In [None]:
# Pre-processing: Filtering

spanish_dataset = load_dataset("ciempiess/ciempiess_test")

def is_audio_length_in_range(duration):
    return duration < 10

print('Before filtering:')
print(spanish_dataset["test"].num_rows)

spanish_dataset["test"] = spanish_dataset["test"].filter(is_audio_length_in_range, input_columns=["duration"])

print('After filtering:')
print(spanish_dataset["test"].num_rows)

What is the **Streaming Mode** in HuggingFace?

* Big challenges in HuggingFace is the datasets' sizes.
* GigaSpeech smallest configuration has 10 hours of training data at 13GB of storage. Other configurations with 10 K hours require over 1 TB of storage space.

The **datasets** library allows the use of a **streaming** mode to load data progressively. In this mode, data is loaded sample by sample by iterating over the dataset.

In [None]:
librispeech_asr = load_dataset("librispeech_asr", split="test.clean", streaming=True)
print(next(iter(librispeech_asr)))

There is one caveat to streaming mode. Different from a traditional operation where we only have to perform the downloading and processing operations once, at streaming mode the data is not downloaded to disk, so any download and processing operations need to be repeated.

# **Additional Material**


* Fine-Tune Wav2Vec2 for English ASR with HuggingFace Transformers [post link](https://huggingface.co/blog/fine-tune-wav2vec2-english)
* Automatic Speech Recognition example [post link](https://huggingface.co/docs/transformers/main/tasks/asr)
* A complete guide to Audio Datasets [post link](https://huggingface.co/blog/audio-datasets)
* Loading a dataset in HuggingFace [post link](https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html)
* Stream Mode [post link](https://huggingface.co/docs/datasets/stream)

---
# Assignment Questions PL6


Upload files from local system provided for this assignment (`PL6_files_assignment.zip`)

The zip files contains the complete version of the TCD-VoIP dataset, plus a csv file `tcdvoip-transcriptions.csv` with the corresponding transcriptions for each wav file in the dataset.

In [None]:
from google.colab import files
zipname = 'PL6_files_assignment.zip'
uploaded = files.upload()

In [None]:
import zipfile
# Extract the zip file
with zipfile.ZipFile(zipname, 'r') as zip_ref:
  zip_ref.extractall()  # Extract all files to the current directory

## Assignment question 1
Using the Hub platform at HuggingFace, look up for these ASR models and test them over the TCD-VoIP dataset (5pts)

| Developer  | Model name               |
|------------|--------------------------|
| microsoft  | speecht5_asr             |
| openai     | whisper-small            |
| openai     | whisper-base.en          |
| facebook   | data2vec-audio-base-100h |
| facebook   | wav2vec2-base-960h       |

In [None]:
##################################
## Assignment question solution #1
##################################

## Assignment question 2
Report the WER scores comparing the results of all five (5) models. To do so, use a scatter plot to depict the models (x-axis) and the corresponding WER scores (y-axis). Add the corresponding labels to the plot. Make sure you're solving any issues that might be affecting the WER scores (e.g., case sensitivity). (3pts)

Add a brief comment (100 words max.) reflecting on the models' performance. You can use the descriptive information available at HuggingFace to help compare the models. (2pts)

In [None]:
##################################
## Assignment question solution #2
##################################

## Assignment question 3
Report the WER scores comparing the results for all five (5) types of degradation (echo, clip, competing speaker, noise, chop). To do so, use a scatter plot to depict the type of degradation (x-axis) and the corresponding WER scores (y-axis). Add the corresponding labels to the plot. Make sure you're solving any issues that might be affecting the WER scores (e.g., case sensitivity). (3pts)

Add a brief comment (100 words max.) reflecting on the results across different types of degradation. (2pts)

In [None]:
##################################
## Assignment question solution #3
##################################