# ASR baseline experiment using Whisper and Covost2 (Spanish-English setup)

In this notebook, we are going to learn how to use the Open AI pre-trained model [Whisper](https://openai.com/index/whisper/) for ASR on the [Covost2](https://huggingface.co/datasets/facebook/covost2) speech translation corpus (using the Spanish-English setup).

First, we import some OpenAI source whisper libraries and additional ones (e.g. for computing Word Error Rate, WER)

In [1]:
import whisper
from whisper.normalizers.basic import BasicTextNormalizer

from tqdm.notebook import tqdm
import pandas as pd

import jiwer

model = whisper.load_model("base")

  checkpoint = torch.load(fp, map_location=device)


<p style="page-break-after:always;"></p>

Load Covost2 dataset (Spanish-English setup) from Hugging Face. Previously, audio data in the source language (version 4) must be downloaded from [Common Voice](https://commonvoice.mozilla.org/en/datasets)

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("facebook/covost2", 'es_en', data_dir="/home/josanna/josanna/doc/ta/lab/covost2")

print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 79015
    })
    validation: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 13221
    })
    test: Dataset({
        features: ['client_id', 'file', 'audio', 'sentence', 'translation', 'id'],
        num_rows: 13221
    })
})


Let's take a closer look at the features of the dataset:

In [3]:
raw_datasets["train"].features

{'client_id': Value(dtype='string', id=None),
 'file': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None),
 'sentence': Value(dtype='string', id=None),
 'translation': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [4]:
raw_datasets["train"][:5]["file"]

['/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742144.mp3',
 '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742146.mp3',
 '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742323.mp3',
 '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742324.mp3',
 '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742325.mp3']

In [5]:
raw_datasets["train"][:5]["audio"]

[{'path': '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742144.mp3',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 16000},
 {'path': '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742146.mp3',
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 16000},
 {'path': '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742323.mp3',
  'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         -3.60844243e-09, -2.13593054e-09, -1.02855546e-08]),
  'sampling_rate': 16000},
 {'path': '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742324.mp3',
  'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         -8.43218004e-07, -5.10772225e-07, -2.45845513e-07]),
  'sampling_rate': 16000},
 {'path': '/home/josanna/josanna/doc/ta/lab/covost2/clips/common_voice_es_19742325.mp3',
  'array': array([-2.27373675e-13, -5.45696821e-12, -5.91171556e-12, ...,
   

Show the first 5 Spanish references

In [6]:
raw_datasets["train"][:5]["sentence"]

['Tras su lanzamiento ha recibido positivas reseñas por parte de la crítica especializada.',
 'Las hojas se secan a la sombra, en un lugar aireado.',
 'Por este motivo no pudo integrar la selección de su país.',
 'Es profundo y navegable sin problemas a medio canal.',
 '"Pretendía recoger la herencia de revistas como ""La Codorniz"" o ""El hermano lobo""."']

Show the first 5 english translations

In [7]:
raw_datasets["train"][:5]["translation"]

['After its release, it has received positive feedback from expert critics.',
 'Leaves are dried in the shade, in a ventilated place.',
 'For this reason, he could not be part of his country’s national team.',
 'The middle of the channel is deep and you can navigate without any problem.',
 'It intended to preserve the heritage of magazines such as “La Codorniz” or “El hermano lobo.”']

<p style="page-break-after:always;"></p>

We pick up the first 1000 audio samples from the training split to be automatically transcribed

In [9]:
data=raw_datasets["validation"][:1000]

Transcribe all the audio data using the Whisper (base) multilingual model. The ASR output is stored in hypotheses.

In [10]:
hypotheses = []
for sample in data["file"]:
    hypotheses.append((model.transcribe(sample, language="Spanish"))['text'])

We add the output transcriptions to the data dictionary

In [11]:
data["hypothesis"]=hypotheses

Show the first 5 output transcriptions

In [12]:
data["hypothesis"][:5]

[' Su álgase dio con el cambio de sitio.',
 ' Es un originario de lo este África Tropical y de Borneo.',
 ' Actualmente, milita en el club Oriente Petrolero de la Primera División de Bolivia.',
 ' La voz es de gran belleza y amplia.',
 ' Tienen notables colecciones arqueológicas y etnográficas.']

Transcription hypotheses, references and translations are normalized using the Whisper basic text standardisation/normalization module

In [13]:
normalizer = BasicTextNormalizer()

data["hypothesis_clean"] = [normalizer(text) for text in data["hypothesis"]]
data["sentence_clean"] = [normalizer(text) for text in data["sentence"]]
data["translation_clean"] = [normalizer(text) for text in data["translation"]]

<p style="page-break-after:always;"></p>

Finally, we compute the transcription WER using [JIWER](https://openai.com/index/whisper/) which is a simple and fast python package to evaluate ASR performance.

In [14]:

wer = jiwer.wer(list(data["sentence_clean"]), list(data["hypothesis_clean"]))

print(f"WER: {wer * 100:.2f} %")

WER: 22.04 %


Hypotheses and translations are stored into a Pandas dataframe

In [15]:
dataframe = pd.DataFrame(dict(transcription=data["hypothesis"], sentence=data["sentence"], translation=data["translation"], transcription_clean=data["hypothesis_clean"],  sentence_clean=data["sentence_clean"], translation_clean=data["translation_clean"] ))
pd.set_option('display.max_colwidth', None)
dataframe.head(1)

Unnamed: 0,transcription,sentence,translation,transcription_clean,sentence_clean,translation_clean
0,Su álgase dio con el cambio de sitio.,Su auge se dio con el cambio de siglo.,Its boom came with the turn of the century.,su álgase dio con el cambio de sitio,su auge se dio con el cambio de siglo,its boom came with the turn of the century


All the data is stored into a file using 'csv' format

In [16]:
dataframe.to_csv('L4.1_ASR_Whisper_Baseline_dev_Covost2.csv', encoding='utf-8')

# Exercise

Perform a similar experiment using a different Covost2 source-english setup. Evaluate the performance of different whisper models 