# Evaluation metrics for Automatic Speech Recognition (ASR) systems.

ASR systems are evaluated by using Word Error Rate (WER) and Character Error Rate (CER). These metrics are commonly used to assess the performance of ASR systems by comparing the output of the ASR system with a reference transcription. With WER being the most common metric, and CER being used for languages like Chinese where words are characters, both metrics are essential for understanding the accuracy of speech recognition systems.

## Evaluation Metrics
1. Substitutions (S): where we transcribe the **wrong word** in our prediction ("sit" instead of "sat")
2. Insertions (I): where we add an **extra word** in our prediction
3. Deletions (D): where we **remove a word** in our prediction

### Word Error Rate (WER)
The Word Error Rate (WER) is calculated using the formula:
$$
\begin{aligned}
WER &= \frac{S + I + D}{N} \\
\end{aligned}
$$

### Character Error Rate (CER)
The Character Error Rate (CER) is calculated similarly to WER, but it focuses on characters instead of words. The formula for CER is:

$$
\begin{aligned}
CER &= \frac{S + I + D}{N} \\
\end{aligned}
$$


In [None]:
%pip install evaluate jiwer torchaudio

Collecting torchaudio
  Downloading torchaudio-2.7.1-cp311-cp311-manylinux_2_28_x86_64.whl (3.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m31m19.0 MB/s[0m eta [36m0:00:01[0m
Collecting torch==2.7.1
  Downloading torch-2.7.1-cp311-cp311-manylinux_2_28_x86_64.whl (821.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m821.2/821.2 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting sympy>=1.13.3
  Using cached sympy-1.14.0-py3-none-any.whl (6.3 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.6.77
  Using cached nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.6.77
  Using cached nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB)
Collecting nvidia-cuda-cupti-cu12==12.6.80
  Using cached nvidia_cuda_cupti_cu12-12.6.80-py

The following example illustrates how to compute WER and CER using Python.

The example has a word missing and a substitution in the prediction.


In [4]:
reference = "This is a test sentence."
prediction = "This is test sentenceWrong."

In [None]:
from evaluate import load

wer_metric = load("wer")

wer = wer_metric.compute(references=[reference], predictions=[prediction])

0.4

In [11]:
print(f"Word Error Rate (WER): {wer:.3f} (lower is better)")
print(f"Accuracy: {1 - wer:.3f} (higher is better)")

Word Error Rate (WER): 0.400 (lower is better)
Accuracy: 0.600 (higher is better)


In [None]:
from evaluate import load

cer_metric = load("cer")

cer = cer_metric.compute(references=[reference], predictions=[prediction])

In [14]:
print(f"Character Error Rate (WER): {cer:.3f} (lower is better)")
print(f"Accuracy: {1 - cer:.3f} (higher is better)")

Character Error Rate (WER): 0.292 (lower is better)
Accuracy: 0.708 (higher is better)


CER is much more forgiving then WER, as small errors in the transcription are not penalized as much. This is because CER is based on character-level accuracy, while WER is based on word-level accuracy. In many cases, a small error in a word can lead to a large increase in WER, while the same error may only slightly affect CER.

However most of the time CER is not used. This is because that CER only looks at the characters in the transcription, and does not take into account the meaning of the words. Like grammar. We want to encourage the model to gain a better understanding of the language, and not just the characters. This is why WER is more commonly used.


## Normalization
When one normalizes a dataset for ASR one remove any casing and the ppunctionation. This makes the Speech Recognition task easier, as the model does not have to learn to recognize different cases and punctuation marks. Like the difference between "Hello" and "hello", or "Hello," and "Hello". This has actually been shown to dramatically improve the performance of ASR models.

In [15]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

prediction = " He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly is drawn from eating and its results occur most readily to the mind."
normalized_prediction = normalizer(prediction)

normalized_prediction

' he tells us that at this festive season of the year with christmas and roast beef looming before us similarly is drawn from eating and its results occur most readily to the mind '

In [16]:
reference = "HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND"
normalized_referece = normalizer(reference)

wer = wer_metric.compute(
    references=[normalized_referece], predictions=[normalized_prediction]
)
wer

0.03488372093023256

## Fine-tuning ASR Model

In [21]:
from transformers import pipeline
import torch

if torch.cuda.is_available():
    device = "cuda:0"
    torch_dtype = torch.float16
else:
    device = "cpu"
    torch_dtype = torch.float32

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-small",
    torch_dtype=torch_dtype,
    device=device,
)

config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

Device set to use cpu


In [17]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [19]:
from datasets import load_dataset

common_voice_test = load_dataset(
    "mozilla-foundation/common_voice_13_0", "nn-NO", split="test"
)

n_shards.json:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

nn-NO_train_0.tar:   0%|          | 0.00/8.46M [00:00<?, ?B/s]

nn-NO_dev_0.tar:   0%|          | 0.00/5.85M [00:00<?, ?B/s]

nn-NO_test_0.tar:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

nn-NO_other_0.tar:   0%|          | 0.00/543k [00:00<?, ?B/s]

nn-NO_invalidated_0.tar:   0%|          | 0.00/1.44M [00:00<?, ?B/s]

train.tsv:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

dev.tsv:   0%|          | 0.00/44.9k [00:00<?, ?B/s]

test.tsv:   0%|          | 0.00/53.8k [00:00<?, ?B/s]

other.tsv:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

invalidated.tsv:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Reading metadata...: 314it [00:00, 153283.46it/s]


Generating validation split: 0 examples [00:00, ? examples/s]

Reading metadata...: 197it [00:00, 152477.93it/s]


Generating test split: 0 examples [00:00, ? examples/s]

Reading metadata...: 230it [00:00, 163230.10it/s]


Generating other split: 0 examples [00:00, ? examples/s]

Reading metadata...: 16it [00:00, 99568.05it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]

Reading metadata...: 42it [00:00, 113505.65it/s]


In [23]:
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

all_predictions = []

# run streamed inference
for prediction in tqdm(
    pipe(
        KeyDataset(common_voice_test, "audio"),
        generate_kwargs={"task": "transcribe"},
        batch_size=32,
    ),
    total=len(common_voice_test),
):
    all_predictions.append(prediction["text"])

  0%|          | 0/230 [00:01<?, ?it/s]


ImportError: torchaudio is required to resample audio samples in AutomaticSpeechRecognitionPipeline. The torchaudio package can be installed through: `pip install torchaudio`.

In [None]:
from evaluate import load

wer_metric = load("wer")

wer_ortho = 100 * wer_metric.compute(
    references=common_voice_test["sentence"], predictions=all_predictions
)
wer_ortho

In [None]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()

# compute normalised WER
all_predictions_norm = [normalizer(pred) for pred in all_predictions]
all_references_norm = [normalizer(label) for label in common_voice_test["sentence"]]

# filtering step to only evaluate the samples that correspond to non-zero references
all_predictions_norm = [
    all_predictions_norm[i]
    for i in range(len(all_predictions_norm))
    if len(all_references_norm[i]) > 0
]
all_references_norm = [
    all_references_norm[i]
    for i in range(len(all_references_norm))
    if len(all_references_norm[i]) > 0
]

wer = 100 * wer_metric.compute(
    references=all_references_norm, predictions=all_predictions_norm
)

wer