# Evaluation of ASR Systems Using WER

This notebook demonstrates the evaluation of Automatic Speech Recognition (ASR) systems using the Word Error Rate (WER) metric.

## Understanding Word Error Rate

Word Error Rate (WER) is a crucial metric used to evaluate the performance of an Automatic Speech Recognition (ASR) system. It measures how accurately the ASR system transcribes spoken language into text by comparing the machine-generated transcription to a human-generated reference transcription.

The formula for WER is:

$$ WER = \frac{S + D + I}{N} $$

Where:
- \( S \) represents the number of substitutions, which occur when a word from the reference is replaced by a different word in the hypothesis.
- \( D \) represents the number of deletions, where a word from the reference is missing in the hypothesis.
- \( I \) represents the number of insertions, where a word not present in the reference appears in the hypothesis.
- \( N \) is the total number of words in the reference transcription.

The WER gives us a percentage that reflects the proportion of errors (substitutions, deletions, and insertions) in the hypothesis compared to the total number of words in the reference. A WER of 0% means perfect transcription, while a higher WER indicates more discrepancies between the hypothesis and the reference.

### Intuition behind WER

WER is designed to provide a simple, yet effective way to quantify the performance of an ASR system. The metric helps us understand how well a system understands and processes natural language in spoken form. By identifying how many changes one would need to make to transform the hypothesis into the reference, WER offers insights into both the accuracy and reliability of the ASR system.

Additionally, analyzing the types of errors (substitutions, deletions, and insertions) can help developers fine-tune the ASR algorithms, focusing on reducing specific kinds of errors that are more prevalent or more impactful on the user experience.


Next, we take a look at calculating WER in Python. First, let's install `jiwer` for calculating the Word Error Rate.

```bash
pip install jiwer

In [3]:
# Import necessary libraries
import numpy as np
import torch
import jiwer  # This is a simple library to calculate WER

# Ensure that PyTorch is installed
print(torch.__version__)

2.2.2


### Example Data

We will use a set of simulated transcriptions and their references to illustrate how WER is calculated.

In [4]:
# Example data
references = [
    "hello world",
    "this is a test",
    "the quick brown fox jumps over the lazy dog"
]

hypotheses = [
    "helo world",
    "this is test",
    "the quick brown fox jump over the lazy"
]


In [5]:
# Function to calculate WER
def calculate_wer(references, hypotheses):
    wer_scores = []
    for ref, hyp in zip(references, hypotheses):
        wer_score = jiwer.wer(ref, hyp)
        wer_scores.append(wer_score)
    return wer_scores

# Calculate WER for each pair
wer_scores = calculate_wer(references, hypotheses)

# Display the results
for i, score in enumerate(wer_scores):
    print(f"Reference {i+1}: {references[i]}")
    print(f"Hypothesis {i+1}: {hypotheses[i]}")
    print(f"WER: {score:.2%}\n")


Reference 1: hello world
Hypothesis 1: helo world
WER: 50.00%

Reference 2: this is a test
Hypothesis 2: this is test
WER: 25.00%

Reference 3: the quick brown fox jumps over the lazy dog
Hypothesis 3: the quick brown fox jump over the lazy
WER: 22.22%



### Explanation of WER Results

#### Example 1:
- **Reference**: "hello world"
- **Hypothesis**: "helo world"
- **WER**: 50.00%

**Breakdown**:
- **Substitutions**: 0 (No words were wrongly replaced; "hello" was shortened but not replaced)
- **Deletions**: 0 (All words in the reference appear in the hypothesis)
- **Insertions**: 0 (No extra words were added)
- **Modifications**: 1 ("hello" was misspelled as "helo")
- **Total Words in Reference (N)**: 2

In this case, there is one modification (misspelling), which is treated as a substitution in WER calculation. Since there is 1 error and there are 2 reference words:

$$ WER = \frac{1}{2} = 50\% $$

This high WER reflects a significant error given the short length of the reference.

#### Example 2:
- **Reference**: "this is a test"
- **Hypothesis**: "this is test"
- **WER**: 25.00%

**Breakdown**:
- **Substitutions**: 0
- **Deletions**: 1 ("a" is missing in the hypothesis)
- **Insertions**: 0
- **Total Words in Reference (N)**: 4

The hypothesis is missing one word:

$$ WER = \frac{1}{4} = 25\% \$$

The WER indicates that to make the hypothesis match the reference, 25% of the reference words need to be considered.

#### Example 3:
- **Reference**: "the quick brown fox jumps over the lazy dog"
- **Hypothesis**: "the quick brown fox jump over the lazy"
- **WER**: 22.22%

**Breakdown**:
- **Substitutions**: 0
- **Deletions**: 1 ("jumps" to "jump" - treating as a grammatical error involving pluralization)
- **Insertions**: 0
- **Total Words in Reference (N)**: 9

Again, there's one deletion:

$$ WER = \frac{1}{9} \approx 22.22\% $$

This WER indicates that about 22.22% of the reference words are in error due to the hypothesis's lack of alignment with the reference, which in this case is the missing plural form.

#### Summary
WER effectively quantifies the errors in transcription relative to the length of the spoken content. A low WER is desirable, indicating fewer errors relative to the total words. These examples illustrate how even a single missing word or a misspelling significantly impacts WER, especially in shorter sentences. Each type of error—whether a substitution, deletion, or insertion—equally affects the calculation, emphasizing the importance of accuracy in each word for ASR systems.
