NOTE: This is the notebook used for calculating inference on your given datasets. Since the inference doesn't require lot of computation, you can load any default GPU runtime in jupyter notebook and run it directly without using your colab credits. It can be used in any free colab GPU.

#### **What is Word Error Rate (WER)**

**Word Error Rate (WER)** is a common metric used to evaluate the performance of automatic speech recognition (ASR) systems. It measures the percentage of words that were incorrectly predicted by the ASR system compared to a reference transcription. WER is calculated using the following formula:


#### *WER = Substitutions + Deletions + Insertions/Total Words in Reference*



- **Substitutions**: The number of words that were incorrectly transcribed as different words.
- **Deletions**: The number of words that were omitted from the transcription.
- **Insertions**: The number of extra words that were added to the transcription.

A lower WER indicates better ASR performance. For instance, a WER of 5% means that 5 out of every 100 words were transcribed incorrectly.

## Example Calculation:

```python
from jiwer import wer

# Reference transcriptions (ground truth)
references = ["the quick brown fox jumps over the lazy dog"]

# Hypothesis transcriptions (ASR output)
hypotheses = ["the quick brown fox jump over the lazy dogs"]

# Calculate WER
error_rate = wer(references, hypotheses)
error_rate_percentage = error_rate * 100
print(f"Word Error Rate (WER): {error_rate_percentage:.2f}%")


**Installation and Setup Instructions**

This section covers the installation of various Python libraries and packages necessary for our ASR project using the Whisper model.

1. **datasets**: This library is used for handling and processing datasets. It provides easy access to a wide range of datasets and tools for processing them.

2. **transformers**: This library from Hugging Face allows us to use state-of-the-art models like Whisper for various NLP tasks, including ASR.

3. **librosa**: This library is used for audio and music processing. It helps in loading, analyzing, and extracting features from audio files.

4. **evaluate**: This library provides various evaluation metrics, which are essential for assessing the performance of our ASR model.

5. **jiwer**: This library is used for calculating Word Error Rate (WER), a common metric to evaluate the accuracy of ASR systems by comparing the predicted transcriptions to the reference transcriptions.

6. **ipython-autotime**: This extension helps measure the execution time of each code cell in Jupyter notebooks, which is useful for performance evaluation.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
project_path = "/content/drive/MyDrive/dissertation project"
os.chdir(project_path)
print("Change to the location:", os.getcwd())

Change to the location: /content/drive/MyDrive/dissertation project


In [None]:
# # 卸载旧版本 datasets
!pip uninstall -y datasets

# # 再次强制安装新版本 datasets 和匹配的 fsspec
# !pip install -U "datasets>=2.14.6" "fsspec>=2023.9.2,<2023.10.0"

Found existing installation: datasets 2.14.4
Uninstalling datasets-2.14.4:
  Successfully uninstalled datasets-2.14.4


In [None]:
!pip install datasets>=2.6.1
!pip install transformers==4.41.0
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install ipython-autotime
%load_ext autotime

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-cupti-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-nvrtc-cu12 12.5.82 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-runtime-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-runtime-cu12 12.5

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sun Jun 22 19:43:37 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

This block of code checks if the GPU information contains the word 'failed'. If it does, it indicates that no GPU is connected. Otherwise, it prints the GPU information.

In [None]:
from transformers import pipeline
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from datasets import load_dataset, Audio
import evaluate
from datasets import load_dataset, DatasetDict

time: 11.5 s (started: 2025-06-22 19:45:24 +00:00)


This section covers the imports of various Python libraries and modules necessary for our ASR project using the Whisper

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

# Load the model and processor
#Example for Model ID: model_id = "huggingface_repo/model_path"
model_id = "liuh6/whisper-tiny_to_Chinese_accent"   #Provide the path to the model you want to use for inference.
#model_id = "openai/whisper-tiny"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/2.88k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


time: 7.1 s (started: 2025-06-22 19:45:43 +00:00)


The model ID, "openai/Whisper-tiny", specifies the path to the pre-trained Whisper model. This ID is used to load the model and processor. You can change this path to your own trained model to do inference with your own finetuned model. Example model_id = "huggingface_repo/finetuned_model"

In [None]:
# Move the model to GPU if available
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 384)
      (layers): ModuleList(
        (0-3): 4 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=384, out_features=384, bias=False)
            (v_proj): Linear(in_features=384, out_features=384, bias=True)
            (q_proj): Linear(in_features=384, out_features=384, bias=True)
            (out_proj): Linear(in_features=384, out_features=384, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=384, out_features=1536, bias=True)
          (fc2): Linear(in_features=1536, out_features=384, bias=True)
          

time: 273 ms (started: 2025-06-22 19:45:53 +00:00)


This line moves the Whisper model to the specified device (either GPU or CPU) to ensure that computations are performed on the appropriate hardware.

In [None]:
from datasets import load_dataset

# 加载数据集
references = load_dataset("naharte/chinese_english", split="test")

# 划分为 30% dev, 70% test
split_dataset = references.train_test_split(test_size=0.7, seed=42)

# 结果中包含两个子集
dev_set = split_dataset["train"]     # 30% as dev set
true_test_set = split_dataset["test"]  # 70% as test set

# check out how many do I have
print(f"Dev set: {len(dev_set)} samples")
print(f"Test set: {len(true_test_set)} samples")


README.md:   0%|          | 0.00/436 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/233M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/82.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1558 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/629 [00:00<?, ? examples/s]

Dev set: 188 samples
Test set: 441 samples
time: 9.79 s (started: 2025-06-22 19:45:57 +00:00)


In [None]:
from datasets import Dataset

true_test_set = Dataset.from_parquet("data/train.parquet")
true_test_set[0]

Generating train split: 0 examples [00:00, ? examples/s]

{'audio': {'path': 'batch_outputs3/000_Anna_bought_a_snack_at_the_mal.wav',
  'array': array([-2.42650160e-04, -1.93879678e-04, -1.06603991e-04, ...,
         -3.41895975e-05, -6.97939031e-05,  1.48760382e-05]),
  'sampling_rate': 22050},
 'sentence': 'anna bought a snack at the mall'}

time: 1.45 s (started: 2025-06-22 20:08:59 +00:00)


In [None]:
from datasets import Audio

true_test_set = true_test_set.cast_column("audio", Audio(decode=False))


time: 3.07 ms (started: 2025-06-22 20:09:06 +00:00)


This block loads the dataset specified by the identifier "naharte/spanish_english" and selects the test split. The loaded dataset is stored in the references variable. If you want to use both the test and train splits for inference, you can pass "test+train" in the split parameter

In [None]:
def transcribe_audio(batch):
    audio = batch["audio"]["array"]  # Ensure you access the raw audio array

    # Convert audio to model input
    inputs = processor(audio, return_tensors="pt", sampling_rate=16000)
    input_features = inputs.input_features.to(model.device)

    # Force Whisper to use English
    forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")

    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

    # Decode output
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    return transcription[0]

time: 1.13 ms (started: 2025-06-22 20:03:45 +00:00)


This function processes the audio data, generates predictions using the Whisper model, and returns the transcribed text.

In [None]:
import torchaudio
import torch

def resample_audio_array_to_16k(audio_array, original_sr):
    waveform = torch.tensor(audio_array).unsqueeze(0).float()  # [1, T]
    if original_sr != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=original_sr, new_freq=16000)
        waveform = resampler(waveform)
    return waveform.squeeze(0).numpy()  # 转回 numpy，方便和 processor 配合

def transcribe_audio(batch):
    path = batch["audio"]["path"]
    try:
        waveform, sr = torchaudio.load(path)

        if sr != 16000:
            resampler = torchaudio.transforms.Resample(sr, 16000)
            waveform = resampler(waveform)

        inputs = processor(waveform.squeeze(0).numpy(), return_tensors="pt", sampling_rate=16000)
        input_features = inputs.input_features.to(model.device)
        forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")

        with torch.no_grad():
            predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
        return transcription[0]

    except Exception as e:
        print(f"[跳过错误文件] {path} 错误信息: {e}")
        return "[ERROR]"



time: 3.26 ms (started: 2025-06-22 20:09:10 +00:00)


In [None]:
#hypotheses = references.map(lambda batch: {"transcription": transcribe_audio(batch)}, remove_columns=["audio"])

# hypotheses = true_test_set.map(lambda batch: {"transcription": transcribe_audio(batch)}, remove_columns=["audio"])
# references = true_test_set




# hypotheses = dev_set.map(lambda batch: {"transcription": transcribe_audio(batch)}, remove_columns=["audio"])
# references = dev_set


time: 519 µs (started: 2025-06-22 19:46:45 +00:00)


In [None]:
# 第一步：提前保存原始参考集
references = true_test_set

# 第二步：再对原数据进行识别
hypotheses = true_test_set.map(
    lambda batch: {"transcription": transcribe_audio(batch)},
    remove_columns=["audio"]
)


Map:   0%|          | 0/600 [00:00<?, ? examples/s]

time: 5min 15s (started: 2025-06-22 20:09:14 +00:00)


In [None]:
print("Reference: ", references['sentence'][6])
print("Hypothesis: ", hypotheses['transcription'][6])

Reference:  the cat fixed the fan at noon
Hypothesis:   Can't fix to the plane that near
time: 1.78 ms (started: 2025-06-22 20:14:29 +00:00)


In [None]:
import re

# 定义清理函数
def clean_text(batch):
    # 转换为小写
    text = batch["transcription"].lower()
    # 移除所有标点符号
    text = re.sub(r"[^\w\s]", "", text)
    return {"transcription": text}

# 应用清理函数到 hypotheses
hypotheses = hypotheses.map(clean_text)

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

time: 48.6 ms (started: 2025-06-22 20:14:29 +00:00)


This block uses the map method from the datasets library to apply the transcribe_audio function to each batch in the references dataset. The lambda function ensures that the transcribe_audio function is applied to each batch, and the resulting transcription is stored in the transcription column. The remove_columns parameter specifies that the audio column should be removed from the resulting dataset.

In [None]:
print("Hypothesis: ", hypotheses)

Hypothesis:  Dataset({
    features: ['sentence', 'transcription'],
    num_rows: 600
})
time: 637 µs (started: 2025-06-22 20:14:29 +00:00)


In [None]:
!pip install num2words

time: 4.26 s (started: 2025-06-22 20:14:29 +00:00)


In [None]:
from num2words import num2words
import re

def normalize_numbers(text):
    # 先匹配文本中的数字，并转换为英文单词（将连字符替换为空格）
    def replace_number(match):
        num = int(match.group())
        return num2words(num).replace('-', ' ')
    text = re.sub(r'\d+', replace_number, text)
    # 再移除所有标点符号和其他非字母、非数字、非空格的字符
    text = re.sub(r'[^\w\s]', '', text)
    return text

def normalize_fn(example):
    # 对每个样本中 "transcription" 字段进行归一化处理（数字归一化 + 去除标点符号）
    example["transcription"] = normalize_numbers(example["transcription"])
    return example

# 假设 hypotheses 是一个包含 "transcription" 特征的 Hugging Face Dataset
hypotheses = hypotheses.map(normalize_fn)

# 输出第425个样本归一化后的 transcription
print("归一化后的 Hypothesis[88]:", hypotheses[88]["transcription"])

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

归一化后的 Hypothesis[88]:  weve her part in there
time: 59.1 ms (started: 2025-06-22 20:14:33 +00:00)


In [None]:
print("Reference: ", references['sentence'][5])
print("Hypothesis: ", hypotheses['transcription'][5])

Reference:  a man fixed the fan on monday
Hypothesis:   men seek to defend one woman
time: 2.95 ms (started: 2025-06-22 20:14:33 +00:00)


In [None]:
from jiwer import wer

# Calculate WER
ref_text = references["sentence"]
hyp_text = hypotheses["transcription"]

error_rate = wer(ref_text, hyp_text)
error_rate_percentage = error_rate * 100
print(f"Word Error Rate (WER): {error_rate_percentage:.2f}%")


Word Error Rate (WER): 75.65%
time: 11.3 ms (started: 2025-06-22 20:14:33 +00:00)


This code snippet calculates and displays the Word Error Rate between the reference and hypothesis transcriptions for your given accented dataset

In [None]:
from jiwer import wer


Sentence_Attention=[]
for i in range(len(references['transcription'])):
  ref=references['transcription'][i]
  hy=hypotheses['transcription'][i]
  error_rate = wer(ref, hy)
  error_rate_percentage = error_rate * 100
  if error_rate_percentage>50:
    Sentence_Attention.append(i)
    print(f"Word Error Rate (WER): {error_rate_percentage:.2f}%")
    print("Reference: ", ref)
    print("Hypothesis: ", hy)
    print(i)
print(len(Sentence_Attention))

KeyError: "Column transcription not in the dataset. Current columns in the dataset: ['audio', 'sentence']"

time: 30.1 ms (started: 2025-06-22 20:02:22 +00:00)


In [None]:
!pip install nltk

time: 11.1 s (started: 2025-06-19 15:19:01 +00:00)


In [None]:
import nltk

time: 1.96 s (started: 2025-06-19 15:19:12 +00:00)


In [None]:
nltk.download('cmudict')

[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


True

time: 165 ms (started: 2025-06-19 15:19:14 +00:00)


In [None]:
import re
from nltk.corpus import cmudict

cmu_dict = cmudict.dict()

contractions = {
    "you're": "you are", "thats": "that is", "didnt": "did not",
    "doesnt": "does not", "shes": "she is", "ive": "i have",
    "isnt": "is not", "youd": "you would", "hed": "he would",
    "theres": "there is", "dont": "do not", "wont": "will not",
    "cant": "cannot", "im": "i am", "wasnt": "was not"
}

def word_to_phonetics(word):
    """Convert a single word into its phoneme string."""
    word = word.lower()
    word = re.sub(r"[^\w']", '', word)  # 保留撇号用于缩写处理

    # 展开缩写
    if word in contractions:
        expanded = contractions[word]
        return ' '.join([word_to_phonetics(w) for w in expanded.split()])

    if word in cmu_dict:
        return ' '.join(cmu_dict[word][0])  # 只取第一个发音
    else:
        return "<UNK>"


def phrase_to_phonetics(phrase):
    words = phrase.split()
    phonetic_words = [word_to_phonetics(word) for word in words]
    return '   '.join(phonetic_words)

time: 1.38 s (started: 2025-06-19 15:19:14 +00:00)


In [None]:
print(phrase_to_phonetics("You're amazing!"))

Y UW1 AA1 R   AH0 M EY1 Z IH0 NG
time: 1.8 ms (started: 2025-06-19 15:19:15 +00:00)


In [None]:
from jiwer import wer


Sentence_Attention=[]
for i in range(len(references['transcription'])):
  ref=references['transcription'][i]
  hy=hypotheses['transcription'][i]
  error_rate = wer(ref, hy)
  error_rate_percentage = error_rate * 100
  if error_rate_percentage>50:
    Sentence_Attention.append(i)
    re_tra=phrase_to_phonetics(ref)
    hy_tra=phrase_to_phonetics(hy)
    print(f"Word Error Rate (WER): {error_rate_percentage:.2f}%")
    print("Reference: ", ref)
    print("Hypothesis: ", hy)
    print("Reference Phonetics: ", re_tra)
    print("Hypothesis Phonetics: ", hy_tra)
    print(i)
print(len(Sentence_Attention))

Word Error Rate (WER): 66.67%
Reference:  play sandino again
Hypothesis:   play some dinner again
Reference Phonetics:  P L EY1   S AE0 N D IY1 N OW0   AH0 G EH1 N
Hypothesis Phonetics:  P L EY1   S AH1 M   D IH1 N ER0   AH0 G EH1 N
25
Word Error Rate (WER): 66.67%
Reference:  id never seen anything like that
Hypothesis:   i dont ever see anything like that
Reference Phonetics:  IH1 D   N EH1 V ER0   S IY1 N   EH1 N IY0 TH IH2 NG   L AY1 K   DH AE1 T
Hypothesis Phonetics:  AY1   D UW1 N AA1 T   EH1 V ER0   S IY1   EH1 N IY0 TH IH2 NG   L AY1 K   DH AE1 T
70
Word Error Rate (WER): 75.00%
Reference:  modern materials allow us to do modern things
Hypothesis:   more than material allows us to do more than things
Reference Phonetics:  M AA1 D ER0 N   M AH0 T IH1 R IY0 AH0 L Z   AH0 L AW1   AH1 S   T UW1   D UW1   M AA1 D ER0 N   TH IH1 NG Z
Hypothesis Phonetics:  M AO1 R   DH AE1 N   M AH0 T IH1 R IY0 AH0 L   AH0 L AW1 Z   AH1 S   T UW1   D UW1   M AO1 R   DH AE1 N   TH IH1 NG Z
124
Word Er

In [None]:
import re

def remove_stress(phoneme):
    return re.sub(r'\d$', '', phoneme)

def align_sequences(ref, hyp):
    n = len(ref)
    m = len(hyp)
    dp = [[0]*(m+1) for _ in range(n+1)]
    backtrace = [[None]*(m+1) for _ in range(n+1)]

    for i in range(n+1):
        dp[i][0] = i
        backtrace[i][0] = 'del'
    for j in range(m+1):
        dp[0][j] = j
        backtrace[0][j] = 'ins'

    backtrace[0][0] = 'match'  # 起始点

    for i in range(1, n+1):
        for j in range(1, m+1):
            if ref[i-1] == hyp[j-1]:
                dp[i][j] = dp[i-1][j-1]
                backtrace[i][j] = 'match'
            else:
                del_cost = dp[i-1][j] + 1
                ins_cost = dp[i][j-1] + 1
                sub_cost = dp[i-1][j-1] + 1
                min_cost = min(del_cost, ins_cost, sub_cost)

                dp[i][j] = min_cost
                if min_cost == sub_cost:
                    backtrace[i][j] = 'sub'
                elif min_cost == del_cost:
                    backtrace[i][j] = 'del'
                else:
                    backtrace[i][j] = 'ins'

    # 回溯生成操作序列
    ops = []
    i, j = n, m
    while i > 0 or j > 0:
        op = backtrace[i][j]
        if op == 'match':
            ops.append(('match', ref[i-1], hyp[j-1]))
            i -= 1
            j -= 1
        elif op == 'sub':
            ops.append(('sub', ref[i-1], hyp[j-1]))
            i -= 1
            j -= 1
        elif op == 'del':
            ops.append(('del', ref[i-1], None))
            i -= 1
        elif op == 'ins':
            ops.append(('ins', None, hyp[j-1]))
            j -= 1
    ops.reverse()
    return ops


# 示例输入数据
references = [phrase_to_phonetics(transcripts) for transcripts in references['transcription']]
hypotheses = [phrase_to_phonetics(transcriptss) for transcriptss in hypotheses['transcription']]

error_counts_ref = {}
error_counts_hyp = {}

for ref_str, hyp_str in zip(references, hypotheses):
    ref_phonemes = ref_str.split()
    hyp_phonemes = hyp_str.split()
    ops = align_sequences(ref_phonemes, hyp_phonemes)
    for op, ref_ph, hyp_ph in ops:
        if op in ["sub", "del"] and ref_ph is not None:
            ph = remove_stress(ref_ph)
            error_counts_ref[ph] = error_counts_ref.get(ph, 0) + 1
        if op in ["sub", "ins"] and hyp_ph is not None:
            ph = remove_stress(hyp_ph)
            error_counts_hyp[ph] = error_counts_hyp.get(ph, 0) + 1

ref_error_array = sorted(error_counts_ref.items(), key=lambda x: x[1], reverse=True)
hyp_error_array = sorted(error_counts_hyp.items(), key=lambda x: x[1], reverse=True)

print("The reference side error：")
print(ref_error_array)
print("The hypothesis side error：")
print(hyp_error_array)



The reference side error：
[('AH', 68), ('T', 62), ('IH', 55), ('D', 47), ('N', 44), ('AE', 43), ('Z', 40), ('R', 37), ('IY', 36), ('S', 36), ('DH', 32), ('EH', 31), ('ER', 30), ('AA', 28), ('AY', 24), ('L', 23), ('V', 23), ('EY', 19), ('K', 18), ('B', 17), ('AO', 17), ('UW', 17), ('<UNK>', 15), ('OW', 15), ('W', 15), ('P', 13), ('Y', 12), ('F', 11), ('AW', 9), ('M', 9), ('G', 7), ('TH', 7), ('JH', 7), ('HH', 6), ('NG', 5), ('UH', 5), ('CH', 5), ('SH', 4)]
The hypothesis side error：
[('AH', 102), ('IH', 73), ('T', 61), ('R', 44), ('L', 42), ('S', 39), ('IY', 38), ('D', 33), ('<UNK>', 33), ('AA', 33), ('DH', 32), ('N', 31), ('EH', 26), ('ER', 21), ('P', 20), ('V', 20), ('AE', 20), ('Z', 20), ('AO', 16), ('UW', 15), ('M', 13), ('EY', 13), ('K', 12), ('AW', 11), ('W', 11), ('Y', 9), ('OW', 8), ('F', 7), ('UH', 7), ('NG', 7), ('B', 6), ('HH', 6), ('AY', 6), ('JH', 5), ('TH', 3), ('ZH', 2), ('CH', 2), ('G', 2), ('SH', 1), ('OY', 1)]
time: 842 ms (started: 2025-06-19 15:19:19 +00:00)


In [None]:
from collections import Counter

# 替换相关的统计器
sub_ref_counter = Counter()
sub_hyp_counter = Counter()
sub_pair_counter = Counter()

for ref_str, hyp_str in zip(references, hypotheses):
    ref_phonemes = ref_str.split()
    hyp_phonemes = hyp_str.split()
    ops = align_sequences(ref_phonemes, hyp_phonemes)
    for op, ref_ph, hyp_ph in ops:
        if op in ["sub", "del"] and ref_ph is not None:
            ph = remove_stress(ref_ph)
            error_counts_ref[ph] = error_counts_ref.get(ph, 0) + 1
        if op in ["sub", "ins"] and hyp_ph is not None:
            ph = remove_stress(hyp_ph)
            error_counts_hyp[ph] = error_counts_hyp.get(ph, 0) + 1

        # 替换统计部分
        if op == "sub":
            ref_clean = remove_stress(ref_ph)
            hyp_clean = remove_stress(hyp_ph)
            sub_ref_counter[ref_clean] += 1
            sub_hyp_counter[hyp_clean] += 1
            sub_pair_counter[(ref_clean, hyp_clean)] += 1

time: 760 ms (started: 2025-06-19 15:19:20 +00:00)


In [None]:
print("最常被替换的音素（reference side）：")
print(sub_ref_counter.most_common(10))

print("最常被错误识别成的音素（hypothesis side）：")
print(sub_hyp_counter.most_common(10))

print("最常见的替换对：")
print(sub_pair_counter.most_common(10))


最常被替换的音素（reference side）：
[('AH', 40), ('IH', 39), ('T', 38), ('N', 34), ('AE', 34), ('D', 33), ('Z', 29), ('S', 28), ('IY', 26), ('R', 25)]
最常被错误识别成的音素（hypothesis side）：
[('AH', 62), ('T', 39), ('IH', 35), ('S', 35), ('IY', 30), ('L', 29), ('<UNK>', 27), ('ER', 25), ('N', 25), ('D', 24)]
最常见的替换对：
[(('Z', 'S'), 17), (('T', 'D'), 15), (('S', 'Z'), 15), (('D', 'T'), 12), (('AE', 'AH'), 10), (('IH', 'IY'), 8), (('IH', 'AH'), 8), (('N', 'M'), 7), (('IY', 'IH'), 7), (('N', 'NG'), 7)]
time: 8.41 ms (started: 2025-04-22 20:23:12 +00:00)
