## Prepare Environment

We can verify that we've been assigned a GPU and view its specifications:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
project_path = "/content/drive/MyDrive/dissertation project/dissertation_note"
os.chdir(project_path)
print("Change to the location:", os.getcwd())

Change to the location: /content/drive/MyDrive/dissertation project/dissertation_note


In [None]:
# Autoload setup (you don't need to edit this cell); instructions to:
#   i) enable autoreloading of modules
%load_ext autoreload
#  ii) import the module 'sp' (which will contain your functions) in an autoreloadable way
%aimport sp
%aimport pp
# iii) indicate that we want autoreloading to happen on every evaluation.
%autoreload 1


In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Jun 19 14:30:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   42C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

We'll employ several popular Python packages to fine-tune the Whisper model.
We'll use `datasets` to download and prepare our training data and
`transformers` to load and train our Whisper model. We'll also require
the `soundfile` package to pre-process audio files, `evaluate` and `jiwer` to
assess the performance of our model. Finally, `autotime` library to this which reports time each notebook cell takes to run. This gives a good idea of how much training time its taking to run the whisper finetuning.

In [None]:
#!pip uninstall -y datasets

In [None]:
# # 卸载旧版本 datasets
!pip uninstall -y datasets

# 再次强制安装新版本 datasets 和匹配的 fsspec
!pip install -U "datasets>=2.14.6" "fsspec>=2023.9.2,<2023.10.0"

Found existing installation: datasets 3.6.0
Uninstalling datasets-3.6.0:
  Successfully uninstalled datasets-3.6.0
Collecting datasets>=2.14.6
  Using cached datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Using cached datasets-3.6.0-py3-none-any.whl (491 kB)
Installing collected packages: datasets
Successfully installed datasets-3.6.0


In [None]:
!pip install datasets>=2.6.1
!pip install transformers==4.49.0
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install ipython-autotime
!pip install nltk
!pip install accelerate -U
# !apt-get install -y sox
# !pip install --upgrade torchaudio
%load_ext autotime

time: 646 µs (started: 2025-06-19 14:31:16 +00:00)


## log into huggingface

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

time: 488 ms (started: 2025-06-19 14:31:29 +00:00)


## Load Dataset

In [None]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("naharte/chinese_english", split="train")
common_voice["test"] = load_dataset("naharte/chinese_english", split="test")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


time: 10.3 s (started: 2025-06-19 14:31:41 +00:00)


In [None]:
common_voice = common_voice.rename_column("transcription", "sentence") #change the name of column from transcription to sentence
print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 1558
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 629
    })
})
time: 4.32 ms (started: 2025-06-19 14:32:01 +00:00)


## Use VITS generated dataset

In [None]:
from datasets import Dataset

time: 1.45 ms (started: 2025-06-15 22:31:21 +00:00)


In [None]:
common_voice['train'] = Dataset.from_parquet("data/train.parquet")

time: 10.1 ms (started: 2025-06-15 22:31:23 +00:00)


In [None]:
common_voice['train'][0]

{'audio': {'path': 'batch_outputs1/000_Anna_delivers_a_little_agenda_.wav',
  'array': array([-2.07609773e-04, -2.11950712e-04, -1.82910095e-04, ...,
         -2.51644906e-05,  3.89215202e-05,  1.62629833e-04]),
  'sampling_rate': 22050},
 'sentence': 'anna delivers a little agenda in the area'}

time: 1.91 s (started: 2025-06-15 22:31:25 +00:00)


## Prepare Feature Extractor, Tokenizer and Data

The ASR pipeline can be de-composed into three stages:

1) A feature extractor which pre-processes the raw audio-inputs

2) The model which performs the sequence-to-sequence mapping

3) A tokenizer which post-processes the model outputs to text format


### Load WhisperFeatureExtractor

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")

time: 1.05 s (started: 2025-06-19 14:32:09 +00:00)


### Load WhisperTokenizer

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", task="transcribe")

time: 444 ms (started: 2025-06-19 14:32:11 +00:00)


### Combine To Create A WhisperProcessor

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny",  task="transcribe")

time: 3.32 s (started: 2025-06-19 14:35:20 +00:00)


### Prepare Data

Let's print the first example of the Accented dataset to see
what form the data is in:

In [None]:
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

time: 6.04 ms (started: 2025-06-19 14:32:13 +00:00)


In [None]:
print(common_voice['train'][0]['audio'])


{'path': 'G00021S1053.wav', 'array': array([-0.00024414, -0.00033569, -0.00030518, ..., -0.00018311,
       -0.00027466, -0.00036621]), 'sampling_rate': 16000}
time: 1.67 s (started: 2025-06-19 14:32:16 +00:00)


In [None]:
!apt-get install -y sox
!pip install --upgrade torchaudio


^C
[31mERROR: Operation cancelled by user[0m[31m
[0mtime: 4.18 s (started: 2025-05-10 16:32:24 +00:00)


## Perform PHaPS

In [None]:
from phoneme_selector import select_top_phoneme_samples

top_samples = select_top_phoneme_samples(common_voice['train'], top_k=600)

[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.


time: 4.68 s (started: 2025-06-13 16:41:47 +00:00)


In [None]:
top_600_indices = [sample['index'] for sample in top_samples]
 # 提取对应的样本
sub_training_set = common_voice['train'].select(top_600_indices)

time: 5.4 ms (started: 2025-06-13 16:41:51 +00:00)


In [None]:
sub_training_set

Dataset({
    features: ['audio', 'sentence'],
    num_rows: 600
})

time: 2.46 ms (started: 2025-06-13 16:41:51 +00:00)


In [1]:
import os
import torch
import torchaudio
from collections import defaultdict

# 1. 构建 speaker_groups
speaker_groups = defaultdict(list)
for idx, sample in enumerate(sub_training_set):
    speaker_id = sample['audio']['path'][:6]
    speaker_groups[speaker_id].append(idx)

# 2. 构建 speaker_id → spk1~spk5 映射
sorted_speakers = sorted(speaker_groups.keys())  # 确保顺序一致
speaker_map = {spk_id: f"spk{i+1}" for i, spk_id in enumerate(sorted_speakers)}

# 3. 设置保存路径
base_dir = "/content/drive/MyDrive/dissertation project/vits_ready_data"
os.makedirs(base_dir, exist_ok=True)

# 4. 遍历每个 speaker 分别保存
for spk_id, indices in speaker_groups.items():
    spk_dir = os.path.join(base_dir, speaker_map[spk_id])
    os.makedirs(spk_dir, exist_ok=True)

    metadata_path = os.path.join(spk_dir, "metadata.csv")
    with open(metadata_path, "w", encoding="utf-8") as f:
        for i, idx in enumerate(indices):
            sample = sub_training_set[idx]
            waveform = sample["audio"]["array"]
            sample_rate = sample["audio"]["sampling_rate"]
            text = sample["sentence"].strip()

            # 保存音频
            wav_name = f"{i:04d}.wav"
            wav_path = os.path.join(spk_dir, wav_name)

            if not isinstance(waveform, torch.Tensor):
                waveform = torch.tensor(waveform)

            torchaudio.save(wav_path, waveform.unsqueeze(0), sample_rate)

            # 写入 metadata.csv
            f.write(f"{wav_name}|{text}\n")

print("所有 speaker 的数据已成功保存！")


NameError: name 'sub_training_set' is not defined

## Perform Speed Perturbation

In [None]:
import os
os.environ['LD_LIBRARY_PATH'] = '/usr/lib/x86_64-linux-gnu'
from sp import speed_perturb_dataset

import torch

subset_SP_voice=speed_perturb_dataset(sub_training_set)


Map:   0%|          | 0/600 [00:00<?, ? examples/s]

time: 1h 35min 43s (started: 2025-05-22 16:00:48 +00:00)


In [None]:
from datasets import Dataset
import torchaudio
import numpy as np

new_rows = []

for item in subset_SP_voice:
    # 原始音频
    new_rows.append({
        'audio_path': None,
        'audio_array': np.array(item['audio']['array'], dtype=np.float32),  # 强制float32
        'sampling_rate': item['audio']['sampling_rate'],
        'sentence': item['sentence'],
        'source': 'original'
    })
    # 0.9x 倍速
    waveform, sr = torchaudio.load(item['audio_sp09'])
    new_rows.append({
        'audio_path': item['audio_sp09'],
        'audio_array': waveform.squeeze().numpy().astype(np.float32),  # 强制float32
        'sampling_rate': sr,
        'sentence': item['sentence'],
        'source': 'sp09'
    })
    # 1.1x 倍速
    waveform, sr = torchaudio.load(item['audio_sp11'])
    new_rows.append({
        'audio_path': item['audio_sp11'],
        'audio_array': waveform.squeeze().numpy().astype(np.float32),  # 强制float32
        'sampling_rate': sr,
        'sentence': item['sentence'],
        'source': 'sp11'
    })

# 创建新的 Dataset
flat_dataset = Dataset.from_list(new_rows)

# 打印检查
print(flat_dataset)


Dataset({
    features: ['audio_path', 'audio_array', 'sampling_rate', 'sentence', 'source'],
    num_rows: 1800
})
time: 24.3 s (started: 2025-05-22 17:43:24 +00:00)


In [None]:
SP_flat_dataset = flat_dataset.shuffle(seed=42)

time: 10.5 ms (started: 2025-05-22 17:49:28 +00:00)


In [None]:
flat_dataset.save_to_disk("my_SP_flat_dataset_PHaPS")

Saving the dataset (0/2 shards):   0%|          | 0/1800 [00:00<?, ? examples/s]

time: 7.9 s (started: 2025-05-22 17:49:47 +00:00)


In [None]:
from datasets import Dataset

SP_dataset_PHaPS = Dataset.load_from_disk("my_SP_flat_dataset_PHaPS")
print(SP_dataset_PHaPS)

Dataset({
    features: ['audio_path', 'audio_array', 'sampling_rate', 'sentence', 'source'],
    num_rows: 1800
})
time: 25.1 ms (started: 2025-06-04 18:04:20 +00:00)


## Perform Pitch Perturbation

In [None]:
from pp import pitch_perturb_dataset
subset_PP_voice=pitch_perturb_dataset(common_voice["train"].select(range(600)))

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

KeyboardInterrupt: 

time: 12.7 s (started: 2025-05-13 15:01:34 +00:00)


In [None]:
subset_PP_voice

Dataset({
    features: ['audio', 'sentence', 'audio_p-2', 'audio_pu2'],
    num_rows: 600
})

time: 4.03 ms (started: 2025-05-10 16:03:06 +00:00)


In [None]:
from flatten_dataset import flatten_dataset


variant_dict = {
    'pp-2': 'audio_p-2',           # pitch‑perturb –2 semitones
    'pp+2': 'audio_pu2',           # pitch‑perturb +2 semitones
}

flat_ds = flatten_dataset(subset_PP_voice, variant_map=variant_dict)

print(flat_ds)
# >>> Dataset({
#         features: ['audio_path', 'audio_array', 'sampling_rate', 'sentence', 'source'],
#         num_rows: <原行数 × (1 + 有效增强数)>
#     })


NameError: name 'subset_PP_voice' is not defined

time: 2.47 s (started: 2025-05-14 16:47:23 +00:00)


In [None]:
PP_flat_dataset = flat_ds.shuffle(seed=44)

time: 13.4 ms (started: 2025-05-10 16:20:09 +00:00)


In [None]:
PP_flat_dataset.save_to_disk("my_PP_flat_dataset")

NameError: name 'PP_flat_dataset' is not defined

In [None]:
from datasets import Dataset

PP_dataset = Dataset.load_from_disk("my_PP_flat_dataset")
print(PP_dataset)

Dataset({
    features: ['audio_path', 'audio_array', 'sampling_rate', 'sentence', 'source'],
    num_rows: 1800
})


## Prepare dataset

In [None]:
# def prepare_dataset(batch):
#     # load and resample audio data from 48 to 16kHz
#     audio = batch["audio"]

#     # compute log-Mel input features from input audio array
#     batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

#     # encode target text to label ids
#     batch["labels"] = tokenizer(batch["sentence"]).input_ids
#     return batch




def prepare_dataset(batch):
    # see if its common voice or SP
    if "audio" in batch and batch["audio"] is not None:
        # Use the original one
        audio_array = batch["audio"]["array"]
        sampling_rate = batch["audio"]["sampling_rate"]
    else:
        # SP
        audio_array = batch["audio_array"]
        sampling_rate = batch["sampling_rate"]

    # log-mel features extractor
    batch["input_features"] = feature_extractor(
        audio_array, sampling_rate=sampling_rate
    ).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids

    return batch


time: 2.62 ms (started: 2025-06-19 14:32:39 +00:00)


In [None]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

Map (num_proc=2):   0%|          | 0/1558 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/629 [00:00<?, ? examples/s]

time: 1min 19s (started: 2025-06-19 14:32:41 +00:00)


In [None]:
SP_dataset_PHaPS = SP_dataset_PHaPS.map(prepare_dataset, remove_columns = SP_dataset_PHaPS.column_names, num_proc=2)

NameError: name 'SP_dataset_PHaPS' is not defined

time: 28.5 ms (started: 2025-06-15 20:08:05 +00:00)


In [None]:
PP_dataset = PP_dataset.map(prepare_dataset, remove_columns = PP_dataset.column_names, num_proc=2)

## Training and Evaluation

### Define a Data Collator

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

time: 3.53 ms (started: 2025-06-19 14:34:01 +00:00)


Initialise the data collator we've just defined:

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

time: 2.66 ms (started: 2025-06-19 14:35:39 +00:00)


### Evaluation Metrics

Use WER

In [None]:
import evaluate
metric = evaluate.load("wer")

time: 4.23 s (started: 2025-06-19 14:35:42 +00:00)


Define a function that returns WER

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

time: 2.14 ms (started: 2025-06-19 14:35:46 +00:00)


### Load a Pre-Trained Checkpoint

Load whisper

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

time: 1.05 s (started: 2025-06-19 14:35:46 +00:00)


Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)):

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

time: 3.28 ms (started: 2025-06-19 14:35:49 +00:00)


### Define the Training Configuration

define seed to make the experiment recreateable

In [None]:
from transformers import set_seed
set_seed(42)

time: 6.49 ms (started: 2025-06-19 14:35:51 +00:00)


In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-tiny_to_Chinese_accent",  # change to a repo name of your choice
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=1500,
    gradient_checkpointing=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    seed=42,
)

time: 28 ms (started: 2025-06-19 14:36:36 +00:00)




Define a subset that contains 600 samples

In [None]:

######Random#########
train_subset = common_voice["train"].select(range(600))
print(train_subset.shape)


# ######PHaPS#########
# top_600_indices = [sample['index'] for sample in top_samples]
# # 提取对应的样本
# sub_training_set = common_voice['train'].select(top_600_indices)

# # 打印一下确认
# print(sub_training_set)


# ####data augmentation

(600, 2)
time: 6.75 ms (started: 2025-06-19 14:36:39 +00:00)


In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train_subset,
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

  trainer = Seq2SeqTrainer(


time: 266 ms (started: 2025-06-19 14:36:41 +00:00)


We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:

In [None]:
processor.save_pretrained(training_args.output_dir)

[]

time: 387 ms (started: 2025-06-19 14:36:44 +00:00)


### Training

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Wer
500,0.2024,0.364832,16.425735
1000,0.0151,0.358419,16.791104
1500,0.0035,0.366914,16.18745


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=1500, training_loss=0.25792478333910307, metrics={'train_runtime': 1287.8782, 'train_samples_per_second': 2.329, 'train_steps_per_second': 1.165, 'total_flos': 7.385665536e+16, 'train_loss': 0.25792478333910307, 'epoch': 5.0})

time: 21min 48s (started: 2025-06-19 14:36:46 +00:00)


You will recieve different WER for different multiaccented dataset. You can use your finetuned model and compare the performance with other non-finetuned models using the Whisper-Inference notebook.

In [None]:
kwargs = {
    "dataset_tags": "Chinese_english",  # Example valid dataset ID from Hugging Face
    "dataset": "Chinese English",       # Pretty name for the training dataset
    "dataset_args": "config: default, split: test",
    "language": "en",                # ISO 639-1 code for English
    "model_name": "Whisper tiny Chinese with pitch pertubation",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}


time: 4.22 ms (started: 2025-06-19 14:59:11 +00:00)


The training results can now be uploaded to the Hub. To do so, execute the `push_to_hub` command and save the preprocessor object we created:

In [None]:
trainer.push_to_hub(**kwargs)

CommitInfo(commit_url='https://huggingface.co/liuh6/whisper-tiny_to_Chinese_accent/commit/79d3154d7d105d63399af5c8e5ef13acbd0246a9', commit_message='End of training', commit_description='', oid='79d3154d7d105d63399af5c8e5ef13acbd0246a9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/liuh6/whisper-tiny_to_Chinese_accent', endpoint='https://huggingface.co', repo_type='model', repo_id='liuh6/whisper-tiny_to_Chinese_accent'), pr_revision=None, pr_num=None)

time: 13.7 s (started: 2025-06-19 14:59:14 +00:00)
