<a href="https://colab.research.google.com/github/SEAS-CVN/SEAS-2025/blob/central-region-nlp/Projects/Central_Region_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install & Import libraries

Restart session after installation

In [1]:
!pip install fsspec==2023.9.2
!pip install -q transformers accelerate datasets evaluate jiwer
!pip install librosa

Collecting fsspec==2023.9.2
  Downloading fsspec-2023.9.2-py3-none-any.whl.metadata (6.7 kB)
Downloading fsspec-2023.9.2-py3-none-any.whl (173 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.4/173.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2023.9.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_sy

In [2]:
import evaluate
from google.colab import drive
from datasets import load_dataset, Dataset, load_from_disk, concatenate_datasets, Audio
from pathlib import Path
import librosa
import os
import math
from typing import Literal, Any, Dict, List, Union
import pandas as pd
import numpy as np
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, TrainerControl, TrainerCallback
from dataclasses import dataclass
import torch
from IPython.display import Audio

# A. Central-region Dialectal Speech Recognition

## I. Data Preparation

In [3]:
# Mount Google Drive, log in to your Drive via window pop up
drive.mount('/content/drive')

# Define the Google Drive path for saving data
SAVE_PATH = '/content/drive/MyDrive/ViMD_Central_downsampled_16Hz'
OUTPUT_PATH = "/content/drive/MyDrive/phowhisper-vimd-ft"

Mounted at /content/drive


### 1. Load Dataset

Since the full dataset is too large to download and save, we will use a method called [streaming](https://huggingface.co/docs/datasets/stream). Streaming allows us to iterate over a large dataset one example at a time without downloading the entire file into memory. Since we only need the 'Central' region from the full dataset, we use this approach to iterate over the data, filter and save only the 'Central' examples.

In [None]:
def resample_audio(audio_dict, target_sr=16000):
    original_sr = audio_dict["sampling_rate"]
    if original_sr == target_sr:
        return audio_dict
    y = np.asarray(audio_dict["array"])
    y_resampled = librosa.resample(y, orig_sr=original_sr, target_sr=target_sr)
    return {
        "array": y_resampled.astype(np.float16), #original float64
        "sampling_rate": target_sr,
        "path": audio_dict.get("path", None)
    }

In [None]:
def save_dataset_in_chunks(split:Literal['train','valid','test'], save_path:str):
    """
    Save a dataset in chunks to disk, with resume capability.

    Args:
        split (str): Dataset split to load ('train', 'valid', 'test')
        save_path (str): Path where chunks will be saved
    """
    # Setup to save in chunks with split subfolder
    split_path = os.path.join(save_path, split)
    Path(split_path).mkdir(parents=True, exist_ok=True)
    existing_chunks = {int(p.name.split("_")[1]) for p in Path(split_path).glob("chunk_*") if p.is_dir()}
    print(f"Existing chunks: {existing_chunks}")

    buffer = []
    CHUNK_SIZE = 300
    chunk_idx = 0
    num_examples = {'train':4705, 'test':623, 'valid':602}
    # Skip loading if already loaded
    if len(existing_chunks) != math.ceil(num_examples[split] / CHUNK_SIZE):
        # Stream the dataset and filter for 'Central' region
        streamed_dataset = load_dataset("nguyendv02/ViMD_Dataset", split=split, streaming=True)
        central_streamed = (ex for ex in streamed_dataset if ex.get("region") == "Central")

        # Save only new chunks, in case of interruption
        for example in central_streamed:
            example["audio"] = resample_audio(example["audio"])
            buffer.append(example)
            if len(buffer) >= CHUNK_SIZE:
                if chunk_idx not in existing_chunks:
                    Dataset.from_list(buffer).save_to_disk(os.path.join(split_path, f"chunk_{chunk_idx}"))
                    print(f"Saved chunk {chunk_idx} with {len(buffer)} examples")
                else:
                    print(f"Skipping existing chunk {chunk_idx}")
                buffer = []
                chunk_idx += 1

        # Separate final chunk out, as final chunk can be partial, not reach CHUNK_SIZE yet
        if buffer:
            if chunk_idx not in existing_chunks:
                Dataset.from_list(buffer).save_to_disk(os.path.join(split_path, f"chunk_{chunk_idx}"))
                print(f"Saved final chunk {chunk_idx} with {len(buffer)} examples")
            else:
                print(f"Skipping existing final chunk {chunk_idx}")
    else:
        print(f'{split.capitalize()} data already loaded in full to {split_path}')

In [None]:
save_dataset_in_chunks("train", SAVE_PATH)

Existing chunks: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
Train data already loaded in full to /content/drive/MyDrive/ViMD_Central_downsampled_16Hz/train


In [None]:
save_dataset_in_chunks("valid", SAVE_PATH)

Existing chunks: {0, 1, 2}
Valid data already loaded in full to /content/drive/MyDrive/ViMD_Central_downsampled_16Hz/valid


In [None]:
save_dataset_in_chunks("test", SAVE_PATH)

Existing chunks: {0, 1, 2}
Test data already loaded in full to /content/drive/MyDrive/ViMD_Central_downsampled_16Hz/test


In [5]:
def load_dataset_chunks(split:Literal['train','valid','test'], save_path:str):
    """
    Load and combine all chunks for a dataset split.

    Args:
        split (str): Dataset split to load ('train', 'valid', 'test')
        save_path (str): Main path where split subfolders are located

    Returns:
        Dataset: Combined dataset from all chunks
    """
    split_path = os.path.join(save_path, split)
    all_chunks = []

    for chunk_dir in sorted(Path(split_path).glob("chunk_*")):
        if chunk_dir.is_dir():
            print(f"Loading {chunk_dir}")
            all_chunks.append(load_from_disk(str(chunk_dir)))

    # Combine into a single Dataset
    combined_dataset = concatenate_datasets(all_chunks)
    print(f"Total examples: {len(combined_dataset)}")

    return combined_dataset

In [7]:
# train_dataset = load_dataset_chunks("train", SAVE_PATH)
# val_dataset = load_dataset_chunks("valid", SAVE_PATH)
test_dataset = load_dataset_chunks("test", SAVE_PATH)

Loading /content/drive/MyDrive/ViMD_Central_downsampled_16Hz/test/chunk_0
Loading /content/drive/MyDrive/ViMD_Central_downsampled_16Hz/test/chunk_1
Loading /content/drive/MyDrive/ViMD_Central_downsampled_16Hz/test/chunk_2
Total examples: 623


  table = cls._concat_blocks(blocks, axis=0)


### 3. Data Analysis

In [None]:
train_dataset.features

{'region': Value(dtype='string', id=None),
 'province_code': Value(dtype='int64', id=None),
 'province_name': Value(dtype='string', id=None),
 'filename': Value(dtype='string', id=None),
 'text': Value(dtype='string', id=None),
 'speakerID': Value(dtype='string', id=None),
 'gender': Value(dtype='int64', id=None),
 'audio': {'array': Sequence(feature=Value(dtype='float32', id=None), length=-1, id=None),
  'path': Value(dtype='string', id=None),
  'sampling_rate': Value(dtype='int64', id=None)}}

In [None]:
audio_array = train_dataset[0]['audio']['array']
sampling_rate = train_dataset[0]['audio']['sampling_rate']
print("Sampling Rate:",sampling_rate)
duration = len(audio_array) / sampling_rate
print(f"Duration: {duration:.2f} seconds")

Sampling Rate: 16000
Duration: 21.40 seconds


In [None]:
# can convert to pandas dataframe for easier analysis
train_df = train_dataset.to_pandas()

In [None]:
train_df.head()

Unnamed: 0,region,province_code,province_name,filename,text,speakerID,gender,audio
0,Central,36,ThanhHoa,36_0001.wav,Rất là tiện đấy ạ. thí dụ như là tôi muốn về t...,spk_36_0001,0,"{'array': [-0.004214028, -0.0073286155, -0.005..."
1,Central,36,ThanhHoa,36_0002.wav,Kiến nghị với các cơ quan chức năng nhà nước c...,spk_36_0002,1,"{'array': [0.0015753801, 0.0031400772, 0.00293..."
2,Central,36,ThanhHoa,36_0003.wav,Mình cũng đề nghị với các cấp các ngành tìm ra...,spk_36_0003,1,"{'array': [-0.0013298234, -0.0027826256, 0.000..."
3,Central,36,ThanhHoa,36_0004.wav,"Hiện nay, thì một số cơ sở dịch vụ thẩm mỹ hoặ...",spk_36_0004,1,"{'array': [0.012952287, 0.0040204944, -0.01261..."
4,Central,36,ThanhHoa,36_0005.wav,Tuy nhiên đâu đó cũng đang còn chưa dứt điểm B...,spk_36_0004,1,"{'array': [0.0019082483, 0.003462267, 0.002922..."


In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4705 entries, 0 to 4704
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   region         4705 non-null   object
 1   province_code  4705 non-null   int64 
 2   province_name  4705 non-null   object
 3   filename       4705 non-null   object
 4   text           4705 non-null   object
 5   speakerID      4705 non-null   object
 6   gender         4705 non-null   int64 
 7   audio          4705 non-null   object
dtypes: int64(2), object(6)
memory usage: 294.2+ KB


In [None]:
train_df.describe(include='object')

Unnamed: 0,region,province_name,filename,text,speakerID,audio
count,4705,4705,4705,4705,4705,4705
unique,1,19,4705,4683,3197,4705
top,Central,NgheAn,77_0242.wav,Chúng tôi đánh giá rất cao phong trào dân vận ...,spk_82_0073,"{'array': [-0.00044601632, -0.0007242712, -0.0..."
freq,4705,280,1,3,9,1


In [None]:
audio_lengths = train_df['audio'].apply(lambda x: len(x['array']))
sampling_rates = train_df['audio'].apply(lambda x: x['sampling_rate'])
durations = audio_lengths / sampling_rates
print(f"Average duration: {durations.mean():.2f} seconds")

Average duration: 19.11 seconds


### 2. Data Pre-processing

## II. Model Finetuning

### 1. Load Pre-trained Model

Nguyen to do: introduce PhoWhisper

In [None]:
processor = AutoProcessor.from_pretrained("vinai/PhoWhisper-base")
model = AutoModelForSpeechSeq2Seq.from_pretrained("vinai/PhoWhisper-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/290M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

### 2. Model Settings

Nguyen to do: find way to speed up this prepare_batch

In [None]:
def prepare_batch(batch):
    # Log-Mel features
    inputs = processor(batch["audio"]["array"], sampling_rate=16000)
    batch["input_features"] = inputs["input_features"][0]

    # Tokenized text
    batch["labels"] = processor.tokenizer(batch["text"]).input_ids
    return batch
train_ds = train_dataset.map(prepare_batch, remove_columns=train_dataset.column_names, num_proc=1)
val_ds = val_dataset.map(prepare_batch, remove_columns=val_dataset.column_names, num_proc=1)


Map:   0%|          | 0/4705 [00:00<?, ? examples/s]

Map:   0%|          | 0/602 [00:00<?, ? examples/s]

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_PATH,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=1e-5,
    warmup_steps=300,
    max_steps=1000,
    gradient_accumulation_steps=2,
    eval_strategy="steps",
    eval_steps=300,
    save_steps=300,
    logging_steps=100,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=True,
    generation_max_length=256,
    report_to="none",
    load_best_model_at_end=True
)

In [None]:
@dataclass
class CustomDataCollator:
    processor: Any
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": f["input_features"]} for f in features]
        label_features = [{"input_ids": f["labels"]} for f in features]

        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels
        return batch

In [None]:
data_collator = CustomDataCollator(processor=processor)

### Evaluation metric: Word Error Rate (WER)

**Word Error Rate (WER)** is a common evaluation metric for Automatic Speech Recognition (ASR) systems. It quantifies the difference between the predicted transcription and the ground truth by computing:

$$
\text{WER} = \frac{S + D + I}{N}
$$

where:

- **Substitutions (S)**: wrong words
- **Deletions (D)**: missing words
- **Insertions (I)**: extra words
- **N**: Total number of words in the reference

#### Example
- Reference: `tôi đang học lập trình`  
- Prediction: `tôi học lập trình`  
- WER = 1 deletion / 4 words = **25%**

WER closer to **0%** means better transcription quality. For fine-tuning PhoWhisper on Vietnamese dialects, a WER below **30%** is a solid target.


In [None]:
class StopOnWERCallback(TrainerCallback):
    def __init__(self, threshold=0.3):
        self.threshold = threshold

    def on_evaluate(self, args, state, control, metrics, **kwargs):
        wer = metrics.get("eval_wer", None)
        if wer is not None and wer < self.threshold:
            print(f"\n WER {wer:.3f} < {self.threshold} — stopping training early!")
            control.should_training_stop = True
        return control

In [None]:
wer = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    return {"wer": wer.compute(predictions=pred_str, references=label_str)}


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

### 3. Finetuning

In [None]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=processor,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[StopOnWERCallback]
)

trainer.train()

  trainer = Seq2SeqTrainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss,Validation Loss,Wer
300,0.6375,0.616321,0.327232
600,0.4341,0.510005,0.285027


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



 WER 0.285 < 0.3 — stopping training early!


TrainOutput(global_step=600, training_loss=0.900865732828776, metrics={'train_runtime': 2098.947, 'train_samples_per_second': 7.623, 'train_steps_per_second': 0.476, 'total_flos': 6.207101632512e+17, 'train_loss': 0.900865732828776, 'epoch': 2.033955857385399})

### 4. Result Analysis (with a data sample)

## III. Inference (DEMO)

In [29]:
sample = test_dataset[100]
audio = sample["audio"]["array"]
print(sample['province_name'])
sampling_rate = sample["audio"]["sampling_rate"]
Audio(audio, rate=sampling_rate)

HaTinh


In [40]:
processor = AutoProcessor.from_pretrained(OUTPUT_PATH + '/checkpoint-500')
model = AutoModelForSpeechSeq2Seq.from_pretrained(OUTPUT_PATH + '/checkpoint-500')
model.eval()

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 512, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(512, 512, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 512)
      (layers): ModuleList(
        (0-5): 6 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=False)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          

In [41]:
inputs = processor(audio, sampling_rate=sampling_rate, return_tensors="pt")
# go into generation_config and fix forced_decoder_ids to be None
with torch.no_grad():
    generated_ids = model.generate(
        inputs["input_features"],
        forced_decoder_ids = None,
        max_length=256
    )

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)


`generation_config` default values have been modified to match model-specific defaults: {'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}. If this is not desired, please set these values explicitly.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTok

Transcription: lắc thải tối về tối về là nó đồng đấy thôi mua ý thủi ảnh ngượng sức khỏe trong cây dấn với nói chung gia súc gia cầm thì nước chảy quá nó uống không được cho sạch.


# B. Downstream Tasks with Texts
---
(1) Dialectizing Vietnamese Standard Texts | (2) Standardizing Dialectal Texts

## I. Data Preparation

### 1. Load Dataset

### 2. Central-region Dictionary Collection

### 3. Synthesize Parallel Data

APPROACH 01: Rule-based Transformation

APPROACH 02: GPT-based Transformation

### 4. Data Analysis

## II. Model Finetuning

### 1. Load Pre-trained Model

### 2. Model Settings

### 3. Finetuning

### 4. Result Analysis (with a data sample)

## III. Inference (DEMO)