# Deep Learning Project Scratch

## Preprocessing
- Choosing frequency (known as *sampling rate* and measured in Hz). "Sampling with a higher sampling rate results in a better approximation of the continuous speech signal, but also requires storing more values per second"
- **Whisper feature extractor** expects sampling rate of 16kHz and padded or truncated to length of 30s
  - Whisper expects a log-Mel spectrogram as input
  - Do a sample preprocessing on a single training example? Show that spectrogram
- Then, use the WhisperTokenizer to give the index within the particular vocabulary
  - As in the fine-tuning article, encode and decode to verify tokenizer working correctly
  - I believe this is the BPE (byte-code encoder?)
- **WhisperProcessor** inherits from the WhisperFeatureExtractor and WhisperProcessor, so only need that
- Data Preparation: need to down- or upsample input audio to be 16kHz

## Model
- pretrained model: Tagalog results from original paper

|          | Translation (BLEU)   | Transcription (% WER) |
| -------- | -------------------- | --------------------- |
| tiny     | 0.8                  | 65.6                  |
| base     | 2.1                  | 45.8                  |
| small    | 12.0                 | 27.7                  |
| medium   | 20.5                 | 19.1                  |
| large    | 22.7                 | 15.8                  |
| large-v2 | 24.4                 | 13.8                  |

**note: Translation and Transcription using Fleurs dataset*


- It looks like medium is definitely the highest and small is the lowest. We could get away probably with going with small as the improvements start to taper off going from small to medium
- possibly sticking with just transcription task (although I'm unsure if we can make it stay with just one task)
- I think let's use medium if training using small is fast enough
- *What the heck is the stem and convolutional layer doing at the start of the encoder?*
- Keeping the training epochs relatively low as this is a small dataset (1800 samples for just the training set); will stick with 10 epochs

## Communication
- probably using Streamlit instead of Gradio
- in the spirit of the original paper, maybe training on one dataset and testing on another

In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


In [None]:
!pip install --upgrade pip
!pip install --upgrade datasets[audio] transformers accelerate evaluate jiwer tensorboard streamlit
# !pip install torch torchvision
# !pip install gradio

Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.3.1
Collecting transformers
  Downloading transformers-4.47.0-py3-none-any.whl.metadata (43 kB)
Collecting accelerate
  Downloading accelerate-1.2.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jiwer
  Downloading jiwer-3.0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting tensorboard
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting streamlit
  Downloading streamlit-1.40.2-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting dataset

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Preprocessing

### Importing Fleurs Dataset

"Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and "unit error rate" (characters, signs) of all languages is averaged."

`"fil_ph"` is the code name for Tagalog with ~1.36 GB of data.

https://huggingface.co/datasets/google/fleurs

As will be shown, the dataset contains records for train, validation, and test.

In [None]:
from datasets import load_dataset
fleurs = load_dataset("google/fleurs", "fil_ph", trust_remote_code=True)

# from torch.utils.data import DataLoader
# dataloader = DataLoader(fleurs, batch_size=32)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

fleurs.py:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

train.tar.gz:   0%|          | 0.00/1.39G [00:00<?, ?B/s]

dev.tar.gz:   0%|          | 0.00/367M [00:00<?, ?B/s]

test.tar.gz:   0%|          | 0.00/889M [00:00<?, ?B/s]

data/fil_ph/train.tsv:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

data/fil_ph/dev.tsv:   0%|          | 0.00/278k [00:00<?, ?B/s]

data/fil_ph/test.tsv:   0%|          | 0.00/676k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
fleurs

DatasetDict({
    train: Dataset({
        features: ['id', 'num_samples', 'path', 'audio', 'transcription', 'raw_transcription', 'gender', 'lang_id', 'language', 'lang_group_id'],
        num_rows: 1884
    })
    validation: Dataset({
        features: ['id', 'num_samples', 'path', 'audio', 'transcription', 'raw_transcription', 'gender', 'lang_id', 'language', 'lang_group_id'],
        num_rows: 418
    })
    test: Dataset({
        features: ['id', 'num_samples', 'path', 'audio', 'transcription', 'raw_transcription', 'gender', 'lang_id', 'language', 'lang_group_id'],
        num_rows: 964
    })
})

In [None]:
# !pip install --force-reinstall -v numpy==2.0.1
!pip uninstall numpy -y

Found existing installation: numpy 2.0.1
Uninstalling numpy-2.0.1:
  Successfully uninstalled numpy-2.0.1


In [None]:
!pip install numpy==2.0.1

Collecting numpy==2.0.1
  Using cached numpy-2.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-2.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.2 MB)
Installing collected packages: numpy
[31mERROR: Could not install packages due to an OSError: [Errno 122] Disk quota exceeded
[0m[31m
[0m

In [None]:
import numpy as np
"NumPy version:", np.__version__

('NumPy version:', '1.26.4')

In [None]:
# load audio sample on the fly
audio_input = fleurs["train"][0]["audio"]  # first decoded audio sample
transcription = fleurs["train"][0]["transcription"]  # first transcription
raw_transcription = fleurs["train"][0]["raw_transcription"]  # first raw transcription
# use `audio_input` and `transcription` to fine-tune your model for ASR

In [None]:
audio_input

{'path': 'train/10004409263773767431.wav',
 'array': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        4.61935997e-05, 2.56896019e-05, 3.20672989e-05]),
 'sampling_rate': 16000}

In [None]:
transcription

'napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog'

In [None]:
raw_transcription

'Napapalibutan ng mga dagat ang Turkey sa tatlong panig: ang Dagat ng Aegean sa gawing kanluran, Dagat na Itim sa gawing hilaga at ang Dagat ng Mediterranean sa gawing timog.'

As you can see, the raw_transcription is the non-normalized text, which includes capitals and punctuation.

In [None]:
language_class = fleurs["train"][0]["lang_id"]  # first id class
language = fleurs["train"].features["lang_id"].names[language_class]

print(language)

fil_ph


We remove all columns except audio samples (audio) and corresponding transcribed text (transcription). I will also keep the raw_transcription column to see if I can get anything out of it or we could use it for checks.

In [None]:
fleurs = fleurs.remove_columns([
    'id','num_samples','path','gender','lang_id','language','lang_group_id'
])

### Load WhisperFeatureExtractor

In the future: I'd like to extract the log-Mel spectrogram here and plot it.



In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

### Load WhisperTokenizer

The Whisper Tokenizer gives an index for each word corresponding to the Whisper vocabulary. It uses a byte-pair encoder (BPE).

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(
    "openai/whisper-small", language="Tagalog", task="transcribe"
)

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

We do a little sanity check to see if the decoder is working. The tokenizer applies special tokens like <|startoftranscript|> and <|endoftext|>, as well as to specify the task and language. If the skip_special_tokens argument of the tokenizer is set to True, then just the sentence is shown without the special tokens, allowing us to compare the original string with our decoded string:

In [None]:
input_str = transcription
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

Input:                 napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog
Decoded w/ special:    <|startoftranscript|><|tl|><|transcribe|><|notimestamps|>napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog<|endoftext|>
Decoded w/out special: napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog
Are equal:             True


### Load WhisperProcessor
The WhisperProcessor inherits the WhisperFeatureExtractor and WhisperProcessor so we only need to use that.

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-small",
    language="Tagalog",
    task="transcribe"
)

We can verify the processor does the same thing by running the following:

In [None]:
labels = processor.tokenizer(input_str).input_ids
decoded_with_special = processor.tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = processor.tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

Input:                 napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog
Decoded w/ special:    <|startoftranscript|><|tl|><|transcribe|><|notimestamps|>napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog<|endoftext|>
Decoded w/out special: napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog
Are equal:             True


In [None]:
print(fleurs["train"])

Dataset({
    features: ['id', 'num_samples', 'path', 'audio', 'transcription', 'raw_transcription', 'gender', 'lang_id', 'language', 'lang_group_id'],
    num_rows: 1884
})


In [None]:
print(fleurs["train"][0])

{'id': 432, 'num_samples': 254400, 'path': '/root/.cache/huggingface/datasets/downloads/extracted/8e688797ccd97677d806828c4ae01c17015237a916ab943710d3d095889d13aa/10004409263773767431.wav', 'audio': {'path': 'train/10004409263773767431.wav', 'array': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
       4.61935997e-05, 2.56896019e-05, 3.20672989e-05]), 'sampling_rate': 16000}, 'transcription': 'napapalibutan ng mga dagat ang turkey sa tatlong panig ang dagat ng aegean sa gawing kanluran dagat na itim sa gawing hilaga at ang dagat ng mediterranean sa gawing timog', 'raw_transcription': 'Napapalibutan ng mga dagat ang Turkey sa tatlong panig: ang Dagat ng Aegean sa gawing kanluran, Dagat na Itim sa gawing hilaga at ang Dagat ng Mediterranean sa gawing timog.', 'gender': 1, 'lang_id': 25, 'language': 'Filipino', 'lang_group_id': 5}


We can note that the Fleurs dataset is already sampled at 16kHz which is the sampling rate required by Whisper, so we do not need to resample the audio.

In [None]:
fleurs["train"][0]["audio"]

{'path': 'train/10004409263773767431.wav',
 'array': array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        4.61935997e-05, 2.56896019e-05, 3.20672989e-05]),
 'sampling_rate': 16000}

### Prepare Dataset function

This will be used to transform batches on the fly that does each of the steps above, namely by inheriting the feature extractor and the tokenizer.

This is for use in model training.

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = processor.tokenizer(batch["transcription"]).input_ids
    return batch


fleurs = fleurs.map(
    prepare_dataset,
    remove_columns=fleurs.column_names["train"], # removes all cols except input_features and labels
    num_proc=4
)

Map (num_proc=4):   0%|          | 0/1884 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/418 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/964 [00:00<?, ? examples/s]

In [None]:
fleurs

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 1884
    })
    validation: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 418
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 964
    })
})

## Training and Evaluation



### Whisper Small Checkpoint

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
    # .to(device)

# setting this prevents model from predicting the incorrect language
model.generation_config.language = "tagalog"
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

### Data Collator

A data collator is needed to pad input features and labels to the length of the longest entry in the batch (NOT the entire dataset), reminiscent of the masked softmax we learned in lecture for Seq2Seq.

According to the [huggingface transformers docs](https://huggingface.co/docs/transformers/tasks/asr#preprocess), there is no data collator for ASR, so we need to adapt the DataCollatorWithPadding class.

We also need the data collator to handle the input_features and labels independently as they are different lengths; the log-Mel spectrogram input_features will not correspond on a one-to-one level with the labels. Since the input_features are in log-Mel spectrogram form (of fixed dimension, therefore not requiring padding and also not requiring truncation to 30s), they will be handled by the feature extractor. Since the labels are encoded via BPE to index the Whisper vocabulary, they will be handled by the tokenizer. With the tokenizer padding, the padded tokens are replaced by -100.

Finally, the bos token is removed and added later during training.

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

### Evaluate and compute metrics

The evaluation metric we will use is the word error rate (WER), the most common metric for ASR tasks.

In [None]:
import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    # pred_logits = pred.predictions
    # pred_ids = np.argmax(pred_logits, axis=-1)
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, basic_normalize=True, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, basic_normalize=True, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

# CHECK IF THIS WORKS FOR WITHOUT USING LOGITS/NUMPY

### TRAIN! - Training args and Trainer
First, we must define all parameters for training using Seq2SeqTrainingArguments (see [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments)).

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-tl",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=100,
    # max_steps=5000,  # excessive for my small dataset
    num_train_epochs=10,
    gradient_checkpointing=True,
    fp16=True,
    # group_by_length=True,
    # evaluation_strategy="steps",  # deprecated
    eval_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=200,  # how often to save intermediate checkpoints to the Hub
    eval_steps=50,  # how often evaluation of intermediate checkpoints
    logging_steps=25,
    report_to=["tensorboard"],  # logs to the Hub
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)


In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args, # using previously-defined training args
    model=model,  # our predefined Whisper-small model
    train_dataset=fleurs["train"],
    eval_dataset=fleurs["validation"], # saving the test set for the very end
    data_collator=data_collator,  # our data collator
    compute_metrics=compute_metrics,  # WER function
    # tokenizer=processor.feature_extractor  # this is deprecated
    processing_class=processor
)


In [None]:
processor.save_pretrained(training_args.output_dir)

[]

Takes about 3 hours to train on Google Colab., about 2 hours on Wulver.

In [None]:
trainer.train(
    # resume_from_checkpoint=True  # if training interrupted, use last checkpoint from output_dir
)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
50,0.6855,0.590301,24.883156
100,0.4628,0.465644,20.377641
150,0.2687,0.422355,18.769864
200,0.2655,0.39978,19.452234
250,0.1747,0.393307,18.863339
300,0.109,0.385717,17.442513
350,0.1171,0.381318,16.872313
400,0.0551,0.388085,17.741634
450,0.0566,0.386823,16.498411
500,0.0247,0.39503,17.227519


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50259], [2, 50359], [3, 50363]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use

TrainOutput(global_step=1180, training_loss=0.11129694690896293, metrics={'train_runtime': 11047.968, 'train_samples_per_second': 1.705, 'train_steps_per_second': 0.107, 'total_flos': 5.4369489420288e+18, 'train_loss': 0.11129694690896293, 'epoch': 10.0})

## Results

### Pushing results to the Hub

First, let's push the results to the Hub for safekeeping.

In [None]:
kwargs = {
    "dataset_tags": "google/fleurs",
    "dataset": "Fleurs",  # a 'pretty' name for the training dataset
    "dataset_args": "config: tl, split: test",
    "language": "tl",
    "model_name": "Whisper Small Tl - Jesse Hilario",  # a 'pretty' name for your model
    "finetuned_from": "openai/whisper-small",
    "tasks": "automatic-speech-recognition",
}


In [None]:
trainer.push_to_hub(**kwargs)

CommitInfo(commit_url='https://huggingface.co/IroquoisHadoop/whisper-tl/commit/c0e9bf8e099b1a57c3774ba22a1f3da799010d3b', commit_message='End of training', commit_description='', oid='c0e9bf8e099b1a57c3774ba22a1f3da799010d3b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/IroquoisHadoop/whisper-tl', endpoint='https://huggingface.co', repo_type='model', repo_id='IroquoisHadoop/whisper-tl'), pr_revision=None, pr_num=None)

### Reinstantiating model from the Hub

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("IroquoisHadoop/whisper-tl")
processor = WhisperProcessor.from_pretrained("IroquoisHadoop/whisper-tl")


config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

### Test Evaluation (first attempt!)

This attempt failed because I neglected to do `trainer.push_to_hub(**kwargs)` to push the results of training to the hub. This resulted in the model not knowing that it was doing a transcription task--I believe it started doing a translation task, returning English and resulting in a WER over 100%...

At least we know that the WER metric works here.

In [None]:
fleurs

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 1884
    })
    validation: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 418
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 964
    })
})

The first half (commented out) is from an example from docs, the second half is it repurposed to hopefully correctly evaluate the WER of the test set.

Recall: the labels from Fleur have been tokenized to their input_ids, so the corresponding predictions must also be ids to work with WER.

Keep in mind, too, that the evaluate.load("wer") expects strings, not token IDs

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

In [None]:
# Define map_to_pred
def map_to_pred(batch):
    input_features = batch["input_features"]

    # Ensure input_features is a 3D tensor
    input_features = torch.tensor(input_features)
    if len(input_features.shape) == 2:
        input_features = input_features.unsqueeze(0)

    input_features = input_features.to(device)

    # Create attention mask for non-zero values
    attention_mask = (input_features != 0).float().to(device)

    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            attention_mask=attention_mask
        )

    batch["prediction"] = predicted_ids.tolist()  # Convert to list for JSON compatibility
    return batch

# Apply map_to_pred to the test set
result = fleurs["test"].map(
    map_to_pred,
    remove_columns=["input_features"],  # Remove unnecessary columns
    num_proc=4  # For debugging, keep it single-process
)


Map:   0%|          | 0/964 [00:00<?, ? examples/s]

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

In [None]:
# Decode predictions and references
prediction = processor.batch_decode(result["prediction"], skip_special_tokens=True)
reference = processor.batch_decode(fleurs["test"]["labels"], skip_special_tokens=True)

# Compute and print WER
print(100 * metric.compute(references=reference, predictions=prediction))

Dataset({
    features: ['labels', 'prediction'],
    num_rows: 964
})

In [None]:
# Flatten nested predictions
flat_predictions = [pred for sublist in result["prediction"] for pred in sublist]

# Decode predictions and references
prediction = processor.batch_decode(flat_predictions, skip_special_tokens=True)
reference = processor.batch_decode(fleurs["test"]["labels"], skip_special_tokens=True)

# Compute and print WER
print(100 * metric.compute(references=reference, predictions=prediction))


100.12791193457112


In [None]:
prediction[0]

" The long-lasting names have the ten most written names, including the outer sky above the tree's"

In [None]:
reference[0]

'ang mahahaba nitong mga panga ay may 70 na mga matutulis na ngipin kasama na ang ekstrang hanay sa itaas ng bunganga nito nangangahulugan na walang matatakasan ang sinuman na makasalubong nito'

### Test Evaluation (second attempt)

Consider using transcribe function here to transcribe instead of having to decode?

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

In [None]:
# Define map_to_pred
def map_to_pred(batch):
    input_features = batch["input_features"]

    # Ensure input_features is a 3D tensor
    input_features = torch.tensor(input_features)
    if len(input_features.shape) == 2:
        input_features = input_features.unsqueeze(0)

    input_features = input_features.to(device)

    # Create attention mask for non-zero values
    attention_mask = (input_features != 0).float().to(device)

    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            attention_mask=attention_mask
        )

    batch["prediction"] = predicted_ids.tolist()  # Convert to list for JSON compatibility
    return batch

In [None]:
# Apply map_to_pred to the test set
result = fleurs["test"].map(
    map_to_pred,
    remove_columns=["input_features"],  # Remove unnecessary columns
    num_proc=1  # For debugging, keep it single-process
)

Map:   0%|          | 0/964 [00:00<?, ? examples/s]

In [None]:
# Flatten nested predictions
flat_predictions = [pred for sublist in result["prediction"] for pred in sublist]

# Decode predictions and references
prediction = processor.batch_decode(flat_predictions, skip_special_tokens=True)
reference = processor.batch_decode(fleurs["test"]["labels"], skip_special_tokens=True)

# Compute and print WER
print(100 * metric.compute(references=reference, predictions=prediction))


16.7603395480445


In [None]:
prediction[0]

'ang mahahaba nitong mga panga ay may pitungpu ng mga matutulis na ngipin kasama na ang estrang hanay sa itaas ng bunganga nito nangangahulugan na walang matatakasan ang sinuman na makasalubong nito'

In [None]:
reference[0]

'ang mahahaba nitong mga panga ay may 70 na mga matutulis na ngipin kasama na ang ekstrang hanay sa itaas ng bunganga nito nangangahulugan na walang matatakasan ang sinuman na makasalubong nito'

#### Very interesting...
Make sure to use `basic_normalize=True` as the original paper (p.21) specifies its use for multilingual text normalization (see documentation [here](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperTokenizer.decode)).

In [None]:
# Flatten nested predictions
flat_predictions1 = [pred for sublist in result["prediction"] for pred in sublist]

# Decode predictions and references. Must use basic_normalize=True since it's multilingual
prediction1 = processor.batch_decode(flat_predictions1, basic_normalize=True, skip_special_tokens=True)
reference1 = processor.batch_decode(fleurs["test"]["labels"], basic_normalize=True, skip_special_tokens=True)

# Compute and print WER
print(100 * metric.compute(references=reference1, predictions=prediction1))


16.082645879058504


In [None]:
prediction1[0]

'ang mahahaba nitong mga panga ay may pitungpu ng mga matutulis na ngipin kasama na ang estrang hanay sa itaas ng bunganga nito nangangahulugan na walang matatakasan ang sinuman na makasalubong nito'

In [None]:
reference1[0]

'ang mahahaba nitong mga panga ay may 70 na mga matutulis na ngipin kasama na ang ekstrang hanay sa itaas ng bunganga nito nangangahulugan na walang matatakasan ang sinuman na makasalubong nito'

Compare here the results from the paper to the further fine-tuning.

A word error rate of 17.17% is much better than the 27.7% of the original Whisper-small model on Tagalog! As you can see, the model is doing the task of transcription above.

I was considering using Whisper-medium rather than Whisper-small due to the accuracy gains. However, the discrepancy between training vs validation loss gives evidence that the model might be overfitting; the training loss goes below .01 while the validation loss reaches a minimum at .39 then rises slightly to .42. Using a more complicated model like Whisper-medium might exacerbate this overfitting.

## Implementation in a demo

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.8.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.5.1 (from gradio)
  Downloading gradio_client-1.5.1-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.19-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.8.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metad

In [None]:
# from transformers import pipeline
# import gradio as gr

# from transformers import WhisperForConditionalGeneration, WhisperProcessor

# model = WhisperForConditionalGeneration.from_pretrained("IroquoisHadoop/whisper-tl")
# processor = WhisperProcessor.from_pretrained("IroquoisHadoop/whisper-tl")

transcriber = pipeline(
    # "automatic-speech-recognition",
    model="IroquoisHadoop/whisper-tl"
)

def transcribe(audio):
    text = transcriber(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources="microphone", type="filepath"),
    outputs="text",
    title="Whisper Small Tagalog",
    description="Realtime demo for Tagalog speech recognition using a fine-tuned Whisper small model.",
)

iface.launch()


Device set to use cuda:0


TypeError: Audio.__init__() got an unexpected keyword argument 'source'