# Finetuning Whisper-large-V2 on Colab using PEFT-Lora + BNB INT8 training

In this Colab, we present a step-by-step guide on how to fine-tune Whisper for any multilingual ASR dataset using Hugging Face 🤗 Transformers and 🤗 PEFT. Using 🤗 PEFT and `bitsandbytes`, you can train the `whisper-large-v2` seamlessly on a colab with T4 GPU (16 GB VRAM). In this notebook, with most parts from [fine_tune_whisper.ipynb](https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb#scrollTo=BRdrdFIeU78w) is adapted to train using PEFT LoRA+BNB INT8.

For more details on model, datasets and metrics, refer blog [Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-whisper)



## Inital Setup

In [4]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt update
!apt install -y ffmpeg

Repository: 'deb https://ppa.launchpadcontent.net/jonathonf/ffmpeg-4/ubuntu/ jammy main'
Description:
Backport of FFmpeg 4 and associated libraries. Now includes AOM/AV1 support!

FDK AAC is not compatible with GPL and FFmpeg can't be redistributed with it included. Please don't ask for it to be added to this public PPA.

---

PPA supporters:

BigBlueButton (https://bigbluebutton.org)

---

Donate to FFMPEG: https://ffmpeg.org/donations.html
Donate to Debian: https://www.debian.org/donations
Donate to this PPA: https://ko-fi.com/jonathonf
More info: https://launchpad.net/~jonathonf/+archive/ubuntu/ffmpeg-4
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding key to /etc/apt/trusted.gpg.d/jonathonf-ubuntu-ffmpeg-4.gpg with fingerprint 4AB0F789CBA31744CC7DA76A8CF63AD3F06FC659
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubun

In [5]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-0yx681ba
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-0yx681ba
  Resolved https://github.com/huggingface/transformers to commit bc30dd1efb99f571d45b2e2131a555d09285ddd8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=10875597 sha256=8d787d082af805ef4b8e222b13a3b5243606388359bebffdfc21e20467aeec01
  Stored in directory: /tmp/pip-ephem-wheel-cache-45hyp8ec/wheels/04/a3/f1/b88775f8e1665827525b19ac7590250f1038d947067beba9fb
Successfully built transformer

In [6]:
import subprocess
import pandas as pd
from google.colab import drive
from datasets import Dataset, Audio
import datetime
from pathlib import Path

Linking the notebook to the Hub is straightforward - it simply requires entering your Hub authentication token when prompted. Find your Hub authentication token [here](https://huggingface.co/settings/tokens):

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
from google.colab import userdata
from huggingface_hub import login

HF_TOKEN = userdata.get('HF_TOKEN')

if HF_TOKEN:
    login(token=HF_TOKEN)
    print("Successfully logged in to Hugging Face!")
else:
    print("HF_TOKEN not found. Please set it in your Colab secrets.")


Successfully logged in to Hugging Face!


In [8]:
# Select CUDA device index
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" # Optimize memory allocation:
model_name_or_path = "openai/whisper-large-v3-turbo"
language = "Chinese"
language_abbr = "cn"
task = "transcribe"

## Load Dataset

In [9]:
def process_video(video_path, srt_path, output_dir):
    # Extract video file name without extension
    video_name = Path(video_path).stem

    # Read SRT file
    with open(srt_path, 'r') as f:
        srt_content = f.read().strip().split('\n\n')

    audio_files = []
    sentences = []

    for idx, segment in enumerate(srt_content):
        lines = segment.split('\n')
        if len(lines) < 3: # Skip segments with less than 3 lines
            print(f"Skipping empty segment {idx}")
            continue

        try:
            times = lines[1].split(' --> ')
            start_time = times[0].replace(',', '.')
            end_time = times[1].replace(',', '.')
            sentence = ' '.join(lines[2:])
        except IndexError as e:
            print(f"Error in segment {idx}: {str(e)}")
            print(f"Segment content: {segment}")
            continue

        # Add small buffer to start time
        start_parts = start_time.split(':')
        seconds = max(0, float(start_parts[-1]) - 0.023)
        start_parts[-1] = f"{seconds:.3f}"
        start_time = ':'.join(start_parts)

        # Include video name in the output file name
        output_file = os.path.join(output_dir, f"{video_name}_segment_{idx:04d}.wav")

        # Check if the file already exists
        if os.path.exists(output_file):
            print(f"Skipping existing file: {output_file}")
            audio_files.append(output_file)
            sentences.append(sentence)
            continue

        # Cut and convert to WAV in one command
        cmd = f"ffmpeg -y -i {video_path} -ss {start_time} -to {end_time} -ac 1 -ar 16000 {output_file}"
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=False)
        if result.returncode != 0:
            print(f"FFmpeg error for segment {idx}: {result.stderr}")
            continue

        audio_files.append(output_file)
        sentences.append(sentence)

    return audio_files, sentences

In [None]:
# Process video and SRT files
video_paths = ["/content/drive/MyDrive/data/CLC014-1-字幕版.mp4"]
srt_paths = ["/content/drive/MyDrive/data/CLC014-1-final.srt",]
output_dir = "/content/drive/MyDrive/audio_test_14_1"
os.makedirs(output_dir, exist_ok=True)

# In the main processing loop:
all_audio_files = []
all_sentences = []

for video_path, srt_path in zip(video_paths, srt_paths):
    print(f"Processing: {video_path}")
    try:
        audio_files, sentences = process_video(video_path, srt_path, output_dir)
        all_audio_files.extend(audio_files)
        all_sentences.extend(sentences)
    except Exception as e:
        print(f"Error processing {video_path}: {str(e)}")
        continue  # Skip to the next video if there's an error

# Create DataFrame with all processed data
df = pd.DataFrame({
    "audio": all_audio_files,
    "sentence": all_sentences
})

# Convert to Dataset and save
dataset_test = Dataset.from_pandas(df)
dataset_test = dataset_test.cast_column("audio", Audio(sampling_rate=16000))
dataset_test.save_to_disk(os.path.join(output_dir, "audio_test_14_1"))

Processing: /content/drive/MyDrive/data/CLC014-1-字幕版.mp4


Saving the dataset (0/1 shards):   0%|          | 0/616 [00:00<?, ? examples/s]

In [None]:
print(dataset_test)

In [None]:
print(dataset_test[0]["audio"])  # This will return a dictionary with path, array, and sampling_rate
print(dataset_test[0]["sentence"])

{'path': '/content/drive/MyDrive/audio_test_14_1/CLC014-1-字幕版_segment_0000.wav', 'array': array([-0.0541687 , -0.03048706, -0.00344849, ..., -0.00717163,
       -0.00708008, -0.00680542]), 'sampling_rate': 16000}
好 我們這一節課呢 接著把這個「三皇五帝」這一部分


In [None]:
print(dataset_test[-1]["audio"])  # This will return a dictionary with path, array, and sampling_rate
print(dataset_test[-1]["sentence"])

In [None]:
# Process video and SRT files
srt_paths = ["/content/drive/MyDrive/data/第13課-1-final.srt",
             "/content/drive/MyDrive/data/第13課-2-final.srt"]

video_paths = ["/content/drive/MyDrive/data/第13課-1-字幕版.mp4",
               "/content/drive/MyDrive/data/第13課-2-字幕版.mp4"]

output_dir = "/content/drive/MyDrive/audio_train_13_1_2"
os.makedirs(output_dir, exist_ok=True)

all_audio_files = []
all_sentences = []

for video_path, srt_path in zip(video_paths, srt_paths):
    print(f"Processing: {video_path}")
    audio_files, sentences = process_video(video_path, srt_path, output_dir)
    all_audio_files.extend(audio_files)
    all_sentences.extend(sentences)

# Create DataFrame with all processed data
df = pd.DataFrame({
    "audio": all_audio_files,
    "sentence": all_sentences
})

# Convert to Dataset and save
dataset_train = Dataset.from_pandas(df)
dataset_train = dataset_train.cast_column("audio", Audio(sampling_rate=16000))
dataset_train.save_to_disk(os.path.join(output_dir, "audio_train_13_1_2"))

In [None]:
from datasets import load_from_disk, DatasetDict

common_voice = DatasetDict()
common_voice["train"] = load_from_disk("/content/drive/MyDrive/audio_train_13_1_2/audio_train_13_1_2")
common_voice["test"] = load_from_disk("/content/drive/MyDrive/audio_test_14_1/audio_test_14_1")
print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 508
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 616
    })
})


## Prepare Feature Extractor, Tokenizer and Data

In [10]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

In [11]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.71M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

In [12]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)

### Prepare Data

Since
our input audio is sampled at 48kHz, we need to _downsample_ it to
16kHz prior to passing it to the Whisper feature extractor, 16kHz being the sampling rate expected by the Whisper model.

We'll set the audio inputs to the correct sampling rate using dataset's
[`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=cast_column#datasets.DatasetDict.cast_column)
method. This operation does not change the audio in-place,
but rather signals to `datasets` to resample audio samples _on the fly_ the
first time that they are loaded:

Re-loading the first audio sample in the Common Voice dataset will resample
it to the desired sampling rate:

In [None]:
print(common_voice["train"][0])

{'audio': {'path': '第13課-1-字幕版_segment_0000.wav', 'array': array([ 0.08483887,  0.13439941,  0.31784058, ..., -0.00128174,
       -0.00241089, -0.00585938]), 'sampling_rate': 16000}, 'sentence': '我們這節課把中國歷史的部分 《三字經》這一部分 把它結束'}


Now we can write a function to prepare our data ready for the model:
1. We load and resample the audio data by calling `batch["audio"]`. As explained above, 🤗 Datasets performs any necessary resampling operations on the fly.
2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.
3. We encode the transcriptions to label ids through the use of the tokenizer.

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"],
                sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

We can apply the data preparation function to all of our training examples using dataset's `.map` method. The argument `num_proc` specifies how many CPU cores to use. Setting `num_proc` > 1 will enable multiprocessing. If the `.map` method hangs with multiprocessing, set `num_proc=1` and process the dataset sequentially.

In [None]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)

Map:   0%|          | 0/508 [00:00<?, ? examples/s]

Map:   0%|          | 0/616 [00:00<?, ? examples/s]

In [None]:
common_voice["train"]

Dataset({
    features: ['input_features', 'labels'],
    num_rows: 508
})

In [None]:
common_voice["test"]

Dataset({
    features: ['input_features', 'labels'],
    num_rows: 616
})

In [None]:
output_dir = "/content/drive/MyDrive/smalldata0304"
os.makedirs(output_dir, exist_ok=True)

common_voice.save_to_disk(output_dir)

Saving the dataset (0/2 shards):   0%|          | 0/508 [00:00<?, ? examples/s]

Saving the dataset (0/2 shards):   0%|          | 0/616 [00:00<?, ? examples/s]

#### Load Large Preprocessed Data

In [13]:
from datasets import load_from_disk
common_voice = load_from_disk("/content/drive/MyDrive/processed_whisper_dataset")

Loading dataset from disk:   0%|          | 0/37 [00:00<?, ?it/s]

In [14]:
common_voice

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 11793
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 2959
    })
})

## Training and Evaluation

### Define a Data Collator

In [15]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

Let's initialise the data collator we've just defined:

In [16]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics

We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing
ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer). We'll load the WER metric from 🤗 Evaluate:

In [17]:
import evaluate

metric = evaluate.load("cer")

Downloading builder script:   0%|          | 0.00/5.60k [00:00<?, ?B/s]

We then simply have to define a function that takes our model
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:

In [18]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    cer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"cer": cer}

### Load a Pre-Trained Checkpoint

Now let's load the pre-trained Whisper `small` checkpoint. Again, this
is trivial through use of 🤗 Transformers!

In [19]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path,
               load_in_8bit=True, device_map="auto")

# model.hf_device_map - this should be {" ": 0}

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)):

In [20]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [21]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model) # up-to-date approach for preparing 8-bit models for training

### Apply LoRA

Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
"""
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()
"""

trainable params: 15728640 || all params: 1559033600 || trainable%: 1.0088711365810203


In [22]:
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 6,553,600 || all params: 815,431,680 || trainable%: 0.8037


We are ONLY using **1%** of the total trainable parameters, thereby performing **Parameter-Efficient Fine-Tuning**

### Define the Training Configuration

In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).

In [23]:
from transformers import Seq2SeqTrainingArguments
from huggingface_hub import HfApi, ModelCardData

training_args = Seq2SeqTrainingArguments(
    output_dir="/content/drive/MyDrive/whisperLarge0305V3",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=5e-5,
    warmup_steps=100,
    num_train_epochs=3,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    eval_steps=500,
    load_best_model_at_end=True,  # Load the best model at the end of training
    metric_for_best_model="eval_loss",  # Use evaluation loss to determine the best model
    fp16=True,
    per_device_eval_batch_size=8,
    # eval_accumulation_steps=1,  # Explicitly control eval memory usage
    generation_max_length=128,
    logging_steps=25,
    remove_unused_columns=False,  # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
    label_names=["labels"],  # same reason as above
)



**Few Important Notes:**
1. `remove_unused_columns=False` and `label_names=["labels"]` are required as the PeftModel's forward doesn't have the signature of the base model's forward.

2. INT8 training required autocasting. `predict_with_generate` can't be passed to Trainer because it internally calls transformer's `generate` without autocasting leading to errors.

3. Because of point 2, `compute_metrics` shouldn't be passed to `Seq2SeqTrainer` as seen below. (commented out)

In [24]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    # compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

# Enable memory-efficient attention and gradient checkpointing
model.config.use_memory_efficient_attention = True
model.gradient_checkpointing_enable()

  trainer = Seq2SeqTrainer(


In [25]:
trainer.train(resume_from_checkpoint="/content/drive/MyDrive/whisperLarge0305V3/checkpoint-2000")

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33myxu_truth[0m ([33myxu_truth-fei-tian-college-at-middletown[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss,Validation Loss
2500,0.9173,0.443578
3000,0.7167,0.433069
3500,0.5651,0.444271
4000,0.7046,0.423587




TrainOutput(global_step=4425, training_loss=0.4309207974450063, metrics={'train_runtime': 12201.5329, 'train_samples_per_second': 2.9, 'train_steps_per_second': 0.363, 'total_flos': 6.085367184162816e+19, 'train_loss': 0.4309207974450063, 'epoch': 3.0})

In [None]:
trainer.train(resume_from_checkpoint="/content/drive/MyDrive/whisperLarge0305V3/checkpoint-1500")

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33myxu_truth[0m ([33myxu_truth-fei-tian-college-at-middletown[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss,Validation Loss
2000,0.9419,0.452113




In [None]:
trainer.train(resume_from_checkpoint="/content/drive/MyDrive/whisperLarge0305V2/checkpoint-500")

  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33myxu_truth[0m ([33myxu_truth-fei-tian-college-at-middletown[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  checkpoint_rng_state = torch.load(rng_file)


Step,Training Loss,Validation Loss
1000,1.0907,0.48908
1500,0.832,0.475805




Step,Training Loss,Validation Loss
1000,1.0907,0.48908
1500,0.832,0.475805


In [None]:
"""
trainer.train()
"""



Step,Training Loss,Validation Loss
500,1.1129,0.509519




In [None]:
"""
trainer.train()
"""

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33myxu_truth[0m ([33myxu_truth-fei-tian-college-at-middletown[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss
1,0.4644,0.528824
2,0.2706,0.495149
3,0.127,0.494306




TrainOutput(global_step=231, training_loss=0.4824646582335105, metrics={'train_runtime': 1682.8358, 'train_samples_per_second': 1.098, 'train_steps_per_second': 0.137, 'total_flos': 3.17865359572992e+18, 'train_loss': 0.4824646582335105, 'epoch': 3.0})

In [None]:
"""
trainer.train()
"""

***** Running training *****
  Num examples = 3927
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1473
  Number of trainable parameters = 15728640


Epoch,Training Loss,Validation Loss
1,0.25,0.257494
2,0.1681,0.2194
3,0.0799,0.214743


***** Running Evaluation *****
  Num examples = 1816
  Batch size = 8
Saving model checkpoint to temp/checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Feature extractor saved in temp/checkpoint-500/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 1816
  Batch size = 8
Saving model checkpoint to temp/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Feature extractor saved in temp/checkpoint-1000/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 1816
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1473, training_loss=0.20141667617233286, metrics={'train_runtime': 12791.5224, 'train_samples_per_second': 0.921, 'train_steps_per_second': 0.115, 'total_flos': 2.52799085113344e+19, 'train_loss': 0.20141667617233286, 'epoch': 3.0})

In [None]:
"""
model_name_or_path = "openai/whisper-large-v3"
peft_model_id = "smangrul/" + f"{model_name_or_path}-{model.peft_config.peft_type}-colab".replace("/", "-")
model.push_to_hub(peft_model_id)
print(peft_model_id)
"""

Uploading the following files to smangrul/openai-whisper-large-v2-LORA-colab: adapter_model.bin,adapter_config.json


Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.bin:   0%|          | 0.00/63.1M [00:00<?, ?B/s]

smangrul/openai-whisper-large-v2-LORA-colab


In [26]:
# Prepare metadata for model push
model_name = "whisper-large-v3-chinese-custom-ft-0306"
username = "SylviaThsu"
peft_model_id = f"{username}/{model_name}"

# Detailed model card metadata
model_card_data = ModelCardData(
    language=["cn", "Chinese"],  # Update with your specific languages
    license="apache-2.0",
    datasets=["common_voice"],  # Update with your training dataset
    metrics=["cer"],  # Word Error Rate is common for speech models
    model_name=model_name,
    description="""
    Fine-tuned Whisper Large v3 model on multilingual speech recognition.
    Trained with custom dataset to improve speech transcription accuracy.

    Key Features:
    - Base Model: OpenAI Whisper Large v3
    - Training Approach: Parameter-efficient fine-tuning (PEFT)
    - Quantization: 8-bit training
    - Key Improvements: [Briefly describe your specific improvements]
    """,
)

model.push_to_hub(peft_model_id)
print(peft_model_id)

adapter_model.safetensors:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

SylviaThsu/whisper-large-v3-chinese-custom-ft-0306


# Evaluation and Inference

**Important points to note while inferencing**:
1. As `predict_with_generate` can't be used, we will write the eval loop with `torch.cuda.amp.autocast()` as shown below.
2. As the base model is frozen, PEFT model sometimes fails ot recognise the language while decoding.Hence, we force the starting tokens to mention the language we are transcribing. This is done via `forced_decoder_ids = processor.get_decoder_prompt_ids(language="Marathi", task="transcribe")` and passing that too the `model.generate` call.
3. Please note that [AutoEvaluate Leaderboard](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=mozilla-foundation%2Fcommon_voice_11_0&only_verified=0&task=automatic-speech-recognition&config=mr&split=test&metric=wer) for `mr` language on `common_voice_11_0` has a bug wherein openai's `BasicTextNormalizer` normalizer is used while evaluation leading to degerated output text, an example is shown below:
```
without normalizer: 'स्विच्चान नरुवित्तीची पद्दत मोठ्या प्रमाणात आमलात आणल्या बसोन या दुपन्याने अनेक राथ प्रवेश केला आहे.'
with normalizer: 'स व च च न नर व त त च पद दत म ठ य प रम ण त आमल त आणल य बस न य द पन य न अन क र थ प रव श क ल आह'
```
Post fixing this bug, we report the 2 metrics for the top model of the leaderboard and the PEFT model:
1. `wer`: `wer` without using the `BasicTextNormalizer` as it doesn't cater to most indic languages. This is want we consider as true performance metric.
2. `normalized_wer`: `wer` using the `BasicTextNormalizer` to be comparable to the leaderboard metrics.
Below are the results:

| Model          | DrishtiSharma/whisper-large-v2-marathi | smangrul/openai-whisper-large-v2-LORA-colab |
|----------------|----------------------------------------|---------------------------------------------|
| wer            | 35.6457                                | 36.1356                                     |
| normalized_wer | 13.6440                                | 14.0165                                     |

We see that PEFT model's performance is comparable to the fully fine-tuned model on the top of the leaderboard. At the same time, we are able to train the large model in Colab notebook with limited GPU memory and the added advantage of resulting checkpoint being jsut `63` MB.



In [27]:
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer

peft_model_id = "SylviaThsu/whisper-large-v3-chinese-custom-ft-0306"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


adapter_model.safetensors:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

In [29]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

eval_dataloader = DataLoader(common_voice["test"], batch_size=8, collate_fn=data_collator)

model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()
cer = 100 * metric.compute()
print(f"{cer=}")

  with torch.cuda.amp.autocast():
100%|██████████| 370/370 [29:48<00:00,  4.83s/it]


cer=33.39755213055304


## Using AutomaticSpeechRecognitionPipeline

**Few important notes:**
1. `pipe()` should be in the autocast context manager `with torch.cuda.amp.autocast():`
2. `forced_decoder_ids` specifying the `language` being transcribed should be provided in `generate_kwargs` dict.
3. You will get warning along the below lines which is **safe to ignore**.
```
The model 'PeftModel' is not supported for . Supported models are ['SpeechEncoderDecoderModel', 'Speech2TextForConditionalGeneration', 'SpeechT5ForSpeechToText', 'WhisperForConditionalGeneration', 'Data2VecAudioForCTC', 'HubertForCTC', 'MCTCTForCTC', 'SEWForCTC', 'SEWDForCTC', 'UniSpeechForCTC', 'UniSpeechSatForCTC', 'Wav2Vec2ForCTC', 'Wav2Vec2ConformerForCTC', 'WavLMForCTC'].

```

In [None]:
import torch
import gradio as gr
from transformers import (
    AutomaticSpeechRecognitionPipeline,
    WhisperForConditionalGeneration,
    WhisperTokenizer,
    WhisperProcessor,
)
from peft import PeftModel, PeftConfig


peft_model_id = "SylviaThsu/whisper-large-v3-chinese-custom-ft-0306"
language = "Chinese"
task = "transcribe"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)

model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = WhisperTokenizer.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
processor = WhisperProcessor.from_pretrained(peft_config.base_model_name_or_path, language=language, task=task)
feature_extractor = processor.feature_extractor
forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
pipe = AutomaticSpeechRecognitionPipeline(model=model, tokenizer=tokenizer, feature_extractor=feature_extractor)


def transcribe(audio):
    with torch.cuda.amp.autocast():
        text = pipe(audio, generate_kwargs={"forced_decoder_ids": forced_decoder_ids}, max_new_tokens=255)["text"]
    return text


iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(sources=["microphone","upload"], type="filepath"),
    outputs="text",
    title="PEFT LoRA + INT8 Whisper Large V3 turbo",
    description="Realtime demo for turbo speech recognition using `PEFT-LoRA+INT8` fine-tuned Whisper Large V3 model.",
)

iface.launch(share=True, debug=True)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Device set to use cuda:0


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://c14b734d59f04935d7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


  with torch.cuda.amp.autocast():
  with torch.cuda.amp.autocast():
