<a href="https://www.kaggle.com/code/matteoparrotta/speech2text-model-training-and-evaluation?scriptVersionId=245813864" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction:

This part of the project aims to test and fine-tune some of the best open-source speech-to-text models for subsequent implementation in a client-server platform. The first part of the project involved a review of the literature on this specific case, the main paid services (using APIs), and the available open-source models. In this notebook, older and less performant models are not tested.

Note: part of the finetuning code was extracted from the open-source guide available at: (https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb)

In [2]:
%%capture
!pip install soundfile speechbrain accelerate 
!pip install evaluate jiwer

In [3]:
from huggingface_hub import login
from datasets import load_dataset, DatasetDict, Audio
from transformers import Wav2Vec2FeatureExtractor, WhisperTokenizer, WhisperProcessor, WhisperForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from torch.nn.parallel import DistributedDataParallel as DDP
import evaluate
from torchmetrics.text import CharErrorRate
from tqdm import tqdm
import torch
import time
import re
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from dataclasses import dataclass
from typing import Any, Dict, List, Union

HF_KEY = "your_hf_key"
WHISPER_VERSION = "openai/whisper-base"
WHISPER_LANGUAGE = "it"
M4T2_LANGUAGE = "ita"
wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

login(token=HF_KEY)

2025-06-16 09:11:47.654299: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750065107.840459      34 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750065107.896617      34 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.60k [00:00<?, ?B/s]

# Section 1: Dataset implementation and pre-processing

It's important to state that the second part of code for each dataset are defined for the section 3 of the project. 

## 1.1) Dataset implementation

In [8]:
sampling_rate = 16000.0

### 1.1.1) DATASET 1: ITALIC

In [None]:
italic  = DatasetDict()

#italic["train"] = load_dataset("RiTA-nlp/ITALIC","hard_speaker", split="train+validation", token=True,)
italic["test"] = load_dataset("RiTA-nlp/ITALIC","hard_speaker",
                              split="test", token=True)


README.md:   0%|          | 0.00/9.47k [00:00<?, ?B/s]

ITALIC.py:   0%|          | 0.00/9.29k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/858 [00:00<?, ?B/s]

The repository for RiTA-nlp/ITALIC contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/RiTA-nlp/ITALIC.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/1.30G [00:00<?, ?B/s]

In [5]:
#Columns selection
dataset = italic["test"].select_columns(["utt","audio"])

#Set the sampling rate of the audio tracks
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))

dataset = dataset.rename_column("utt","text")

### 1.1.2) DATASET 2: Common-Voice

In [6]:
common_voice = DatasetDict()


#common_voice["train"] = load_dataset("mozilla-foundation/common_voice_13_0","it", split="train+validation", token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_13_0","it",
                              split="test", token=True)

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 15542it [00:00, 155406.81it/s][A
Reading metadata...: 31564it [00:00, 158233.21it/s][A
Reading metadata...: 47388it [00:00, 154165.18it/s][A
Reading metadata...: 62817it [00:00, 151291.82it/s][A
Reading metadata...: 78697it [00:00, 153933.90it/s][A
Reading metadata...: 94104it [00:00, 152678.07it/s][A
Reading metadata...: 109969it [00:00, 154597.09it/s][A
Reading metadata...: 125438it [00:00, 154442.94it/s][A
Reading metadata...: 140889it [00:00, 148912.18it/s][A
Reading metadata...: 162637it [00:01, 150773.16it/s][A


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 15086it [00:00, 178531.88it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 15096it [00:00, 181452.24it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 209it [00:00, 116896.86it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 17764it [00:00, 147264.26it/s][A


In [9]:
dataset = common_voice["test"].select_columns(["audio","sentence"])

dataset = dataset.rename_column("sentence","text")

dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))

#Selection of the first 1500 rows
dataset = dataset.select(range(0,1500))

### 1.1.3) DATASET 3: Minds14

In [6]:
minds14 = DatasetDict()

#minds14["train"] = load_dataset("PolyAI/minds14","it-IT", split="train+validation", token=True)
minds14["test"] = load_dataset("PolyAI/minds14","it-IT",
                               token=True)

README.md:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

minds14.py:   0%|          | 0.00/5.83k [00:00<?, ?B/s]

The repository for PolyAI/minds14 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/PolyAI/minds14.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


MInDS-14.zip:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
dataset = minds14["test"].select_columns(["audio","transcription"])

dataset = dataset.rename_column("transcription","text")

dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))

#Seleziono l'oggetto dataset
dataset = dataset["train"]

#dataset = dataset.select(range(0,1500))

### 1.1.4) DATASET 4: AMI (English Only)

In [None]:
ami = DatasetDict()

#ami["train"] = load_dataset("edinburghcstr/ami","ihm", split="train+validation", token=True)
ami["test"] = load_dataset("edinburghcstr/ami","ihm",
                              split="test", token=True)

In [None]:
dataset = ami["test"].select_columns(["audio","text"])

dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))

#Select the first 1500 rowsr
dataset = dataset.select(range(0,1500))

## 1.2) Dataset preparation for whisper finetuning
 


### 1.2.1) Definition of whisper processor
1) Whisper feature extractor
2) Whisper tokenizer

This component enables the transformation of raw audio data to a suitable format for a subsequent analysis by the Whisper Model

In [6]:
#choose the preferred language
processor = WhisperProcessor.from_pretrained(WHISPER_VERSION, language=WHISPER_LANGUAGE, task="transcribe", device_map='auto')

feature_extractor = processor.feature_extractor
tokenizer = processor.tokenizer

input_str = italic["train"][0]['utt']
labels = tokenizer(input_str).input_ids

decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

# CHECK if the processor code and decode correctly the labels (transcribed text)
print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Input:                 svegliami alle cinque di mattina questa settimana
Decoded w/ special:    <|startoftranscript|><|it|><|transcribe|><|notimestamps|>svegliami alle cinque di mattina questa settimana<|endoftext|>
Decoded w/out special: svegliami alle cinque di mattina questa settimana
Are equal:             True


### 1.2.2) Definition of the Whisper Model (Encoder - Decoder) architecture

In [7]:
model = WhisperForConditionalGeneration.from_pretrained(WHISPER_VERSION,device_map='auto')

#It's possible to choose the preferred language
model.generation_config.language = WHISPER_LANGUAGE
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.81k [00:00<?, ?B/s]

In [8]:
#The dataset preparation method
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["utt"]).input_ids
    return batch


In [9]:
italic = italic.map(prepare_dataset, remove_columns=italic.column_names["train"], num_proc=8)
italic

Map (num_proc=8):   0%|          | 0/15080 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/1441 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 15080
    })
    test: Dataset({
        features: ['input_features', 'labels'],
        num_rows: 1441
    })
})

# Section 2: Finetuning (optional)

The fine-tuning step requires higher-level hardware. For example, if you try to use the P100 GPU provided by Kaggle, each iteration (forward + backward) of a batch (8 elements) for the Whisper Medium model takes approximately 17 seconds to complete.

## 2.1) Data Collator definition
This element is necessary to:
1) Handles variable-length sequencies of data audio
2) Enable an efficient batch processing on GPU

In [10]:
import torch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


In [11]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor = processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

## 2.2) Hugging face trainer definition and training process

In [36]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * wer_metric.compute(predictions = pred_str, references = label_str)
    cer = 100 * cer_metric.compute(predictions = pred_str, references = label_str)

    return {"wer": wer, "cer": cer}


In [35]:
model.config.use_cache = False

BATCH_SIZE = 128

training_args = Seq2SeqTrainingArguments(
    dataloader_num_workers = 4,
    output_dir="./whisper_base", 
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=2, 
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=700,
    gradient_checkpointing=True,
    fp16=True,
    eval_strategy="steps",
    per_device_eval_batch_size=BATCH_SIZE,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=40,
    eval_steps=20,
    logging_steps=25,
    report_to= "none",
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    #push_to_hub=True,
    #hub_model_id="matteoparrott/whisper_large_v3_it"
    
)

Downloading builder script:   0%|          | 0.00/5.60k [00:00<?, ?B/s]

In [37]:
def partition_dataset(dataset, num_partitions):
    partition_size = len(dataset) // num_partitions
    return [dataset.select(range(i * partition_size, (i + 1) * partition_size)) for i in range(num_partitions)]

num_partitions = 16  # Numero di partizioni
partitions = partition_dataset(italic["train"], num_partitions)
partitions[0]

Dataset({
    features: ['input_features', 'labels'],
    num_rows: 942
})

In [41]:
trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=partitions[0],
    eval_dataset=partitions[1],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

  trainer = Seq2SeqTrainer(


In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Wer,Cer
20,No log,1.349006,52.061582,18.387127
40,1.380000,1.160457,48.646257,17.427231
60,1.185400,0.924857,41.019289,13.861017




## 2.3) Evaluation

In [None]:
eval_result = trainer.evaluate()
trainer.save_metrics("train",eval_result)
eval_result

# Section 3: Comparison M4T2 model - Whisper

In this section there's a comparison between two of the best open-source speech to text models. The metrics considered for the best model are:
1) WER (Word Error Rate)
2) CER (Character Error Rate)
3) Inference time


## 3.1) Implementation of M4T2 model and data pre-processing

In [10]:
from transformers import AutoProcessor, SeamlessM4Tv2Model
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large", device_map = 'auto')
model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large", device_map = 'auto')

#model.to(device)


preprocessor_config.json:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/19.7k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.17M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.34k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.72k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/211k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.24G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/9.91M [00:00<?, ?B/s]

### 3.1.1) Simple inference with the model

In [18]:
audio_sample = dataset["audio"][0]
label = dataset["text"][0]

audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt", sampling_rate=16000)
audio_inputs.to(device)

output_tokens = model.generate(**audio_inputs, tgt_lang="ita", generate_speech=False)
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
print("The text transcribed by the model is: ",translated_text_from_audio)
print("The label text is: ", label)

The text transcribed by the model is:  Il libro ha suscitato molte polemiche a causa dei suoi contenuti.
The label text is:  Il libro ha suscitato molte polemiche a causa dei suoi contenuti.


## 3.2) Implementation of Whisper model

In [6]:
from transformers import WhisperProcessor
from transformers import WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3", language=WHISPER_LANGUAGE, task="transcribe", device_map='auto')

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3",device_map='auto')

model.generation_config.language = "it"
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

## 3.3) Comparison between Whisper and M4T2

### 3.3.1) SeamlessM4T2-Test loop

In [19]:
length = dataset.num_rows

all_predictions = []
all_references = dataset['text'][0:length]

t=0
#for each item in the dataset, transcribe and store results in all_predictions
for i in tqdm(range(0,length)):
    input_speech = dataset[i]['audio']
    t0= time.time()
    
    input_features = processor(audios = input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt")
    input_features.to(device)
    with torch.no_grad(): 
        output_tokens = model.generate(**input_features, tgt_lang=M4T2_LANGUAGE, generate_speech=False)
    
    transcription = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
    
    t+= (time.time() - t0)
    all_predictions.append(transcription)


100%|██████████| 1500/1500 [19:38<00:00,  1.27it/s]


### 3.3.2) Whisper test loop

In [11]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
length = dataset.num_rows

all_predictions = []
all_references = dataset['text'][0:length]

t=0

for i in tqdm(range(0,length)):
    input_speech = dataset[i]['audio']
    t0= time.time()
    input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features.to(device)
    predicted_ids = model.generate(input_features)#, forced_decoder_ids=forced_decoder_ids)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    t+= (time.time() - t0)
    all_predictions.append(transcription[0])

  0%|          | 0/1500 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 1500/1500 [24:36<00:00,  1.02it/s]


## 3.4) Result visualization

To compute the WER and CER metrics, a normalization step is necessary to remove elements such as extra whitespaces, uppercase letters, punctuation, and other artifacts. This step also handles Unicode normalization and diacritics removal, as performed by the Whisper normalizer.

It’s important to note that the WhisperBasicTextNormalizer performs well on English text, but its effectiveness in other languages may be limited.

In [21]:
import json

#Metrics
wer_metric = evaluate.load("wer")

#Normalizer provided by Whisper
normalizer = BasicTextNormalizer()

#normalize both predictions and references. Also remove additional whitespace at the end of the predictions
all_predictions_normalized = list(map(lambda x: normalizer(x), all_predictions))
all_predictions_normalized = list(map(lambda x: re.sub(' $', '',x), all_predictions_normalized))
all_references_normalized = list(map(lambda x: normalizer(x), all_references))

wer = 100 * wer_metric.compute(
    references=all_references_normalized, predictions=all_predictions_normalized
)


cer = 100 * cer_metric.compute(references = all_references_normalized, predictions = all_predictions_normalized )


metrics = {
    'WER': wer,
    'CER': cer,
    'Time': t
}

print('WERNorm:', wer)
print('CERNorm:', cer)

with open('metricsNorm(M4T2-ami).json', 'w') as f:
    json.dump(metrics, f, indent=4)
    
wer = 100 * wer_metric.compute(
    references=all_references, predictions=all_predictions
)
cer = 100 * cer_metric.compute(references = all_references, predictions = all_predictions )

metricsNorm = {
    'WER': wer,
    'CER': cer,
    'Time': t
}

print('WER:', wer)
print('CER:', cer)
print('Time:', t)

with open('metrics(M4T2-ami).json', 'w') as f:
    json.dump(metricsNorm, f, indent=4)


WERNorm: 8.607060106310481
CERNorm: 3.4840729679016733
WER: 12.675958188153311
CER: 4.381951822380261
Time: 1161.9507732391357
