The Whisper checkpoints come in five configurations of varying model sizes.
The smallest four are trained on either English-only or multilingual data.
The largest checkpoints are multilingual only. All 11 of the pre-trained checkpoints
are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The
checkpoints are summarised in the following table with links to the models on the Hub:

| Size     | Layers | Width | Heads | Parameters | English-only                                         | Multilingual                                        |
|----------|--------|-------|-------|------------|------------------------------------------------------|-----------------------------------------------------|
| tiny     | 4      | 384   | 6     | 39 M       | [✓](https://huggingface.co/openai/whisper-tiny.en)   | [✓](https://huggingface.co/openai/whisper-tiny.)    |
| base     | 6      | 512   | 8     | 74 M       | [✓](https://huggingface.co/openai/whisper-base.en)   | [✓](https://huggingface.co/openai/whisper-base)     |
| small    | 12     | 768   | 12    | 244 M      | [✓](https://huggingface.co/openai/whisper-small.en)  | [✓](https://huggingface.co/openai/whisper-small)    |
| medium   | 24     | 1024  | 16    | 769 M      | [✓](https://huggingface.co/openai/whisper-medium.en) | [✓](https://huggingface.co/openai/whisper-medium)   |
| large    | 32     | 1280  | 20    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large)    |
| large-v2 | 32     | 1280  | 20    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v2) |
| large-v3 | 32     | 1280  | 20    | 1550 M     | x                                                    | [✓](https://huggingface.co/openai/whisper-large-v3) |


For demonstration purposes, we'll fine-tune the multilingual version of the
[`"small"`](https://huggingface.co/openai/whisper-small) checkpoint with 244M params (~= 1GB).
As for our data, we'll train and evaluate our system on a low-resource language
taken from the [Common Voice](https://huggingface.co/datasets/mozilla-foundation/dataset_11_0)
dataset. We'll show that with as little as 8 hours of fine-tuning data, we can achieve
strong performance in this language.

In [None]:
!pip install --upgrade --quiet pip
!pip install --upgrade --quiet datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio

In [1]:
import torch, gc 

gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

torch.cuda.empty_cache()
gc.collect()

Fri Apr 11 11:53:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  GRID V100S-32Q                 On  |   00000000:03:00.0  On |                    0 |
| N/A   N/A    P0             N/A /  N/A  |    1130MiB /  32768MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

0

In [None]:
from huggingface_hub import notebook_login

# notebook_login()

!huggingface-cli login --token TOKEN_HUGGINGFACE_A_COLLER_ICIs

  from .autonotebook import tqdm as notebook_tqdm


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ziane212/.cache/huggingface/token
Login successful


! pip install -U openai-whisper

import whisper
model = whisper.load_model("medium")
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Loading our checkpoint state dict here.
finetune_model = WhisperForConditionalGeneration.from_pretrained("Rziane/whisper-medium-test")
model.load_state_dict(finetune_model.model.state_dict())

In [5]:
model_id = "openai/whisper-large-v3-turbo"
output_model_path = "/mnt/PROJET_CAENNAIS/expe_frapeor_ASR/models/whisper-large-v3-turbo-CAENNAIS_GB_E2"

kwargs = {
    "dataset_tags": "CAENNAIS_GB_E2",
    "dataset": "CAENNAIS_GB_E2",  # a 'pretty' name for the training dataset
    "dataset_args": "config: fr, split: test",
    "language": "fr",
    "model_name": "Whisper large v3 turbo CAENNAIS_GB_E2",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-large-v3-turbo",
    "tasks": "automatic-speech-recognition",
}

path_train = "/mnt/PROJET_CAENNAIS/expe_frapeor_ASR/dataset_caennais_GB_E2/dataset/train_dataset_GB_E2.tsv"
path_test = "/mnt/PROJET_CAENNAIS/expe_frapeor_ASR/dataset_caennais_GB_E2/dataset/test_dataset_GB_E2.tsv"

In [6]:
from datasets import load_dataset, Audio

file_dict = {
  "train" : path_train,
  "test" : path_test
}

dataset = load_dataset('csv',
                     data_files=file_dict,
                     delimiter='\t',
                     column_names=['audio','text'],
                     skiprows=1
                     )

dataset_train = dataset['train']
dataset_test= dataset['test']

print(dataset_train)

Dataset({
    features: ['audio', 'text'],
    num_rows: 1028
})


In [7]:
# import re 
# def remove_bracketed_text(batch):
#     batch["text"] = re.sub(r'\[.*?\]', '', batch["text"])
#     return batch

# dataset['train'] = dataset['train'].map(remove_bracketed_text)
# dataset['test'] = dataset['test'].map(remove_bracketed_text)


In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'text'],
        num_rows: 1028
    })
    test: Dataset({
        features: ['audio', 'text'],
        num_rows: 257
    })
})

In [9]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_id)

In [10]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(model_id, language="french", task="transcribe")

In [11]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(model_id, language="french", task="transcribe")

In [12]:
print(dataset["train"][0])

{'audio': '/mnt/PROJET_CAENNAIS/expe_frapeor_ASR/dataset_caennais_GB_E2/audio/GB_E2_LOCB1_745430_746110.wav', 'text': 'étaient'}


In [13]:
from datasets import Audio

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

In [14]:
print(dataset["train"][0])

{'audio': {'path': '/mnt/PROJET_CAENNAIS/expe_frapeor_ASR/dataset_caennais_GB_E2/audio/GB_E2_LOCB1_745430_746110.wav', 'array': array([ 0.00073775, -0.00260267, -0.011188  , ..., -0.00399018,
       -0.00315858, -0.00228711]), 'sampling_rate': 16000}, 'text': 'étaient'}


In [15]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["text"]).input_ids
    return batch

In [16]:
dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names["train"], num_proc=4)

Map (num_proc=4): 100%|██████████| 1028/1028 [03:22<00:00,  5.07 examples/s]
Map (num_proc=4): 100%|██████████| 257/257 [00:50<00:00,  5.05 examples/s]


In [17]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(model_id)

In [18]:
model.generation_config.language = "french"
model.generation_config.task = "transcribe"

model.config.return_token_timestamps = True
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

model.generation_config.forced_decoder_ids = None

In [19]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [20]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

In [21]:
import evaluate

metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

In [22]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    # cer = cer_metric.compute(predictions=pred_str, references=label_str)

    # return {"wer": wer, "cer": cer}
    return {"wer": wer}

In [23]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir=output_model_path,  # change to a repo name of your choice
    per_device_train_batch_size=32,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    gradient_checkpointing=True,
    fp16=True,
    num_train_epochs=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=225,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
    save_total_limit=2
)



In [24]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

  trainer = Seq2SeqTrainer(


In [25]:
processor.save_pretrained(training_args.output_dir)

[]

The peak GPU memory for the given training configuration is approximately 15.8GB.
Depending on the GPU allocated to the Google Colab, it is possible that you will encounter a CUDA `"out-of-memory"` error when you launch training.
In this case, you can reduce the `per_device_train_batch_size` incrementally by factors of 2
and employ [`gradient_accumulation_steps`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments.gradient_accumulation_steps)
to compensate.

To launch training, simply execute:

In [None]:
trainer.train()

  0%|          | 0/330 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
 10%|█         | 33/330 [07:17<44:18,  8.95s/it]  You have passed task=transcribe, but also have set `forced_decoder_ids` to [(1, 50265), (2, 50360), (3, 50364)] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Trainer.tokenize

{'eval_loss': 1.3546198606491089, 'eval_wer': 39.415268002165675, 'eval_runtime': 129.1086, 'eval_samples_per_second': 1.991, 'eval_steps_per_second': 0.132, 'epoch': 1.0}


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
 20%|██        | 66/330 [17:12<39:31,  8.98s/it]  Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You shoul

{'eval_loss': 0.551038920879364, 'eval_wer': 29.507309149972926, 'eval_runtime': 133.7542, 'eval_samples_per_second': 1.921, 'eval_steps_per_second': 0.127, 'epoch': 2.0}


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
 23%|██▎       | 75/330 [22:13<1:10:28, 16.58s/it]

In [28]:
# trainer.save_pretrained(save_directory = output_model_path, safe_serialization=False)
# model = WhisperForConditionalGeneration.from_pretrained(output_model_path)
trainer.push_to_hub()


events.out.tfevents.1736961078.V301V-JGRCC1.campus.unicaen.fr.7784.0: 100%|██████████| 9.53k/9.53k [00:00<00:00, 24.9kB/s]


CommitInfo(commit_url='https://huggingface.co/Rziane/whisper-large-v3-turbo-seg/commit/99acada2b1806f5631448e9857aa89248c69f451', commit_message='End of training', commit_description='', oid='99acada2b1806f5631448e9857aa89248c69f451', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Rziane/whisper-large-v3-turbo-seg', endpoint='https://huggingface.co', repo_type='model', repo_id='Rziane/whisper-large-v3-turbo-seg'), pr_revision=None, pr_num=None)

In [28]:
a

NameError: name 'a' is not defined

In [1]:
from datasets import load_dataset, Audio
from transformers import WhisperProcessor, WhisperForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
import torch
import gc
from dataclasses import dataclass
from typing import Any, Dict, List, Union

# Libérer la mémoire GPU
torch.cuda.empty_cache()
gc.collect()

file_dict = {
    "train": "/home/ziane212/projects/MMS_ASR_finetuning/dataset_seg/train.tsv",
    "test": "/home/ziane212/projects/MMS_ASR_finetuning/dataset_seg/test.tsv"
}

# Charger le dataset
dataset = load_dataset(
    'csv',
    data_files=file_dict,
    delimiter='\t',
    column_names=['audio', 'text', 'timecodes', 'speaker'],
    skiprows=1
)

# Convertir les chemins audio en objets Audio
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

# Séparer en ensemble d'entraînement et de test
dataset_train = dataset['train']
dataset_test = dataset['test']

# Charger le modèle pré-entraîné et le processeur
model_name = "openai/whisper-large-v3-turbo"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Prétraiter les données
def prepare_dataset(batch):
    # Charger et rééchantillonner les données audio de 48 kHz à 16 kHz
    audio = batch["audio"]

    # Calculer les caractéristiques log-Mel à partir du tableau audio d'entrée
    batch["input_features"] = processor.feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"]
    ).input_features[0]

    # Encoder le texte cible en identifiants d'étiquettes
    batch["labels"] = processor.tokenizer(batch["text"]).input_ids
    return batch

# Appliquer le prétraitement
dataset_train = dataset_train.map(prepare_dataset, remove_columns=["timecodes", "speaker"])
dataset_test = dataset_test.map(prepare_dataset, remove_columns=["timecodes", "speaker"])

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Séparer les entrées et les étiquettes car elles doivent avoir des longueurs différentes et des méthodes de padding différentes
        # Traiter d'abord les entrées audio en renvoyant simplement des tenseurs torch
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Obtenir les séquences d'étiquettes tokenisées
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # Appliquer le padding aux étiquettes pour atteindre la longueur maximale
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Remplacer le padding par -100 pour ignorer correctement la perte
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # Si le token bos est ajouté dans l'étape de tokenisation précédente,
        # supprimer le token bos ici car il est ajouté plus tard
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

# Initialiser le collateur de données
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

# Configurer le modèle pour générer des timestamps
model.config.return_token_timestamps = True
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# Configurer les arguments d'entraînement
training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-finetuned-timestamps",
    evaluation_strategy="epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=4,  # Réduit pour éviter les problèmes de mémoire
    gradient_accumulation_steps=4,
    save_total_limit=1,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),  # Activer FP16 si le GPU le supporte
    logging_dir="./logs",
)

# Définir le Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train,
    eval_dataset=dataset_test,
    data_collator=data_collator,  # Utilisation du collateur personnalisé
)

# Lancer l'entraînement
trainer.train()

# Sauvegarder le modèle fine-tuné
trainer.save_model("./whisper-finetuned-timestamps")


  from .autonotebook import tqdm as notebook_tqdm
  0%|          | 0/220 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
                                                
 20%|██        | 44/220 [05:13<17:59,  6.14s/it]

{'eval_loss': 0.9869877099990845, 'eval_runtime': 41.7505, 'eval_samples_per_second': 4.263, 'eval_steps_per_second': 0.551, 'epoch': 0.99}


                                                
 40%|████      | 88/220 [10:25<13:28,  6.13s/it]

{'eval_loss': 0.9118785262107849, 'eval_runtime': 41.8205, 'eval_samples_per_second': 4.256, 'eval_steps_per_second': 0.55, 'epoch': 1.99}


                                                 
 60%|██████    | 132/220 [15:37<08:57,  6.11s/it]

{'eval_loss': 0.8953073620796204, 'eval_runtime': 41.5885, 'eval_samples_per_second': 4.28, 'eval_steps_per_second': 0.553, 'epoch': 2.98}


                                                 
 80%|████████  | 177/220 [20:46<03:38,  5.09s/it]

{'eval_loss': 0.9079563021659851, 'eval_runtime': 41.2409, 'eval_samples_per_second': 4.316, 'eval_steps_per_second': 0.558, 'epoch': 4.0}


                                                 
100%|██████████| 220/220 [25:49<00:00,  7.04s/it]


{'eval_loss': 0.9392129778862, 'eval_runtime': 44.043, 'eval_samples_per_second': 4.042, 'eval_steps_per_second': 0.522, 'epoch': 4.97}
{'train_runtime': 1549.4104, 'train_samples_per_second': 2.285, 'train_steps_per_second': 0.142, 'train_loss': 0.6746267838911577, 'epoch': 4.97}


In [34]:
print(dataset["train"][0]["audio"])

KeyError: 'audio'

In [2]:
from datasets import load_dataset, Audio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import random

# Charger le dataset
file_dict = {
  "train": "/home/ziane212/projects/MMS_ASR_finetuning/caennais_gold_12.11.24/dataset_caennais_12.11.24/train_dataset_caennais_12.11.24.tsv",
  "test": "/home/ziane212/projects/MMS_ASR_finetuning/caennais_gold_12.11.24/dataset_caennais_12.11.24/test_dataset_caennais_12.11.24.tsv"
}

eval_dataset = load_dataset('csv',
                            data_files=file_dict,
                            delimiter='\t',
                            column_names=['audio', 'sentence'],
                            skiprows=1)

eval_common_voice_test = eval_dataset['test']
eval_common_voice_test = eval_common_voice_test.cast_column("audio", Audio(sampling_rate=16_000))

# Charger le modèle fine-tuné
model_id = "/home/ziane212/projects/MMS_ASR_finetuning/whisper-finetuned-timestamps"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Configurer le modèle pour générer des timestamps
model.config.return_token_timestamps = True
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

# Tester les prédictions avec timestamps
for i in range(0, 5):
    rand = random.randint(0, len(eval_common_voice_test) - 1)  # Index aléatoire
    audio = eval_common_voice_test[rand]["audio"]["array"]
    reference = eval_common_voice_test[rand]["sentence"]

    # Prétraitement de l'audio
    input_features = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True).input_features

    # Prédictions avec timestamps activés
    outputs = model.generate(input_features, max_length=448)

    # Décoder les tokens générés
    tokens_with_timestamps = processor.tokenizer.convert_ids_to_tokens(outputs[0])

    # Vérifier si des timestamps sont générés
    print(f"\nAudio ID: {rand}")
    print("Tokens générés :", tokens_with_timestamps)
    print("Reference :", reference.lower())


OSError: /home/ziane212/projects/MMS_ASR_finetuning/whisper-finetuned-timestamps does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//home/ziane212/projects/MMS_ASR_finetuning/whisper-finetuned-timestamps/tree/main' for available files.

In [None]:
from datasets import load_dataset, Audio

file_dict = {
  "train" : "/home/ziane212/projects/MMS_ASR_finetuning/caennais_gold_12.11.24/dataset_caennais_12.11.24/train_dataset_caennais_12.11.24.tsv",
  "test" : "/home/ziane212/projects/MMS_ASR_finetuning/caennais_gold_12.11.24/dataset_caennais_12.11.24/test_dataset_caennais_12.11.24.tsv"
}

eval_dataset = load_dataset('csv',
                     data_files=file_dict,
                     delimiter='\t',
                     column_names=['audio', 'sentence'],
                     skiprows=1
                     )

eval_common_voice_train = eval_dataset['train']
eval_common_voice_test = eval_dataset['test']

eval_common_voice_train = eval_common_voice_train.cast_column("audio", Audio(sampling_rate=16_000))
eval_common_voice_test = eval_common_voice_test.cast_column("audio", Audio(sampling_rate=16_000))

from transformers import WhisperTokenizer

model_id = "/home/ziane212/projects/MMS_ASR_finetuning/whisper-large-v3-turbo-seg"
model_id2 = "openai/whisper-large-v3-turbo"

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

processor2 = WhisperProcessor.from_pretrained(model_id2)
model2 = WhisperForConditionalGeneration.from_pretrained(model_id2)
forced_decoder_ids2 = processor2.get_decoder_prompt_ids(language="french", task="transcribe")

import torch
# processor.tokenizer.set_target_lang("fr")
# model.load_adapter("aeb")

import random

for i in range(0, 9):
    # Sélectionne un chiffre aléatoire entre 0 et 100
    rand = random.randint(0, 100)
    input_dict = processor(eval_common_voice_test[rand]["audio"]["array"], sampling_rate=16_000, return_tensors="pt", padding=True).input_features
    predicted_ids = model.generate(input_dict, language='french').to("cuda")
    transcription = processor.batch_decode(predicted_ids)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    input_dict2 = processor2(eval_common_voice_test[rand]["audio"]["array"], sampling_rate=16_000, return_tensors="pt", padding=True).input_features
    predicted_ids2 = model2.generate(input_dict2, forced_decoder_ids=forced_decoder_ids2, language='french').to("cuda")
    transcription2 = processor2.batch_decode(predicted_ids2)
    transcription2 = processor2.batch_decode(predicted_ids2, skip_special_tokens=True)

    print(rand)
    print("\nPrediction:", transcription)
    print("Prediction2:", transcription2)
    print("Reference:", eval_common_voice_test[rand]["sentence"].lower())

You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50265], [2, 50360], [3, 50364]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
You have passed language=french, but also have set `forced_decoder_ids` to [(1, 50265), (2, 50360), (3, 50364)] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=french.


81

Prediction: [' ->•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••']
Prediction2: [' vous donnez vous-même des cours, vous savez ce que vous allez faire.']
Reference: vous donnez vous même des cours vous savez pas –fin vous savez ce que vous allez faire
14

Prediction: [' Laink et euh ouais ça va dépendre mais ça enfin je me débrouille']
Prediction2: [' Et ouais, ça va dépendre. Mais ça va, je me débrouille. Je me débrouille.']
Reference: et euh ouais ça va dépendre mais ça va j’ me débrouille
3

Prediction: [" quand j'étais dix-neuf ans"]
Prediction2: [" quand j'étais 19 ans."]
Reference: quand

KeyboardInterrupt: 

In [30]:
from datasets import load_dataset, Audio

file_dict = {
  "train" : "/home/ziane212/projects/MMS_ASR_finetuning/caennais_gold_12.11.24/dataset_caennais_12.11.24/train_dataset_caennais_12.11.24.tsv",
  "test" : "/home/ziane212/projects/MMS_ASR_finetuning/caennais_gold_12.11.24/dataset_caennais_12.11.24/test_dataset_caennais_12.11.24.tsv"
}

eval_dataset = load_dataset('csv',
                     data_files=file_dict,
                     delimiter='\t',
                     column_names=['audio', 'sentence'],
                     skiprows=1
                     )

eval_common_voice_train = eval_dataset['train']
eval_common_voice_test = eval_dataset['test']

eval_common_voice_train = eval_common_voice_train.cast_column("audio", Audio(sampling_rate=16_000))
eval_common_voice_test = eval_common_voice_test.cast_column("audio", Audio(sampling_rate=16_000))

from transformers import WhisperTokenizer

model_id = "/home/ziane212/projects/MMS_ASR_finetuning/whisper-medium-CAENNAIS"
model_id2 = "openai/whisper-large-v3-turbo"

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

import torch
# processor.tokenizer.set_target_lang("fr")
# model.load_adapter("aeb")

In [31]:
for id_, i in enumerate(eval_common_voice_test):
    input_dict = processor(eval_common_voice_test[id_]["audio"]["array"], sampling_rate=16_000, return_tensors="pt", padding=True).input_features
    predicted_ids = model.generate(input_dict, language='french').to("cuda")
    transcription = processor.batch_decode(predicted_ids)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    # eval_common_voice_test[id_]['text'] = transcription

    print("\nPrediction:", transcription)
    print("Reference:", eval_common_voice_test[rand]["sentence"].lower())


Prediction: ["i’ fallait qu'on fasse euh une quinzaine de pages et donc du coup là i’ fallait qu' nous mêmes"]
Reference: donc c'est pas- c’est le cas pour moi en fait que

Prediction: ['ouais bro- ouais des cra- ouais exactement ouais']
Reference: donc c'est pas- c’est le cas pour moi en fait que

Prediction: ["donc du coup à ce moment-là fin voilà donc du coup les annotations c'est juste les détails sur ce genre de"]
Reference: donc c'est pas- c’est le cas pour moi en fait que

Prediction: ["quand j'étais dix-neuf ans"]
Reference: donc c'est pas- c’est le cas pour moi en fait que

Prediction: ['ok ouais']
Reference: donc c'est pas- c’est le cas pour moi en fait que

Prediction: ['ou voyager ou juste travailler pa’ce que']
Reference: donc c'est pas- c’est le cas pour moi en fait que


KeyboardInterrupt: 

In [None]:
from datasets import load_dataset, Audio

file_dict = {
  "train" : "/home/ziane212/projects/MMS_ASR_finetuning/transcription_auto_caennais/train_dataset_caennais.tsv",
  "test" : "/home/ziane212/projects/MMS_ASR_finetuning/transcription_auto_caennais/test_dataset_caennais.tsv"
}

eval_dataset = load_dataset('csv',
                     data_files=file_dict,
                     delimiter='\t',
                     column_names=['audio', 'sentence'],
                     skiprows=1
                     )

eval_common_voice_train = eval_dataset['train']
eval_common_voice_test = eval_dataset['test']

eval_common_voice_train = eval_common_voice_train.cast_column("audio", Audio(sampling_rate=16_000))
eval_common_voice_test = eval_common_voice_test.cast_column("audio", Audio(sampling_rate=16_000))

from transformers import WhisperTokenizer

model_id = "/home/ziane212/projects/MMS_ASR_finetuning/whisper-medium-fr"
model_id2 = "openai/whisper-medium"

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french")

processor2 = WhisperProcessor.from_pretrained(model_id2)
model2 = WhisperForConditionalGeneration.from_pretrained(model_id2)
forced_decoder_ids2 = processor2.get_decoder_prompt_ids(language="french", task="transcribe")

import torch
# processor.tokenizer.set_target_lang("fr")
# model.load_adapter("aeb")

import random

for i in range(0, 9):
    # Sélectionne un chiffre aléatoire entre 0 et 100
    rand = random.randint(0, 100)
    input_dict = processor(eval_common_voice_test[rand]["audio"]["array"], sampling_rate=16_000, return_tensors="pt", padding=True).input_features
    predicted_ids = model.generate(input_dict, forced_decoder_ids=forced_decoder_ids).to("cuda")
    transcription = processor.batch_decode(predicted_ids)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    input_dict2 = processor2(eval_common_voice_test[rand]["audio"]["array"], sampling_rate=16_000, return_tensors="pt", padding=True).input_features
    predicted_ids2 = model2.generate(input_dict2, language='fr').to("cuda")
    transcription2 = processor2.batch_decode(predicted_ids2)
    transcription2 = processor2.batch_decode(predicted_ids2, skip_special_tokens=True)

    print("\nPrediction:", transcription)
    print("Prediction2:", transcription2)
    print("Reference:", eval_common_voice_test[rand]["sentence"].lower())


Prediction: ["pa’ce que c'était trop d’ travail d'un coup euh"]
Prediction2: [" parce que c'était trop de travail d'un coup."]
Reference: pa’ce que c’était trop d’ travail d’un coup euh

Prediction: ["c'est"]
Prediction2: [" C'est..."]
Reference: c'est

Prediction: ['ouais']
Prediction2: [' ...']
Reference: ouais

Prediction: ["c'est logique je pense"]
Prediction2: [" C'est logique, je pense."]
Reference: c'est logique je pense

Prediction: ['ouais oui un train']
Prediction2: [' Ouais. Oui. Un train.']
Reference: comment on dit ils euh le petit parade

Prediction: ['euh on a on a le']
Prediction2: [' On a... on a le...']
Reference: euh on a on a lu

Prediction: ['mmh et m’ic start']
Prediction2: [' On va commencer.']
Reference: m- mjukstart

Prediction: ['t’est p’t-êt’ pas trop mon']
Prediction2: [" C'est peut-être pas trop..."]
Reference: mais c'était p’t-êt' pas trop mon

Prediction: ["je j'ai r-"]
Prediction2: [" J'ai... J'ai ri."]
Reference: je j'ai ris beaucoup


Prediction: ['euh on a on a le']
Prediction2: [' On a... on a le...']
Reference: euh on a on a lu

Prediction: ['mmh et m’ic start']
Prediction2: [' On va commencer.']
Reference: m- mjukstart

Prediction: ["pa’ce que c'était trop d’ travail d'un coup euh"]
Prediction2: [" parce que c'était trop de travail d'un coup."]
Reference: pa’ce que c’était trop d’ travail d’un coup euh

Prediction: ['t’est p’t-êt’ pas trop mon']
Prediction2: [" C'est peut-être pas trop..."]
Reference: mais c'était p’t-êt' pas trop mon

Prediction: ["j’ pense que euh c'est plus facile d’ aimer ce film-là que euh"]
Reference: j’ pense que euh c'est plus facile d'aimer ce film-là que

Prediction: ['tu fais une licence et puis ensuite tu fais le master si tu veux']
Reference: tu fais une licence et p’is ensuite tu fais le master si si tu veux

Prediction: ["je j'ai r-"]
Prediction2: [" J'ai... J'ai ri."]
Reference: je j'ai ris beaucoup

Prediction: ['euh i’ s’ faisait noir mais']
Reference: et- i’ s- faisait noir mais

Prediction: ["d'accord oui ils font comme-fin c'est un peu comme en suède du coup j'imagine"]
Reference: d’accord oui i’s font comme -fin c'est un peu comme en suède du coup j'imagine euh

Prediction: ['i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i- i-']
Reference: i’s ont i’s ont pas allumé en fait c'est juste une longue euh

Prediction: ["c'est p’t-êt' trop long mais p’t-êt' p’t-êt' que vous allez l'aimer pa’ce que y’ a beaucoup de gens qu' i’s aiment"]
Reference: c’est c’est un peu ça le le but je pense mais c’était un peu trop long mais c'est peut-êt' que vous vous allez l'aimer pa’ce que y’ a y’ a beaucoup de gens qui l'aiment mais euh

Prediction: ["on va avoir des cours de littérature de philosophie euh des cours de langue des cours d'histoire on va avoir en fait"]
Reference: on va avoir des cours de littérature de philosophie euh des cours de langue des cours d'histoire on va avoir en fait

Prediction: ["et puis euh donc je pense que l'année prochaine quand je vais aller à l'université en norvège ça va"]
Prediction2: [" Et puis je pense que l'année prochaine, quand je vais aller à l'université en Norvège, ça va..."]
Reference: et p'is euh donc je pense que l'année prochaine quand je vais aller à l'université en norvège ça va

Prediction: ["bah c'est bien ça"]
Prediction2: [" Moi c'est bien ça."]
Reference: ah c’est bien ça

Prediction: ['ouais']
Prediction2: [' A la prochaine !']
Reference: ouais

Prediction: ['ouais']
Prediction2: [' Ouais.']
Reference: ouais

Prediction: ["ou général- c'est pas généralement des"]
Prediction2: [" Ou généralement, c'est pas généralement des..."]
Reference: où général-  c'est pas généralement des

Prediction: ['ouais et puis on dans le film on suive aussi le le processus nan le procès']
Prediction2: [' Et puis dans le film on suit aussi le processus... Non, le procès.']
Reference: ouais et puis on dans le film on suit aussi le euh le processus non le procès

Prediction: ["j’ pense que euh c'est plus facile d’ aimer ce film-là que euh"]
Prediction2: [" Je pense que c'est plus facile d'aimer ce film-là que..."]
Reference: j’ pense que euh c'est plus facile d'aimer ce film-là que

Prediction: ['pas the square']
Prediction2: [' ...']
Reference: pas the square



In [None]:
a

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="sanchit-gandhi/whisper-small-hi")  # change to "your-username/the-name-you-picked"

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(source="microphone", type="filepath"),
    outputs="text",
    title="Whisper Small Hindi",
    description="Realtime demo for Hindi speech recognition using a fine-tuned Whisper small model.",
)

iface.launch()

## Closing Remarks

In this blog, we covered a step-by-step guide on fine-tuning Whisper for multilingual ASR
using 🤗 Datasets, Transformers and the Hugging Face Hub. For more details on the Whisper model, the Common Voice dataset and the theory behind fine-tuning, refere to the accompanying [blog post](https://huggingface.co/blog/fine-tune-whisper). If you're interested in fine-tuning other
Transformers models, both for English and multilingual ASR, be sure to check out the
examples scripts at [examples/pytorch/speech-recognition](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition).