# Hands-on exercise

Bu ünitede, yeni bir dilde Whisper gibi bir modele (küçük bir kontrol noktası bile olsa) ince ayar yapmak için gereken zaman ve kaynakları kabul ederek ASR modellerine ince ayar yapmanın zorluklarını araştırdık. Uygulamalı bir deneyim sağlamak için, daha küçük bir veri kümesi kullanırken bir ASR modeline ince ayar yapma sürecinde gezinmenizi sağlayan bir alıştırma tasarladık. Bu alıştırmanın temel amacı, üretim düzeyinde sonuçlar beklemek yerine sizi sürece alıştırmaktır. Sınırlı kaynaklarla bile bunu başarabilmenizi sağlamak için kasıtlı olarak düşük bir metrik belirledik.

İşte talimatlar:

- "PolyAI/minds14" veri kümesinin Amerikan İngilizcesi ("en-US") alt kümesini kullanarak "openai/whisper-tiny" modeline ince ayar yapın. 

- Eğitim için ilk 450 örneği ve değerlendirme için geri kalanını kullanın. .map yöntemini kullanarak veri kümesini ön işleme tabi tutarken num_proc=1 olarak ayarladığınızdan emin olun (bu, modelinizin değerlendirme için doğru şekilde gönderilmesini sağlayacaktır). 

- Modeli değerlendirmek için, bu Ünitede açıklandığı gibi wer ve wer_ortho metriklerini kullanın. Ancak, metriği 100 ile çarparak yüzdelere dönüştürmeyin (Örneğin, WER %42 ise, bu alıştırmada 0,42 değerini görmeyi bekleyeceğiz).


Bir modele ince ayar yaptıktan sonra, aşağıdaki kwargs ile Hub'a yüklediğinizden emin olun:
```
kwargs = {
     "dataset_tags": "PolyAI/minds14",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
}
```

Modelinizin normalleştirilmiş WER (wer) değeri 0,37'den düşükse bu ödevi geçeceksiniz.

## Load Dataset

In [1]:
from datasets import load_dataset

dataset = load_dataset("PolyAI/minds14", name="en-US", trust_remote_code=True)
dataset

Downloading builder script:   0%|          | 0.00/5.90k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.29k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 563
    })
})

In [2]:
dataset = dataset["train"].train_test_split(train_size=450)
dataset

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

In [3]:
dataset["train"][0]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~PAY_BILL/602bae7bbb1e6d0fbce92264.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/28aa727f91fee90575c34956bab09d1716cfaf460c6afcba86a10f04a7d58b83/en-US~PAY_BILL/602bae7bbb1e6d0fbce92264.wav',
  'array': array([-0.00024414,  0.        , -0.00024414, ..., -0.00024414,
          0.        ,  0.        ]),
  'sampling_rate': 8000},
 'transcription': "I'd like to make a payment",
 'english_transcription': "I'd like to make a payment",
 'intent_class': 13,
 'lang_id': 4}

In [4]:
dataset = dataset.select_columns(["audio", "transcription"])
dataset

DatasetDict({
    train: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 450
    })
    test: Dataset({
        features: ['audio', 'transcription'],
        num_rows: 113
    })
})

In [5]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    "openai/whisper-tiny", language="english", task="transcribe"
)

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
dataset["train"].features

{'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None)}

In [7]:
from datasets import Audio

sampling_rate = processor.feature_extractor.sampling_rate
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
dataset["train"].features

{'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None)}

In [8]:
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        audio=audio["array"],
        sampling_rate=audio["sampling_rate"],
        text=example["transcription"],
    )
    
    example["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    return example

In [9]:
dataset = dataset.map(
    prepare_dataset, remove_columns=dataset.column_names["train"], num_proc=1
)
dataset

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

2024-08-15 11:32:31.044284: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 11:32:31.044386: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 11:32:31.164483: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Map:   0%|          | 0/113 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 450
    })
    test: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 113
    })
})

In [10]:
max_input_length = 30.0

def is_audio_in_length_range(length):
    return length < max_input_length

dataset["train"] = dataset["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

dataset

Filter:   0%|          | 0/450 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 447
    })
    test: Dataset({
        features: ['input_features', 'labels', 'input_length'],
        num_rows: 113
    })
})

In [11]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    
    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )
        
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        
        batch["labels"] = labels
        
        return batch

In [12]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [13]:
!pip install -q evaluate jiwer

  pid, fd = os.forkpty()


In [14]:
import evaluate
from transformers.models.whisper.english_normalizer import BasicTextNormalizer


metric = evaluate.load("wer")
normalizer = BasicTextNormalizer()

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)
    
    wer_ortho = metric.compute(predictions=pred_str, references=label_str)
    
    pred_str_norm = [normalizer(pred) for pred in pred_str]
    label_str_norm = [normalizer(label) for label in label_str]
    
    pred_str_norm = [
        pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
    ]
    label_str_norm = [
        label_str_norm[i]
        for i in range(len(label_str_norm))
        if len(label_str_norm[i]) > 0
    ]
    
    wer = metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {"wer_ortho": wer_ortho, "wer": wer}

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [15]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

In [16]:
from functools import partial

# gradyan kontrol noktası ile uyumsuz olduğu için eğitim sırasında önbelleği devre dışı bırakın
model.config.use_cache = False

# üretim için dili ve görevi ayarlayın ve önbelleği yeniden etkinleştirin
model.generate = partial(
    model.generate, language="english", task="transcribe", use_cache=True
)

In [17]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-tiny-en",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=200,
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=100,
    eval_steps=100,
    logging_steps=50,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)



In [18]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [19]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

max_steps is given, it will override any value given in num_train_epochs


In [20]:
trainer.train()



Step,Training Loss,Validation Loss,Wer Ortho,Wer
100,0.2419,0.486529,0.29423,0.286589
200,0.0048,0.594461,0.289791,0.289039


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], '

TrainOutput(global_step=200, training_loss=0.6235875126719475, metrics={'train_runtime': 1457.0057, 'train_samples_per_second': 4.393, 'train_steps_per_second': 0.137, 'total_flos': 1.5721620037632e+17, 'train_loss': 0.6235875126719475, 'epoch': 14.285714285714286})

In [21]:
kwargs = {
     "dataset_tags": "PolyAI/minds14",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
}

trainer.push_to_hub(**kwargs)

Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93, 359, 503, 522, 542, 873, 893, 902, 918, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253, 3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938, 12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075, 21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425, 49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362], 'begin_suppress_tokens': [220, 50257]}


CommitInfo(commit_url='https://huggingface.co/Leotrim/whisper-tiny-en/commit/fb8c546f0b68c6f9adb3d70a5beba825171d3a7b', commit_message='End of training', commit_description='', oid='fb8c546f0b68c6f9adb3d70a5beba825171d3a7b', pr_url=None, pr_revision=None, pr_num=None)