<a href="https://colab.research.google.com/github/AbdulxoliqMirzayev/stt_model/blob/main/stt_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bizdagi GPU ishlayotganini tekshiramiz

In [None]:
!pip install --upgrade --quiet pip
!pip install --upgrade --quiet datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio

In [None]:
from huggingface_hub import notebook_login

notebook_login()

### Load Dataset

Birinchi muhim qadam sifatida biz kerakli datasetni yuklab olishimiz kerak bo'ladi, Biz  DavronSherbaev/uzbekvoice-filtered datasetini ishlatamiz (O'zbek tilida)

Dataset ancha katta, shuning uchun biz uning bir qismidan foydalanamiz

In [None]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("DavronSherbaev/uzbekvoice-filtered", split="train[35000:49000]")
common_voice["test"] = load_dataset("DavronSherbaev/uzbekvoice-filtered", split="train[49000:50000]")

print(common_voice)

Biz o'zimizga kerakli bo'lgan ma'lumotlarnigina qoldiramiz

In [None]:
common_voice = common_voice.remove_columns(['previous_text', 'id', 'client_id', 'duration', 'sentence', 'created_at', 'original_sentence_id', 'sentence_clips_count', 'upvotes_count', 'downvotes_count', 'reported_count', 'reported_reasons', 'skipped_clips', 'gender', 'accent_region', 'native_language', 'year_of_birth'])

print(common_voice)

In [None]:
common_voice["train"] = common_voice["train"].shuffle(seed=42)
common_voice["test"] = common_voice["test"].shuffle(seed=42)

## Prepare Feature Extractor, Tokenizer and Data


Jarayoni uch bosqichga bo'linadi:

- Xom audio ma'lumotlarni oldindan ishlovchi **feature extractor**.
- Ma'lumotni ketma-ketlikka moslovchi **model**.
- Model chiqishlarini matn formatiga o'tkazuvchi **tokenizer**.

Transformers kutubxonasida Whisper modeli uchun maxsus WhisperFeatureExtractor va WhisperTokenizer mavjud.

### Load WhisperFeatureExtractor

In [None]:
# import transformers
# transformers.utils.move_cache()

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base")

### Load WhisperTokenizer

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base", language="uzbek", task="transcribe")

### Combine To Create A WhisperProcessor

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-base", language="uzbek", task="transcribe")

### Prepare Data

Navbat ma'lumotlarni tozalash qismiga, keling avval datasetdan bir ma'lumotni ko'raylik

In [None]:
print(common_voice["train"][0])

`cast_column` metodida foydalanamiz va audioning `sampling_rate`ini o'zgaritishimiz kerak, chunki whisper modeli 16 kHz ma'lumot uchun ishlay oladi

In [None]:
from datasets import Audio

common_voice = common_voice.cast_column("path", Audio(sampling_rate=16000))

Keling endi shu ma'lumotni qayta ko'ramiz

In [None]:
print(common_voice["train"][0])

Model uchun ma'lumot tayyorlash:

- Audio yuklanadi va qayta namuna olinadi.
- Log-Mel spektrogram xususiyatlari olinadi.
- Transkripsiyalar label IDlarga kodlanadi.







In [None]:
def prepare_dataset(batch):

    audio = batch["path"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["text"]).input_ids

    return batch

**Bizda `num_proc` parametri muhim parametr hisoblanadi ya'ni bu bizga multi-processing imkonini beradi. Uni to'g'ri tanlash uchun biz cpu yadrolari soni aniqlashimiz kerak va shunga teng bo'lgan qiymat tanlaymiz**

In [None]:
import os
print(os.cpu_count())

`.map` metodi orqali ma'lumotni tayyorlaymiz.

Jarayon biroz vaqt oladi

In [None]:
common_voice = common_voice.map(prepare_dataset,remove_columns=common_voice.column_names["train"],  num_proc=4 )

## Training and Evaluation

Endi ma'lumot tayyorlandi, trening jarayoniga o'tamiz.
Qadamlar:

- **Pre-trained checkpoint yuklash**: modelni to'g'ri o'rnatish va treningga tayyorlash.
- **Data collator aniqlash**: oldindan ishlov berilgan ma'lumotlarni PyTorch tensorga tayyorlash.
- **Baholash metrikalari**: WER (word error rate) metrikasidan foydalanib, modelni baholash.
- **Treaning konfiguratsiyasini belgilash**: Trainer uchun trening jadvalini aniqlash.
Fine-tune qilingan modelni sinov ma'lumotlari ustida baholash

## Load a Model

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")

Inference paytida modelni faqat O'zbek tilida ishlashga moslashtirish uchun automatic speech recognationni o'chiramiz.



---


**Inference — bu modelni o'qitishdan so'ng, yangi ma'lumotlar bilan test qilish yoki undan natija olish jarayonidir. Masalan, modelni treningdan so'ng, unga audio yoki matn kiritib, modeldan javob olish (masalan, nutqni matnga aylantirish) inference deb ataladi.**

In [None]:
model.generation_config.language = "uzbek"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

### Define a Data Collator

Data collator `input_features` va `labels`ni alohida ishlaydi. `input_features`ni `feature extractor` bilan, `labels`ni esa `tokenizer` bilan qayta ishlaymiz.

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

### Evaluation Metrics

Modelni baholash uchun standart bo'lib hisoblangan `word error rate (WER)`dan foydalanamiz

In [None]:
import evaluate

metric = evaluate.load("wer")

`compute_metrics` funksiyasi -100ni `pad_token_id`ga almashtirib, `predict` va `label` IDlarini matnga decoding qiladi va WERni hisoblaydi.

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    label_ids[label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Define the Training Configuration

Oxirgi qadam: Treaning uchun kerakli parametrlarni yozamiz

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-uz",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=2000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

Treaning argumentlarini trainerga, `model`, `dataset`, `data collator` va `compute_metrics` funksiyasi bilan birga uzatamiz:

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

In [None]:
import torch
torch.cuda.empty_cache()

Treningni boshlashdan oldin, processor obyektini saqlab olish maslahat beriladi. U training jarayonida o'zgarmaydi

In [None]:
processor.save_pretrained(training_args.output_dir)

### Training

In [None]:
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
500,0.5844,0.561094,48.880368
1000,0.2606,0.330639,32.453988
1500,0.2337,0.275608,27.990798
2000,0.1468,0.256811,26.165644


You have passed task=transcribe, but also have set `forced_decoder_ids` to [[1, 50259], [2, 50359], [3, 50363]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of task=transcribe.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
There were missing keys in the checkpoint model loaded: ['proj_out.weight'].


TrainOutput(global_step=2000, training_loss=0.5499511467218399, metrics={'train_runtime': 7371.1445, 'train_samples_per_second': 4.341, 'train_steps_per_second': 0.271, 'total_flos': 2.07551987712e+18, 'train_loss': 0.5499511467218399, 'epoch': 2.2857142857142856})

Endi modelni huggingfacega yuklaymiz

In [None]:
kwargs = {
    "dataset_args": "config: uz, split: test",
    "language": "uz",
    "model_name": "Whisper base uz - AbdulxoliqMirzaev",
    "tasks": "automatic-speech-recognition",
}

 `push_to_hub` buyrug'idan foydalanamiz:

In [None]:
trainer.push_to_hub(**kwargs)

**TESTING**