**Install packages:** Install the necessary packages for running the notebook.

In [None]:
pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /private/var/folders/0c/j4c53d0102lcr2b5rk247yrm0000gn/T/pip-req-build-0nxo8x6h
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /private/var/folders/0c/j4c53d0102lcr2b5rk247yrm0000gn/T/pip-req-build-0nxo8x6h
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install evaluate



In [None]:
pip install librosa

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install jiwer

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hInstalling collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.29.2
    Uninstalling accelerate-0.29.2:
      Successfully uninstalled accelerate-0.29.2
Successfully installed accelerate-0.30.1
Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install yt_dlp

Collecting yt_dlp
  Downloading yt_dlp-2024.4.9-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting brotli (from yt_dlp)
  Downloading Brotli-1.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
Collecting mutagen (from yt_dlp)
  Downloading mutagen-1.47.0-py3-none-any.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.4/194.4 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pycryptodomex (from yt_dlp)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
Collecting websockets>=12.0 (from yt_dlp

**Setting up PyTorch Device:** Check if CUDA is available and set the device to GPU if possible; otherwise, fall back to CPU.

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


**Dataset Preparation:** Import necessary libraries and define a function to create a dataset from a TSV file, which reads the file, adjusts the file paths, and converts the data into a Hugging Face dataset format.

In [None]:
from datasets import Dataset, DatasetDict
import pandas as pd
from pathlib import Path

base_path = Path('cv-corpus-16.1-2023-12-06/hy-AM')

def create_dataset_from_dataframe(split_name, base_path):
    df = pd.read_csv(base_path / f'{split_name}.tsv', sep='\t')
    clips_path = base_path / 'clips'
    df['path'] = df['path'].apply(lambda x: str(clips_path / x))
    dataset = Dataset.from_pandas(df)
    return dataset

my_common_voice = DatasetDict()

my_common_voice['train'] = create_dataset_from_dataframe('train', base_path)
my_common_voice['test'] = create_dataset_from_dataframe('test', base_path)
my_common_voice['validation'] = create_dataset_from_dataframe('dev', base_path)

**Clean Dataset:** Remove unnecessary columns like 'client_id', 'up_votes', etc., from the dataset

In [None]:
my_common_voice = my_common_voice.remove_columns(["client_id", "up_votes", "down_votes", "age", "gender", "accents", "variant", "locale", "segment"])

**Print Dataset Overview:** Display the structure of the dataset to confirm the changes and view the remaining columns.

In [None]:
print(my_common_voice)

DatasetDict({
    train: Dataset({
        features: ['path', 'sentence'],
        num_rows: 3794
    })
    test: Dataset({
        features: ['path', 'sentence'],
        num_rows: 2853
    })
    validation: Dataset({
        features: ['path', 'sentence'],
        num_rows: 2656
    })
})


**Load Feature Extractor and Tokenizer:** Load the Whisper feature extractor and tokenizer from Hugging Face's Transformers library.

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Armenian", task="transcribe")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


**Transcription Tokenization and Decoding:** Perform tokenization on a sample text, and decode it to ensure the process retains the original text without any special tokens.

In [None]:
input_str = my_common_voice["train"][0]["sentence"]
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

Input:                 Բամբակենու ծայրատումը կատարվում է նախքան զանգվածային ծաղկումը։
Decoded w/ special:    <|startoftranscript|><|hy|><|transcribe|><|notimestamps|>Բամբակենու ծայրատումը կատարվում է նախքան զանգվածային ծաղկումը։<|endoftext|>
Decoded w/out special: Բամբակենու ծայրատումը կատարվում է նախքան զանգվածային ծաղկումը։
Are equal:             True


**Print a Sample Entry:** Display a specific example from the training set to verify the data structure and contents.

In [None]:
print(my_common_voice["train"][0])

{'path': 'cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39459169.mp3', 'sentence': 'Բամբակենու ծայրատումը կատարվում է նախքան զանգվածային ծաղկումը։'}


**Define Data Preparation Function:** Create a function to process the audio data: loading, resampling, extracting features using Whisper, and encoding the labels for training.

In [None]:
import librosa

def load_and_resample_audio(file_path, target_sr=16000):
    audio, _ = librosa.load(file_path, sr=target_sr)
    return {'array': audio, 'sampling_rate': target_sr}

**Apply Data Preparation:** Use the 'map' method to apply the data preparation function across the dataset, ensuring the data is ready for input into the model.

In [None]:
def prepare_dataset(batch):
    path = batch["path"]
    audio = load_and_resample_audio(path)
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
my_common_voice = my_common_voice.map(prepare_dataset, remove_columns=my_common_voice.column_names["train"])

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

**Define Data Collator Class:** This class is designed to handle batching of speech sequences which might have variable lengths. It processes the input features and labels, padding them appropriately for the model training.

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels
        return batch

**Initialize Processor and Data Collator:** Load the Whisper processor and instantiate the custom data collator class using this processor.

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Armenian", task="transcribe")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

**Setup Evaluation Metrics:** Import evaluation metrics specifically for word error rate (WER) and character error rate (CER) to be used during model testing.

In [None]:
import evaluate

metric_wer = evaluate.load("wer")
metric_cer = evaluate.load("cer")

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric_wer.compute(predictions=pred_str, references=label_str)
    cer = 100 * metric_cer.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

**Load Pretrained Model for Conditional Generation:** Import the WhisperForConditionalGeneration class from the transformers library and load a pretrained Whisper model configured for generating text in Armenian. The model configuration is also adjusted to set the generation language to Armenian.

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.generation_config.language = "armenian"

**Configure Model Decoding:** Configure the model's decoding behavior by setting forced_decoder_ids to None and clearing any suppress_tokens. These settings adjust how the model generates output, ensuring that it does not force any specific decoder token IDs and does not suppress any tokens during the decoding phase.

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

**Create Logging Directory** Checks if a logging directory exists and if not, it creates one. This directory will be used to store training logs, which are crucial for monitoring the training process and evaluating model performance over time.

In [None]:
import os

path = "logs/small_25_epoch_nohup_24_hours"
if not os.path.exists(path):
    os.makedirs(path)
    logdir = path
else:
    logdir = path

**Define Training Arguments:** Setup various training parameters such as batch sizes, learning rates, and evaluation strategies.

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="whisper-small-hy-nohup-24-hours",
    per_device_train_batch_size=16,
    logging_dir=logdir,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=1,
    num_train_epochs=1,
    gradient_checkpointing=True,
    fp16=False,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_strategy = "epoch",
    evaluation_strategy = "epoch",
    logging_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

**Prepare Model for Training:** Move the model to the designated computing device (GPU or CPU).

In [None]:
model.to(device)

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

**Initialize Training Setup:** Combine all necessary components like the model, datasets, data collator, and metrics function into the Seq2SeqTrainer for training.

In [None]:
from transformers import TrainerCallback

class SaveLastModelCallback(TrainerCallback):
    def on_train_end(self, args, state, control, **kwargs):
        trainer.save_model(output_dir=args.output_dir + "/last_epoch")

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=my_common_voice["train"],
    eval_dataset=my_common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
    callbacks=[SaveLastModelCallback()],
)

**Training:** Execute the training process using the trainer configuration established in previous cells.

In [None]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Epoch,Training Loss,Validation Loss,Wer,Cer
1,0.0,0.360475,41.666667,11.032028


Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}
There were missing keys in the checkpoint model loaded: ['proj_out.weight'].
Non-default generation parameters: {'max_length': 448, 'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}


TrainOutput(global_step=1, training_loss=4.760254887514748e-05, metrics={'train_runtime': 18.7099, 'train_samples_per_second': 0.534, 'train_steps_per_second': 0.053, 'total_flos': 2885854003200000.0, 'train_loss': 4.760254887514748e-05, 'epoch': 1.0})

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperTokenizer, WhisperFeatureExtractor, WhisperProcessor
import evaluate
import librosa

metric_wer = evaluate.load("wer")
metric_cer = evaluate.load("cer")

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.generation_config.language = "armenian"
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Armenian", task="transcribe")
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Armenian", task="transcribe")
model.to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

In [None]:
def map_to_pred(batch):
    path = batch["path"]
    audio, _ = librosa.load(batch["path"], sr=16000)
    input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['sentence'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

result = my_common_voice["test"].map(map_to_pred)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
print("Test WER: {:.3f}".format(100 * metric_wer.compute(references=result["reference"], predictions=result["prediction"])))
print("Test CER: {:.3f}".format(100 * metric_cer.compute(references=result["reference"], predictions=result["prediction"])))

Test WER: 200.000
Test CER: 136.511


In [None]:
model.eval()

def transcribe_audio(audio_file):
    audio, sampling_rate = librosa.load(audio_file, sr=16000, mono=True)
    inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=sampling_rate)
    inputs = inputs.to(device)

    with torch.no_grad():
        predictions = model.generate(inputs.input_features)

    text = tokenizer.batch_decode(predictions, skip_special_tokens=True)[0]
    return text

In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39509489.mp3')
print(result)

 Մատմողն աշխատանք է ստանում գրադարանում


In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39517769.mp3')
print(result)

 այս պայջարով, որոշ ծրագրայի բագեր չյն հայդնավ էրվում միջև ծրագրիր չիտ էրվում.


In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39427421.mp3')
print(result)

 Նակոչ չերանութը պիտակամորդների դեմ բայկարել, մարկարեներ է չեին խոսումներ անունից.


In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39295677.mp3')
print(result)

 դեկլենց կսետցվալ ձամակաջքերով, պեծ այմբիցի աղեկտուր հոսքերով, որնույնի սկոր.


In [None]:
import yt_dlp
import os


def download_youtube_audio(url):
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': '%(id)s.%(ext)s',
        'quiet': False
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(url, download=True)
        filename = ydl.prepare_filename(info_dict)
        base, ext = os.path.splitext(filename)
        return base + '.mp3'

audio_file = download_youtube_audio("https://www.youtube.com/watch?v=-OhScYMGqvc")

def segment_audio(audio_file, segment_length=30, overlap=5, sample_rate=16000):
    audio, sr = librosa.load(audio_file, sr=sample_rate, mono=True)
    total_length = librosa.get_duration(y=audio, sr=sr)

    start = 0
    while start + segment_length <= total_length:
        end = start + segment_length
        yield audio[int(start * sr):int(end * sr)]
        start += (segment_length - overlap)

def transcribe_segments(audio_file):
    texts = []
    previous_text = ""
    for segment in segment_audio(audio_file, overlap=5):
        inputs = feature_extractor(segment, return_tensors="pt", sampling_rate=16000)
        inputs = inputs.to(device)

        with torch.no_grad():
            predictions = model.generate(inputs.input_features)

        current_text = tokenizer.batch_decode(predictions, skip_special_tokens=True)[0]

        if previous_text:
            overlap_index = current_text.find(previous_text.split()[-1])
            if overlap_index != -1:
                current_text = current_text[overlap_index + len(previous_text.split()[-1]):].strip()

        texts.append(current_text)
        previous_text = current_text

    return " ".join(texts)

result = transcribe_segments(audio_file)
print(result)

[youtube] Extracting URL: https://www.youtube.com/watch?v=-OhScYMGqvc
[youtube] -OhScYMGqvc: Downloading webpage
[youtube] -OhScYMGqvc: Downloading ios player API JSON
[youtube] -OhScYMGqvc: Downloading android player API JSON




[youtube] -OhScYMGqvc: Downloading m3u8 information
[info] -OhScYMGqvc: Downloading 1 format(s): 251
[download] Destination: -OhScYMGqvc.webm
[download] 100% of    3.10MiB in 00:00:01 at 2.98MiB/s   
[ExtractAudio] Destination: -OhScYMGqvc.mp3
Deleting original file -OhScYMGqvc.webm (pass -k to keep)
 Հառստակամ բանյգանչում խայլածի հոմոտլեր, չաց չիպիտի, սգացել էք վերչերսուն չկանիգ մոզում Հառստակամ բանյգանչուն մասին. բոխչույն, ես ուն արեն սգացել է, քե որը շել ենք ստախսել հաղով տաշատ, որ ինտածկում գոսյնք Հառստակամ բանյգանչուն վայս տանուն երագրաղոթյան մասին. Հառստակամ բանյգանչուն ես հատեր պատկեր այստում են այսպես. բեծնավ, վամ են դեպկում ինչ առեստակամ բանականություն, մի դոպը. Նես հածքելին է, որ արեստակամ բանականություն եմ վեր այսնալոյ, բալորմասնակիտություն ները կամ որ մատկանս կալի կայլ է ոսչինին էլո.  դեղ արստական բանականսուն նայսոր ամենուրը. բրջսկության մեջ բարդ վիր հատություն ներան աղտներ են ստահսվում. արստական բանական չամստահսվածնե կարնե դության անդեսներում են են կայցվում. արվոտներ ալ կա

**Load Pretrained Model:** Load a Whisper model pre-trained for conditional generation with specific configurations set for Armenian language.

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperTokenizer, WhisperFeatureExtractor, WhisperProcessor
import evaluate
import librosa

metric_wer = evaluate.load("wer")
metric_cer = evaluate.load("cer")

model = WhisperForConditionalGeneration.from_pretrained("whisper-small-hy-nohup-24-hours/checkpoint-5950")
model.generation_config.language = "armenian"
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Armenian", task="transcribe")
feature_extractor = WhisperFeatureExtractor.from_pretrained("whisper-small-hy-nohup-24-hours/checkpoint-5950")
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Armenian", task="transcribe")
model.to(device)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

In [None]:
def map_to_pred(batch):
    path = batch["path"]
    audio, _ = librosa.load(batch["path"], sr=16000)
    input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['sentence'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

result = my_common_voice["test"].map(map_to_pred)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
print("Test WER: {:.3f}".format(100 * metric_wer.compute(references=result["reference"], predictions=result["prediction"])))
print("Test CER: {:.3f}".format(100 * metric_cer.compute(references=result["reference"], predictions=result["prediction"])))

Test WER: 40.278
Test CER: 10.432


In [None]:
model.eval()

def transcribe_audio(audio_file):
    audio, sampling_rate = librosa.load(audio_file, sr=16000, mono=True)
    inputs = feature_extractor(audio, return_tensors="pt", sampling_rate=sampling_rate)
    inputs = inputs.to(device)

    with torch.no_grad():
        predictions = model.generate(inputs.input_features)

    text = tokenizer.batch_decode(predictions, skip_special_tokens=True)[0]
    return text

In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39509489.mp3')
print(result)

Պատմողն աշխատանք է ստանում գրադարանում։


In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39517769.mp3')
print(result)

Այս պատճառով, որոշ ծրագրային բագեր չեն հայտնաբերվում մինչև ծրագիրը չի տեստավորվում։


In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39427421.mp3')
print(result)

Նակոչ չէր անում սպիտակամորդների դեմ պայքարել՝ մարգարեները չէին խոսումները անունից։


In [None]:
result = transcribe_audio('cv-corpus-16.1-2023-12-06/hy-AM/clips/common_voice_hy-AM_39295677.mp3')
print(result)

Թեկլեն սկսետ զգալ ցամա կաչքերով, բայց այնպիսի աղեկտուռ խոսքերով, ուրնունի սկոր։


In [None]:
import yt_dlp
import os


def download_youtube_audio(url):
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': '%(id)s.%(ext)s',
        'quiet': False
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(url, download=True)
        filename = ydl.prepare_filename(info_dict)
        base, ext = os.path.splitext(filename)
        return base + '.mp3'

audio_file = download_youtube_audio("https://www.youtube.com/watch?v=-OhScYMGqvc")

def segment_audio(audio_file, segment_length=30, overlap=5, sample_rate=16000):
    audio, sr = librosa.load(audio_file, sr=sample_rate, mono=True)
    total_length = librosa.get_duration(y=audio, sr=sr)

    start = 0
    while start + segment_length <= total_length:
        end = start + segment_length
        yield audio[int(start * sr):int(end * sr)]
        start += (segment_length - overlap)

def transcribe_segments(audio_file):
    texts = []
    previous_text = ""
    for segment in segment_audio(audio_file, overlap=5):
        inputs = feature_extractor(segment, return_tensors="pt", sampling_rate=16000)
        inputs = inputs.to(device)

        with torch.no_grad():
            predictions = model.generate(inputs.input_features)

        current_text = tokenizer.batch_decode(predictions, skip_special_tokens=True)[0]

        if previous_text:
            overlap_index = current_text.find(previous_text.split()[-1])
            if overlap_index != -1:
                current_text = current_text[overlap_index + len(previous_text.split()[-1]):].strip()

        texts.append(current_text)
        previous_text = current_text

    return " ".join(texts)

result = transcribe_segments(audio_file)
print(result)

[youtube] Extracting URL: https://www.youtube.com/watch?v=-OhScYMGqvc
[youtube] -OhScYMGqvc: Downloading webpage
[youtube] -OhScYMGqvc: Downloading ios player API JSON
[youtube] -OhScYMGqvc: Downloading android player API JSON




[youtube] -OhScYMGqvc: Downloading m3u8 information
[info] -OhScYMGqvc: Downloading 1 format(s): 251
[download] Destination: -OhScYMGqvc.webm
[download] 100% of    3.10MiB in 00:00:00 at 5.40MiB/s   
[ExtractAudio] Destination: -OhScYMGqvc.mp3
Deleting original file -OhScYMGqvc.webm (pass -k to keep)
Արստական բանականձին, խայլացի ռոբոտներ, չա ճիփիթի։ զգացել եք վերջերից միջքան իմպոզում արստական բանականության մասին։ երսունարենը զգացելենք օրոշել ենք ստեղծել հաղուրդաշար, որի ընթացքում կոգոսենք արստական բանականության, վայեստանում նրա գրաղության մասին։Արստական բանականությունը շատերը պատկերաստումը նայսպես… Պեցնավ, ամեն դեպքում ինչը արեստական բանականությունը, մինուբը… լեսածկլիներ, որ արեստական բանականությունը վերասնալրե՝ բելորմասնագիտությունները, կամ որ մարմարդկանց կայլևս չինինելու… Դարեստական բանականությունները, կամ արհտկանց կառիկ այլևշ չիրինելու։ դե՛արհստական բանականությունն այսօրը ամենուր է, վժշկության մեջ բիրահատություններանով ռոբոտներ են ստեացվում, արհստական բանականությամբ ստեացվածնէկարներ