<figure>
<img src="https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/whisper_architecture.svg" alt="Trulli" style="width:100%">
<figcaption align = "center"><b>Figure 1:</b> Whisper model. The architecture 
follows the standard Transformer-based encoder-decoder model. A 
log-Mel spectrogram is input to the encoder. The last encoder 
hidden states are input to the decoder via cross-attention mechanisms. The 
decoder autoregressively predicts text tokens, jointly conditional on the 
encoder hidden states and previously predicted tokens. 

# Available models and languages
There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed.

| Size   | Parameters | Required VRAM | Relative speed |
|--------|------------|---------------|----------------|
| tiny   | 39 M       | ~1 GB         | ~32x           |
| base   | 74 M       | ~1 GB         | ~16x           |
| small  | 244 M      | ~2 GB         | ~6x            |
| medium | 769 M      | ~5 GB         | ~2x            |
| large  | 1550 M     | ~10 GB        | 1x             |

### We will be utilizing a <b> small </b> model for our needs

## Prepare Environment

In [None]:
#checking GPU stats
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio

In [None]:
import torch
torch.cuda.is_available()       
#must return True

## Transforming Custom Datasets into Model Specification Format

In [2]:
from datasets import Dataset
import pandas as pd
from datasets import Audio
import gc
df = pd.read_csv('data.csv')
df.columns = ['audio', 'sentence']
train_df=df.iloc[:900,:]
test_df=df.iloc[900:,:]

In [3]:
test_df.head()

Unnamed: 0,audio,sentence
900,C:\\Users\\test\\wav\\tel_0902.wav,ప్రపంచ ప్రఖ్యాత వ్యవసాయ శాస్త్రవేత్త ఆయన
901,C:\\Users\\test\\wav\\tel_0903.wav,నమస్కారం మా ఊరి పేరు వికీలో లేదు
902,C:\\Users\\test\\wav\\tel_0904.wav,వచ్చే మూడు నాలుగు నెలలు నాకు పరీక్షలు ఉన్నాయి
903,C:\\Users\\test\\wav\\tel_0905.wav,దిద్దుబాటు చేసిన వెంటనే మార్పులు కనిపించడం లేదు
904,C:\\Users\\test\\wav\\tel_0906.wav,అన్నవరం శ్రీ సత్యనారాయణ స్వామివారి జయంతి


In [None]:
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [None]:
train_dataset = train_dataset.cast_column('audio', Audio(sampling_rate=16000))
test_dataset = test_dataset.cast_column('audio', Audio(sampling_rate=16000))

In [None]:
train_dataset[0]

{'audio': {'path': 'C:\\\\Users\\\\test\\\\wav\\\\tel_0002.wav',
  'array': array([-0.00180054, -0.00201416, -0.00192261, ..., -0.00109863,
         -0.00125122, -0.00115967], dtype=float32),
  'sampling_rate': 16000},
 'sentence': 'ఈ గ్రామంలో ప్రజల ప్రధాన వృత్తి వ్యవసాయం '}

## Prepare Feature Extractor, Tokenizer and Data

####The ASR pipeline can be de-composed into three stages: 
1-> A feature extractor which pre-processes the raw audio-inputs

2->The model which performs the sequence-to-sequence mapping 

3->A tokenizer which post-processes the model outputs to text format

### *WhisperFeatureExtractor*

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

### *WhisperTokenizer*

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="telugu", task="transcribe")

### *WhisperProcessor*

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="telugu", task="transcribe")

Now we can write a function to prepare our data ready for the model:
1. Load and resample the audio data by calling `batch["audio"]`
2. Use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.
3. Encode the transcriptions to label ids through the use of the tokenizer.

In [None]:
def prepare_dataset(batch):
    #load and resample data to 16khz
    audio = batch["audio"]
    # compute log-Mel 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
train_dataset = train_dataset.map(prepare_dataset, num_proc=1)
test_dataset = test_dataset.map(prepare_dataset, num_proc=1)

                                                             

## Training and Evaluation

### Define a Data Collator

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics Wer(word error rate)

In [None]:
import evaluate

metric = evaluate.load("wer")

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Load a Pre-Trained Checkpoint

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

### Training Configuration

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-small-te",  # change to a your specfic folder
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

Please note that the above specified code is version-specific. In case of an error, try installing the following one by one. Otherwise, check for other solutions on [`Seq2SeqTraining`](https://github.com/huggingface/transformers/tree/main/examples/legacy/seq2seq)


In [None]:
# !pip install --upgrade accelerate

In [None]:
# !pip uninstall -y transformers accelerate
# !pip install transformers accelerate

In [None]:
# !pip install pytorch-accelerated

### Training

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

In [None]:
processor.save_pretrained(training_args.output_dir)

In [None]:
trainer.train()

in case of Cuda  `out-of-Memory` error try out

1->`per_device_train_batch_size`

## Transcription using Gradio

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="--------------") #path to the model or checkpoint

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="upload", type="filepath"), 
    outputs="text",
    title="Whisper Small telugu",
    description="demo for telugu speech recognition using Whisper small model.",
)

iface.launch()