<a href="https://colab.research.google.com/github/Korsholm22/M4_Group_Assignments/blob/main/Group_Assignment_4/NHN/Group_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Build an exctiting and perhapps also fun application using techniques learned in this module Submission: Github repo as usual.

**Minimal requirements:**
- Relevant task solved
- Self-trained or fine-tuned transformer, however not sentence transformer for semantic search only (you are welcome to explore techniques beyond the scope of the course e.g. on HF)
published on HF
- Gradio (in-notebook) app or HF spaces


**Nice-to:**
- Streamlit app on Hub
- Optional use of API (HF Inference API, Cohere, be careful with OpenAI 💸)
- Optional more complex LLM setup with e.g. langchain, promptify, pinecone and other integrations etc.

# Fine-Tuning a Whisper Model for Automatic Speech Recognition

Following a tutorial by Sanchit Gandhi, we have fine-tuned a Whisper Model based on the Common Voice dataset from Mozilla using Hugging Face 🤗 Transformers. Whisper is a pre-trained model for automatic speech recognition (ASR) published in 2022, and it is trained on 680,000 hours of labelled audio-transciption data.

The model is currently available in five different sizes, which vary in numbers of layers, parameters, languages and so forth.

Due to time constraints and computational power available on Google Colab, we will work with the smallest model called "tiny" in the following.

# Environment

In [None]:
# First, we verify whether we have access to GPU on Google Colab or not
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
# Pip installing necessary packages
!pip install datasets>=2.6.1 -q           # to load the Common Voice dataset from Hugging Face
!pip install transformers -q              # to load transformer models from Hugging Face
!pip install librosa -q                   # to pre-process audio files
!pip install evaluate>=0.30 -q            # to assess the performance of the model
!pip install jiwer -q                     # to assess the performance of the model
!pip install gradio -q                    # to create demo interfaces
!export LC_ALL=en_US.UTF-8                # needed in order to pip install googletrans below
!pip install googletrans==4.0.0-rc1 -q    # to create a google translate API

In [None]:
# Logging into Hugging Face so we can upload model checkpoints while training and upload the trained model
from huggingface_hub import notebook_login

notebook_login()

# Data

In [None]:
# Importing Common Voice data from Hugging Face
# The model will be trained on Japanese, i.e. "ja"
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="train", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="test", use_auth_token=True)

print(common_voice)

In [None]:
# Seen on Hugging Face, the dataset contains multiple columns with additional information. We remove all columns except for the audiofile and transcribed text for fine-tuning
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

print(common_voice)

# Preprocessing

The pipeline for Automatic Speech Recognition can be divided into three stages:
- Pre-processing, feature extraction on audio inputs
- Sequence-2-Sequence mapping
- Post-processing, tokenization of model output to text format

Whisper also provides the feature extractor and tokenizer which can be loaded from Hugging Face

## Loading Processing Tools

In [None]:
# Loading the WhisperFeatureExtractor
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")

In [None]:
# Loading the WhisperTokenizer setting the language to Japanese and the task to transcribe
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Japanese", task="transcribe")

In [None]:
# Loading WhisperProcessor to wrap both WhisperFeatureExtractor and WhisperTokenizer into a single class
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="Japanese", task="transcribe")

## Data Preparation

In [None]:
# Checking the form of the data
print(common_voice["train"][0])

In [None]:
# Seen above, the input is sampled at 48kHz. The WhisperFeatureExtractor requires a sampling rate of 16kHz, i.e. the data has to be resampled by using Audio from Hugging Face Datasets
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
# Checking that the data has been resampeld to 16 kHz
print(common_voice["train"][0])

In [None]:
# Creating a function for data pre-processing, which resamples the data, extracts the features and tokenizes the transcriptions

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
# Pre-processing the training dataset using the function above
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

# Training

The model is trained by using 🤗 Trainer. To do so we have to:

- Define a data collator: the data collator takes pre-processed data and prepares PyTorch tensors

- Evaluation metrics: we evaluate the model using the [word error rate (WER)](https://huggingface.co/metrics/wer) metric.

- Load a pre-trained checkpoint: we load a pre-trained checkpoint and configure it for training.

- Define the training configuration: this is used by the 🤗 Trainer to define the training schedule.

## Data Collator

The data collator for a sequence-to-sequence speech model is unique in the sense that it 
treats the `input_features` and `labels` independently: the  `input_features` must be 
handled by the feature extractor and the `labels` by the tokenizer.

The `input_features` are already padded to 30s and converted to a log-Mel spectrogram 
of fixed dimension by action of the feature extractor, so all we have to do is convert the `input_features`
to batched PyTorch tensors. We do this using the feature extractor's `.pad` method with `return_tensors=pt`.

The `labels` on the other hand are un-padded. We first pad the sequences
to the maximum length in the batch using the tokenizer's `.pad` method. The padding tokens 
are then replaced by `-100` so that these tokens are **not** taken into account when 
computing the loss. We then cut the BOS token from the start of the label sequence as we 
append it later during training.

We can leverage the `WhisperProcessor` we defined earlier to perform both the 
feature extractor and the tokenizer operations:

In [None]:
# Defining a class for the data collator
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
# Initialising the data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## Evaluation

We then simply have to define a function that takes our model 
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the 
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:

In [None]:
# Importing the evaluation metric "WER"
import evaluate

metric = evaluate.load("wer")

In [None]:
# Defining a function that returns the WER based on model predictions
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Checkpoints

In [None]:
# Loading pre-trained checkpoints from the tiny Whisper model
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

In [None]:
# Overriding generation arguments, not tokens are forced as decoder outputs or suppresed during generation
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

## Training Configuration

NOTE check how many rows of data and adjust warm-up and max steps

In [None]:
# Defining parameters for training referring to the Seq2SeqTrainingArguments docs.
from transformers import Seq2SeqTrainingArguments

#training_args = Seq2SeqTrainingArguments(
    output_dir="./Japanese_Fine_Tuned_Whisper_Model",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

In [None]:
# Forwarding the training arguments, model, dataset, data collator and evaluation metric function to the 🤗 Trainer
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

In [None]:
# Saving the processor object
processor.save_pretrained(training_args.output_dir)

## Training

In [None]:
# Training the model
trainer.train()

# Pushing the model to Hub

In [None]:
model.push_to_hub('Japanese_Fine_Tuned_Whisper_Model')

In [None]:
trainer.push_to_hub('Japanese_Fine_Tuned_Whisper_Model')

In [None]:
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_11_0",
    "dataset": "Common Voice 11.0",
    "dataset_args": "config: ja, split: test",
    "language": "ja",
    "model_name": "Japanese_Fine_Tuned_Whisper_Model",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

trainer.push_to_hub(**kwargs)

# Gradio

## Whisper Gradio Demo

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model")  # change to "your-username/the-name-you-picked"

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface_transcribe = gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), 
    outputs="text",
    title="Whisper Tiny Japanese",
    description="Realtime demo for Japanese speech recognition using a fine-tuned Whisper tiny model.",
)

iface_transcribe.launch()

## Google Translate API Demo

In [None]:
def translate(text):
    result = translator.translate(text, dest='en')
    translation = result.text
    pronunciation = translator.translate(translation, dest='ja').pronunciation
    return f"Pronunciation: {pronunciation}", f"Translation: {translation}"

iface_translate = gr.Interface(
    fn=translate,
    inputs="text",
    outputs=["text", "text"],
    output_labels=["Pronunciation", "Translation"],
    title="Google Translate"
)

iface_translate.launch()

## Whisper + Google Translate API Demo

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="NadiaHolmlund/Fined_Tuned_Whisper_Model")

def transcribe(audio):
    transcription = pipe(audio)["text"]
    result = translator.translate(text, dest='en')
    translation = result.text
    pronunciation = translator.translate(translation, dest='ja').pronunciation
    return f"Transcription: {transcription}", f"Pronunciation: {pronunciation}", f"Translation: {translation}"

iface_transcribe = gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), 
    outputs=["text", "text", "text"],
    output_labels=["Transcription", "Pronunciation", "Translation"],
    title="Whisper Tiny Japanese",
    description="Realtime demo for Japanese speech recognition using a fine-tuned Whisper tiny model.",
)

iface_transcribe.launch()