<a href="https://colab.research.google.com/github/NadiaHolmlund/BDS_M4_Exam/blob/main/Group_Assignment_4/Group_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

***

What we have created:
- We have fine-tuned a Whisper model based on Japanese speech samples from the Common Voice Dataset
- We have then created a Demo where users can perform Automatic Speech Recognition in real time in Japanese
- We have also linked the model to a Google Translate API from Pypi, which shows the pronunciation of the Japanese speech as well as the English translation

You can find the model and Gradio demos here:
- [Whisper Model](https://huggingface.co/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model)
- [Whisper Demo](https://huggingface.co/spaces/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model)
- [Google Translate API Demo](https://huggingface.co/spaces/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model_2)
- [Whisper + Google Translate API Demo](https://huggingface.co/spaces/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model_3)

***

# Task

Build an exctiting and perhapps also fun application using techniques learned in this module Submission: Github repo as usual.

**Minimal requirements:**
- Relevant task solved
- Self-trained or fine-tuned transformer, however not sentence transformer for semantic search only (you are welcome to explore techniques beyond the scope of the course e.g. on HF)
published on HF
- Gradio (in-notebook) app or HF spaces


**Nice-to:**
- Streamlit app on Hub
- Optional use of API (HF Inference API, Cohere, be careful with OpenAI 💸)
- Optional more complex LLM setup with e.g. langchain, promptify, pinecone and other integrations etc.

# Fine-Tuning a Whisper Model for Automatic Speech Recognition

Following a tutorial by Sanchit Gandhi, we have fine-tuned a Whisper Model based on the Common Voice dataset from Mozilla using Hugging Face 🤗 Transformers. Whisper is a pre-trained model for automatic speech recognition (ASR) published in 2022, and it is trained on 680,000 hours of labelled audio-transciption data.

The model is currently available in five different sizes, which vary in numbers of layers, parameters, languages and so forth.

Due to time constraints and computational power available on Google Colab, we will work with the smallest model called "tiny" in the following.

# Environment

In [None]:
# First, we verify whether we have access to GPU on Google Colab or not
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sun Mar 12 12:22:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    26W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Pip installing necessary packages
!pip install datasets>=2.6.1 -q           # to load the Common Voice dataset from Hugging Face
!pip install transformers -q              # to load transformer models from Hugging Face
!pip install librosa -q                   # to pre-process audio files
!pip install evaluate>=0.30 -q            # to assess the performance of the model
!pip install jiwer -q                     # to assess the performance of the model
!pip install gradio -q                    # to create demo interfaces
!export LC_ALL=en_US.UTF-8                # needed to pip install googletrans below
!pip install googletrans==4.0.0-rc1 -q    # to create a google translate API

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m57.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m106.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m79.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.3/14.3 MB[0m [31m87.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m88.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 KB[0m [31m5.0 MB/s[0m eta 

In [None]:
# Logging into Hugging Face so we can upload model checkpoints while training and upload the trained model
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Data

In [None]:
# Importing Common Voice data from Hugging Face
# The model will be trained on Japanese, i.e. "ja"
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="train", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "ja", split="test", use_auth_token=True)

print(common_voice)



DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 6505
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 4604
    })
})


In [None]:
# Seen on Hugging Face, the dataset contains multiple columns with additional information. We remove all columns except for the audiofile and transcribed text for fine-tuning
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 6505
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4604
    })
})


# Preprocessing

The pipeline for Automatic Speech Recognition can be divided into three stages:
- Pre-processing, feature extraction on audio inputs
- Sequence-2-Sequence mapping
- Post-processing, tokenization of model output to text format

Whisper also provides the feature extractor and tokenizer which can be loaded from Hugging Face

## Loading Processing Tools

In [None]:
# Loading the WhisperFeatureExtractor
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")

In [None]:
# Loading the WhisperTokenizer setting the language to Japanese and the task to transcribe
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Japanese", task="transcribe")

In [None]:
# Loading WhisperProcessor to wrap both WhisperFeatureExtractor and WhisperTokenizer into a single class
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="Japanese", task="transcribe")

## Data Preparation

In [None]:
# Checking the form of the data
print(common_voice["train"][0])

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/0fb75edb8bd454df8bb23ae8bee20532bae3434c4b63d0b4e126d443f35c56e7/common_voice_ja_25861545.mp3', 'array': array([-1.0615939e-15, -7.5679541e-14,  1.2697682e-14, ...,
       -1.3306192e-07, -2.4940018e-08, -4.0501153e-08], dtype=float32), 'sampling_rate': 48000}, 'sentence': '別の話を持ちかけられた。'}


In [None]:
# Seen above, the input is sampled at 48kHz. The WhisperFeatureExtractor requires a sampling rate of 16kHz, i.e. the data has to be resampled by using Audio from Hugging Face Datasets
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
# Checking that the data has been resampeld to 16 kHz
print(common_voice["train"][0])

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/0fb75edb8bd454df8bb23ae8bee20532bae3434c4b63d0b4e126d443f35c56e7/common_voice_ja_25861545.mp3', 'array': array([-6.2760500e-14,  2.4398964e-13, -7.9903315e-14, ...,
       -7.8295676e-07, -8.7345489e-07, -2.4102926e-07], dtype=float32), 'sampling_rate': 16000}, 'sentence': '別の話を持ちかけられた。'}


In [None]:
# Creating a function for data pre-processing, which resamples the data, extracts the features and tokenizes the transcriptions

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
# Pre-processing the training dataset
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

Map (num_proc=2):   0%|          | 0/6505 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/4604 [00:00<?, ? examples/s]

# Training

The model is trained by using 🤗 Trainer. To do so we have to:

- Define a data collator: the data collator takes pre-processed data and prepares PyTorch tensors

- Define evaluation metrics: we evaluate the model using the [word error rate (WER)](https://huggingface.co/metrics/wer).

- Load a pre-trained checkpoint: we load a pre-trained checkpoint from Hugging Face and configure it for training.

- Define the training configuration: this is used by the 🤗 Trainer to define the training schedule.

## Data Collator

The data collator for a sequence-to-sequence speech model is unique in the sense that it 
treats the `input_features` and `labels` independently: the  `input_features` must be 
handled by the feature extractor and the `labels` by the tokenizer.

The `input_features` are already padded to 30s and converted to a log-Mel spectrogram 
of fixed dimension by action of the feature extractor, so all we have to do is convert the `input_features`
to batched PyTorch tensors. We do this using the feature extractor's `.pad` method with `return_tensors=pt`.

The `labels` on the other hand are un-padded. We first pad the sequences
to the maximum length in the batch using the tokenizer's `.pad` method. The padding tokens 
are then replaced by `-100` so that these tokens are **not** taken into account when 
computing the loss. We then cut the BOS token from the start of the label sequence as we 
append it later during training.

We can leverage the `WhisperProcessor` we defined earlier to perform both the 
feature extractor and the tokenizer operations:

In [None]:
# Defining a class for the data collator
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
# Initialising the data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## Evaluation

We then simply have to define a function that takes our model 
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the 
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:

In [None]:
# Importing the evaluation metric "WER"
import evaluate

metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [None]:
# Defining a function that returns the WER based on model predictions
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Checkpoints

In [None]:
# Loading pre-trained checkpoints from the tiny Whisper model
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/151M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.49k [00:00<?, ?B/s]

In [None]:
# Overriding generation arguments, not tokens are forced as decoder outputs or suppresed during generation
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

## Training Configuration

In [None]:
# Defining parameters for training referring to the Seq2SeqTrainingArguments docs.
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./Japanese_Fine_Tuned_Whisper_Model",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True
    
)

In [None]:
# Forwarding the training arguments, model, dataset, data collator and evaluation metric function to the 🤗 Trainer
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Cloning https://huggingface.co/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model into local empty directory.
max_steps is given, it will override any value given in num_train_epochs
Using cuda_amp half precision backend


In [None]:
# Saving the processor object
processor.save_pretrained(training_args.output_dir)

Feature extractor saved in ./Japanese_Fine_Tuned_Whisper_Model/preprocessor_config.json
tokenizer config file saved in ./Japanese_Fine_Tuned_Whisper_Model/tokenizer_config.json
Special tokens file saved in ./Japanese_Fine_Tuned_Whisper_Model/special_tokens_map.json
added tokens file saved in ./Japanese_Fine_Tuned_Whisper_Model/added_tokens.json


## Training

In [None]:
# Training the model
trainer.train()

![picture](https://raw.github.com/Korsholm22/M4_Group_Assignments/main/Group_Assignment_4/Illustrations/Training_Steps.png)

# Pushing the model to Hub

In [None]:
model.push_to_hub('Japanese_Fine_Tuned_Whisper_Model')

Configuration saved in Japanese_Fine_Tuned_Whisper_Model/config.json
Configuration saved in Japanese_Fine_Tuned_Whisper_Model/generation_config.json
Model weights saved in Japanese_Fine_Tuned_Whisper_Model/pytorch_model.bin
Uploading the following files to NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model: generation_config.json,pytorch_model.bin,config.json


CommitInfo(commit_url='https://huggingface.co/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model/commit/1617455d09dbecb4da5223b55b19494daae29b39', commit_message='Upload WhisperForConditionalGeneration', commit_description='', oid='1617455d09dbecb4da5223b55b19494daae29b39', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
trainer.push_to_hub('Japanese_Fine_Tuned_Whisper_Model')

Saving model checkpoint to ./Japanese_Fine_Tuned_Whisper_Model
Configuration saved in ./Japanese_Fine_Tuned_Whisper_Model/config.json
Configuration saved in ./Japanese_Fine_Tuned_Whisper_Model/generation_config.json
Model weights saved in ./Japanese_Fine_Tuned_Whisper_Model/pytorch_model.bin
Feature extractor saved in ./Japanese_Fine_Tuned_Whisper_Model/preprocessor_config.json
Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Automatic Speech Recognition', 'type': 'automatic-speech-recognition'}, 'metrics': [{'name': 'Wer', 'type': 'wer', 'value': 301.6258400173423}]}


In [None]:
kwargs = {
    "dataset_tags": "mozilla-foundation/common_voice_11_0",
    "dataset": "Common Voice 11.0",
    "dataset_args": "config: ja, split: test",
    "language": "ja",
    "model_name": "Japanese_Fine_Tuned_Whisper_Model",
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
    "tags": "hf-asr-leaderboard",
}

trainer.push_to_hub(**kwargs)

Saving model checkpoint to ./Japanese_Fine_Tuned_Whisper_Model
Configuration saved in ./Japanese_Fine_Tuned_Whisper_Model/config.json
Configuration saved in ./Japanese_Fine_Tuned_Whisper_Model/generation_config.json
Model weights saved in ./Japanese_Fine_Tuned_Whisper_Model/pytorch_model.bin
Feature extractor saved in ./Japanese_Fine_Tuned_Whisper_Model/preprocessor_config.json
Several commits (4) will be pushed upstream.
The progress bars may be unreliable.
ref main:: Error in git rev-list --stdin --objects --not --remotes=origin --: exit status 128 fatal: bad object 1617455d09dbecb4da5223b55b19494daae29b39

error: failed to push some refs to 'https://user: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@huggingface.co/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model'


error: failed to push some refs to 'https://user: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@huggingface.co/NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model'

Error pushing update to the model card. Please read logs and retry.
$ref main:

# Gradio

In [None]:
# Imports
from transformers import pipeline
import gradio as gr

## Whisper Demo

In [None]:
from transformers import pipeline
import gradio as gr

pipe = pipeline(model="NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model")

def transcribe(audio):
    text = pipe(audio)["text"]
    return text

iface = gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), 
    outputs="text",
    title="Tiny Whisper Model Fine-Tuned on Japanese",
    description="Real-time demo for Japanese speech recognition. Using a fine-tuned tiny Whisper model, trained on the Common Voice dataset.",
)

iface.launch()

![picture](https://raw.github.com/Korsholm22/M4_Group_Assignments/main/Group_Assignment_4/Illustrations/Whisper%20Demo.png)

## Google Translate API Demo

In [None]:
from googletrans import Translator
import gradio as gr

def translate(text):
    translator = Translator()
    result = translator.translate(text, dest='en')
    translation = result.text
    pronunciation = translator.translate(translation, dest='ja').pronunciation
    return f"Pronunciation: {pronunciation}", f"Translation: {translation}"

iface = gr.Interface(
    fn=translate,
    inputs="text",
    outputs=["text", "text"],
    output_labels=["Pronunciation", "Translation"],
    title="Japanese to English Translator"
)

iface.launch()

![picture](https://raw.github.com/Korsholm22/M4_Group_Assignments/main/Group_Assignment_4/Illustrations/Google%20Translate%20API%20Demo.png)

## Whisper + Google Translate API Demo

In [None]:
import gradio as gr
from googletrans import Translator
from transformers import pipeline

pipe = pipeline(model="NadiaHolmlund/Japanese_Fine_Tuned_Whisper_Model")

def translate_and_transcribe(audio):
    translator = Translator()
    
    # Transcribe Japanese audio to text
    transcription = pipe(audio)["text"]  

    # Translate the transcription to English
    result = translator.translate(transcription, dest='en')
    translation = result.text

    # Get the pronunciation of the transcription in Japanese
    pronunciation = translator.translate(transcription, dest='ja').pronunciation

    return transcription, pronunciation, translation

input_audio = gr.inputs.Audio(label="Upload your Japanese speech here. Try to say 'Kon'nichiwa', 'Arigatō' or perhaps 'Sayōnara'", source="microphone", type="filepath")
output_textbox1 = gr.outputs.Textbox(label="Transcription")
output_textbox2 = gr.outputs.Textbox(label="Pronunciation")
output_textbox3 = gr.outputs.Textbox(label="Translation")

iface = gr.Interface(
    fn=translate_and_transcribe, 
    inputs=input_audio, 
    outputs=[output_textbox1, output_textbox2, output_textbox3],
    title="Japanese Automatic Speech Recognition, Pronunciation and Translation",
    description="Record Japanese speech to get its pronunciation and translate it to English. All done by using a fine-tuned version of the tiny Whisper model which is connected to a Google Translate API"
)

iface.launch()

![picture](https://raw.github.com/Korsholm22/M4_Group_Assignments/main/Group_Assignment_4/Illustrations/Whisper%20%2B%20Google%20Translate%20API%20Demo.png)