Kernel: PYTORCH 2.0.0 Python 3.10 CPU Kernel, ml.m5.large

# Customize Whisper to add support for new languages


In this workshop we will train a Whisper Tiny model on a new language, Gaeilge

We will do a form of training know as transfer learning, where we pick a similar
task to the one we need to implement, and use that as a starting point. 
In today workshop the starting point will be English and we will not create
a new language from scratch, we will tune English to become Gaeilge.

Whisper Tiny is a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

Whisper is implemented using PyTorch, a machine learning framework originally developed by Meta AI and now part of the Linux Foundation umbrella.

This is the first section of the workshop, in which we prepare the data needed for finetuning.

We will:

1. Download labelled (sound > transcript) Gaeilge sound samples from Mozilla Common Voice
2. Prepare the samples in the format that Whisper expects for training
3. Pick one sample and run it trough the base model to estabilish a baseline
4. Upload samples to S3 for the next step, finetuning

In [None]:
# PREREQUISITE: thee libraries are a technical requirement for retrieving download the model from HuggingFace
!apt update
!apt install gnupg -y
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
!apt install -y git-lfs -y
!git lfs install

In [None]:
# CHECK: this step is just to raise an error if the pytorch is not configured correctly on the notebook
import torch
# torch.cuda.get_device_name(0)


In [None]:
# ffmpeg 4 is required to process the audio from common voice into data that can be used by pytorch
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt install -y ffmpeg


In [None]:
# These are the actual dependencies needed to a) download the dataset from hugging face and b) run the model locally to estabilish a baseline
!pip install --upgrade pip
!pip install datasets>=2.6.1 git+https://github.com/huggingface/transformers accelerate>=0.20.3 librosa evaluate>=0.3.0 jiwer gradio soundfile tqdm -U

In [None]:
# Download common voice dataset
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

train_dataset, validation_dataset, test_dataset = \
                load_dataset("mozilla-foundation/common_voice_11_0", "ga-IE", \
                             split=["train", "validation", "test"], use_auth_token=False)

print(train_dataset)


In [None]:
# We need to take the different dataset and build them in a train/validation/test structure for managed training
common_voice["train"]=train_dataset
common_voice["validation"]=validation_dataset
common_voice["test"]=test_dataset
print(common_voice)


In [None]:
# Save the dataset to disk so that we can recover the dataset more easily.
common_voice.save_to_disk("ga-common-voice-original")

In [None]:
# Remove unneeded field for the training, we are only interested in the audio and the transcript, we are not using the metadata about the voice
from datasets import load_from_disk, DatasetDict
common_voice = load_from_disk("ga-common-voice-original")
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", 
                                            "gender", "locale", "path", "segment", "up_votes"])
common_voice

In [None]:
# Let's download and initialize the component of the base model
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="en", task="transcribe")
tokenizer = processor.tokenizer

In [None]:
# Verify that the data is in a format that the model can encode and decode
input_str = common_voice["train"][0]["sentence"]
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")


In [None]:
# Let's view a sample
print(common_voice["train"][0])

In [None]:
# Downsample the samples, we need to match the frequence at which the model was trained at
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

# Apply it to the whole dataset. Give it some time even all lines goes green.
# If the process failed, try delete the cache-xxxx arrow files under ga-common-voice-original/train or the ~/.cache and restart the kernel.
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=8)


In [None]:
# Save the processed dataset to disk
common_voice.save_to_disk("ga-common-voice-processed")
common_voice

In [None]:
#Display sample transcription from the test set 
input_features = torch.FloatTensor([common_voice['test'][0]['input_features']])
expected = tokenizer.decode(common_voice['test'][0]['labels'], skip_special_tokens=False)

expected

In [None]:
# Prepare the model for inference, run inference on the test sample, see the transcription (the output value from the model should be fairly different from the expected)
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
tprocessor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
tmodel = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
tmodel.config.forced_decoder_ids = None


# generate token ids
predicted_ids = tmodel.generate(input_features)
# decode token ids to text
transcription = tprocessor.batch_decode(predicted_ids, skip_special_tokens=False)

transcription

In [None]:
# pick the default bucket configuration and upload the data there for finetuning
import boto3
import sagemaker
import os
from sagemaker import get_execution_role
import os

sess = sagemaker.Session()
ROLE = get_execution_role()

BUCKET = sess.default_bucket() 
PREFIX = "whisper/data/ga-common-voice-processed"
s3uri = os.path.join("s3://", BUCKET, PREFIX)
s3uri

In [None]:
#Use the aws s3 cli to upload the processed dataset. You could also choose to use the boto3 python sdk to do the upload.
!aws s3 cp --recursive ga-common-voice-processed {s3uri}