### **Fine-tuning Whisper in a Google Colab - Feature Pipeline**

### Preparing Environment

In [None]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install accelerate -U

In [None]:
from huggingface_hub import notebook_login

notebook_login()

### Loading Data Set

We are using the Polish Common_Voice_11_0 Data Set to fine tune the whisper-small model. Our training data is made up of the train and validation split from common voice, and the test data is its own split.

In [None]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "pl", split="train+validation", token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "pl", split="test", token=True)

print(common_voice)

We remove the columns that won't be used for our model, as we only require audio file and its transcription/sentence.

In [None]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

print(common_voice)

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 24833
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 8294
    })
})


### Acquiring Feature Extractor, Tokenizer and Processor

The Whisper model has an associated feature extractor and tokenizer, called WhisperFeatureExtractor and WhisperTokenizer respectively.

To simplify using the feature extractor and tokenizer, we can wrap both into a single WhisperProcessor class. This processor object inherits from the WhisperFeatureExtractor and WhisperProcessor and can be used on the audio inputs and model predictions as required. In doing so, we only need to keep track of two objects during training: the processor and the model.

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Polish", task="transcribe")

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Polish", task="transcribe")

### Sampling Rate

The whisper model takes audio at a sampling rate of 16kHz, and the dataset provided by commonvoice has a sampling rate of 4800kHz, thus we sample it down using the Audio import.

In [None]:
print(common_voice["train"][0])

In [None]:
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
print(common_voice["train"][0])

### Prepare Data for Model

Now we can write a function to prepare our data ready for the model:

We load and resample the audio data by calling batch["audio"]. Datasets performs any necessary resampling operations on the fly.

We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.

We encode the transcriptions to label ids through the use of the tokenizer.''

Finally, we use the map method to prepare all our data

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

In [None]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

### Saving Data

We save our prepared data on our Google Drive so we can then load it in our training pipeline and train the model

In [None]:
from google.colab import drive

drive.mount('/content/drive',force_remount=True)

In [None]:
from datasets import load_dataset, DatasetDict
import os

output_dir = "/content/drive/MyDrive/ML/common_voice"
os.makedirs(output_dir, exist_ok=True)

common_voice.save_to_disk(output_dir)