

# Testing Pipeline
This is the final testing pipeline of the best performing model: Whisper Small.


## Preparing Environment

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/ScalableMLDL/LAB2

/content/drive/.shortcut-targets-by-id/1_cIstNruukoo-q-nK4QpxdH0GRCHMaa6/LAB2


In [3]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Dec  8 21:52:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


### Installing dependencies

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt update
!apt install -y ffmpeg
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer

### Loading tokenizer and processor

In [6]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


The Whisper model outputs a sequence of _token ids_. The tokenizer maps each of these token ids to their corresponding text string.

In [7]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="Italian", task="transcribe")

Downloading:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

The `WhisperProcessor` class
inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`, 
and can be used on the audio inputs and model predictions as required. 
In doing so, we only need to keep track of two objects during training: 
the `processor` and the `model`:

In [8]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Italian", task="transcribe")


Downloading:   0%|          | 0.00/185k [00:00<?, ?B/s]

### Define a Data Collator


The data collator for a sequence-to-sequence speech model is unique in the sense that it 
treats the `input_features` and `labels` independently: the  `input_features` must be 
handled by the feature extractor and the `labels` by the tokenizer.

The `input_features` are already padded to 30s and converted to a log-Mel spectrogram 
of fixed dimension by action of the feature extractor, so all we have to do is convert the `input_features`
to batched PyTorch tensors. We do this using the feature extractor's `.pad` method with `return_tensors=pt`.

The `labels` on the other hand are un-padded. We first pad the sequences
to the maximum length in the batch using the tokenizer's `.pad` method. The padding tokens 
are then replaced by `-100` so that these tokens are **not** taken into account when 
computing the loss. We then cut the BOS token from the start of the label sequence as we 
append it later during training.

We can leverage the `WhisperProcessor` we defined earlier to perform both the 
feature extractor and the tokenizer operations:

In [9]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch


In [10]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics


We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing 
ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer).

In [11]:
import evaluate

metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

We then simply have to define a function that takes our model 
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the 
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:

In [12]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}


### Loading our trained model


In [13]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("GIanlucaRub/whisper-small-it-3",use_cache = False)


Downloading:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/967M [00:00<?, ?B/s]

Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)):

In [14]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

##Testing

We will perform the test using the trainer evaluation method.

### Define the Training Configuration


In [15]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    num_train_epochs=1,
    output_dir="./whisper-small-it-3",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)


### Load the test dataset

In [16]:
from datasets import load_from_disk, DatasetDict

common_voice_test = DatasetDict()
common_voice_test = DatasetDict.load_from_disk("common_voice_test")

### Instantiating the trainer

In [17]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    eval_dataset=common_voice_test["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,

)

Cloning https://huggingface.co/GIanlucaRub/whisper-small-it-3 into local empty directory.


Download file checkpoint-2000/optimizer.pt:   0%|          | 3.44k/1.80G [00:00<?, ?B/s]

Download file checkpoint-4000/pytorch_model.bin:   0%|          | 3.44k/922M [00:00<?, ?B/s]

Download file checkpoint-1000/pytorch_model.bin:   0%|          | 6.25k/922M [00:00<?, ?B/s]

Download file checkpoint-4000/optimizer.pt:   0%|          | 3.44k/1.80G [00:00<?, ?B/s]

Download file checkpoint-3000/pytorch_model.bin:   0%|          | 3.44k/922M [00:00<?, ?B/s]

Download file checkpoint-2000/pytorch_model.bin:   0%|          | 3.44k/922M [00:00<?, ?B/s]

Download file checkpoint-1000/optimizer.pt:   0%|          | 3.44k/1.80G [00:00<?, ?B/s]

Download file checkpoint-3000/optimizer.pt:   0%|          | 3.44k/1.80G [00:00<?, ?B/s]

Download file runs/Dec05_07-18-38_7298f7d61fda/events.out.tfevents.1670224750.7298f7d61fda.85.0:  11%|#1      …

Download file checkpoint-4000/rng_state.pth:  24%|##4       | 3.45k/14.2k [00:00<?, ?B/s]

Download file checkpoint-2000/rng_state.pth:  24%|##4       | 3.45k/14.2k [00:00<?, ?B/s]

Download file checkpoint-1000/rng_state.pth:  24%|##4       | 3.45k/14.2k [00:00<?, ?B/s]

Download file checkpoint-3000/rng_state.pth:  24%|##4       | 3.45k/14.2k [00:00<?, ?B/s]

Download file runs/Dec05_07-18-38_7298f7d61fda/1670224750.2764614/events.out.tfevents.1670224750.7298f7d61fda.…

Download file checkpoint-1000/training_args.bin: 100%|##########| 3.42k/3.42k [00:00<?, ?B/s]

Download file checkpoint-2000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Download file checkpoint-4000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Download file checkpoint-1000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Download file checkpoint-3000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Download file checkpoint-1000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Download file checkpoint-4000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Download file checkpoint-3000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Download file checkpoint-2000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Clean file checkpoint-3000/pytorch_model.bin:   0%|          | 1.00k/922M [00:00<?, ?B/s]

Clean file runs/Dec05_07-18-38_7298f7d61fda/events.out.tfevents.1670224750.7298f7d61fda.85.0:   3%|3         |…

Clean file checkpoint-4000/rng_state.pth:   7%|7         | 1.00k/14.2k [00:00<?, ?B/s]

Clean file checkpoint-2000/rng_state.pth:   7%|7         | 1.00k/14.2k [00:00<?, ?B/s]

Clean file checkpoint-1000/rng_state.pth:   7%|7         | 1.00k/14.2k [00:00<?, ?B/s]

Clean file checkpoint-3000/rng_state.pth:   7%|7         | 1.00k/14.2k [00:00<?, ?B/s]

Clean file runs/Dec05_07-18-38_7298f7d61fda/1670224750.2764614/events.out.tfevents.1670224750.7298f7d61fda.85.…

Clean file checkpoint-1000/training_args.bin:  29%|##9       | 1.00k/3.42k [00:00<?, ?B/s]

Clean file checkpoint-2000/training_args.bin:  29%|##9       | 1.00k/3.42k [00:00<?, ?B/s]

Clean file checkpoint-3000/training_args.bin:  29%|##9       | 1.00k/3.42k [00:00<?, ?B/s]

Clean file checkpoint-4000/training_args.bin:  29%|##9       | 1.00k/3.42k [00:00<?, ?B/s]

Clean file training_args.bin:  29%|##9       | 1.00k/3.42k [00:00<?, ?B/s]

Clean file checkpoint-2000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Clean file checkpoint-4000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Clean file checkpoint-1000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Clean file checkpoint-3000/scheduler.pt: 100%|##########| 623/623 [00:00<?, ?B/s]

Clean file checkpoint-1000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Clean file checkpoint-4000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Clean file checkpoint-3000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Clean file checkpoint-2000/scaler.pt: 100%|##########| 559/559 [00:00<?, ?B/s]

Clean file checkpoint-4000/pytorch_model.bin:   0%|          | 1.00k/922M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/922M [00:00<?, ?B/s]

Clean file checkpoint-2000/pytorch_model.bin:   0%|          | 1.00k/922M [00:00<?, ?B/s]

Clean file checkpoint-1000/pytorch_model.bin:   0%|          | 1.00k/922M [00:00<?, ?B/s]

Clean file checkpoint-3000/optimizer.pt:   0%|          | 1.00k/1.80G [00:00<?, ?B/s]

Clean file checkpoint-2000/optimizer.pt:   0%|          | 1.00k/1.80G [00:00<?, ?B/s]

Clean file checkpoint-4000/optimizer.pt:   0%|          | 1.00k/1.80G [00:00<?, ?B/s]

Clean file checkpoint-1000/optimizer.pt:   0%|          | 1.00k/1.80G [00:00<?, ?B/s]

max_steps is given, it will override any value given in num_train_epochs
Using cuda_amp half precision backend


### Computing the test WER

In [18]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 13503
  Batch size = 8


{'eval_loss': 0.2717597484588623,
 'eval_wer': 16.29595629731072,
 'eval_runtime': 12194.0055,
 'eval_samples_per_second': 1.107,
 'eval_steps_per_second': 0.138}