# Master Thesis 1st Version of a model on spontaneous speech

**Author**: Karin Thommen

**Date**: April 2023


---

**Content of the Notebook**:  Fine-tuning and Training of OpenAi Whisper ASR Model

---
---
**References**:
- https://huggingface.co/blog/fine-tune-whisper
- https://github.com/vasistalodagala/whisper-finetune

## Step 1: Import and Setup

In [None]:
%%capture
!pip install datasets
!pip install transformers==4.28.0
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install audio-metadata
!pip install "dill<0.3.5"
!pip install git-lfs

In [None]:
import pandas as pd
import os
import transformers

from datasets.fingerprint import Hasher
import pickle
import dill

from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import re
import json

import IPython.display as ipd
import numpy as np
import random

import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

import audio_metadata

from datasets import load_dataset, Audio, load_metric, load_from_disk, DatasetDict, list_datasets
from datasets import Dataset, Sequence

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

from transformers import WhisperTokenizer
from transformers import WhisperTokenizerFast
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from huggingface_hub import notebook_login

from google.colab import drive

In [None]:
transformers.__version__

'4.28.0'

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

## Step 2: Load Data

In [None]:
# Build connection to data folder on GDrive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# login to huggingface account for data
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# load dataset from huggingface (after uploading it via local machine to huggingface)
dataset = load_dataset("karinthommen/schawinski")

Downloading readme:   0%|          | 0.00/580 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/karinthommen___parquet/karinthommen--schawinski-2c957299d6bd5e56/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/427M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/458M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/432M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/437M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/361M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3009 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/941 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/753 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/karinthommen___parquet/karinthommen--schawinski-2c957299d6bd5e56/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# check if data loading worked
dataset["train"][0]

{'audio': {'path': 'Badran_Schawinski_13-05-2013_SPK0-Badran_Schawinski_13-05-2013-0004.wav',
  'array': array([ 0.00234985,  0.00328064,  0.01695251, ..., -0.00216675,
         -0.00085449,  0.00071716]),
  'sampling_rate': 44100},
 'transcription': '[music] [laughter] tu du maal ärklääre wär du bisch',
 'duration': 2.05}

In [None]:
dataset.shape

{'train': (3009, 3), 'test': (941, 3), 'validation': (753, 3)}

In [None]:
def preprocess(batch):
  batch["transcription"] = re.sub('\[music\]', '', batch["transcription"])
  batch["transcription"] = re.sub('\[breath_mouth_noise\]', '', batch["transcription"])
  batch["transcription"] = re.sub('\[laughter\]', '', batch["transcription"])
  batch["transcription"] = re.sub('\[speech-in-speech\]', '', batch["transcription"])
  batch["transcription"] = re.sub(r"\\", '', batch["transcription"])
  batch["transcription"] = re.sub('\*', '', batch["transcription"])
  return batch

In [None]:
dataset = dataset.filter(lambda example: not example["transcription"].startswith("[speech-in-speech]"))

Filter:   0%|          | 0/3009 [00:00<?, ? examples/s]

Filter:   0%|          | 0/941 [00:00<?, ? examples/s]

Filter:   0%|          | 0/753 [00:00<?, ? examples/s]

In [None]:
dataset = dataset.map(preprocess, num_proc=1)

Map:   0%|          | 0/1613 [00:00<?, ? examples/s]

Map:   0%|          | 0/502 [00:00<?, ? examples/s]

Map:   0%|          | 0/410 [00:00<?, ? examples/s]

In [None]:
chars_to_remove_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\'\$]'

def remove_special_characters(batch):
    batch["transcription"] = re.sub(chars_to_remove_regex, '', batch["transcription"]).lower()
    batch["transcription"] = batch["transcription"].strip()
    return batch

In [None]:
dataset = dataset.map(remove_special_characters, num_proc=1)

Map:   0%|          | 0/1613 [00:00<?, ? examples/s]

Map:   0%|          | 0/502 [00:00<?, ? examples/s]

Map:   0%|          | 0/410 [00:00<?, ? examples/s]

In [None]:
# check if preprocessing loading worked
dataset["train"][0]

{'audio': {'path': 'Badran_Schawinski_13-05-2013_SPK0-Badran_Schawinski_13-05-2013-0004.wav',
  'array': array([ 0.00234985,  0.00328064,  0.01695251, ..., -0.00216675,
         -0.00085449,  0.00071716]),
  'sampling_rate': 44100},
 'transcription': 'tu du maal ärklääre wär du bisch',
 'duration': 2.05}

In [None]:
dataset.shape

{'train': (1613, 3), 'test': (502, 3), 'validation': (410, 3)}

In [None]:
# load tokenizer form Whisper Tokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", task="transcribe")

Downloading (…)okenizer_config.json:   0%|          | 0.00/842 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

In [None]:
# load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

In [None]:
# downsample dataset to a sampling rate of 16kHz for the model
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
# Check if audio loading worked with a random audio and sentence
rand_int = random.randint(0, len(dataset["train"])-1)
print(dataset["train"]["transcription"][rand_int])
ipd.Audio(data=dataset["train"][rand_int]["audio"]["array"], autoplay=True, rate=16000)

und ganz e gueti wuche


In [None]:
# Check sentence, input array shape and sampling rate
rand_int = random.randint(0, len(dataset["train"])-1)

print("Target text:", dataset["train"][rand_int]["transcription"])
print("Input array shape:", dataset["train"][rand_int]["audio"]["array"].shape)
print("Sampling rate:", dataset["train"][rand_int]["audio"]["sampling_rate"])

Target text: jez chasch nomale jez isch aa
Input array shape: (30400,)
Sampling rate: 16000


In [None]:
# show sentence decoded with the special characters ( in the format that is needed by whisper )
input_str = dataset["train"][0]["transcription"]
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")

Input:                 tu du maal ärklääre wär du bisch
Decoded w/ special:    <|startoftranscript|><|transcribe|><|notimestamps|>tu du maal ärklääre wär du bisch<|endoftext|>
Decoded w/out special: tu du maal ärklääre wär du bisch
Are equal:             True


In [None]:
# load processor from Whisper Processor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", task="transcribe")

In [None]:
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    batch["labels"] = tokenizer(batch["transcription"]).input_ids
    return batch

In [None]:
dataset = dataset.map(prepare_dataset, num_proc=2)

Map (num_proc=2):   0%|          | 0/1613 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/502 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/410 [00:00<?, ? examples/s]

In [None]:
dataset.push_to_hub("karinthommen/schawinski-features-no-vocab", private=True)

## Fine-Tune & Train Model

In [None]:
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

In [None]:
import evaluate

metric = evaluate.load("wer")

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to("cuda")

Downloading (…)neration_config.json:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./spontaneous-whisper-v1",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    hub_private_repo=True,
    push_to_hub=True,
)

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

Cloning https://huggingface.co/karinthommen/spontaneous-whisper-v1 into local empty directory.


In [None]:
processor.save_pretrained(training_args.output_dir)

In [None]:
trainer.train()

`use_cache = True` is incompatible with gradient checkpointing. Setting `use_cache = False`...


Step,Training Loss,Validation Loss,Wer
1000,0.0226,1.148224,63.173184
2000,0.0017,1.252167,52.335196
3000,0.0004,1.321579,51.977654
4000,0.0003,1.339919,51.441341


TrainOutput(global_step=4000, training_loss=0.18595466940372718, metrics={'train_runtime': 10926.1791, 'train_samples_per_second': 5.857, 'train_steps_per_second': 0.366, 'total_flos': 1.843570112864256e+19, 'train_loss': 0.18595466940372718, 'epoch': 39.6})