# Feature Pipeline for Swedish Whisper Finetuning
This notebook downloads and preprocesses the common_voice dataset for the swedish language to be used with the Whisper Transformer. Additionally, 6400 samples from the [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/) dataset are preprocessed that can be added to the training pipeline if wanted.

The notebook is designed to be run on google colab to take advantage of the fast internet connection required in downloading and uploading all of the data involved. The code assumes that enough storage on Google Drive (approximately 35GB) is available to host the data.

In [None]:
# Install dependencies
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt update
!apt install -y ffmpeg

!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install hopsworks

## Connect to different platforms

Find your Huggingface authentication token [here](https://huggingface.co/settings/tokens):

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
# Hopsworks not needed for this one
#import hopsworks
#project = hopsworks.login()

## Load and preprocess dataset

In [None]:
from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
common_voice["train"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset("mozilla-foundation/common_voice_11_0", "sv-SE", split="test", use_auth_token=True)

Add an additional 6400 training samples from [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/). Note that common_voice has roughly 12000 training samples.

In [4]:
# Download and unzip meta data
!mkdir -p extra_data/meta-files
!wget "https://www.nb.no/sbfil/talegjenkjenning/16kHz_2020/se_2020/ADB_SWE_0467.tar.gz" -O extra_data/meta-files/meta.tar.gz
!tar -xf extra_data/meta-files/meta.tar.gz --directory extra_data/meta-files/
!rm extra_data/meta-files/meta.tar.gz

# Download and unzip audio files
!mkdir -p extra_data/audio-files
!wget https://www.nb.no/sbfil/talegjenkjenning/16kHz_2020/se_2020/lydfiler_16_1.tar.gz -O extra_data/audio-files/audio.tar.gz
!tar -xf extra_data/audio-files/audio.tar.gz --directory extra_data/audio-files/
!rm extra_data/audio-files/audio.tar.gz

--2022-12-04 07:30:59--  https://www.nb.no/sbfil/talegjenkjenning/16kHz_2020/se_2020/ADB_SWE_0467.tar.gz
Resolving www.nb.no (www.nb.no)... 158.39.129.53
Connecting to www.nb.no (www.nb.no)|158.39.129.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15352363 (15M) [application/octet-stream]
Saving to: ‘extra_data/meta-files/meta.tar.gz’


2022-12-04 07:31:03 (5.09 MB/s) - ‘extra_data/meta-files/meta.tar.gz’ saved [15352363/15352363]

--2022-12-04 07:31:05--  https://www.nb.no/sbfil/talegjenkjenning/16kHz_2020/se_2020/lydfiler_16_1.tar.gz
Resolving www.nb.no (www.nb.no)... 158.39.129.53
Connecting to www.nb.no (www.nb.no)|158.39.129.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32463480213 (30G) [application/octet-stream]
Saving to: ‘extra_data/audio-files/audio.tar.gz’


2022-12-04 07:58:37 (7.92 MB/s) - Connection closed at byte 13705092120. Retrying.

--2022-12-04 07:58:38--  (try: 2)  https://www.nb.no/sbfil/talegjenkjenning

In [2]:
import os
import json
from datasets import Audio, Dataset, concatenate_datasets

def is_good_sample(fname, text):
    n_words = len(text.split(" "))
    return 20 < n_words < 40

samples = []
for fname_meta in os.listdir(os.path.join("extra_data", "meta-files")):
    group = fname_meta.split("_")[0]
    fpath_meta = os.path.join("extra_data", "meta-files", fname_meta)
    with open(fpath_meta, "r", encoding="utf-8") as f:
        meta = json.load(f)
    
    if not "val_recordings" in meta:
        continue

    for full_sample in meta["val_recordings"]:
        if is_good_sample(full_sample["file"], full_sample["text"]):
            sample_fname = f"{group}_{full_sample['file'].replace('.wav', '-1.wav')}"
            full_path = os.path.join("extra_data", "audio-files", "se", group, sample_fname)
            if os.path.isfile(full_path):
              samples.append({
                  "fpath": full_path,
                  "text": full_sample["text"],
                  "group": group
                  })

print("Number of samples in additional dataset (total):", len(samples))

# Select only 6400 samples, which is a ~50% increase to the common_voice train set
samples = samples[:6400]
file_paths = [sample["fpath"] for sample in samples]
sentences = [sample["text"] for sample in samples]
print("Number of samples kept: ", len(file_paths))

# We need to process the dataset in slices, as colab crashes otherwise
NST_slices = []
slice_size = 800
for i in range(int(6400/slice_size)):
  NST_slices.append(
      Dataset.from_dict({
        "audio": file_paths[i*slice_size:(i+1)*slice_size],
        "sentence": sentences[i*slice_size:(i+1)*slice_size],
        }).cast_column("audio", Audio(sampling_rate=16000))
  )

Number of samples in additional dataset (total): 10152
Number of samples kept:  6400


In [3]:
# Define functions to preprocess
from datasets import Audio
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="Swedish", task="transcribe")

# Preprocess
def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
    return batch

In [4]:
NST_preprocessed_slices = []
for sl in NST_slices:
  NST_preprocessed_slices.append(sl.map(prepare_dataset, remove_columns=sl.column_names, num_proc=1))
NST = concatenate_datasets(NST_preprocessed_slices)
NST.save_to_disk(F"/content/gdrive/My Drive/SML/lab2/NST")

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/800 [00:00<?, ?ex/s]

  0%|          | 0/800 [00:00<?, ?ex/s]

In [None]:
# Remove columns that we do not need
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

# Let the audio column have a consistent sampling rate of 16000Hz
# which is what whisper expects. The data is converted on the fly when being accessed
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

# Save both on local machine and on Google Drive
common_voice.save_to_disk("common_voice")
common_voice.save_to_disk(F"/content/gdrive/My Drive/SML/lab2/common_voice")

In [None]:
import os
print(os.listdir("./common_voice/"))
print(os.listdir("./common_voice/train"))
print(os.listdir("./common_voice/test"))

['train', 'dataset_dict.json', 'test']
['dataset_info.json', 'dataset.arrow', 'state.json']
['dataset_info.json', 'dataset.arrow', 'state.json']


Keep browser busy and avoid timeout by using this code in the browser:

```javascript
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton, 60000);
```