# Automatic speech recognition tutorial

- This notebook aims to study the different in each speech-to-text model.

list of the advanced models:
- Whisper
- Wav2Wec2.0
- HuBERT

## Evaluation metric:
- Word Error Rate (WER)
- Latency


# finetuning MInDS-14 dataset
https://huggingface.co/docs/transformers/main/tasks/asr
- MInDS-14 is training and evaluation resource for intent detection task with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties.
- While the dataset contains a lot of useful information, like lang_id and english_transcription, you’ll focus on the audio and transcription in this guide.
- We will compare the performance of two models, including Wav2Vec2.0 and Whisper

## TODO
- Load data
- Preprocess data
- Load the pretrained models:
    - Wav2Vec2.0
    - Whisper
- Fine-tune the models
- Evaluate the models
- Compare the performance of the models
- Save the models
- Load the models
- Transcribe the audio files
- Compare the transcriptions
- Conclusion


In [121]:
from datasets import load_dataset, Audio, load_from_disk

# minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]") # Dataset minds14 downloaded and prepared to /home/tslab/phusaeng/.cache/huggingface/datasets/PolyAI___minds14/en-US/1.0.0/65c7e0f3be79e18a6ffaf879a083daf706312d421ac90d25718459cbf3c42696. Subsequent calls will reuse this data.
minds = load_from_disk("minds14")
# minds.save_to_disk("minds14")

In [122]:
# split train and test set
minds = minds.train_test_split(test_size=0.2)
minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 80
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 20
    })
})

In [123]:
from IPython.display import Audio as IPAudio

print(f'transcription: {minds["train"][0]["transcription"]}')
IPAudio(data=minds["train"]['audio'][0]['array'], rate=8000)

transcription: I'd like to make a payment


In [124]:
minds["train"][0]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~PAY_BILL/602bae7bbb1e6d0fbce92264.wav',
 'audio': {'path': '602bae7bbb1e6d0fbce92264.wav',
  'array': array([-0.00024414,  0.        , -0.00024414, ..., -0.00024414,
          0.        ,  0.        ]),
  'sampling_rate': 8000},
 'transcription': "I'd like to make a payment",
 'english_transcription': "I'd like to make a payment",
 'intent_class': 13,
 'lang_id': 4}

In [125]:
# what is the unique intent class? 
from collections import Counter

train_classes = []
test_classes = []
data_set = {'train': minds['train'], 'test': minds['test']}
for ds in data_set:
    for s in range(len(data_set[ds])):
        if ds == 'train':
            train_classes.append(data_set[ds][s]["intent_class"])
        else:
            test_classes.append(data_set[ds][s]["intent_class"])

print(f'train: {Counter(train_classes)}') # train: Counter({11: 35, 4: 33, 13: 12})
print(f'test: {Counter(test_classes)}') # test: Counter({4: 8, 11: 7, 13: 5})

train: Counter({11: 34, 4: 33, 13: 13})
test: Counter({11: 8, 4: 8, 13: 4})


In [126]:
# load and preprocess the data
# we don't need 'intent_class and 'lang_id' columns
minds = minds.remove_columns(['intent_class', 'lang_id', 'english_transcription'])
train_set = minds['train']
test_set = minds['test']

In [127]:
test_set[0]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~JOINT_ACCOUNT/602ba3fe963e11ccd901cc7f.wav',
 'audio': {'path': '602ba3fe963e11ccd901cc7f.wav',
  'array': array([0.        , 0.        , 0.        , ..., 0.00024414, 0.00024414,
         0.00024414]),
  'sampling_rate': 8000},
 'transcription': 'I would like to set up a joint account can I do that in the app'}

In [128]:
# Show random element
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset
                                     )-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

# show_random_elements(timit["train"].remove_columns(["file", "audio"]))

In [20]:
minds['train']

Dataset({
    features: ['path', 'audio', 'transcription'],
    num_rows: 80
})

In [66]:
show_random_elements(minds["train"].remove_columns(['path', 'audio']))

Unnamed: 0,transcription
0,could you please tell me how to set up a joint account
1,how do I set up a joint account
2,I'd like to pay a bail
3,can you see my account balance
4,what is my checking account balance
5,can you tell me what my current account balances
6,yes or no I'm going to because I would like to set up a joint account with my wife
7,hello I'd like to set up a joint account was my partner how do I do that
8,how do I start a joint account
9,how do I set up a joint account


In [75]:
# ignore special characters for speech
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\…\–\—\(\)\[\]\{\}\<\>\=\+\@\#\$\&\*\^\~\_\`\’\/\’\‘\|]'

def remove_special_characters(batch, column_names='transcription'):
    batch[column_names] = re.sub(chars_to_ignore_regex, '', batch[column_names]).lower() + " "
    return batch

train_set = train_set.map(remove_special_characters, num_proc=4)
test_set = test_set.map(remove_special_characters, num_proc=4)

Map (num_proc=4):   0%|          | 0/80 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/20 [00:00<?, ? examples/s]

In [76]:
show_random_elements(train_set.remove_columns(['path', 'audio']))

Unnamed: 0,transcription
0,can i have an account with my sister i want to set up a joint account
1,hi i want you to tell me my account balance until i'm just see it and i need to be able to reconcile a my account
2,i would like to be showing my account balance please
3,like to see my account balance
4,show me my account balance please
5,hello i'm calling about my account balance
6,how much information about signing up for a joint account
7,so you spent the money i'd like to see my new account balance
8,yes i'd like to set up a joint account i'm allowed to anyone
9,what is my checking account balance


In [77]:
# build vocab
def extract_all_chars(batch, text="transcription"):
    all_text = " ".join(batch[text])
    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

vocab_train = train_set.map(extract_all_chars, batched=True, batch_size=1, keep_in_memory=True)
vocab_test = test_set.map(extract_all_chars, batched=True, batch_size=1, keep_in_memory=True)
# vocab_test = common_voice_test.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, num_proc=4)


Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [82]:
# vocab_dict = sorted(list(set(sum(vocab_train['vocab'], []))| set(sum(vocab_test['vocab'], []))))
vocab_dict = sorted(list(set(sum(vocab_train['vocab'], [])).intersection(set(sum(vocab_test['vocab'], [])))))
vocab_dict = {v: k for k, v in enumerate(vocab_dict)}
vocab_dict

{' ': 0,
 "'": 1,
 'a': 2,
 'b': 3,
 'c': 4,
 'd': 5,
 'e': 6,
 'f': 7,
 'g': 8,
 'h': 9,
 'i': 10,
 'j': 11,
 'k': 12,
 'l': 13,
 'm': 14,
 'n': 15,
 'o': 16,
 'p': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'y': 24}

In [42]:
len(vocab_train['vocab'])

80

In [43]:
len(train_set)

80

In [129]:
# preprocess the data
# what this processor can do?
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")



In [130]:
# The MInDS-14 dataset has a sampling rate of 8000kHz (you can find this information in its dataset card), 
# which means you’ll need to resample the dataset to 16000kHz to use the pretrained Wav2Vec2 model:
from datasets import Audio

minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds['train'][12]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602ba046bb1e6d0fbce91fd3.wav',
 'audio': {'path': '602ba046bb1e6d0fbce91fd3.wav',
  'array': array([-1.63881659e-05, -5.43160438e-05,  1.55731900e-05, ...,
         -2.08235957e-04, -2.38026489e-04, -1.22843365e-04]),
  'sampling_rate': 16000},
 'transcription': 'what is my current bank balance'}

In [131]:
# The Wav2Vec2 tokenizer is only trained on uppercase characters 
# so you’ll need to make sure the text matches the tokenizer’s vocabulary:

def uppercase(example):
    return {"transcription": example["transcription"].upper()}

minds = minds.map(uppercase)
minds['train'][12]

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602ba046bb1e6d0fbce91fd3.wav',
 'audio': {'path': '602ba046bb1e6d0fbce91fd3.wav',
  'array': array([-1.63881659e-05, -5.43160438e-05,  1.55731900e-05, ...,
         -2.08235957e-04, -2.38026489e-04, -1.22843365e-04]),
  'sampling_rate': 16000},
 'transcription': 'WHAT IS MY CURRENT BANK BALANCE'}

In [132]:
def prepare_dataset(batch):
    # call the audio column to get audio data

    audio = batch["audio"]
    batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"])
    batch["input_length"] = len(batch["input_values"][0]) # dict_keys(['input_values', 'labels', 'input_length'])
    return batch

In [133]:
# apply the preprocessing function to the entire dataset by using Datasets map function
encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names['train'], num_proc=4)

Map (num_proc=4):   0%|          | 0/80 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/20 [00:00<?, ? examples/s]

In [134]:
minds['train'][12]

{'path': '/home/tslab/phusaeng/.cache/huggingface/datasets/downloads/extracted/1a2f7ed31ee9dea31314ca8b0f56280ce353e5e05d593431105d4848eba86946/en-US~BALANCE/602ba046bb1e6d0fbce91fd3.wav',
 'audio': {'path': '602ba046bb1e6d0fbce91fd3.wav',
  'array': array([-1.63881659e-05, -5.43160438e-05,  1.55731900e-05, ...,
         -2.08235957e-04, -2.38026489e-04, -1.22843365e-04]),
  'sampling_rate': 16000},
 'transcription': 'WHAT IS MY CURRENT BANK BALANCE'}

In [135]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

# add dataclass decorator to instantiate the __init__, __repr__, and __eq__
@dataclass
class DataCollatorCTCWithPadding:
    processor: AutoProcessor
    padding: bool|str = "longest"

    def __call__(self, features: List[Dict[str, List[int]|torch.Tensor]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"][0]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")
        labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels
        return batch

In [136]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest")
print(data_collator)

DataCollatorCTCWithPadding(processor=Wav2Vec2Processor:
- feature_extractor: Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

- tokenizer: Wav2Vec2CTCTokenizer(name_or_path='facebook/wav2vec2-base', vocab_size=32, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True), padding='longest')


In [137]:
# call evaluation method
import evaluate

wer = evaluate.load('wer')

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [138]:
# create a method that compute the WER to evaluate between pred and gt
import numpy as np

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [139]:
from transformers import AutoModelForCTC, TrainingArguments, Trainer

model = AutoModelForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
)



Downloading pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base were not used when initializing Wav2Vec2ForCTC: ['project_hid.weight', 'quantizer.codevectors', 'project_hid.bias', 'project_q.weight', 'project_q.bias', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti

In [140]:
training_args = TrainingArguments(
    output_dir="my_awesome_asr_mind_model",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=2000,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

In [129]:
test_var = torch.Tensor(encoded_minds['test'][0]['input_values'][0])
test_shape = test_var.shape
test_var = test_var.unsqueeze(0)
test_var.shape

torch.Size([1, 87678])

In [133]:
rand_var = torch.randn((1, test_shape[0]))
device = torch.device("cuda")
model = model.to(device)
rand_var = rand_var.to(device)
test_var = test_var.to(device)

In [134]:
pred = model(rand_var)
pred

CausalLMOutput(loss=None, logits=tensor([[[ 0.2743,  0.3006,  0.4644,  ..., -0.3185, -0.0955,  0.2472],
         [ 0.2778,  0.2633,  0.4002,  ..., -0.2830, -0.1032,  0.2347],
         [ 0.2877,  0.2562,  0.3968,  ..., -0.2728, -0.1023,  0.2253],
         ...,
         [ 0.2781,  0.2277,  0.3745,  ..., -0.2655, -0.1093,  0.2180],
         [ 0.2705,  0.2418,  0.3888,  ..., -0.2801, -0.1057,  0.2275],
         [ 0.2671,  0.2798,  0.4356,  ..., -0.2933, -0.1016,  0.2378]]],
       device='cuda:0', grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [135]:
pred_var = model(test_var)
pred_var

CausalLMOutput(loss=None, logits=tensor([[[-0.0258,  0.2510,  0.4802,  ..., -0.0041, -0.1058,  0.1840],
         [ 0.0301,  0.2544,  0.4487,  ..., -0.0210, -0.1419,  0.1955],
         [ 0.0550,  0.2683,  0.4259,  ..., -0.0457, -0.1417,  0.1469],
         ...,
         [-0.0289,  0.2344,  0.4614,  ..., -0.1182, -0.3728,  0.0207],
         [ 0.0023,  0.1959,  0.4319,  ..., -0.1516, -0.3663,  0.0022],
         [-0.0344,  0.1948,  0.4796,  ..., -0.0717, -0.3801,  0.0596]]],
       device='cuda:0', grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [136]:
pred.logits.shape, pred_var.logits.shape

(torch.Size([1, 273, 32]), torch.Size([1, 273, 32]))

In [139]:
processor

Wav2Vec2Processor:
- feature_extractor: Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

- tokenizer: Wav2Vec2CTCTokenizer(name_or_path='facebook/wav2vec2-base', vocab_size=32, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True)

In [138]:
predicted_ids = torch.argmax(pred_var.logits, dim=-1)
print(predicted_ids)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

tensor([[ 2,  2,  2,  2,  2, 14, 14, 12, 16, 14, 12, 12, 14, 12, 12, 12, 22, 12,
         14, 12, 14, 14, 12, 14, 14,  2, 14, 14, 12, 14,  2, 14,  2,  2,  2,  2,
          2,  2, 14,  2,  2, 14, 12, 12, 14,  2, 14,  2,  2, 14, 14,  2,  2,  2,
          2, 14, 14,  2,  2, 14,  2,  2,  2,  2,  2,  2,  2,  2, 14,  2,  2, 14,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 14,  2,  2,  2,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          1, 22, 22, 28, 22, 24, 10, 14, 14, 14, 28, 22, 22, 22, 22, 14, 14, 14,
         12, 22, 22, 22, 22, 22, 22, 22, 22, 22, 24, 22, 22, 22, 22, 24, 24, 28,
         28, 22, 14, 14, 22, 22, 22, 22, 28, 28, 22, 14, 24, 24, 24, 24, 24, 22,
         22, 22, 22, 14, 22, 22, 24, 24, 19,  2, 24, 24, 24, 22, 28, 22,  1,  1,
         14, 24, 24, 27, 27,  1, 10, 10, 22, 22, 22, 28,  0, 24, 22, 22, 22,  5,
         14, 14, 24, 28,  1, 22, 22, 22, 28,  1,  1, 24, 22, 22, 22, 22, 22, 22,
         22,  2, 22, 22, 22,

In [142]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_minds["train"],
    eval_dataset=encoded_minds["test"],
    tokenizer=processor,
    data_collator=data_collator,
)

Cloning https://huggingface.co/Phurich/my_awesome_asr_mind_model into local empty directory.


In [143]:
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/tslab/phusaeng/.netrc


Step,Training Loss,Validation Loss


# TODO
- create script file for training each model
    - Wav2Vec2.0
    - Whisper
    - HuBERT
- create script file for evaluating each model
- Use this notebook to viusalize and analize the result.    

In [1]:
class Trainer:
    raise NotImplementedError

In [19]:
# try to load the model
import torch
import fairseq

model = torch.load('./weights/wav2vec_small_100h.pt')
model['model'].keys(), len(model['model'].keys())

(odict_keys(['w2v_encoder.proj.weight', 'w2v_encoder.proj.bias', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.0.0.weight', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.0.2.weight', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.0.2.bias', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.1.0.weight', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.2.0.weight', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.3.0.weight', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.4.0.weight', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.5.0.weight', 'w2v_encoder.w2v_model.feature_extractor.conv_layers.6.0.weight', 'w2v_encoder.w2v_model.encoder.pos_conv.0.bias', 'w2v_encoder.w2v_model.encoder.pos_conv.0.weight_g', 'w2v_encoder.w2v_model.encoder.pos_conv.0.weight_v', 'w2v_encoder.w2v_model.encoder.layers.0.self_attn.k_proj.weight', 'w2v_encoder.w2v_model.encoder.layers.0.self_attn.k_proj.bias', 'w2v_encoder.w2v_model.encoder.layers.0.self_attn.v_proj.w

In [1]:
import soundfile as sf
from IPython.display import Audio

d = sf.read('/net/papilio/storage6/phusaeng/fun/database/LibriSpeech/dev-clean/84/121123/84-121123-0000.flac')

Audio(data=d[0], rate=d[1])

In [105]:
from datasets import load_from_disk, DatasetDict

common_voice = DatasetDict()

# common_voice_train = load_dataset("common_voice", "th", split="train+validation")
# common_voice_test = load_dataset("common_voice", "th", split="test")
common_voice['train'] = load_from_disk('common_voice_train')
common_voice['test'] = load_from_disk('common_voice_test')

In [106]:
common_voice

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 4839
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 2188
    })
})

In [107]:
common_voice = common_voice.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes"])

In [109]:
common_voice

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 4839
    })
    test: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 2188
    })
})

In [110]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

In [116]:
feature_extractor

WhisperFeatureExtractor {
  "chunk_length": 30,
  "feature_extractor_type": "WhisperFeatureExtractor",
  "feature_size": 80,
  "hop_length": 160,
  "n_fft": 400,
  "n_samples": 480000,
  "nb_max_frames": 3000,
  "padding_side": "right",
  "padding_value": 0.0,
  "processor_class": "WhisperProcessor",
  "return_attention_mask": false,
  "sampling_rate": 16000
}

In [111]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="th", task="transcribe")

Downloading (…)okenizer_config.json:   0%|          | 0.00/842 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

In [112]:
tokenizer

WhisperTokenizer(name_or_path='openai/whisper-small', vocab_size=50258, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|endoftext|>', '<|startoftranscript|>', '<|en|>', '<|zh|>', '<|de|>', '<|es|>', '<|ru|>', '<|ko|>', '<|fr|>', '<|ja|>', '<|pt|>', '<|tr|>', '<|pl|>', '<|ca|>', '<|nl|>', '<|ar|>', '<|sv|>', '<|it|>', '<|id|>', '<|hi|>', '<|fi|>', '<|vi|>', '<|he|>', '<|uk|>', '<|el|>', '<|ms|>', '<|cs|>', '<|ro|>', '<|da|>', '<|hu|>', '<|ta|>', '<|no|>', '<|th|>', '<|ur|>', '<|hr|>', '<|bg|>', '<|lt|>', '<|la|>', '<|mi|>', '<|ml|>', '<|cy|>', '<|sk|>', '<|te|>

In [113]:
# verify the tokenizer
input_str = common_voice["train"][0]["sentence"]
labels = tokenizer(input_str).input_ids
decoded_with_special = tokenizer.decode(labels, skip_special_tokens=False)
decoded_str = tokenizer.decode(labels, skip_special_tokens=True)

print(f"Input:                 {input_str}")
print(f"Decoded w/ special:    {decoded_with_special}")
print(f"Decoded w/out special: {decoded_str}")
print(f"Are equal:             {input_str == decoded_str}")


Input:                 เงียบหน่อย เจ้าหนู
Decoded w/ special:    <|startoftranscript|><|th|><|transcribe|><|notimestamps|>เงียบหน่อย เจ้าหนู<|endoftext|>
Decoded w/out special: เงียบหน่อย เจ้าหนู
Are equal:             True


In [114]:
# To simplify using the feature extractor and tokenizer, we can use WhisperProcessor
from transformers import WhisperProcessor

processpr = WhisperProcessor.from_pretrained('openai/whisper-small', language='th', task='transcribe')

In [115]:
processor

Wav2Vec2Processor:
- feature_extractor: Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

- tokenizer: Wav2Vec2CTCTokenizer(name_or_path='facebook/wav2vec2-base', vocab_size=32, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=True)

In [117]:
# Prepare Data: Whisper model use 16kHz sampling rate
print(common_voice["train"][0])

{'audio': {'path': 'common_voice_th_23654854.mp3', 'array': array([ 0.00000000e+00, -1.15516158e-13,  2.51645154e-14, ...,
       -2.67443284e-07,  9.94784386e-07, -1.33871993e-07]), 'sampling_rate': 48000}, 'sentence': 'เงียบหน่อย เจ้าหนู'}


In [119]:
# downsample
from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16_000))
print(common_voice["train"][0])

{'audio': {'path': 'common_voice_th_23654854.mp3', 'array': array([-1.20055743e-13,  1.51213391e-13,  3.08460461e-14, ...,
        1.46906592e-07, -1.81063095e-06, -6.15896399e-07]), 'sampling_rate': 16000}, 'sentence': 'เงียบหน่อย เจ้าหนู'}


In [98]:
count_gender = {'male': 0, 'female': 0, 'wierd': 0}
female_id = []
for s in range(len(common_voice_train)): # take arond 34 secs
    count_gender['male'] += 1 if common_voice_train[s]['gender'] == 'male' else 0
    count_gender['female'] += 1 if common_voice_train[s]['gender'] == 'female' else 0
    if common_voice_train[s]['gender'] == 'female':
        female_id.append(s)
    if common_voice_train[s]['sentence'] == 'รักนะจุ๊บจุ๊บ ฝันดีนะ ถ้าไม่คิดถึงกันนะจะโกรธเลย ง้อกี่ทีก็ไม่หายนะบอกก่อน':
        print(s)
        break

2278


In [99]:
count_gender, len(common_voice_train), female_id[0], female_id[-1]

({'male': 2080, 'female': 199, 'wierd': 0}, 4839, 1663, 2263)

In [100]:
# some_data = common_voice_train[2791:2916]
some_data = common_voice_train[2278:2279] # รักนะจุ๊บจุ๊บ ฝันดีนะ ถ้าไม่คิดถึงกันนะจะโกรธเลย ง้อกี่ทีก็ไม่หายนะบอกก่อน

In [102]:
id_test = 0
print(some_data['sentence'][id_test])
IPAudio(data=some_data['audio'][id_test]['array'], rate=some_data['audio'][id_test]['sampling_rate'])

รักนะจุ๊บจุ๊บ ฝันดีนะ ถ้าไม่คิดถึงกันนะจะโกรธเลย ง้อกี่ทีก็ไม่หายนะบอกก่อน


In [103]:
# Show random element
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset
                                     )-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

# show_random_elements(timit["train"].remove_columns(["file", "audio"]))

In [104]:
show_random_elements(common_voice_train.remove_columns(['path', 'audio' ,'client_id', 'up_votes',	'down_votes',	'age',	'gender',	'accent',	'locale',	'segment']), 10)

Unnamed: 0,sentence
0,ฉันยังไม่ได้กินข้าวเลย
1,ที่บ้านบอกมา
2,เธอไปที่ร้านขายยาเพื่อรับยา
3,นายกรัฐมนตรีเยี่ยมชมนิทรรศการผลงานวิจัยและนวัตกรรมในงานมหกรรมงานวิจัยแห่งชาติ
4,แต่เดี๋ยวก่อน
5,ตอนฉันเป็นเด็ก ฉันรักการซื้อไอศกรีมเชอร์เบทที่ร้านขนมหวาน
6,ฉันคิดว่าฉันสั่งเนื้อไป
7,แมวอาจมองไปที่ราชา อลิซพูด
8,ทำไมคุณต้องเล่าเรื่องราวของฉันให้กับคนหลายคนได้ฟังกัน?
9,วิธีการปั้นตุ๊กตาหิมะ


In [103]:
# ignore special characters for speech
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\…\–\—\(\)\[\]\{\}\<\>\=\+\@\#\$\&\*\^\~\_\`\’\/\’\‘\|]'

def remove_special_characters(batch, column_names='sentence'):
    batch[column_names] = re.sub(chars_to_ignore_regex, '', batch[column_names]).lower() + " "
    return batch

common_voice_train = common_voice_train.map(remove_special_characters, num_proc=4)

Map (num_proc=4):   0%|          | 0/4839 [00:00<?, ? examples/s]

In [143]:
len(common_voice_train['sentence'])

4839

In [87]:
# build vocab
def extract_all_chars(batch):
    all_text = " ".join(batch["sentence"])
    vocab = list(set(all_text))
    print(f'vocab: {len(vocab)}')
    print(vocab)
    return {"vocab": [vocab], "all_text": [all_text]}

# vocab_train = common_voice_train.map(extract_all_chars, keep_in_memory=True, num_proc=4)
# vocab_test = common_voice_test.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, num_proc=4)


In [90]:
vocab_dict = list(set(vocab_train['vocab'][0])| set(vocab_test['vocab'][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_dict)}
# vocab_dict

In [98]:
common_voice_train[2278]['sentence'].lower()

'รักนะจุ๊บจุ๊บ ฝันดีนะ ถ้าไม่คิดถึงกันนะจะโกรธเลย ง้อกี่ทีก็ไม่หายนะบอกก่อน'

In [100]:
common_voice_train.column_names

['client_id',
 'path',
 'audio',
 'sentence',
 'up_votes',
 'down_votes',
 'age',
 'gender',
 'accent',
 'locale',
 'segment']

# Whisper 

In [1]:
# load a pre-trained checkpoint
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained('openai/whisper-small')

In [2]:
model

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (f

In [3]:
model.config.forced_decoder_ids # These token ids control the transcription language and task for zero-shot ASR.

[[1, 50259], [2, 50359], [3, 50363]]

In [31]:
import torch
num_params = 0 
for layer in model.parameters():
    num_params += layer.numel()
print(num_params * 1/1e6)

241.734912


In [35]:
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-large")

from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

Downloading (…)rocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]



In [36]:
import torch
num_params = 0 
for layer in model.parameters():
    num_params += layer.numel()
print(num_params * 1/1e6)

94.39632


In [37]:
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large", 
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
)

Downloading pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-large were not used when initializing Wav2Vec2ForCTC: ['project_q.bias', 'project_hid.bias', 'quantizer.weight_proj.bias', 'project_q.weight', 'quantizer.weight_proj.weight', 'quantizer.codevectors', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large and are newly initialized: ['lm_head.bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predic

In [38]:
import torch
num_params = 0 
for layer in model.parameters():
    num_params += layer.numel()
print(num_params * 1/1e6)

315.461792
