## Setting Up

Before running this notebook, please ensure that you are on GPU runtime (`Runtime` > `Change runtime type` > `GPU`). The following cell will install [`gsoc-wav2vec2`](https://github.com/vasudevgupta7/gsoc-wav2vec2) package & its dependencies.

In [2]:
!pip install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@main
# !sudo apt-get install -y libsndfile1-dev # to be compiled on the terminal
!pip install -q SoundFile

In [21]:
import os

import tensorflow as tf
import tensorflow_hub as hub
from wav2vec2 import Wav2Vec2Config
from tensorflow import keras
import soundfile as sf

config = Wav2Vec2Config()

print("TF version:", tf.__version__)

TF version: 2.16.1


In [5]:
os.chdir(r"/mnt/f/IA/WOLOF")

# Utils

In [28]:
import pandas as pd
import os


def extract_all_chars(directory, file_name):
    all_text = ""
    with open(os.path.join(directory, file_name), 'rb') as f:
        batch = pd.read_csv(f)
    for k in range(len(batch)):
        all_text += " " + batch['transcription'][k]

    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

In [18]:
def read_wav_file(file_path, REQUIRED_SAMPLE_RATE = 16000):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".wav")]
  return {file_id: audio}

In [19]:
def text_preprocess(row):
    text = row.get('transcription')
    samples = text.split("\n")
    samples = {row.get('filename')[:-len(".wav")]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 2}
    return samples

In [24]:
def fetch_sound_text_mapping(data_dir, text_file_df, AUDIO_MAXLEN):
  all_files = os.listdir(data_dir)

  wav_files = [os.path.join(data_dir, f) for f in all_files if f.endswith(".wav")]
  aux = text_file_df.apply(lambda row: text_preprocess(row), axis = 1)

  txt_samples = {}
  for (_, text_sample) in aux.items():
    txt_samples.update(text_sample)

  speech_samples = {}
  for f in wav_files:
    speech_samples.update(read_wav_file(f))

  assert len(txt_samples) == len(speech_samples)

  samples = [(speech_samples[file_id], txt_samples[file_id]) for file_id in speech_samples.keys() if len(speech_samples[file_id]) < AUDIO_MAXLEN]
  return samples

# Data loading & processing 

In [25]:
LABEL_MAXLEN = 256
BATCH_SIZE = 2
AUDIO_MAXLEN = 246000

In [7]:
data_dir = r"SPEECH_TO_TEXT/DATA/CLEANED/WOLOF_AUDIO_TRANS/alffa/audio"
all_files = os.listdir(data_dir)

wav_files = [f for f in all_files if f.endswith(".wav")]

In [8]:
import pandas as pd
text_file = pd.read_csv("SPEECH_TO_TEXT/DATA/CLEANED/WOLOF_AUDIO_TRANS/alffa/alffa_clean_df.csv")
text_file.head(3).T

Unnamed: 0,0,1,2
Unnamed: 0,0,1,2
id,1,2,3
transcription,jén fa nga ko jàppe soo ko fa sange mu rëcc,ngorsi kat masu maa jaar buroom di tataani,dañu ko logal moo tax déggatuloo ko
length,5.968,3.946,5.608
filename,isma_1_WOL.wav,isma_2_WOL.wav,isma_3_WOL.wav


In [9]:
from IPython.display import Audio
import random

file_id = random.choice([f[:-len(".wav")] for f in wav_files])
flac_file_path = os.path.join(data_dir, f"{file_id}.wav")

print("Text Transcription:", text_file[text_file['filename'] == file_id + '.wav']['transcription'].iloc[0], "\nAudio:")
Audio(filename=flac_file_path)

Text Transcription:  dafay toog rekk di digle te du dugal loxoom ci liggéey bi 
Audio:


In [26]:
samples = fetch_sound_text_mapping(data_dir, text_file, AUDIO_MAXLEN)
samples[:5]

[(array([0.        , 0.        , 0.        , ..., 0.01473999, 0.01318359,
         0.01153564]),
  'defee ngàll ci robb bi dafa koy ñaawal'),
 (array([-3.05175781e-05,  9.15527344e-05, -2.13623047e-04, ...,
          1.46484375e-03,  2.10571289e-03,  2.59399414e-03]),
  'dëkk bi taseeg nag yi daldi wóññeeku'),
 (array([-0.00015259,  0.00021362, -0.00039673, ..., -0.00204468,
         -0.00628662,  0.00183105]),
  'ay noonam ak ay ñoñam'),
 (array([ 3.05175781e-05, -9.15527344e-05,  1.83105469e-04, ...,
          3.66210938e-04, -1.06811523e-03, -1.37329102e-03]),
  'na ñu bëggul ku leen gantu suba teel'),
 (array([0.        , 0.        , 0.        , ..., 0.0027771 , 0.00326538,
         0.00140381]),
  'taatu guy googu la jigéeni ajoor yi di jaaye sanqal')]

## Vocabulary & Tokenizer & Processor

In [29]:
directory = "SPEECH_TO_TEXT/DATA/CLEANED/WOLOF_AUDIO_TRANS/alffa"
file_name = "alffa_clean_df.csv"

vocab = extract_all_chars(directory, file_name)

In [30]:
vocab_list = list(set(vocab["vocab"][0]))
vocab_list = list(set(vocab_list))

In [31]:
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

{'é': 0,
 's': 1,
 'r': 2,
 'à': 3,
 'u': 4,
 'x': 5,
 'v': 6,
 'a': 7,
 'c': 8,
 'n': 9,
 'ë': 10,
 'l': 11,
 ' ': 12,
 'ç': 13,
 'N': 14,
 'ã': 15,
 'g': 16,
 'ó': 17,
 'q': 18,
 'o': 19,
 'y': 20,
 'b': 21,
 'k': 22,
 'ñ': 23,
 'e': 24,
 'j': 25,
 'w': 26,
 'm': 27,
 'd': 28,
 'f': 29,
 'i': 30,
 't': 31,
 'p': 32,
 'h': 33}

In [32]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

In [33]:
vocab_dict["<unk>"] = len(vocab_dict)
vocab_dict["<pad>"] = len(vocab_dict)
vocab_size = len(vocab_dict)
print(vocab_size)

36


In [34]:
vocab_path = r"SPEECH_TO_TEXT/CODES/MODELS/WAV2VEC2/vocabs"

import json
with open(vocab_path +'/vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [35]:
from wav2vec2 import Wav2Vec2Processor
tokenizer = Wav2Vec2Processor(is_tokenizer=True)
processor = Wav2Vec2Processor(is_tokenizer=False)

def preprocess_text(text):
  label = tokenizer(text)
  return tf.constant(label, dtype=tf.int32)

def preprocess_speech(audio):
  audio = tf.constant(audio, dtype=tf.float32)
  return processor(tf.transpose(audio))

In [36]:
def inputs_generator():
  for speech, text in samples:
    yield preprocess_speech(speech), preprocess_text(text)

## Setting up `tf.data.Dataset`

Following cell will setup `tf.data.Dataset` object using its `.from_generator(...)` method. We will be using the `generator` object, we defined in the above cell.

**Note:** For distributed training (especially on TPUs), `.from_generator(...)` doesn't work currently and it is recommended to train on data stored in `.tfrecord` format (Note: The TFRecords should ideally be stored inside a GCS Bucket in order for the TPUs to work to the fullest extent).

You can refer to [this script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/make_tfrecords.py) for more details on how to convert LibriSpeech data into tfrecords.

In [37]:
BUFFER_SIZE = len(wav_files)
SEED = 42

In [38]:
output_signature = (
    tf.TensorSpec(shape=(None),  dtype=tf.float32),
    tf.TensorSpec(shape=(None), dtype=tf.int32),
)

dataset = tf.data.Dataset.from_generator(inputs_generator, output_signature=output_signature)

: 

: 

In [None]:
dataset = dataset.shuffle(BUFFER_SIZE, seed=SEED)

We will pass the dataset into multiple batches, so let's prepare batches in the following cell. Now, all the sequences in a batch should be padded to a constant length. We will use the`.padded_batch(...)` method for that purpose.

In [None]:
dataset = dataset.padded_batch(BATCH_SIZE, padded_shapes=(AUDIO_MAXLEN, LABEL_MAXLEN), padding_values=(0.0, 0))

Accelerators (like GPUs/TPUs) are very fast and often data-loading (& pre-processing) becomes the bottleneck during training as the data-loading part happens on CPUs. This can increase the training time significantly especially when there is a lot of online pre-processing involved or data is streamed online from GCS buckets. To handle those issues, `tf.data.Dataset` offers the `.prefetch(...)` method. This method helps in preparing the next few batches in parallel (on CPUs) while the model is making predictions (on GPUs/TPUs) on the current batch.

In [None]:
dataset = dataset.prefetch(tf.data.AUTOTUNE)

In [None]:
num_train_batches = len(wav_files)
num_val_batches = int(len(wav_files)*0.2)

train_dataset = dataset.take(num_train_batches)
val_dataset = dataset.skip(num_train_batches).take(num_val_batches)

# Setup Model

In [None]:
pretrained_layer = hub.KerasLayer("https://tfhub.dev/vasudevgupta7/wav2vec2/1", trainable=True)

In [None]:
# Prepare a directory to store all the checkpoints.
checkpoint_dir = "/kaggle/working/models/ckpt"
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)

In [None]:
def get_model(pretrained_layer, vocab_size, AUDIO_MAXLEN = 246000):
    inputs = tf.keras.Input(shape=(AUDIO_MAXLEN,))
    hidden_states = pretrained_layer(inputs)
    outputs = tf.keras.layers.Dense(vocab_size)(hidden_states)

    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
def make_or_restore_compiled_model(pretrained_layer, optimizer, loss_fn, AUDIO_MAXLEN):
    # Either restore the latest model, or create a fresh one
    # if there is no checkpoint available.
    checkpoints = [checkpoint_dir + "/" + name for name in os.listdir(checkpoint_dir)]
    if checkpoints:
        latest_checkpoint = max(checkpoints, key=os.path.getctime)
        print("Restoring from", latest_checkpoint)
        return keras.models.load_model(latest_checkpoint)
    print("Creating a new model")

    model =  get_model(pretrained_layer, AUDIO_MAXLEN = AUDIO_MAXLEN)
    return model.compile(optimizer, loss = loss_fn)

In [None]:
def run_training(pretrained_layer, optimizer, loss_fn, AUDIO_MAXLEN, epochs = 1):
    # Create a MirroredStrategy.
    strategy = tf.distribute.MirroredStrategy()

    # Open a strategy scope and create/restore the model
    with strategy.scope():
        model = make_or_restore_compiled_model(pretrained_layer, optimizer, loss_fn, AUDIO_MAXLEN)

    callbacks = [
        # This callback saves a SavedModel every epoch
        # We include the current epoch in the folder name.
        keras.callbacks.ModelCheckpoint(
            filepath=checkpoint_dir + "/ckpt-{epoch}", save_freq="epoch"
        )
    ]
    return model.fit(
        train_dataset,
        epochs = epochs,
        callbacks = callbacks,
        validation_data = val_dataset,
        verbose = 2,
    )

# Setup Training

In [None]:
# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))

Now, we need to define the `loss_fn` and optimizer to be able to train the model. The following cell will do that for us. We will be using the `Adam` optimizer for simplicity. `CTCLoss` is a common loss type that is used for tasks (like `ASR`) where input sub-parts can't be easily aligned with output sub-parts. You can read more about CTC-loss from this amazing [blog post](https://distill.pub/2017/ctc/).


`CTCLoss` (from [`gsoc-wav2vec2`](https://github.com/vasudevgupta7/gsoc-wav2vec2) package) accepts 3 arguments: `config`, `model_input_shape` & `division_factor`. If `division_factor=1`, then loss will simply get summed, so pass `division_factor` accordingly to get mean over batch.

In [None]:
from wav2vec2 import CTCLoss

LEARNING_RATE = 5e-5

loss_fn = CTCLoss(config, (BATCH_SIZE, AUDIO_MAXLEN), division_factor=BATCH_SIZE)
optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)

In [None]:
model = make_or_restore_compiled_model(pretrained_layer, optimizer, loss_fn, AUDIO_MAXLEN)

In [None]:
model(tf.random.uniform(shape=(BATCH_SIZE, AUDIO_MAXLEN)))
model.summary()

# Model training

For training our model, we will be directly calling `.fit(...)` method after compiling our model with `.compile(...)`.

In [None]:
history = run_training(pretrained_layer, optimizer, loss_fn, AUDIO_MAXLEN, epochs = 1)
history.history

Let's save our model with `.save(...)` method to be able to perform inference later. You can also export this SavedModel to TFHub by following [TFHub documentation](https://www.tensorflow.org/hub/publish).

In [None]:
save_dir = "/kaggle/working/models/finetuned-wav2vec2"
if not os._exists(save_dir):
    os.makedirs(save_dir)
model.save(save_dir, include_optimizer=False)

# Evaluation

Now we will be computing Word Error Rate over the validation dataset

**Word error rate** (WER) is a common metric for measuring the performance of an automatic speech recognition system. The WER is derived from the Levenshtein distance, working at the word level. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the percentage of words that were incorrectly predicted. 

You can refer to [this paper](https://www.isca-speech.org/archive_v0/interspeech_2004/i04_2765.html) to learn more about WER.

We will use `load_metric(...)` function from [HuggingFace datasets](https://huggingface.co/docs/datasets/) library. Let's first install the `datasets` library using `pip` and then define the `metric` object.

In [None]:
!pip3 install -q datasets

from datasets import load_metric
metric = load_metric("wer")

In [None]:
@tf.function(jit_compile=True)
def eval_fwd(batch):
  logits = model(batch, training=False)
  return tf.argmax(logits, axis=-1)

In [None]:
from tqdm.auto import tqdm

for speech, labels in tqdm(val_dataset, total=num_val_batches):
    predictions  = eval_fwd(speech)
    predictions = [tokenizer.decode(pred) for pred in predictions.numpy().tolist()]
    references = [tokenizer.decode(label, group_tokens=False) for label in labels.numpy().tolist()]
    metric.add_batch(references=references, predictions=predictions)

We are using the `tokenizer.decode(...)` method for decoding our predictions and labels back into the text and will add them to the metric for `WER` computation later.

Now, let's calculate the metric value in following cell:

In [None]:
metric.compute()

# Inference

Now that we are satisfied with the training process & have saved the model in `save_dir`, we will see how this model can be used for inference.

First, we will load our model using `tf.keras.models.load_model(...)`.

In [None]:
finetuned_model = tf.keras.models.load_model(save_dir)

Now, we will read the speech sample using `soundfile.read(...)` and pad it to `AUDIO_MAXLEN` to satisfy the model signature. Then we will normalize that speech sample using the `Wav2Vec2Processor` instance & will feed it into the model.

In [None]:
import numpy as np

speech, _ = sf.read("/kaggle/input/wolof-speech2text/alffa_git/alffa_git/audio/WOL_01_lect_0001.wav")
speech = np.pad(speech, (0, AUDIO_MAXLEN - len(speech)))
speech = tf.expand_dims(processor(tf.constant(speech)), 0)

outputs = finetuned_model(speech)
outputs

Let's decode numbers back into text sequence using the `Wav2Vec2tokenizer` instance, we defined above.

In [None]:
predictions = tf.argmax(outputs, axis=-1)
predictions = [tokenizer.decode(pred) for pred in predictions.numpy().tolist()]
predictions