## Setting Up

Before running this notebook, please ensure that you are on GPU runtime (`Runtime` > `Change runtime type` > `GPU`). The following cell will install [`gsoc-wav2vec2`](https://github.com/vasudevgupta7/gsoc-wav2vec2) package & its dependencies.

In [3]:
# !pip install --upgrade pip
# !pip install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@main
# !sudo apt-get install -y libsndfile1-dev # to be compiled on the terminal
# !pip install -q SoundFile
!pip install --upgrade tensorflow



In [4]:
import os

import tensorflow as tf
import tensorflow_hub as hub
from wav2vec2 import Wav2Vec2Config
from tensorflow import keras
import soundfile as sf
import tqdm as notebook_tqdm
config = Wav2Vec2Config()


print("TF version:", tf.__version__)

TF version: 2.16.1


In [11]:
os.chdir(r"E:\IA\WOLOF")

# Utils

In [5]:
import pandas as pd
import os


def extract_all_chars(directory, file_name):
    all_text = ""
    with open(os.path.join(directory, file_name), 'rb') as f:
        batch = pd.read_csv(f)
    for k in range(len(batch)):
        all_text += " " + batch['transcription'][k]

    vocab = list(set(all_text))
    return {"vocab": [vocab], "all_text": [all_text]}

In [6]:
def read_wav_file(file_path, REQUIRED_SAMPLE_RATE = 16000):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".wav")]
  return {file_id: audio}

In [7]:
def text_preprocess(row):
    text = row.get('transcription')
    samples = text.split("\n")
    samples = {row.get('filename')[:-len(".wav")]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 2}
    return samples

In [8]:
def fetch_sound_text_mapping(data_dir, text_file_df, AUDIO_MAXLEN):
  all_files = os.listdir(data_dir)

  wav_files = [os.path.join(data_dir, f) for f in all_files if f.endswith(".wav")]
  aux = text_file_df.apply(lambda row: text_preprocess(row), axis = 1)

  txt_samples = {}
  for (_, text_sample) in aux.items():
    txt_samples.update(text_sample)

  speech_samples = {}
  for f in wav_files:
    speech_samples.update(read_wav_file(f))

  assert len(txt_samples) == len(speech_samples)

  samples = [(speech_samples[file_id], txt_samples[file_id]) for file_id in speech_samples.keys() if len(speech_samples[file_id]) < AUDIO_MAXLEN]
  return samples

# Data loading & processing 

In [16]:
LABEL_MAXLEN = 256
BATCH_SIZE = 2
AUDIO_MAXLEN = 246000

In [12]:
data_dir = r"SPEECH_TO_TEXT\DATA\CLEANED\WOLOF_AUDIO_TRANS\alffa\audio"
all_files = os.listdir(data_dir)

wav_files = [f for f in all_files if f.endswith(".wav")]

In [13]:
import pandas as pd
text_file = pd.read_csv(r"SPEECH_TO_TEXT\DATA\CLEANED\WOLOF_AUDIO_TRANS\alffa\alffa_clean_df.csv")
text_file.head(3).T

Unnamed: 0,0,1,2
Unnamed: 0,0,1,2
id,1,2,3
transcription,jén fa nga ko jàppe soo ko fa sange mu rëcc,ngorsi kat masu maa jaar buroom di tataani,dañu ko logal moo tax déggatuloo ko
length,5.968,3.946,5.608
filename,isma_1_WOL.wav,isma_2_WOL.wav,isma_3_WOL.wav


In [14]:
from IPython.display import Audio
import random

file_id = random.choice([f[:-len(".wav")] for f in wav_files])
flac_file_path = os.path.join(data_dir, f"{file_id}.wav")

print("Text Transcription:", text_file[text_file['filename'] == file_id + '.wav']['transcription'].iloc[0], "\nAudio:")
Audio(filename=flac_file_path)

Text Transcription:  ce ngoon sa ba yumaane liggéeyee ba sotaal mu woo ko ni ko 
Audio:


In [17]:
samples = fetch_sound_text_mapping(data_dir, text_file, AUDIO_MAXLEN)
samples[:5]

[(array([0.        , 0.        , 0.        , ..., 0.01473999, 0.01318359,
         0.01153564]),
  'defee ngàll ci robb bi dafa koy ñaawal'),
 (array([-3.05175781e-05,  9.15527344e-05, -2.13623047e-04, ...,
          1.46484375e-03,  2.10571289e-03,  2.59399414e-03]),
  'dëkk bi taseeg nag yi daldi wóññeeku'),
 (array([-0.00015259,  0.00021362, -0.00039673, ..., -0.00204468,
         -0.00628662,  0.00183105]),
  'ay noonam ak ay ñoñam'),
 (array([ 3.05175781e-05, -9.15527344e-05,  1.83105469e-04, ...,
          3.66210938e-04, -1.06811523e-03, -1.37329102e-03]),
  'na ñu bëggul ku leen gantu suba teel'),
 (array([0.        , 0.        , 0.        , ..., 0.0027771 , 0.00326538,
         0.00140381]),
  'taatu guy googu la jigéeni ajoor yi di jaaye sanqal')]

## Vocabulary & Tokenizer & Processor

In [60]:
directory = r"SPEECH_TO_TEXT\DATA\CLEANED\WOLOF_AUDIO_TRANS\alffa"
file_name = "alffa_clean_df.csv"

vocab = extract_all_chars(directory, file_name)

In [61]:
vocab_list = list(set(vocab["vocab"][0]))
vocab_list = list(set(vocab_list))

In [62]:
# vocab_dict = {v: tf.constant(k, dtype = tf.int32) for k, v in enumerate(vocab_list)}
vocab_dict = {v: int(k) for k, v in enumerate(vocab_list)}
vocab_dict

{'b': 0,
 'g': 1,
 'e': 2,
 'j': 3,
 'd': 4,
 'l': 5,
 'f': 6,
 'r': 7,
 'k': 8,
 'ç': 9,
 'u': 10,
 'v': 11,
 's': 12,
 'w': 13,
 'm': 14,
 'ë': 15,
 'i': 16,
 'o': 17,
 'y': 18,
 'h': 19,
 'N': 20,
 'é': 21,
 'a': 22,
 'p': 23,
 't': 24,
 'q': 25,
 'c': 26,
 'à': 27,
 'n': 28,
 'ñ': 29,
 'ó': 30,
 'x': 31,
 'ã': 32,
 ' ': 33}

In [63]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

In [64]:
vocab_dict["<unk>"] = len(vocab_dict)
vocab_dict["<pad>"] = len(vocab_dict)
vocab_size = len(vocab_dict)
print(vocab_size)

36


In [65]:
# vocab_path = "/kaggle/working/models/vocabs"
vocab_path = r"SPEECH_TO_TEXT\CODES\MODELS\WAV2VEC2\vocabs"
if not os.path.exists(vocab_path):
    os.makedirs(vocab_path)

import json
with open(vocab_path +'/vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [83]:
from wav2vec2 import CTCLoss
from wav2vec2 import Wav2Vec2Processor

tokenizer = Wav2Vec2Processor(is_tokenizer=True)
processor = Wav2Vec2Processor(is_tokenizer=False)

In [90]:
def preprocess_text(text):
  label = tokenizer(text)
  return tf.constant(label, dtype=tf.int32)

def preprocess_speech(audio):
  audio = tf.constant(audio, dtype=tf.float32)
  return processor(tf.transpose(audio))

In [91]:
def inputs_generator():
  for speech, text in samples:
    yield preprocess_speech(speech, processor), preprocess_text(text, tokenizer)

## Setting up `tf.data.Dataset`

Following cell will setup `tf.data.Dataset` object using its `.from_generator(...)` method. We will be using the `generator` object, we defined in the above cell.

**Note:** For distributed training (especially on TPUs), `.from_generator(...)` doesn't work currently and it is recommended to train on data stored in `.tfrecord` format (Note: The TFRecords should ideally be stored inside a GCS Bucket in order for the TPUs to work to the fullest extent).

You can refer to [this script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/make_tfrecords.py) for more details on how to convert LibriSpeech data into tfrecords.

In [92]:
BUFFER_SIZE = len(wav_files)
SEED = 42

In [93]:
output_signature = (
    tf.TensorSpec(shape=(None),  dtype=tf.float32),
    tf.TensorSpec(shape=(None), dtype=tf.int32),
)

dataset = tf.data.Dataset.from_generator(inputs_generator, output_signature=output_signature)

In [94]:
dataset = dataset.shuffle(BUFFER_SIZE, seed=SEED)

We will pass the dataset into multiple batches, so let's prepare batches in the following cell. Now, all the sequences in a batch should be padded to a constant length. We will use the`.padded_batch(...)` method for that purpose.

In [95]:
dataset = dataset.padded_batch(BATCH_SIZE, padded_shapes=(AUDIO_MAXLEN, LABEL_MAXLEN), padding_values=(0.0, 0))

Accelerators (like GPUs/TPUs) are very fast and often data-loading (& pre-processing) becomes the bottleneck during training as the data-loading part happens on CPUs. This can increase the training time significantly especially when there is a lot of online pre-processing involved or data is streamed online from GCS buckets. To handle those issues, `tf.data.Dataset` offers the `.prefetch(...)` method. This method helps in preparing the next few batches in parallel (on CPUs) while the model is making predictions (on GPUs/TPUs) on the current batch.

In [96]:
dataset = dataset.prefetch(tf.data.AUTOTUNE)

In [97]:
num_train_batches = len(wav_files)
num_val_batches = int(len(wav_files)*0.2)

num_train_batches = 10
num_val_batches = 4

train_dataset = dataset.take(num_train_batches)
val_dataset = dataset.skip(num_train_batches).take(num_val_batches)

# Setup Model

In [106]:
pretrained_layer = hub.KerasLayer("https://tfhub.dev/vasudevgupta7/wav2vec2/1", trainable=True)

In [107]:
pretrained_layer1 = tf.keras.models.load_model(pretrained_layer)

ValueError: File format not supported: filepath=<tensorflow_hub.keras_layer.KerasLayer object at 0x0000027DA75C4A10>. Keras 3 only supports V3 `.keras` files and legacy H5 format files (`.h5` extension). Note that the legacy SavedModel format is not supported by `load_model()` in Keras 3. In order to reload a TensorFlow SavedModel as an inference-only layer in Keras 3, use `keras.layers.TFSMLayer(<tensorflow_hub.keras_layer.KerasLayer object at 0x0000027DA75C4A10>, call_endpoint='serving_default')` (note that your `call_endpoint` might have a different name).

In [77]:
# Prepare a directory to store all the checkpoints.
checkpoint_dir = r"SPEECH_TO_TEXT\CODES\MODELS\WAV2VEC2\checkpoints"
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)

# Setup Training

In [99]:
class MyLayer(tf.keras.layers.Layer):
    def call(self, inputs):
        return pretrained_layer(inputs)
    
checkpoints = [checkpoint_dir + "/" + name for name in os.listdir(checkpoint_dir)]
if checkpoints:
    latest_checkpoint = max(checkpoints, key=os.path.getctime)
    print("Restoring from", latest_checkpoint)
    model =  keras.models.load_model(latest_checkpoint)
else:
    print("Creating a new model")
    inputs = tf.keras.Input(shape=(AUDIO_MAXLEN,))
    # hidden_states = pretrained_layer(inputs)
    hidden_states = MyLayer()(inputs)
    outputs = tf.keras.layers.Dense(vocab_size)(hidden_states)

    model = tf.keras.Model(inputs=inputs, outputs=outputs)

Creating a new model


In [100]:
model(tf.random.uniform(shape=(BATCH_SIZE, AUDIO_MAXLEN)))
model.summary()

In [101]:
from wav2vec2 import CTCLoss

LEARNING_RATE = 5e-5

loss_fn = CTCLoss(config, (BATCH_SIZE, AUDIO_MAXLEN), division_factor=BATCH_SIZE)
optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)

In [102]:
model.compile(optimizer, loss = loss_fn)

In [103]:
history = model.fit(train_dataset, validation_data = val_dataset, epochs = 2)
history.history

Epoch 1/2


TypeError: Value passed to parameter 'indices' has DataType float32 not in list of allowed values: uint8, int8, int32, int64

In [38]:
# LEARNING_RATE = 5e-5


# def run_training( 
#         vocab_size, 
#         AUDIO_MAXLEN, 
#         BATCH_SIZE, 
#         SEED,
#         epochs = 1
#     ):
    
#     loss_fn = CTCLoss(config, (2, 246000), division_factor=2)
#     optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)

#     model =  make_or_restore_compiled_model(vocab_size, 246000)
#     model.compile(optimizer, loss = loss_fn)
        
#     callbacks = [
#         # This callback saves a SavedModel every epoch
#         # We include the current epoch in the folder name.
#         tf.keras.callbacks.ModelCheckpoint(
#             filepath=checkpoint_dir + "/ckpt-{epoch}", save_freq="epoch"
#         )
#     ]
#     return model.fit(
#         train_dataset,
#         epochs = epochs,
#         callbacks = callbacks,
#         validation_data = val_dataset,
#         verbose = 2,
#     )

# Model training

For training our model, we will be directly calling `.fit(...)` method after compiling our model with `.compile(...)`.

In [39]:
history = run_training(
        vocab_size, 
        AUDIO_MAXLEN, 
        BATCH_SIZE, 
        SEED,
        epochs = 1
    )
history.history

Creating a new model


TypeError: Exception encountered when calling layer 'keras_layer' (type KerasLayer).

Binding inputs to tf.function failed due to `A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.operations`). You are likely doing something like:

```
x = Input(...)
...
tf_fn(x)  # Invalid.
```

What you should do instead is wrap `tf_fn` in a layer:

```
class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)
```
`. Received args: (<KerasTensor shape=(None, 246000), dtype=float32, sparse=None, name=keras_tensor>,) and kwargs: {} for signature: (args_0: TensorSpec(shape=(None, 246000), dtype=tf.float32, name='speech'), /).

Call arguments received by layer 'keras_layer' (type KerasLayer):
  • inputs=<KerasTensor shape=(None, 246000), dtype=float32, sparse=None, name=keras_tensor>
  • training=None

Let's save our model with `.save(...)` method to be able to perform inference later. You can also export this SavedModel to TFHub by following [TFHub documentation](https://www.tensorflow.org/hub/publish).

In [None]:
save_dir = "/kaggle/working/models/finetuned-wav2vec2"
if not os._exists(save_dir):
    os.makedirs(save_dir)
model.save(save_dir, include_optimizer=False)

# Evaluation

Now we will be computing Word Error Rate over the validation dataset

**Word error rate** (WER) is a common metric for measuring the performance of an automatic speech recognition system. The WER is derived from the Levenshtein distance, working at the word level. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the percentage of words that were incorrectly predicted. 

You can refer to [this paper](https://www.isca-speech.org/archive_v0/interspeech_2004/i04_2765.html) to learn more about WER.

We will use `load_metric(...)` function from [HuggingFace datasets](https://huggingface.co/docs/datasets/) library. Let's first install the `datasets` library using `pip` and then define the `metric` object.

In [None]:
!pip3 install -q datasets

from datasets import load_metric
metric = load_metric("wer")

In [None]:
@tf.function(jit_compile=True)
def eval_fwd(batch):
  logits = model(batch, training=False)
  return tf.argmax(logits, axis=-1)

In [None]:
from tqdm.auto import tqdm

for speech, labels in tqdm(val_dataset, total=num_val_batches):
    predictions  = eval_fwd(speech)
    predictions = [tokenizer.decode(pred) for pred in predictions.numpy().tolist()]
    references = [tokenizer.decode(label, group_tokens=False) for label in labels.numpy().tolist()]
    metric.add_batch(references=references, predictions=predictions)

We are using the `tokenizer.decode(...)` method for decoding our predictions and labels back into the text and will add them to the metric for `WER` computation later.

Now, let's calculate the metric value in following cell:

In [None]:
metric.compute()

# Inference

Now that we are satisfied with the training process & have saved the model in `save_dir`, we will see how this model can be used for inference.

First, we will load our model using `tf.keras.models.load_model(...)`.

In [None]:
finetuned_model = tf.keras.models.load_model(save_dir)

Now, we will read the speech sample using `soundfile.read(...)` and pad it to `AUDIO_MAXLEN` to satisfy the model signature. Then we will normalize that speech sample using the `Wav2Vec2Processor` instance & will feed it into the model.

In [None]:
import numpy as np

speech, _ = sf.read("/kaggle/input/wolof-speech2text/alffa_git/alffa_git/audio/WOL_01_lect_0001.wav")
speech = np.pad(speech, (0, AUDIO_MAXLEN - len(speech)))
speech = tf.expand_dims(processor(tf.constant(speech)), 0)

outputs = finetuned_model(speech)
outputs

Let's decode numbers back into text sequence using the `Wav2Vec2tokenizer` instance, we defined above.

In [None]:
predictions = tf.argmax(outputs, axis=-1)
predictions = [tokenizer.decode(pred) for pred in predictions.numpy().tolist()]
predictions