##### Copyright 2021 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [1]:
#@title Copyright 2021 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/hub/tutorials/wav2vec2_saved_model_finetuning"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/hub/tutorials/wav2vec2_saved_model_finetuning.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/hub/tutorials/wav2vec2_saved_model_finetuning.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/hub/tutorials/wav2vec2_saved_model_finetuning.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
  <td>
    <a href="https://tfhub.dev/vasudevgupta7/wav2vec2/1"><img src="https://www.tensorflow.org/images/hub_logo_32px.png" />See TF Hub model</a>
  </td>
</table>

# Fine-tuning Wav2Vec2 with an LM head

In this notebook, we will load the pre-trained wav2vec2 model from [TFHub](https://tfhub.dev) and will fine-tune it on [LibriSpeech dataset](https://huggingface.co/datasets/librispeech_asr) by appending Language Modeling head (LM) over the top of our pre-trained model. The underlying task is to build a model for **Automatic Speech Recognition** i.e. given some speech, the model should be able to transcribe it into text.

## Setting Up

Before running this notebook, please ensure that you are on GPU runtime (`Runtime` > `Change runtime type` > `GPU`). The following cell will install [`gsoc-wav2vec2`](https://github.com/vasudevgupta7/gsoc-wav2vec2) package & its dependencies.

In [2]:
!pip3 install -q git+https://github.com/vasudevgupta7/gsoc-wav2vec2@main
!sudo apt-get install -y libsndfile1-dev
!pip3 install -q SoundFile

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for wav2vec2 (setup.py) ... [?25l[?25hdone
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libsndfile1-dev is already the newest version (1.0.31-2ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


## Model setup using `TFHub`

We will start by importing some libraries/modules.

In [3]:
import os

import tensorflow as tf
import tensorflow_hub as hub
from wav2vec2 import Wav2Vec2Config

config = Wav2Vec2Config()

print("TF version:", tf.__version__)

TF version: 2.18.0


In [4]:
import keras

In [5]:
# Use Keras 2.
version_fn = getattr(keras, "version", None)
if version_fn and version_fn().startswith("3."):
  import tf_keras as keras
else:
  keras = tf.keras

First, we will download our model from TFHub & will wrap our model signature with [`hub.KerasLayer`](https://www.tensorflow.org/hub/api_docs/python/hub/KerasLayer) to be able to use this model like any other Keras layer. Fortunately, `hub.KerasLayer` can do both in just 1 line.

**Note:** When loading model with `hub.KerasLayer`, model becomes a bit opaque but sometimes we need finer controls over the model, then we can load the model with `tf.keras.models.load_model(...)`.

In [6]:
pretrained_layer = hub.KerasLayer("https://tfhub.dev/vasudevgupta7/wav2vec2/1", trainable=True)

You can refer to this [script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/export2hub.py) in case you are interested in the model exporting script. Object `pretrained_layer` is the freezed version of [`Wav2Vec2Model`](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/wav2vec2/modeling.py). These pre-trained weights were converted from HuggingFace PyTorch [pre-trained weights](https://huggingface.co/facebook/wav2vec2-base) using [this script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/convert_torch_to_tf.py).

Originally, wav2vec2 was pre-trained with a masked language modelling approach with the objective to identify the true quantized latent speech representation for a masked time step. You can read more about the training objective in the paper- [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477).

Now, we will be defining a few constants and hyper-parameters which will be useful in the next few cells. `AUDIO_MAXLEN` is intentionally set to `246000` as the model signature only accepts static sequence length of `246000`.

In [7]:
AUDIO_MAXLEN = 246000
LABEL_MAXLEN = 256
BATCH_SIZE = 2

In the following cell, we will wrap `pretrained_layer` & a dense layer (LM head) with the [Keras's Functional API](https://www.tensorflow.org/guide/keras/functional).

In [8]:
inputs = keras.Input(shape=(AUDIO_MAXLEN,))
hidden_states = pretrained_layer(inputs)
outputs = keras.layers.Dense(config.vocab_size)(hidden_states)

model = keras.Model(inputs=inputs, outputs=outputs)

The dense layer (defined above) is having an output dimension of `vocab_size` as we want to predict probabilities of each token in the vocabulary at each time step.

## Setting up training state

In TensorFlow, model weights are built only when `model.call` or `model.build` is called for the first time, so the following cell will build the model weights for us. Further, we will be running `model.summary()` to check the total number of trainable parameters.

In [9]:
model(tf.random.uniform(shape=(BATCH_SIZE, AUDIO_MAXLEN)))
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 246000)]          0         
                                                                 
 keras_layer (KerasLayer)    (None, 768, 768)          94371712  
                                                                 
 dense (Dense)               (None, 768, 32)           24608     
                                                                 
Total params: 94396320 (360.09 MB)
Trainable params: 94396320 (360.09 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Now, we need to define the `loss_fn` and optimizer to be able to train the model. The following cell will do that for us. We will be using the `Adam` optimizer for simplicity. `CTCLoss` is a common loss type that is used for tasks (like `ASR`) where input sub-parts can't be easily aligned with output sub-parts. You can read more about CTC-loss from this amazing [blog post](https://distill.pub/2017/ctc/).


`CTCLoss` (from [`gsoc-wav2vec2`](https://github.com/vasudevgupta7/gsoc-wav2vec2) package) accepts 3 arguments: `config`, `model_input_shape` & `division_factor`. If `division_factor=1`, then loss will simply get summed, so pass `division_factor` accordingly to get mean over batch.

In [10]:
from wav2vec2 import CTCLoss

LEARNING_RATE = 5e-5

class CustomLoss(CTCLoss):
  def _init_(self, config, model_input_shape, division_factor=1):
    super()._init_(config, model_input_shape, division_factor)
  def call(self, y_true, y_pred):
    #print('CustomLoss call()')
    #print(type(y_true), type(y_pred))
    y_true = tf.cast(y_true, tf.int32)
    return super().call(y_true, y_pred)

loss_fn = CustomLoss(config, (BATCH_SIZE, AUDIO_MAXLEN), division_factor=BATCH_SIZE)
optimizer = keras.optimizers.Adam(LEARNING_RATE)

#loss_fn = CTCLoss(config, (BATCH_SIZE, AUDIO_MAXLEN), division_factor=BATCH_SIZE)
#optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)

## Loading & Pre-processing data

Let's now download the LibriSpeech dataset from the [official website](http://www.openslr.org/12) and set it up.

In [11]:
!wget https://www.openslr.org/resources/12/dev-clean.tar.gz -P ./data/train/
!tar -xf ./data/train/dev-clean.tar.gz -C ./data/train/

--2025-05-01 21:02:55--  https://www.openslr.org/resources/12/dev-clean.tar.gz
Resolving www.openslr.org (www.openslr.org)... 46.101.158.64
Connecting to www.openslr.org (www.openslr.org)|46.101.158.64|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://openslr.elda.org/resources/12/dev-clean.tar.gz [following]
--2025-05-01 21:02:56--  https://openslr.elda.org/resources/12/dev-clean.tar.gz
Resolving openslr.elda.org (openslr.elda.org)... 141.94.109.138, 2001:41d0:203:ad8a::
Connecting to openslr.elda.org (openslr.elda.org)|141.94.109.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 337926286 (322M) [application/x-gzip]
Saving to: ‘./data/train/dev-clean.tar.gz’


2025-05-01 21:03:12 (20.4 MB/s) - ‘./data/train/dev-clean.tar.gz’ saved [337926286/337926286]



**Note:** We are using `dev-clean` configuration as this notebook is just for demonstration purposes, so we need a small amount of data. Complete training data can be easily downloaded from [LibriSpeech website](http://www.openslr.org/12).

In [12]:
ls ./data/train/

dev-clean.tar.gz  [0m[01;34mLibriSpeech[0m/


Our dataset lies in the LibriSpeech directory. Let's explore these files.

In [13]:
data_dir = "./data/train/LibriSpeech/dev-clean/2428/83705/"
all_files = os.listdir(data_dir)

flac_files = [f for f in all_files if f.endswith(".flac")]
txt_files = [f for f in all_files if f.endswith(".txt")]

print("Transcription files:", txt_files, "\nSound files:", flac_files)

Transcription files: ['2428-83705.trans.txt'] 
Sound files: ['2428-83705-0016.flac', '2428-83705-0031.flac', '2428-83705-0037.flac', '2428-83705-0033.flac', '2428-83705-0005.flac', '2428-83705-0038.flac', '2428-83705-0034.flac', '2428-83705-0013.flac', '2428-83705-0011.flac', '2428-83705-0017.flac', '2428-83705-0029.flac', '2428-83705-0023.flac', '2428-83705-0002.flac', '2428-83705-0021.flac', '2428-83705-0024.flac', '2428-83705-0043.flac', '2428-83705-0015.flac', '2428-83705-0028.flac', '2428-83705-0027.flac', '2428-83705-0039.flac', '2428-83705-0008.flac', '2428-83705-0042.flac', '2428-83705-0020.flac', '2428-83705-0018.flac', '2428-83705-0022.flac', '2428-83705-0036.flac', '2428-83705-0025.flac', '2428-83705-0041.flac', '2428-83705-0004.flac', '2428-83705-0014.flac', '2428-83705-0006.flac', '2428-83705-0003.flac', '2428-83705-0007.flac', '2428-83705-0010.flac', '2428-83705-0012.flac', '2428-83705-0030.flac', '2428-83705-0026.flac', '2428-83705-0032.flac', '2428-83705-0000.flac', '24

Alright, so each sub-directory has many `.flac` files and a `.txt` file. The `.txt` file contains text transcriptions for all the speech samples (i.e. `.flac` files) present in that sub-directory.

We can load this text data as follows:

In [14]:
def read_txt_file(f):
  with open(f, "r") as f:
    samples = f.read().split("\n")
    samples = {s.split()[0]: " ".join(s.split()[1:]) for s in samples if len(s.split()) > 2}
  return samples

Similarly, we will define a function for loading a speech sample from a `.flac` file.

`REQUIRED_SAMPLE_RATE` is set to `16000` as wav2vec2 was pre-trained with `16K` frequency and it's recommended to fine-tune it without any major change in data distribution due to frequency.

In [15]:
import soundfile as sf

REQUIRED_SAMPLE_RATE = 16000

def read_flac_file(file_path):
  with open(file_path, "rb") as f:
      audio, sample_rate = sf.read(f)
  if sample_rate != REQUIRED_SAMPLE_RATE:
      raise ValueError(
          f"sample rate (={sample_rate}) of your files must be {REQUIRED_SAMPLE_RATE}"
      )
  file_id = os.path.split(file_path)[-1][:-len(".flac")]
  return {file_id: audio}

Now, we will pick some random samples & will try to visualize them.

In [16]:
from IPython.display import Audio
import random

file_id = random.choice([f[:-len(".flac")] for f in flac_files])
flac_file_path, txt_file_path = os.path.join(data_dir, f"{file_id}.flac"), os.path.join(data_dir, "2428-83705.trans.txt")

print("Text Transcription:", read_txt_file(txt_file_path)[file_id], "\nAudio:")
Audio(filename=flac_file_path)

Text Transcription: BUT WHY ON THAT ACCOUNT THEY SHOULD PITY ME I ALTOGETHER FAIL TO UNDERSTAND 
Audio:


Now, we will combine all the speech & text samples and will define the function (in next cell) for that purpose.

In [17]:
def fetch_sound_text_mapping(data_dir):
  all_files = os.listdir(data_dir)

  flac_files = [os.path.join(data_dir, f) for f in all_files if f.endswith(".flac")]
  txt_files = [os.path.join(data_dir, f) for f in all_files if f.endswith(".txt")]

  txt_samples = {}
  for f in txt_files:
    txt_samples.update(read_txt_file(f))

  speech_samples = {}
  for f in flac_files:
    speech_samples.update(read_flac_file(f))

  assert len(txt_samples) == len(speech_samples)

  samples = [(speech_samples[file_id], txt_samples[file_id]) for file_id in speech_samples.keys() if len(speech_samples[file_id]) < AUDIO_MAXLEN]
  return samples

It's time to have a look at a few samples ...

In [18]:
samples = fetch_sound_text_mapping(data_dir)
samples[:5]

[(array([-0.00036621, -0.00015259, -0.00012207, ..., -0.0005188 ,
         -0.00048828, -0.00048828]),
  'THERE WERE NO SIGNS OF FALTERING ABOUT HER FLOW OF LANGUAGE'),
 (array([ 0.00018311,  0.00021362,  0.00021362, ..., -0.00036621,
         -0.00036621, -0.00036621]),
  'SUCH IS THE SELFISHNESS OF HUMAN NATURE'),
 (array([-6.10351562e-05, -6.10351562e-05, -3.05175781e-05, ...,
         -2.13623047e-04, -9.15527344e-05, -3.05175781e-05]),
  'I CANNOT PRETEND TO EXPLAIN WHY EXCEPT ON THE SUPPOSITION THAT ROMANCE IS DEAD AT LEAST IN THAT CIRCLE OF SOCIETY IN WHICH THE SNELLINGS MOVE'),
 (array([ 2.74658203e-04,  2.74658203e-04,  1.22070312e-04, ...,
         -1.83105469e-04, -1.22070312e-04, -9.15527344e-05]),
  'WE HAVE ALL BEEN GIVING MARY ANN PRESENTS AND I SUPPOSE YOU MISTER WHITING HAVE BEEN GIVING HER SOMETHING TOO'),
 (array([-0.00201416, -0.0022583 , -0.00234985, ...,  0.00137329,
          0.0012207 ,  0.00109863]),
  'FOR INSTANCE LOOK AT THEIR BEHAVIOUR IN THE MATTER OF THE 

Note: We are loading this data into memory as we working with a small amount of dataset in this notebook. But for training on the complete dataset (~300 GBs), you will have to load data lazily. You can refer to [this script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/data_utils.py) to know more on that.

Let's pre-process the data now !!!

We will first define the tokenizer & processor using `gsoc-wav2vec2` package. Then, we will do very simple pre-processing. `processor` will normalize raw speech w.r.to frames axis and `tokenizer` will convert our model outputs into the string (using the defined vocabulary) & will take care of the removal of special tokens (depending on your tokenizer configuration).

In [19]:
from wav2vec2 import Wav2Vec2Processor
tokenizer = Wav2Vec2Processor(is_tokenizer=True)
processor = Wav2Vec2Processor(is_tokenizer=False)

def preprocess_text(text):
  label = tokenizer(text)
  return tf.constant(label, dtype=tf.int32)

def preprocess_speech(audio):
  audio = tf.constant(audio, dtype=tf.float32)
  return processor(tf.transpose(audio))

Downloading `vocab.json` from https://github.com/vasudevgupta7/gsoc-wav2vec2/raw/main/data/vocab.json ... DONE


Now, we will define the python generator to call the preprocessing functions we defined in above cells.

In [20]:
def inputs_generator():
  for speech, text in samples:
    yield preprocess_speech(speech), preprocess_text(text)

## Setting up `tf.data.Dataset`

Following cell will setup `tf.data.Dataset` object using its `.from_generator(...)` method. We will be using the `generator` object, we defined in the above cell.

**Note:** For distributed training (especially on TPUs), `.from_generator(...)` doesn't work currently and it is recommended to train on data stored in `.tfrecord` format (Note: The TFRecords should ideally be stored inside a GCS Bucket in order for the TPUs to work to the fullest extent).

You can refer to [this script](https://github.com/vasudevgupta7/gsoc-wav2vec2/blob/main/src/make_tfrecords.py) for more details on how to convert LibriSpeech data into tfrecords.

In [21]:
output_signature = (
    tf.TensorSpec(shape=(None),  dtype=tf.float32),
    tf.TensorSpec(shape=(None), dtype=tf.int32),
)

dataset = tf.data.Dataset.from_generator(inputs_generator, output_signature=output_signature)

In [22]:
BUFFER_SIZE = len(flac_files)
SEED = 42

dataset = dataset.shuffle(BUFFER_SIZE, seed=SEED)

We will pass the dataset into multiple batches, so let's prepare batches in the following cell. Now, all the sequences in a batch should be padded to a constant length. We will use the`.padded_batch(...)` method for that purpose.

In [23]:
dataset = dataset.padded_batch(BATCH_SIZE, padded_shapes=(AUDIO_MAXLEN, LABEL_MAXLEN), padding_values=(0.0, 0))

Accelerators (like GPUs/TPUs) are very fast and often data-loading (& pre-processing) becomes the bottleneck during training as the data-loading part happens on CPUs. This can increase the training time significantly especially when there is a lot of online pre-processing involved or data is streamed online from GCS buckets. To handle those issues, `tf.data.Dataset` offers the `.prefetch(...)` method. This method helps in preparing the next few batches in parallel (on CPUs) while the model is making predictions (on GPUs/TPUs) on the current batch.

In [24]:
dataset = dataset.prefetch(tf.data.AUTOTUNE)

Since this notebook is made for demonstration purposes, we will be taking first `num_train_batches` and will perform training over only that. You are encouraged to train on the whole dataset though. Similarly, we will evaluate only `num_val_batches`.

In [25]:
num_train_batches = 10
num_val_batches = 4

train_dataset = dataset.take(num_train_batches)
val_dataset = dataset.skip(num_train_batches).take(num_val_batches)

## Model training

For training our model, we will be directly calling `.fit(...)` method after compiling our model with `.compile(...)`.

In [26]:
model.compile(optimizer, loss=loss_fn)

The above cell will set up our training state. Now we can initiate training with the `.fit(...)` method.

#Calculating carbon emissions during model training

In [None]:
!pip install codecarbon
from codecarbon import EmissionsTracker

In [28]:
#start emissions tracking
tracker = EmissionsTracker()
tracker.start()

#Model training
history = model.fit(train_dataset, validation_data=val_dataset, epochs=3)

#stop emissions tracking
emissions: float = tracker.stop()
print(f"Emissions during training: {emissions} kg CO2")

history.history

[codecarbon INFO @ 21:03:25] [setup] RAM Tracking...
[codecarbon INFO @ 21:03:25] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 21:03:26] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.00GHz
[codecarbon INFO @ 21:03:26] [setup] GPU Tracking...
[codecarbon INFO @ 21:03:26] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 21:03:26] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 21:03:26] >>> Tracker's metadata:
[codecarbon INFO @ 21:03:26]   Platform system: Linux-6.1.123+-x86_64-with-glibc2.35
[codecarbon INFO @ 21:03:26]   Python version: 3.11.12
[codecarbon INFO @ 21:03:26]   CodeCarbon version: 3.0.1
[codecarbon INFO @ 21:03:26]   Available RAM : 12.674 GB
[codecarbon INF

Epoch 1/3


[codecarbon INFO @ 21:03:41] Energy consumed for RAM : 0.000042 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:03:41] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:03:41] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 21:03:41] Energy consumed for all GPUs : 0.000109 kWh. Total GPU Power : 26.107486684038413 W
[codecarbon INFO @ 21:03:41] 0.000328 kWh of electricity used since the beginning.
[codecarbon INFO @ 21:03:56] Energy consumed for RAM : 0.000083 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:03:56] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:03:56] Energy consumed for All CPU : 0.000354 kWh
[codecarbon INFO @ 21:03:56] Energy consumed for all GPUs : 0.000218 kWh. Total GPU Power : 26.249465245085208 W
[codecarbon INFO @ 21:03:56] 0.000656 kWh of electricity used since the beginning.
[codecarbon INFO @ 21:04:11] Energy consumed for RAM : 0.000125 kWh. RAM Power :

      7/Unknown - 59s 2s/step - loss: 1173.4866

[codecarbon INFO @ 21:04:26] Energy consumed for RAM : 0.000167 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:04:26] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:04:26] Energy consumed for All CPU : 0.000708 kWh
[codecarbon INFO @ 21:04:26] Energy consumed for all GPUs : 0.000532 kWh. Total GPU Power : 47.64901333130055 W
[codecarbon INFO @ 21:04:26] 0.001407 kWh of electricity used since the beginning.


     10/Unknown - 64s 2s/step - loss: 1206.2732

[codecarbon INFO @ 21:04:41] Energy consumed for RAM : 0.000208 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:04:41] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:04:41] Energy consumed for All CPU : 0.000885 kWh
[codecarbon INFO @ 21:04:41] Energy consumed for all GPUs : 0.000695 kWh. Total GPU Power : 39.261263020145535 W
[codecarbon INFO @ 21:04:41] 0.001789 kWh of electricity used since the beginning.


Epoch 2/3

[codecarbon INFO @ 21:04:56] Energy consumed for RAM : 0.000250 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:04:56] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:04:56] Energy consumed for All CPU : 0.001062 kWh
[codecarbon INFO @ 21:04:56] Energy consumed for all GPUs : 0.000898 kWh. Total GPU Power : 48.72690254817144 W
[codecarbon INFO @ 21:04:56] 0.002211 kWh of electricity used since the beginning.




[codecarbon INFO @ 21:05:11] Energy consumed for RAM : 0.000292 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:05:11] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:05:11] Energy consumed for All CPU : 0.001239 kWh
[codecarbon INFO @ 21:05:11] Energy consumed for all GPUs : 0.001065 kWh. Total GPU Power : 39.98756534535027 W
[codecarbon INFO @ 21:05:11] 0.002596 kWh of electricity used since the beginning.


Epoch 3/3
 1/10 [==>...........................] - ETA: 15s - loss: 807.0832

[codecarbon INFO @ 21:05:26] Energy consumed for RAM : 0.000333 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:05:26] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:05:26] Energy consumed for All CPU : 0.001416 kWh
[codecarbon INFO @ 21:05:26] Energy consumed for all GPUs : 0.001188 kWh. Total GPU Power : 29.579990231057984 W
[codecarbon INFO @ 21:05:26] 0.002938 kWh of electricity used since the beginning.
[codecarbon INFO @ 21:05:26] 0.008546 g.CO2eq/s mean an estimation of 269.4998336354521 kg.CO2eq/year




[codecarbon INFO @ 21:05:41] Energy consumed for RAM : 0.000375 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:05:41] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 21:05:41] Energy consumed for All CPU : 0.001593 kWh
[codecarbon INFO @ 21:05:41] Energy consumed for all GPUs : 0.001405 kWh. Total GPU Power : 52.035878885682536 W
[codecarbon INFO @ 21:05:41] 0.003373 kWh of electricity used since the beginning.




[codecarbon INFO @ 21:05:51] Energy consumed for RAM : 0.000402 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 21:05:51] Delta energy consumed for CPU with constant : 0.000115 kWh, power : 42.5 W
[codecarbon INFO @ 21:05:51] Energy consumed for All CPU : 0.001708 kWh
[codecarbon INFO @ 21:05:51] Energy consumed for all GPUs : 0.001505 kWh. Total GPU Power : 36.99660877414437 W
[codecarbon INFO @ 21:05:51] 0.003615 kWh of electricity used since the beginning.


Emissions during training: 0.0012625974020339925 kg CO2


{'loss': [1206.273193359375, 946.9650268554688, 505.2660217285156],
 'val_loss': [966.91357421875, 839.6303100585938, 307.30877685546875]}

In [29]:
emissions_data = tracker.final_emissions_data
print(f"Energy consumed: {emissions_data.energy_consumed:.6f} kWh")
print(f"Execution Time: {emissions_data.duration:.2f} seconds")
print(f"Emissions: {emissions_data.emissions} kg CO2")

Energy consumed: 0.003615 kWh
Execution Time: 144.81 seconds
Emissions: 0.0012625974020339925 kg CO2


Let's save our model with `.save(...)` method to be able to perform inference later. You can also export this SavedModel to TFHub by following [TFHub documentation](https://www.tensorflow.org/hub/publish).

In [30]:
save_dir = "finetuned-wav2vec2"
model.save(save_dir, include_optimizer=False)

Note: We are setting `include_optimizer=False` as we want to use this model for inference only.

## Evaluation

Now we will be computing Word Error Rate over the validation dataset

**Word error rate** (WER) is a common metric for measuring the performance of an automatic speech recognition system. The WER is derived from the Levenshtein distance, working at the word level. Word error rate can then be computed as: WER = (S + D + I) / N = (S + D + I) / (S + D + C) where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C). This value indicates the percentage of words that were incorrectly predicted.

You can refer to [this paper](https://www.isca-speech.org/archive_v0/interspeech_2004/i04_2765.html) to learn more about WER.

We will use `load_metric(...)` function from [HuggingFace datasets](https://huggingface.co/docs/datasets/) library. Let's first install the `datasets` library using `pip` and then define the `metric` object.

In [31]:
#!pip3 install -q datasets

#from datasets import load_metric
#metric = load_metric("wer")

In [32]:
!pip3 install evaluate
import evaluate
metric = evaluate.load("wer")

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.18-py311-none-any.whl.metadata (7.5 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

In [33]:
@tf.function(jit_compile=True)
def eval_fwd(batch):
  logits = model(batch, training=False)
  return tf.argmax(logits, axis=-1)

It's time to run the evaluation on validation data now.

In [34]:
from tqdm.auto import tqdm

for speech, labels in tqdm(val_dataset, total=num_val_batches):
    predictions  = eval_fwd(speech)
    predictions = [tokenizer.decode(pred) for pred in predictions.numpy().tolist()]
    references = [tokenizer.decode(label, group_tokens=False) for label in labels.numpy().tolist()]
    metric.add_batch(references=references, predictions=predictions)

  0%|          | 0/4 [00:00<?, ?it/s]

We are using the `tokenizer.decode(...)` method for decoding our predictions and labels back into the text and will add them to the metric for `WER` computation later.

Now, let's calculate the metric value in following cell:

In [35]:
metric.compute()

1.0

**Note:** Here metric value doesn't make any sense as the model is trained on very small data and ASR-like tasks often require a large amount of data to learn a mapping from speech to text. You should probably train on large data to get some good results. This notebook gives you a template to fine-tune a pre-trained speech model.

## Inference

Now that we are satisfied with the training process & have saved the model in `save_dir`, we will see how this model can be used for inference.

First, we will load our model using `tf.keras.models.load_model(...)`.

In [36]:
finetuned_model = keras.models.load_model(save_dir)



Let's download some speech samples for performing inference. You can replace the following sample with your speech sample also.

In [37]:
!wget https://github.com/vasudevgupta7/gsoc-wav2vec2/raw/main/data/SA2.wav

--2025-05-01 21:07:10--  https://github.com/vasudevgupta7/gsoc-wav2vec2/raw/main/data/SA2.wav
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/thevasudevgupta/gsoc-wav2vec2/raw/main/data/SA2.wav [following]
--2025-05-01 21:07:10--  https://github.com/thevasudevgupta/gsoc-wav2vec2/raw/main/data/SA2.wav
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thevasudevgupta/gsoc-wav2vec2/main/data/SA2.wav [following]
--2025-05-01 21:07:10--  https://raw.githubusercontent.com/thevasudevgupta/gsoc-wav2vec2/main/data/SA2.wav
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HT

Now, we will read the speech sample using `soundfile.read(...)` and pad it to `AUDIO_MAXLEN` to satisfy the model signature. Then we will normalize that speech sample using the `Wav2Vec2Processor` instance & will feed it into the model.

In [38]:
import numpy as np

speech, _ = sf.read("SA2.wav")
speech = np.pad(speech, (0, AUDIO_MAXLEN - len(speech)))
speech = tf.expand_dims(processor(tf.constant(speech)), 0)

outputs = finetuned_model(speech)
outputs

<tf.Tensor: shape=(1, 768, 32), dtype=float32, numpy=
array([[[ 6.437108  , -0.43524763, -0.9372016 , ..., -0.02225186,
         -0.1516123 , -0.44300503],
        [ 6.4488845 , -0.46492705, -1.0301777 , ..., -0.01199191,
         -0.11294978, -0.5281335 ],
        [ 6.4118214 , -0.40753633, -1.0161344 , ..., -0.02898878,
         -0.18163836, -0.3958183 ],
        ...,
        [ 6.4599614 , -0.59491956, -1.0304283 , ...,  0.13192573,
         -0.25528634, -0.5739558 ],
        [ 6.465603  , -0.5894994 , -1.0314901 , ...,  0.12987547,
         -0.2556316 , -0.5858758 ],
        [ 6.449605  , -0.57871133, -1.0426371 , ...,  0.13068128,
         -0.26076162, -0.6023362 ]]], dtype=float32)>

Let's decode numbers back into text sequence using the `Wav2Vec2tokenizer` instance, we defined above.

In [39]:
predictions = tf.argmax(outputs, axis=-1)
predictions = [tokenizer.decode(pred) for pred in predictions.numpy().tolist()]
predictions

['']

This prediction is quite random as the model was never trained on large data in this notebook (as this notebook is not meant for doing complete training). You will get good predictions if you train this model on complete LibriSpeech dataset.

Finally, we have reached an end to this notebook. But it's not an end of learning TensorFlow for speech-related tasks, this [repository](https://github.com/tulasiram58827/TTS_TFLite) contains some more amazing tutorials. In case you encountered any bug in this notebook, please create an issue [here](https://github.com/vasudevgupta7/gsoc-wav2vec2/issues).