# Import the necessary Libraries

In [1]:
import os
import random
from glob import glob
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


# Define the Transformer Input Layer

In [2]:
class TokenEmbedding(layers.Layer):
    def __init__(self, num_vocab=1000, maxlen=100, num_hid=64):
        super().__init__()
        self.emb = tf.keras.layers.Embedding(num_vocab, num_hid)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=num_hid)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        x = self.emb(x)
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        return x + positions

1. **Class Definition (`TokenEmbedding`):**
   - A custom Keras layer is defined, named `TokenEmbedding`.
   - The layer is designed to create token embeddings for input sequences in a transformer model.

2. **`__init__` Method:**
   - The constructor method initializes the layer's attributes when an instance is created.
   - `num_vocab`: Number of unique tokens in the vocabulary. Default is 1000.
   - `maxlen`: Maximum sequence length. Default is 100.
   - `num_hid`: Number of dimensions for the token embeddings. Default is 64.

3. **Embedding Layers Initialization:**
   - Inside the constructor, two embedding layers are initialized:
     - `self.emb`: An embedding layer to create token embeddings based on the vocabulary size (`num_vocab`) and embedding dimensions (`num_hid`).
     - `self.pos_emb`: An embedding layer to create positional embeddings based on the sequence length (`maxlen`) and embedding dimensions (`num_hid`).

4. **`call` Method:**
   - The `call` method defines the forward pass of the layer, where the actual computations are performed.
   - `x` is the input tensor representing a sequence of token indices.

5. **Token Embeddings:**
   - `x = self.emb(x)`: The input token indices are passed through the embedding layer to create token embeddings. Each token index is mapped to a continuous vector representation.

6. **Positional Embeddings:**
   - `maxlen = tf.shape(x)[-1]`: Calculates the maximum sequence length based on the shape of the input tensor `x`.
   - `positions = tf.range(start=0, limit=maxlen, delta=1)`: Creates a sequence of integers from 0 to `maxlen - 1` to represent positions in the sequence.
   - `positions = self.pos_emb(positions)`: Passes the position indices through the positional embedding layer to create positional embeddings.

7. **Combining Token and Positional Embeddings:**
   - `return x + positions`: Adds the token embeddings and positional embeddings element-wise. This is a common step in transformer models to inject positional information into the token embeddings.

`TokenEmbedding` layer initializes token embeddings using an embedding layer and positional embeddings using another embedding layer. The `call` method combines these embeddings to provide enriched token representations that include both token-specific and positional information. This layer can be used as part of a transformer model's input layer.

In [3]:
class SpeechFeatureEmbedding(layers.Layer):
    def __init__(self, num_hid=64, maxlen=100):
        super().__init__()
        self.conv1 = tf.keras.layers.Conv1D(
            num_hid, 11, strides=2, padding="same", activation="relu"
        )
        self.conv2 = tf.keras.layers.Conv1D(
            num_hid, 11, strides=2, padding="same", activation="relu"
        )
        self.conv3 = tf.keras.layers.Conv1D(
            num_hid, 11, strides=2, padding="same", activation="relu"
        )

    def call(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        return self.conv3(x)

1. **Class Definition (`SpeechFeatureEmbedding`):**
   - This is a custom Keras layer designed to process speech features and convert them into embeddings suitable for a transformer input.

2. **`__init__` Method:**
   - The constructor method initializes the layer's attributes when an instance is created.
   - `num_hid`: Number of filters or hidden units in each convolutional layer. Default is 64.
   - `maxlen`: Maximum sequence length (number of time steps in the speech features). Default is 100.

3. **Convolutional Layers Initialization:**
   - Three 1D convolutional layers (`conv1`, `conv2`, and `conv3`) are initialized:
     - Each convolutional layer performs a 1D convolution operation on the input data.
     - `num_hid` filters are used for each convolutional layer.
     - A filter size of 11 is specified for each convolutional layer.
     - The `strides` parameter is set to 2, which means the convolutional operation will skip every other input value.
     - Padding is set to "same," which maintains the input shape after convolution.
     - The `activation` function is set to ReLU (Rectified Linear Unit), which introduces non-linearity.

4. **`call` Method:**
   - The `call` method defines the forward pass of the layer, where the actual computations are performed.
   - `x` is the input tensor representing speech features, often in the form of spectrogram frames.

5. **Convolutional Operations:**
   - `x = self.conv1(x)`: Applies the first convolutional layer to the input `x`.
   - `x = self.conv2(x)`: Applies the second convolutional layer to the result of the previous convolution.
   - The convolutional operations with different filters and strides help capture different levels of information from the input features.

6. **Final Convolution and Output:**
   - `return self.conv3(x)`: Applies the third convolutional layer to the result of the previous convolutional operations. This produces the final output of the layer.

`SpeechFeatureEmbedding` layer uses three 1D convolutional layers to process speech features and create embeddings that can be used as input to a transformer model. The layer captures different patterns and representations in the speech data at various levels of abstraction through convolutional operations.

# Transformer Encoder Layer

In [4]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, num_heads, feed_forward_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [
                layers.Dense(feed_forward_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

1. **Class Definition (`TransformerEncoder`):**
   - This class represents a single layer of the Transformer encoder.

2. **`__init__` Method:**
   - The constructor initializes the layer's attributes when an instance is created.
   - `embed_dim`: The dimension of the input and output embeddings.
   - `num_heads`: The number of attention heads in the multi-head attention mechanism.
   - `feed_forward_dim`: The dimension of the feed-forward neural network within the layer.
   - `rate`: Dropout rate, which controls the amount of dropout regularization applied to the layer's outputs.

3. **Multi-Head Attention (`self.att`):**
   - Initializes a multi-head attention layer using the specified number of heads and embedding dimension (`embed_dim`).

4. **Feed-Forward Network (`self.ffn`):**
   - Initializes a feed-forward neural network (FFN) composed of two dense layers:
     - The first dense layer has `feed_forward_dim` units and uses ReLU activation.
     - The second dense layer has `embed_dim` units.

5. **Layer Normalization (`self.layernorm1` and `self.layernorm2`):**
   - Initializes two layer normalization layers, each with a small epsilon value to prevent division by zero.

6. **Dropout (`self.dropout1` and `self.dropout2`):**
   - Initializes two dropout layers with the specified dropout rate (`rate`).

7. **`call` Method:**
   - The `call` method defines the forward pass of the layer, where the actual computations are performed.
   - `inputs`: The input tensor.
   - `training`: A boolean flag indicating whether the model is in training mode.

8. **Multi-Head Attention Forward Pass:**
   - `attn_output = self.att(inputs, inputs)`: Applies the multi-head attention mechanism to the input.
   - `attn_output = self.dropout1(attn_output, training=training)`: Applies dropout to the attention output.

9. **Residual Connection and Layer Normalization (`out1`):**
   - `out1 = self.layernorm1(inputs + attn_output)`: Adds the attention output to the input and applies layer normalization.

10. **Feed-Forward Network Forward Pass:**
    - `ffn_output = self.ffn(out1)`: Passes the output of the attention layer through the feed-forward network.
    - `ffn_output = self.dropout2(ffn_output, training=training)`: Applies dropout to the FFN output.

11. **Residual Connection and Layer Normalization (`return` statement):**
    - `return self.layernorm2(out1 + ffn_output)`: Adds the FFN output to the previous output and applies layer normalization.
    - The final result of the layer represents the output of the TransformerEncoder.

`TransformerEncoder` layer implements a single block of a Transformer encoder. It includes a multi-head attention mechanism followed by a feed-forward network, with residual connections, layer normalization, and dropout applied at appropriate points in the computation. This layer can be stacked multiple times to form the complete Transformer encoder.

# Transformer Decoder Layer

In [5]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, num_heads, feed_forward_dim, dropout_rate=0.1):
        super().__init__()
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = layers.LayerNormalization(epsilon=1e-6)
        self.self_att = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.enc_att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.self_dropout = layers.Dropout(0.5)
        self.enc_dropout = layers.Dropout(0.1)
        self.ffn_dropout = layers.Dropout(0.1)
        self.ffn = keras.Sequential(
            [
                layers.Dense(feed_forward_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )

    def causal_attention_mask(self, batch_size, n_dest, n_src, dtype):
        """Masks the upper half of the dot product matrix in self attention.

        This prevents flow of information from future tokens to current token.
        1's in the lower triangle, counting from the lower right corner.
        """
        i = tf.range(n_dest)[:, None]
        j = tf.range(n_src)
        m = i >= j - n_src + n_dest
        mask = tf.cast(m, dtype)
        mask = tf.reshape(mask, [1, n_dest, n_src])
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
        )
        return tf.tile(mask, mult)

    def call(self, enc_out, target):
        input_shape = tf.shape(target)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = self.causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        target_att = self.self_att(target, target, attention_mask=causal_mask)
        target_norm = self.layernorm1(target + self.self_dropout(target_att))
        enc_out = self.enc_att(target_norm, enc_out)
        enc_out_norm = self.layernorm2(self.enc_dropout(enc_out) + target_norm)
        ffn_out = self.ffn(enc_out_norm)
        ffn_out_norm = self.layernorm3(enc_out_norm + self.ffn_dropout(ffn_out))
        return ffn_out_norm

1. **Class Definition (`TransformerDecoder`):**
   - This class represents a single layer of the Transformer decoder.

2. **`__init__` Method:**
   - Initializes the layer's attributes when an instance is created.
   - `embed_dim`: The dimension of the input and output embeddings.
   - `num_heads`: The number of attention heads in the multi-head attention mechanisms.
   - `feed_forward_dim`: The dimension of the feed-forward neural network within the layer.
   - `dropout_rate`: Dropout rate for various dropout layers.

3. **Layer Normalization (`self.layernorm1`, `self.layernorm2`, `self.layernorm3`):**
   - Initializes three layer normalization layers, each with a small epsilon value to prevent division by zero.

4. **Multi-Head Self-Attention (`self.self_att`):**
   - Initializes a multi-head self-attention layer for the decoder's self-attention mechanism.

5. **Multi-Head Encoder-Decoder Attention (`self.enc_att`):**
   - Initializes a multi-head attention layer for the encoder-decoder attention mechanism.

6. **Dropout (`self.self_dropout`, `self.enc_dropout`, `self.ffn_dropout`):**
   - Initializes dropout layers with the specified dropout rates for various components.

7. **Feed-Forward Network (`self.ffn`):**
   - Initializes a feed-forward neural network (FFN) composed of two dense layers:
     - The first dense layer has `feed_forward_dim` units and uses ReLU activation.
     - The second dense layer has `embed_dim` units.

8. **`causal_attention_mask` Method:**
   - Defines a function to create a causal attention mask that masks the upper half of the dot product matrix to prevent future information flow.

9. **`call` Method:**
   - Defines the forward pass of the layer where the actual computations are performed.
   - `enc_out`: The output of the encoder.
   - `target`: The input to the decoder.

10. **Creating Causal Attention Mask:**
    - `causal_mask = self.causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)`: Creates a causal attention mask.

11. **Self-Attention for Decoder Input (`target_att`):**
    - `target_att = self.self_att(target, target, attention_mask=causal_mask)`: Applies self-attention to the decoder input with the causal attention mask.
    - `target_norm = self.layernorm1(target + self.self_dropout(target_att))`: Applies layer normalization and dropout to the self-attention output.

12. **Encoder-Decoder Attention (`enc_out`):**
    - `enc_out = self.enc_att(target_norm, enc_out)`: Applies encoder-decoder attention by attending over the encoder output using the decoder input.
    - `enc_out_norm = self.layernorm2(self.enc_dropout(enc_out) + target_norm)`: Applies layer normalization and dropout to the encoder-decoder attention output.

13. **Feed-Forward Network (`ffn_out`):**
    - `ffn_out = self.ffn(enc_out_norm)`: Passes the output of the encoder-decoder attention through the feed-forward network.

14. **Final Layer Normalization (`ffn_out_norm`):**
    - `ffn_out_norm = self.layernorm3(enc_out_norm + self.ffn_dropout(ffn_out))`: Applies layer normalization and dropout to the output of the feed-forward network and adds it to the output of the encoder-decoder attention.

17. **Returning the Output (`return ffn_out_norm)`:**
    - The `call` method returns the final output of the decoder layer after passing it through the multi-head self-attention, encoder-decoder attention, feed-forward network, and applying normalization and dropout.

This `TransformerDecoder` layer is one component of the Transformer model's decoder stack, responsible for processing the decoder inputs and producing contextualized outputs based on self-attention and attention to the encoder outputs.

# Complete the Transformer model

In [6]:
class Transformer(keras.Model):
    def __init__(
        self,
        num_hid=64,
        num_head=2,
        num_feed_forward=128,
        source_maxlen=100,
        target_maxlen=100,
        num_layers_enc=4,
        num_layers_dec=1,
        num_classes=10,
    ):
        super().__init__()
        self.loss_metric = keras.metrics.Mean(name="loss")
        self.num_layers_enc = num_layers_enc
        self.num_layers_dec = num_layers_dec
        self.target_maxlen = target_maxlen
        self.num_classes = num_classes

        self.enc_input = SpeechFeatureEmbedding(num_hid=num_hid, maxlen=source_maxlen)
        self.dec_input = TokenEmbedding(
            num_vocab=num_classes, maxlen=target_maxlen, num_hid=num_hid
        )

        self.encoder = keras.Sequential(
            [self.enc_input]
            + [
                TransformerEncoder(num_hid, num_head, num_feed_forward)
                for _ in range(num_layers_enc)
            ]
        )

        for i in range(num_layers_dec):
            setattr(
                self,
                f"dec_layer_{i}",
                TransformerDecoder(num_hid, num_head, num_feed_forward),
            )

        self.classifier = layers.Dense(num_classes)

    def decode(self, enc_out, target):
        y = self.dec_input(target)
        for i in range(self.num_layers_dec):
            y = getattr(self, f"dec_layer_{i}")(enc_out, y)
        return y

    def call(self, inputs):
        source = inputs[0]
        target = inputs[1]
        x = self.encoder(source)
        y = self.decode(x, target)
        return self.classifier(y)

    @property
    def metrics(self):
        return [self.loss_metric]

    def train_step(self, batch):
        """Processes one batch inside model.fit()."""
        source = batch["source"]
        target = batch["target"]
        dec_input = target[:, :-1]
        dec_target = target[:, 1:]
        with tf.GradientTape() as tape:
            preds = self([source, dec_input])
            one_hot = tf.one_hot(dec_target, depth=self.num_classes)
            mask = tf.math.logical_not(tf.math.equal(dec_target, 0))
            loss = self.compiled_loss(one_hot, preds, sample_weight=mask)
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        self.loss_metric.update_state(loss)
        return {"loss": self.loss_metric.result()}

    def test_step(self, batch):
        source = batch["source"]
        target = batch["target"]
        dec_input = target[:, :-1]
        dec_target = target[:, 1:]
        preds = self([source, dec_input])
        one_hot = tf.one_hot(dec_target, depth=self.num_classes)
        mask = tf.math.logical_not(tf.math.equal(dec_target, 0))
        loss = self.compiled_loss(one_hot, preds, sample_weight=mask)
        self.loss_metric.update_state(loss)
        return {"loss": self.loss_metric.result()}

    def generate(self, source, target_start_token_idx):
        """Performs inference over one batch of inputs using greedy decoding."""
        bs = tf.shape(source)[0]
        enc = self.encoder(source)
        dec_input = tf.ones((bs, 1), dtype=tf.int32) * target_start_token_idx
        dec_logits = []
        for i in range(self.target_maxlen - 1):
            dec_out = self.decode(enc, dec_input)
            logits = self.classifier(dec_out)
            logits = tf.argmax(logits, axis=-1, output_type=tf.int32)
            last_logit = tf.expand_dims(logits[:, -1], axis=-1)
            dec_logits.append(last_logit)
            dec_input = tf.concat([dec_input, last_logit], axis=-1)
        return dec_input

1. **Class Definition (`Transformer`):**
   - The class `Transformer` inherits from `keras.Model`.
   - It initializes hyperparameters, metrics, and various components of the Transformer model.

2. **Initializer (`__init__`):**
   - Initializes the Transformer model with hyperparameters like `num_hid`, `num_head`, `num_feed_forward`, etc.
   - Initializes a mean metric `loss_metric` for tracking the loss during training.
   - Sets the number of encoder and decoder layers, maximum target length, and number of classes.
   - Creates input embedding layers for the encoder (`enc_input`) and decoder (`dec_input`).

3. **Encoder Initialization (`encoder`):**
   - Sets up the encoder using a sequence of layers, starting with the `enc_input`.
   - It stacks multiple `TransformerEncoder` layers based on the specified `num_layers_enc`.

4. **Decoder Initialization (`dec_layer_i`):**
   - For each decoder layer, initializes a `TransformerDecoder` layer and assigns it to an attribute named `dec_layer_i`.

5. **Classifier Layer (`classifier`):**
   - Initializes a dense layer (`classifier`) for predicting class labels.

6. **Decode Method (`decode`):**
   - Takes encoder outputs (`enc_out`) and target tokens (`target`) as input.
   - Passes the target tokens through the decoder layers sequentially to generate contextualized outputs (`y`).

7. **Call Method (`call`):**
   - Takes source and target inputs and processes them through the encoder and decoder.
   - Returns the output of the classifier layer.

8. **Metrics Property (`metrics`):**
   - Returns the list of metrics, which includes the `loss_metric`.

9. **Train Step Method (`train_step`):**
   - Implements a custom training step to handle one batch of data.
   - Computes the loss, gradients, applies gradients using the optimizer, and updates the loss metric.

10. **Test Step Method (`test_step`):**
   - Similar to `train_step`, but used for testing/evaluation.

11. **Generate Method (`generate`):**
   - Performs inference on a batch of source inputs using greedy decoding.
   - Initializes the decoder input with a start token index.
   - Iterates over each time step, decoding and generating the output sequence.

Overall, this code defines a custom Transformer model for sequence-to-sequence tasks. It includes encoder and decoder stacks, customized training and testing steps, and an inference method for generating sequences using greedy decoding.

# Download the dataset

In [7]:
keras.utils.get_file(
    os.path.join(os.getcwd(), "data.tar.gz"),
    "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2",
    extract=True,
    archive_format="tar",
    cache_dir=".",
)


saveto = "./datasets/LJSpeech-1.1"
wavs = glob("{}/**/*.wav".format(saveto), recursive=True)

id_to_text = {}
with open(os.path.join(saveto, "metadata.csv"), encoding="utf-8") as f:
    for line in f:
        id = line.strip().split("|")[0]
        text = line.strip().split("|")[2]
        id_to_text[id] = text


def get_data(wavs, id_to_text, maxlen=50):
    """ returns mapping of audio paths and transcription texts """
    data = []
    for w in wavs:
        id = w.split("/")[-1].split(".")[0]
        if len(id_to_text[id]) < maxlen:
            data.append({"audio": w, "text": id_to_text[id]})
    return data

Downloading data from https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2


Certainly, I'll provide a brief explanation of the code:

1. **Downloading and Extracting Data (`keras.utils.get_file`):**
   - Downloads a compressed data file from a URL and extracts its contents.
   - The `data.tar.gz` file is downloaded from the provided URL and extracted using the "tar" archive format.
   - The extracted files are saved in the current working directory.

2. **File Paths and Metadata Extraction:**
   - Defines the `saveto` variable as the path to the extracted data directory.
   - Uses the `glob` function to recursively search for all ".wav" files within the `saveto` directory and its subdirectories.
   - Creates an empty dictionary `id_to_text` to store mappings between audio file IDs and their corresponding transcription texts.

3. **Parsing Metadata File (`metadata.csv`):**
   - Opens the `metadata.csv` file in the `saveto` directory for reading, assuming it contains metadata about the audio files.
   - Iterates through each line in the file and extracts the ID and transcription text from the pipe-separated format.
   - Builds the `id_to_text` dictionary, mapping audio file IDs to their corresponding transcription texts.

4. **Data Preparation (`get_data` Function):**
   - Defines a function `get_data` that takes a list of audio file paths, the `id_to_text` dictionary, and an optional `maxlen` parameter.
   - The function processes each audio file and its associated text transcription.
   - It extracts the ID from the audio file path and checks if the length of the corresponding text is less than the specified `maxlen`.
   - If the text length condition is met, the audio file path and text are added to the `data` list.
   - The function returns the `data` list containing audio file-text pairs that meet the criteria.

# Preprocess the dataset

In [8]:
class VectorizeChar:
    def __init__(self, max_len=50):
        self.vocab = (
            ["-", "#", "<", ">"]
            + [chr(i + 96) for i in range(1, 27)]
            + [" ", ".", ",", "?"]
        )
        self.max_len = max_len
        self.char_to_idx = {}
        for i, ch in enumerate(self.vocab):
            self.char_to_idx[ch] = i

    def __call__(self, text):
        text = text.lower()
        text = text[: self.max_len - 2]
        text = "<" + text + ">"
        pad_len = self.max_len - len(text)
        return [self.char_to_idx.get(ch, 1) for ch in text] + [0] * pad_len

    def get_vocabulary(self):
        return self.vocab


max_target_len = 200  # all transcripts in out data are < 200 characters
data = get_data(wavs, id_to_text, max_target_len)
vectorizer = VectorizeChar(max_target_len)
print("vocab size", len(vectorizer.get_vocabulary()))


def create_text_ds(data):
    texts = [_["text"] for _ in data]
    text_ds = [vectorizer(t) for t in texts]
    text_ds = tf.data.Dataset.from_tensor_slices(text_ds)
    return text_ds


def path_to_audio(path):
    # spectrogram using stft
    audio = tf.io.read_file(path)
    audio, _ = tf.audio.decode_wav(audio, 1)
    audio = tf.squeeze(audio, axis=-1)
    stfts = tf.signal.stft(audio, frame_length=200, frame_step=80, fft_length=256)
    x = tf.math.pow(tf.abs(stfts), 0.5)
    # normalisation
    means = tf.math.reduce_mean(x, 1, keepdims=True)
    stddevs = tf.math.reduce_std(x, 1, keepdims=True)
    x = (x - means) / stddevs
    audio_len = tf.shape(x)[0]
    # padding to 10 seconds
    pad_len = 2754
    paddings = tf.constant([[0, pad_len], [0, 0]])
    x = tf.pad(x, paddings, "CONSTANT")[:pad_len, :]
    return x


def create_audio_ds(data):
    flist = [_["audio"] for _ in data]
    audio_ds = tf.data.Dataset.from_tensor_slices(flist)
    audio_ds = audio_ds.map(
        path_to_audio, num_parallel_calls=tf.data.AUTOTUNE
    )
    return audio_ds


def create_tf_dataset(data, bs=4):
    audio_ds = create_audio_ds(data)
    text_ds = create_text_ds(data)
    ds = tf.data.Dataset.zip((audio_ds, text_ds))
    ds = ds.map(lambda x, y: {"source": x, "target": y})
    ds = ds.batch(bs)
    ds = ds.prefetch(tf.data.AUTOTUNE)
    return ds


split = int(len(data) * 0.99)
train_data = data[:split]
test_data = data[split:]
ds = create_tf_dataset(train_data, bs=64)
val_ds = create_tf_dataset(test_data, bs=4)

vocab size 34


1. **`VectorizeChar` Class:**
   - Creates a character vectorization class.
   - Initializes with a vocabulary that includes lowercase letters, common punctuation, and special tokens ("<", ">", "-", "#").
   - Defines the maximum text length (`max_len`) and a character-to-index mapping (`char_to_idx`) based on the vocabulary.

2. **Creating the `data` List:**
   - Calls the `get_data` function to obtain a list of audio and text data pairs.
   - The maximum target length (`max_target_len`) is set to 200 characters.

3. **Creating the Vectorizer and Vocabulary:**
   - Initializes a `VectorizeChar` instance with the `max_target_len`.
   - Prints the vocabulary size using the `get_vocabulary()` method of the vectorizer.

4. **Creating Text Dataset (`create_text_ds`):**
   - Extracts the texts from the `data` list.
   - Applies the vectorizer to each text to convert it into a sequence of character indices.
   - Creates a TensorFlow dataset from the vectorized texts.

5. **Creating Audio Dataset (`create_audio_ds`):**
   - Extracts audio file paths from the `data` list.
   - Reads the audio data from each path using TensorFlow's audio decoding functions.
   - Applies Short-Time Fourier Transform (STFT) to obtain spectrograms.
   - Normalizes the spectrograms using mean and standard deviation.
   - Pads or truncates the spectrograms to a fixed length of 2754 time steps.

6. **Creating TensorFlow Dataset (`create_tf_dataset`):**
   - Creates audio and text datasets using the previously defined functions.
   - Zips the audio and text datasets together to create a dataset with pairs of audio and text data.
   - Maps a function that transforms the pairs into dictionary entries with keys "source" and "target".
   - Batches the dataset with a given batch size (`bs`).
   - Prefetches data to optimize data loading.

7. **Splitting Data and Creating Datasets for Training and Validation:**
   - Splits the `data` list into training and validation portions.
   - Calls `create_tf_dataset` to create TensorFlow datasets for training (`ds`) and validation (`val_ds`).
   - The training dataset has a batch size of 64, while the validation dataset has a batch size of 4.

In summary, this code prepares data for training a Transformer model. It creates vectorized text and audio datasets and forms batches suitable for training and validation. The text data is vectorized using the `VectorizeChar` class, and audio data is processed into spectrogram-like representations before being combined into TensorFlow datasets for further processing during model training.

# Callbacks to display predictions

In [9]:
class DisplayOutputs(keras.callbacks.Callback):
    def __init__(
        self, batch, idx_to_token, target_start_token_idx=27, target_end_token_idx=28
    ):
        """Displays a batch of outputs after every epoch

        Args:
            batch: A test batch containing the keys "source" and "target"
            idx_to_token: A List containing the vocabulary tokens corresponding to their indices
            target_start_token_idx: A start token index in the target vocabulary
            target_end_token_idx: An end token index in the target vocabulary
        """
        self.batch = batch
        self.target_start_token_idx = target_start_token_idx
        self.target_end_token_idx = target_end_token_idx
        self.idx_to_char = idx_to_token

    def on_epoch_end(self, epoch, logs=None):
        if epoch % 5 != 0:
            return
        source = self.batch["source"]
        target = self.batch["target"].numpy()
        bs = tf.shape(source)[0]
        preds = self.model.generate(source, self.target_start_token_idx)
        preds = preds.numpy()
        for i in range(bs):
            target_text = "".join([self.idx_to_char[_] for _ in target[i, :]])
            prediction = ""
            for idx in preds[i, :]:
                prediction += self.idx_to_char[idx]
                if idx == self.target_end_token_idx:
                    break
            print(f"target:     {target_text.replace('-','')}")
            print(f"prediction: {prediction}\n")

1. **`DisplayOutputs` Callback Class:**
   - This is a custom callback class used to display the model's output on a batch of data after every epoch, specifically when the epoch number is a multiple of 5.
   - It takes the following arguments during initialization:
     - `batch`: A test batch containing the keys "source" and "target".
     - `idx_to_token`: A list containing the vocabulary tokens corresponding to their indices.
     - `target_start_token_idx`: Index of the start token in the target vocabulary.
     - `target_end_token_idx`: Index of the end token in the target vocabulary.

2. **`on_epoch_end` Method:**
   - This method is called at the end of each epoch during training.
   - It checks if the current epoch number is not divisible by 5 and returns early if so.
   - It extracts the source and target data from the provided batch.
   - Calls the model's `generate` method to generate predictions for the provided source data, using the provided start token index.
   - Iterates over each example in the batch and prints the target text and the generated prediction.
   - The target text is obtained by converting the target indices to characters using the `idx_to_char` mapping.
   - The prediction is obtained by iterating over the generated indices until the end token index is encountered.

This callback is used to display the target text and the model's prediction on a test batch of data after every epoch, but only for epochs with a number that is a multiple of 5. This can help in visually inspecting the model's performance and its ability to generate coherent output.

# Learning rate schedule

In [10]:
class CustomSchedule(keras.optimizers.schedules.LearningRateSchedule):
    def __init__(
        self,
        init_lr=0.00001,
        lr_after_warmup=0.001,
        final_lr=0.00001,
        warmup_epochs=15,
        decay_epochs=85,
        steps_per_epoch=203,
    ):
        super().__init__()
        self.init_lr = init_lr
        self.lr_after_warmup = lr_after_warmup
        self.final_lr = final_lr
        self.warmup_epochs = warmup_epochs
        self.decay_epochs = decay_epochs
        self.steps_per_epoch = steps_per_epoch

    def calculate_lr(self, epoch):
        """ linear warm up - linear decay """
        warmup_lr = (
            self.init_lr
            + ((self.lr_after_warmup - self.init_lr) / (self.warmup_epochs - 1)) * epoch
        )
        decay_lr = tf.math.maximum(
            self.final_lr,
            self.lr_after_warmup
            - (epoch - self.warmup_epochs)
            * (self.lr_after_warmup - self.final_lr)
            / self.decay_epochs,
        )
        return tf.math.minimum(warmup_lr, decay_lr)

    def __call__(self, step):
        epoch = step // self.steps_per_epoch
        return self.calculate_lr(epoch)

1. **`CustomSchedule` Class:**
   - This is a custom learning rate schedule class that inherits from `keras.optimizers.schedules.LearningRateSchedule`.
   - It's designed to define a learning rate schedule with linear warm-up followed by linear decay.

2. **Initialization:**
   - The constructor `__init__` takes the following arguments:
     - `init_lr`: The initial learning rate before warm-up.
     - `lr_after_warmup`: The learning rate after the warm-up period.
     - `final_lr`: The final learning rate after the decay period.
     - `warmup_epochs`: The number of epochs for the warm-up phase.
     - `decay_epochs`: The number of epochs for the decay phase.
     - `steps_per_epoch`: The number of steps (batches) in one epoch.

3. **`calculate_lr` Method:**
   - This method calculates the learning rate based on the current epoch number.
   - It implements a linear warm-up followed by linear decay.
   - For the warm-up phase, it calculates a linear interpolation between `init_lr` and `lr_after_warmup` based on the current epoch.
   - For the decay phase, it calculates a linear interpolation between `lr_after_warmup` and `final_lr` based on the current epoch after the warm-up phase.
   - The learning rate chosen is the minimum of the warm-up and decay values.

4. **`__call__` Method:**
   - This method is called when the learning rate is requested at a specific step (usually within an epoch).
   - It calculates the current epoch by dividing the step number by `steps_per_epoch`.
   - It then calls the `calculate_lr` method to obtain the corresponding learning rate for the current epoch.

`CustomSchedule` class defines a learning rate schedule that smoothly increases the learning rate during the warm-up phase and then gradually decreases it during the decay phase. This type of schedule can help stabilize the training process and improve convergence, especially in large-scale models like Transformers.

# Create & train the end-to-end model

In [11]:
batch = next(iter(val_ds))

# The vocabulary to convert predicted indices into characters
idx_to_char = vectorizer.get_vocabulary()
display_cb = DisplayOutputs(
    batch, idx_to_char, target_start_token_idx=2, target_end_token_idx=3
)  # set the arguments as per vocabulary index for '<' and '>'

model = Transformer(
    num_hid=200,
    num_head=2,
    num_feed_forward=400,
    target_maxlen=max_target_len,
    num_layers_enc=4,
    num_layers_dec=1,
    num_classes=34,
)
loss_fn = tf.keras.losses.CategoricalCrossentropy(
    from_logits=True, label_smoothing=0.1,
)

learning_rate = CustomSchedule(
    init_lr=0.00001,
    lr_after_warmup=0.001,
    final_lr=0.00001,
    warmup_epochs=15,
    decay_epochs=85,
    steps_per_epoch=len(ds)
)

learning_rate = 0.0001
optimizer = keras.optimizers.Adam(learning_rate)

model.compile(optimizer=optimizer, loss=loss_fn)

history = model.fit(ds, validation_data=val_ds, callbacks=[display_cb], epochs=1)

prediction: <the athe the as the the the the the the athe the an an the the an an an the the the the the the athe the the the the are the an anere there the the hed the thed wathe thathond pre as thenthe hang in 

target:     <at the sound of the second shot>
prediction: <the as the an the the an and the the and and the the the an an and the and the and and the the an the the an the thend the the anthe angere the the hed the thed wathere tend anged as thenthe wang ind

target:     <a certain number of bedsteads were provided, and there was a slight increase in the ration of bread.>
prediction: <the athe the as the the the the the the athe the the the an an the the an an the the the the the the the the the the athe the the on athere the the he there the wathe the on anente hed anthe han t ar

target:     <their crimes follow in the lines of others already found, and often more than once, in the calendars.>
prediction: <the athe the as the the the the the the athe the the the an an the t

1. **Data Preparation:**
   - `batch = next(iter(val_ds))`: It fetches the next batch from the validation dataset (`val_ds`). This batch will be used for displaying outputs during training.

2. **Display Callback Setup:**
   - `idx_to_char = vectorizer.get_vocabulary()`: It retrieves the vocabulary of characters from the `VectorizeChar` instance created earlier.
   - `display_cb = DisplayOutputs(...)`: An instance of the `DisplayOutputs` callback is created. This callback is used to display predicted and target text outputs after every epoch. It takes the `batch`, `idx_to_char`, `target_start_token_idx`, and `target_end_token_idx` as arguments.

3. **Model Initialization:**
   - `model = Transformer(...)`: An instance of the `Transformer` model is created. This is a custom Transformer model designed for the specific task.
   - Various hyperparameters are passed to configure the model. For example, `num_hid` specifies the hidden dimension size, `num_head` is the number of attention heads, `num_feed_forward` is the feed-forward dimension, and so on.

4. **Loss Function Setup:**
   - `loss_fn = tf.keras.losses.CategoricalCrossentropy(...)`: A categorical cross-entropy loss function is initialized. This is a common loss function used for multi-class classification tasks. The argument `from_logits=True` indicates that the model's output logits are used (before applying softmax).

5. **Learning Rate Schedule Setup:**
   - `learning_rate = CustomSchedule(...)`: An instance of the `CustomSchedule` learning rate schedule is created. This custom schedule defines how the learning rate will change during training based on the specified parameters.
   - Alternatively, `learning_rate = 0.0001` is assigned a fixed learning rate value.

6. **Optimizer Setup:**
   - `optimizer = keras.optimizers.Adam(...)`: An Adam optimizer instance is created, and the previously defined `learning_rate` (either from the custom schedule or the fixed value) is passed to it.

7. **Model Compilation:**
   - `model.compile(...)`: The model is compiled with the specified optimizer and loss function. This prepares the model for training.

8. **Training Loop:**
   - `history = model.fit(...)`: The training loop is executed using the `fit` method. The training data (`ds`) and validation data (`val_ds`) are provided. The `callbacks` parameter includes the `display_cb`, which will display outputs after each epoch.
   - The training is performed for a single epoch (`epochs=1`), which means the model will go through the entire training dataset once.

Overall, the code sets up a Transformer-based model, defines the necessary components (loss function, optimizer, learning rate schedule), and trains the model for one epoch while displaying predicted and target text outputs at specific intervals during training.

**Inspiration:** [Automatic Speech Recognition with Transformer](https://keras.io/examples/audio/transformer_asr/)