## Introduction

In this example, we'll build a sequence-to-sequence Transformer model, which
we'll train on an English-to-Hinglish machine translation task.

Methodology

- Vectorize text using the Keras `TextVectorization` layer.
- Implement a `TransformerEncoder` layer, a `TransformerDecoder` layer,
and a `PositionalEmbedding` layer.
- Prepare data for training a sequence-to-sequence model.
- Use the trained model to generate translations of never-seen-before
input sentences (sequence-to-sequence inference).


## Setup

In [1]:
#!pip install datasets

In [2]:
import pathlib
import random
import string
import re
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from datasets import load_dataset

## Get data
The data is from hugging face. This dataset features casual tone hinglish translations for english texts. sampled fom various sources (including the Hinglish TOP Dataset)

In [3]:
dataset_name = "findnitai/english-to-hinglish"
dataset = load_dataset(dataset_name, split="train")

In [4]:
dataset_dict = dataset.to_dict()

In [5]:
english = [item['en'] for item in dataset_dict['translation']]
hinglish = [item['hi_ng'] for item in dataset_dict['translation']]

## Parsing the data

Each line contains an English sentence and its corresponding Hinglish sentence.
The English sentence is the *source sequence* and hinglish one is the *target sequence*.
We prepend the token `"[start]"` and we append the token `"[end]"` to the Spanish sentence.

In [7]:

def clean_data(text):
  result=[]
  # regex for removing weird chars
  re_print = re.compile('[^%s]' % re.escape(string.printable))
  # regex for removing punctuation
  regex_punct = re.compile('[%s]' % re.escape(string.punctuation))
  for line in text:
    # split on whitespace so we can remove weird chars and punctuation
    line = line.split()
    # convert to lower case
    line = [word.lower() for word in line]
    # remove punctuation
    line = [regex_punct.sub('', word) for word in line]
    # remove weird chars
    line = [re_print.sub('', w) for w in line]
    #remove numbers
    line = [word for word in line if word.isalpha()]
    result.append(' '.join(line))
  return result

In [11]:
clean_eng = clean_data(english)
clean_hing = clean_data(hinglish)

In [17]:
#padding clean_hing with [start] and [end] tokens and creating paired
pairs = []
for i in range(len(clean_eng)):
    hing_pad = "[start] "+clean_hing[i]+" [end]"
    pairs.append((clean_eng[i], hing_pad))

Here's what our sentence pairs look like:

In [18]:
for _ in range(5):
    print(random.choice(pairs))

('play bruno mars on pandora', '[start] pandora par bruno mars ko bajao [end]')
('deduct minutes from my active timer', '[start] mere active timer se minutes deduct kare [end]')
('remove the alert on my calendar about doug s promotion', '[start] doug ke promotion ke baare me alert hata dein [end]')
('will i make my am appointment if i leave now', '[start] agar mai abhi nikalta hoon toh kya mai apna subah bajhe ka appointment pahunch sakta hoon [end]')
('how cold is it for shanghai in degrees f', '[start] shanghai ke liye f degree mein kitni thand hai [end]')


Now, let's split the sentence pairs into a training set, a validation set,
and a test set.

In [22]:
random.shuffle(pairs)
num_val_samples = int(0.15 * len(pairs))
num_train_samples = len(pairs) - 2 * num_val_samples
train_pairs = pairs[:num_train_samples]
val_pairs = pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = pairs[num_train_samples + num_val_samples :]

print(f"{len(pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

189102 total pairs
132372 training pairs
28365 validation pairs
28365 test pairs


## Vectorizing the text data

We'll use the `TextVectorization` layer to vectorize the text
data, that is to say, to turn the original strings into integer sequences
where each integer represents the index of a word in a vocabulary.


In [24]:
strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

vocab_size = 15000
sequence_length = 20
batch_size = 64

eng_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
hing_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
)
train_eng_texts = [pair[0] for pair in train_pairs]
train_hing_texts = [pair[1] for pair in train_pairs]
eng_vectorization.adapt(train_eng_texts)
hing_vectorization.adapt(train_hing_texts)

Next, we'll format our datasets.

At each training step, the model will seek to predict target words N+1 (and beyond)
using the source sentence and the target words 0 to N.

As such, the training dataset will yield a tuple `(inputs, targets)`, where:

- `inputs` is a dictionary with the keys `encoder_inputs` and `decoder_inputs`.
`encoder_inputs` is the vectorized source sentence and `decoder_inputs` is the target sentence "so far",
that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target sentence.
- `target` is the target sentence offset by one step:
it provides the next words in the target sentence -- what the model will try to predict.

In [25]:

def format_dataset(eng, hing):
    eng = eng_vectorization(eng)
    hing = hing_vectorization(hing)
    return (
        {
            "encoder_inputs": eng,
            "decoder_inputs": hing[:, :-1],
        },
        hing[:, 1:],
    )


def make_dataset(pairs):
    eng_texts, hing_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    hing_texts = list(hing_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, hing_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.shuffle(2048).prefetch(16).cache()


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

Let's take a quick look at the sequence shapes
(we have batches of 64 pairs, and all sequences are 20 steps long):

In [26]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)


2023-11-07 17:58:39.740340: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## Building the model

Our sequence-to-sequence Transformer consists of a `TransformerEncoder`
and a `TransformerDecoder` chained together. To make the model aware of word order,
we also use a `PositionalEmbedding` layer.

The source sequence will be pass to the `TransformerEncoder`,
which will produce a new representation of it.
This new representation will then be passed
to the `TransformerDecoder`, together with the target sequence so far (target words 0 to N).
The `TransformerDecoder` will then seek to predict the next words in the target sequence (N+1 and beyond).

A key detail that makes this possible is causal masking
(`use_causal_mask=True` in the first attention layer of the `TransformerDecoder`).
The `TransformerDecoder` sees the entire sequences at once, and thus we must make
sure that it only uses information from target tokens 0 to N when predicting token N+1
(otherwise, it could use information from the future, which would
result in a model that cannot be used at inference time).

In [27]:

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        attention_output = self.attention(query=inputs, value=inputs, key=inputs)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "dense_dim": self.dense_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=embed_dim
        )
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "sequence_length": self.sequence_length,
                "vocab_size": self.vocab_size,
                "embed_dim": self.embed_dim,
            }
        )
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(latent_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.add = layers.Add()  # instead of `+` to preserve mask
        self.supports_masking = True

    def call(self, inputs, encoder_outputs, mask=None):
        attention_output_1 = self.attention_1(
            query=inputs, value=inputs, key=inputs, use_causal_mask=True
        )
        out_1 = self.layernorm_1(self.add([inputs, attention_output_1]))

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
        )
        out_2 = self.layernorm_2(self.add([out_1, attention_output_2]))

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(self.add([out_2, proj_output]))

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "latent_dim": self.latent_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


Next, we assemble the end-to-end model.

In [28]:
embed_dim = 256
latent_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, encoded_seq_inputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

decoder_outputs = decoder([decoder_inputs, encoder_outputs])
transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer"
)

## Training our model

We'll use accuracy as a quick way to monitor training progress on the validation data.
Note that machine translation typically uses BLEU scores as well as other metrics, rather than accuracy.

In [29]:
epochs = 1  # This should be at least 30 for convergence

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 encoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 positional_embedding (Position  (None, None, 256)   3845120     ['encoder_inputs[0][0]']         
 alEmbedding)                                                                                     
                                                                                                  
 decoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 transformer_encoder (Transform  (None, None, 256)   3155456     ['positional_embedding[

<keras.callbacks.History at 0x2c342fe80>

## Decoding test sentences

Finally, let's demonstrate how to translate brand new English sentences.
We simply feed into the model the vectorized English sentence
as well as the target token `"[start]"`, then we repeatedly generated the next token, until
we hit the token `"[end]"`.

In [34]:
spa_vocab = hing_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = eng_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = hing_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])

        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token

        if sampled_token == "[end]":
            break
    return decoded_sentence


test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(30):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequence(input_sentence)
    print("\ninput: ", input_sentence, "\ntranslated: ", translated)


input:  tell me the weather for the next days 
translated:  [start] mujhe agle din tak ke liye mausam batao end  karne end ke karne ko hai end hai end end

input:  i need an alarm for the next months at am 
translated:  [start] mujhe agle mahine subah baje ke liye alarm chahiye end  end karne ke ke hai end hai end kare

input:  how many miles is it to the grocery store 
translated:  [start] mere grocery store tak kitne miles hai end  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] end [UNK]

input:  will it rain tomorrow 
translated:  [start] kya kal baarish hogi end  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] end [UNK] [UNK] [UNK] [UNK] [UNK]

input:  please search for the forecast in kelowna canada 
translated:  [start] please toronto me forecast ko the forecast ko dikhaiye end  end end karne end ke kare end kare end

input:  set a reminder to call references for the new job candidate before the interview 
translated:  [start] new [UNK] ke liye new [UNK] ko call karne ke 

In [32]:
input_sentence

'drive me home on the route with the least cops on it'

In [33]:
translated

'[start] mujhe ghar ke liye sabse tez route par drive karke jaane me kitna samay lagega end  end karne end'