# <p style="padding: 15px; background-color: #3F384A; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Neural Machine Translation with Attention</p>


In [18]:
import os
import shutil
import subprocess
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
import plotly.express as px
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow import keras
from keras import layers
from colorama import Fore, Style
from IPython.core.display import HTML

warnings.filterwarnings("ignore")

K = keras.backend
ON_KAGGLE = os.getenv("KAGGLE_KERNEL_RUN_TYPE") is not None
FONT_COLOR = "#141B4D"
BACKGROUND_COLOR = "#F6F5F5"
CLR = (Style.BRIGHT + Fore.BLACK) if ON_KAGGLE else (Style.BRIGHT + Fore.WHITE)
RED = Style.BRIGHT + Fore.RED
BLUE = Style.BRIGHT + Fore.BLUE
CYAN = Style.BRIGHT + Fore.CYAN
RESET = Style.RESET_ALL
NOTEBOOK_PALETTE = {
    "DeepPlum": "#3F384A",
    "RubyRed": "#E04C5F",
    "SunburstOrange": "#FFB74D",
}


def download_dataset_from_kaggle(user, dataset, directory):
    command = "kaggle datasets download -d "
    filepath = directory / (dataset + ".zip")

    if not filepath.is_file():
        subprocess.run((command + user + "/" + dataset).split())
        filepath.parent.mkdir(parents=True, exist_ok=True)
        shutil.unpack_archive(dataset + ".zip", "data")
        shutil.move(dataset + ".zip", "data")


HTML(
    """
<style>
code {
    background: rgba(42, 53, 125, 0.10) !important;
    border-radius: 4px !important;
}
</style>
"""
)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
">
    <b>Notebook Description</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 20px;
    margin-right: 20px;
    margin-bottom: 20px;
">
    This notebook aims to handle one of the natural language processing (NLP) challenges, i.e. <b>machine translation</b>. We will focus on employing the encoder-decoder RNN architecture and a disruptive approach to NLP, i.e. transformers architecture. To do that, we will use two <b>English-French</b> datasets. In the first part, we will focus on an easy dataset (around <b>180000 sentences</b>, <b>12 MB</b>), whereas in the second part, we will use the second dataset (about <b>22.5 million sentences</b>, <b>8 GB</b>). In this notebook, we translate English sentences into French ones. Therefore, we tackle the <b>sequence-to-sequence</b> (seq2seq) learning problem.
</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
">
    <b>This Notebook Covers</b> 📔
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
    margin-bottom: 20px;
">
    <li>Handle sequence-to-sequence translation problems using English-French datasets.</li>
    <li>Preparing efficient datasets with <code>TensorFlow</code> data API.</li>
    <li>Preparing <code>TensorFlow</code> datasets based on data which does not fit into memory.</li>
    <li>Building a bidirectional encoder-decoder RNN architecture.</li>
    <li>Building a transformer architecture.</li>
    <li>Example sentences translation with above models.</li>
</ul>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
">
    <b>See Datasets Here</b> 📈
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 20px;
    margin-right: 20px;
    margin-bottom: 20px;
">
    <a href="https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench" style="color:#FFB74D"><b>Easy English-French Dataset</b></a></br>
    <a href="https://www.kaggle.com/datasets/dhruvildave/en-fr-translation-dataset" style="color:#FFB74D"><b>Hard English-French Dataset</b></a>
</p>

# <p style="padding: 15px; background-color: #3F384A; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Tackling Easy Dataset</p>


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>In this section, we will focus on an easy English-French dataset.</li>
    <li>First, let's download it and see what we are dealing with.</li>
</ul>


In [19]:
easy_dataset_user = "devicharith"
easy_dataset = "language-translation-englishfrench"
data_dir = Path("data")

if not ON_KAGGLE:
    download_dataset_from_kaggle(easy_dataset_user, easy_dataset, data_dir)
    easy_dataset_path = data_dir / "eng_-french.csv"
else:
    easy_dataset_path = Path(
        "/kaggle/input/language-translation-englishfrench/eng_-french.csv"
    )


In [20]:
easy_dataset = pd.read_csv(easy_dataset_path, encoding="utf-8", engine="pyarrow")
easy_dataset = easy_dataset.sample(len(easy_dataset), random_state=42)
easy_dataset.head()


Unnamed: 0,English words/sentences,French words/sentences
2785,Take a seat.,Prends place !
29880,I wish Tom was here.,J'aimerais que Tom soit là.
53776,How did the audition go?,Comment s'est passée l'audition ?
154386,I've no friend to talk to about my problems.,Je n'ai pas d'ami avec lequel je puisse m'entr...
149823,I really like this skirt. Can I try it on?,"J'aime beaucoup cette jupe, puis-je l'essayer ?"


In [21]:
easy_dataset.info()


<class 'pandas.core.frame.DataFrame'>
Index: 175621 entries, 2785 to 121958
Data columns (total 2 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   English words/sentences  175621 non-null  object
 1   French words/sentences   175621 non-null  object
dtypes: object(2)
memory usage: 4.0+ MB


In [22]:
easy_dataset["English Words in Sentence"] = (
    easy_dataset["English words/sentences"].str.split().apply(len)
)
easy_dataset["French Words in Sentence"] = (
    easy_dataset["French words/sentences"].str.split().apply(len)
)

fig = px.histogram(
    easy_dataset,
    x=["English Words in Sentence", "French Words in Sentence"],
    color_discrete_sequence=["#3f384a", "#e04c5f"],
    labels={"variable": "Variable", "value": "Words in Sentence"},
    marginal="box",
    barmode="group",
    height=540,
    width=840,
    title="Easy Dataset - Words in Sentence",
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    bargap=0.2,
    bargroupgap=0.1,
    legend=dict(orientation="h", yanchor="bottom", xanchor="right", y=1.02, x=1),
    yaxis_title="Count",
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>As you can see, sentences usually have several words, at most $15$.</li>
    <li>Additionally, sentences are arranged in ascending order of the number of words. Therefore, I took the liberty of shuffling this data.</li>
    <li>Let's prepare the training dataset and validation dataset. We assign 10% of the dataset to validation.</li>
</ul>


In [6]:
sentences_en = easy_dataset["English words/sentences"].to_numpy()
sentences_fr = easy_dataset["French words/sentences"].to_numpy()

valid_fraction = 0.1
valid_len = int(valid_fraction * len(easy_dataset))

sentences_en_train = sentences_en[:-valid_len]
sentences_fr_train = sentences_fr[:-valid_len]

sentences_en_valid = sentences_en[-valid_len:]
sentences_fr_valid = sentences_fr[-valid_len:]


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>Since we have the data split, we can prepare that for encoder-decoder RNN architecture. In general, we need two inputs and a target. The first input, i.e. English sentences, is passed to the encoder. On the other hand, French ones are passed to the decoder. Nevertheless, the decoder should take them one timestamp earlier. Therefore, we need to add a unique token - <b>the start of a sequence (SOS)</b>. It acts as an indicator or a trigger signal for the decoder to start generating the translated output. Also, the target should contain a unique token - <b>the end of a sequence (EOS)</b>. It serves as a marker to indicate the completion of the translation. When the decoder generates the EOS token, it signals that the translation process is finished.</li>
    <li>Now, we will write two short utility functions that create <code>TensorFlow</code> datasets for the encoder-decoder RNN.</li>
</ul>


In [7]:
def prepare_input_and_target(sentences_en, sentences_fr):
    """Return data in the format: `((encoder_input, decoder_input), target)`"""
    return (sentences_en, b"startofseq " + sentences_fr), sentences_fr + b" endofseq"


def from_sentences_dataset(
    sentences_en,
    sentences_fr,
    batch_size=32,
    cache=True,
    shuffle=False,
    shuffle_buffer_size=10_000,
    seed=None,
):
    """Creates `TensorFlow` dataset for encoder-decoder RNN from given sentences."""
    dataset = tf.data.Dataset.from_tensor_slices((sentences_en, sentences_fr))
    dataset = dataset.map(prepare_input_and_target, num_parallel_calls=tf.data.AUTOTUNE)
    if cache:
        dataset = dataset.cache()
    if shuffle:
        dataset = dataset.shuffle(shuffle_buffer_size, seed=seed)
    return dataset.batch(batch_size)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>Let's see how it works and measure performance.</li>
</ul>


In [8]:
benchmark_ds = from_sentences_dataset(sentences_en_train, sentences_fr_train)
benchmark_ds = benchmark_ds.prefetch(tf.data.AUTOTUNE)
bench_results = tfds.benchmark(benchmark_ds, batch_size=32)




************ Summary ************




  0%|          | 0/4940 [00:00<?, ?it/s]

Examples/sec (First included) 36532.10 ex/sec (total: 158112 ex, 4.33 sec)

Examples/sec (First only) 219.20 ex/sec (total: 32 ex, 0.15 sec)

Examples/sec (First excluded) 37799.73 ex/sec (total: 158080 ex, 4.18 sec)


In [9]:
example_ds = from_sentences_dataset(
    sentences_en_train, sentences_fr_train, batch_size=4
)
list(example_ds.take(1))[0]


((<tf.Tensor: shape=(4,), dtype=string, numpy=
  array([b'Take a seat.', b'I wish Tom was here.',
         b'How did the audition go?',
         b"I've no friend to talk to about my problems."], dtype=object)>,
  <tf.Tensor: shape=(4,), dtype=string, numpy=
  array([b'startofseq Prends place !',
         b"startofseq J'aimerais que Tom soit l\xc3\xa0.",
         b"startofseq Comment s'est pass\xc3\xa9e l'audition\xc2\xa0?",
         b"startofseq Je n'ai pas d'ami avec lequel je puisse m'entretenir de mes probl\xc3\xa8mes."],
        dtype=object)>),
 <tf.Tensor: shape=(4,), dtype=string, numpy=
 array([b'Prends place ! endofseq',
        b"J'aimerais que Tom soit l\xc3\xa0. endofseq",
        b"Comment s'est pass\xc3\xa9e l'audition\xc2\xa0? endofseq",
        b"Je n'ai pas d'ami avec lequel je puisse m'entretenir de mes probl\xc3\xa8mes. endofseq"],
       dtype=object)>)

In [10]:
example_ds.cardinality()  # Number of batches per epoch.


<tf.Tensor: shape=(), dtype=int64, numpy=39515>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>As you can see, everything should work. We got the output in the desired form, i.e. <code>((encoder_input, decoder_input), target)</code>.</li>
    <li>We need another two functions. The first <code>adapt_compile_and_fit()</code> is liable for additional datasets preparation, an adaptation of the model text vectorization layers, and, finally, for the training process. The second one: <code>translate()</code>, is responsible for the sentence translation.</li>
    <li>Additionally, we will write a small callback, i.e. <code>ColoramaVerbose</code>, which slightly prettifies the training output.</li>
</ul>


In [11]:
class ColoramaVerbose(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        print(
            f"{CLR}Epoch: {RED}{epoch + 1:02d}{CLR} -",
            f"{CLR}loss: {RED}{logs['loss']:.5f}{CLR} -",
            f"{CLR}accuracy: {RED}{logs['accuracy']:.5f}{CLR} -",
            f"{CLR}val_loss: {RED}{logs['val_loss']:.5f}{CLR} -",
            f"{CLR}val_accuracy: {RED}{logs['val_accuracy']:.5f}",
        )


In [12]:
def adapt_compile_and_fit(
    model,
    train_dataset,
    valid_dataset,
    n_epochs=25,
    n_patience=5,
    init_lr=0.001,
    lr_decay_rate=0.1,
    colorama_verbose=False,
):
    """Takes the model vectorization layers and adapts them to the training data.
    Then, it prepares the final datasets vectorizing targets and prefetching,
    and finally trains the given model. Additionally, provides learning rate scheduling
    (exponential decay), early stopping and colorama verbose."""

    model.vectorization_en.adapt(
        train_dataset.map(
            lambda sentences, target: sentences[0],  # English sentences.
            num_parallel_calls=tf.data.AUTOTUNE,
        )
    )
    model.vectorization_fr.adapt(
        train_dataset.map(
            lambda sentences, target: sentences[1] + b" endofseq",  # French sentences.
            num_parallel_calls=tf.data.AUTOTUNE,
        )
    )

    train_dataset_prepared = train_dataset.map(
        lambda sentences, target: (sentences, model.vectorization_fr(target)),
        num_parallel_calls=tf.data.AUTOTUNE,
    ).prefetch(tf.data.AUTOTUNE)

    valid_dataset_prepared = valid_dataset.map(
        lambda sentences, target: (sentences, model.vectorization_fr(target)),
        num_parallel_calls=tf.data.AUTOTUNE,
    ).prefetch(tf.data.AUTOTUNE)

    early_stopping_cb = keras.callbacks.EarlyStopping(
        monitor="val_accuracy", patience=n_patience, restore_best_weights=True
    )
    
    # The line below doesn't work with multi-file interleaving.
    # n_decay_steps = n_epochs * train_dataset_prepared.cardinality().numpy()
    # Less elegant solution.
    n_decay_steps = n_epochs * len(list(train_dataset_prepared))
    scheduled_lr = keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=init_lr,
        decay_steps=n_decay_steps,
        decay_rate=lr_decay_rate,
    )

    model_callbacks = [early_stopping_cb]
    verbose_level = 1
    if colorama_verbose:
        model_callbacks.append(ColoramaVerbose())
        verbose_level = 0

    model.compile(
        loss="sparse_categorical_crossentropy",
        optimizer=keras.optimizers.RMSprop(learning_rate=scheduled_lr),
        metrics=["accuracy"],
    )

    return model.fit(
        train_dataset_prepared,
        epochs=n_epochs,
        validation_data=valid_dataset_prepared,
        callbacks=model_callbacks,
        verbose=verbose_level,
    )


In [13]:
def translate(model, sentence_en):
    translation = ""
    for word_idx in range(model.max_sentence_len):
        X_encoder = np.array([sentence_en])
        X_decoder = np.array(["startofseq " + translation])
        # Last token's probas.
        y_proba = model.predict((X_encoder, X_decoder), verbose=0)[0, word_idx]
        predicted_word_id = np.argmax(y_proba)
        predicted_word = model.vectorization_fr.get_vocabulary()[predicted_word_id]
        if predicted_word == "endofseq":
            break
        translation += " " + predicted_word
    return translation.strip()


# <p style="padding: 15px; background-color: #3F384A; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Bidirectional Encoder-Decoder with Attention</p>


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>All utility functions and preprocessing steps are done, so we can get through to the implementation of an encoder-decoder RNN with an attention mechanism.</li>
    <li>In general, the encoder-decoder RNN with attention mechanism is an extension of the basic encoder-decoder architecture for sequence-to-sequence tasks, such as machine translation. It handles the limitation of the basic architecture, which often struggles with longer input sequences.</li>
    <li>In such an architecture, the encoder is responsible for encoding the input sequence into a fixed-length context vector, while the decoder generates the output sequence based on the encoded information. The attention mechanism enables the decoder to focus on specific parts of the input sequence, allowing for better alignment and handling of long sentences.</li>
    <li>Roughly speaking, the encoder-decoder RNN consists of vectorization layers, embedding layers, usually LSTM or GRU cells (actually, these are the encoder and decoder), an attention layer and final output dense layer.</li>
    <li>The last thing is the "bidirectional" word. The point here is that the encoder is bidirectional, meaning the sequence is processed from left to right and from right to left. So, when we have a bidirectional LSTM cell with, for example, $16$ units, we actually have $32$ units. Such a mechanism helps to capture the sentence context.</li>
</ul>


In [14]:
class BidirectionalEncoderDecoderWithAttention(keras.Model):
    def __init__(
        self,
        vocabulary_size=5000,
        max_sentence_len=50,
        embedding_size=256,
        n_units_lstm=512,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.max_sentence_len = max_sentence_len

        self.vectorization_en = layers.TextVectorization(
            vocabulary_size, output_sequence_length=max_sentence_len
        )
        self.vectorization_fr = layers.TextVectorization(
            vocabulary_size, output_sequence_length=max_sentence_len
        )

        self.encoder_embedding = layers.Embedding(
            vocabulary_size, embedding_size, mask_zero=True
        )
        self.decoder_embedding = layers.Embedding(
            vocabulary_size, embedding_size, mask_zero=True
        )

        self.encoder = layers.Bidirectional(
            layers.LSTM(n_units_lstm // 2, return_sequences=True, return_state=True)
        )
        self.decoder = layers.LSTM(n_units_lstm, return_sequences=True)
        self.attention = layers.Attention()
        self.output_layer = layers.Dense(vocabulary_size, activation="softmax")

    def call(self, inputs):
        encoder_inputs, decoder_inputs = inputs

        encoder_input_ids = self.vectorization_en(encoder_inputs)
        decoder_input_ids = self.vectorization_fr(decoder_inputs)

        encoder_embeddings = self.encoder_embedding(encoder_input_ids)
        decoder_embeddings = self.decoder_embedding(decoder_input_ids)

        # The final hidden state of the encoder, representing the entire
        # input sequence, is used to initialize the decoder.
        encoder_output, *encoder_state = self.encoder(encoder_embeddings)
        encoder_state = [
            tf.concat(encoder_state[0::2], axis=-1),  # Short-term state (0 & 2).
            tf.concat(encoder_state[1::2], axis=-1),  # Long-term state (1 & 3).
        ]
        decoder_output = self.decoder(decoder_embeddings, initial_state=encoder_state)
        attention_output = self.attention([decoder_output, encoder_output])

        return self.output_layer(attention_output)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>We're ready to run all this stuff now. As we remember, we usually have sentences no longer than $15$ words. Therefore, it's better to use this value in the model.</li>
</ul>


In [15]:
K.clear_session()  # Resets all state generated by Keras.
tf.random.set_seed(42)  # Ensure reproducibility on CPU.

easy_train_ds = from_sentences_dataset(
    sentences_en_train, sentences_fr_train, shuffle=True, seed=42
)
easy_valid_ds = from_sentences_dataset(sentences_en_valid, sentences_fr_valid)

bidirect_encoder_decoder = BidirectionalEncoderDecoderWithAttention(max_sentence_len=15)
bidirect_history = adapt_compile_and_fit(
    bidirect_encoder_decoder,
    easy_train_ds,
    easy_valid_ds,
    init_lr=0.01,
    lr_decay_rate=0.01,
    colorama_verbose=True,
)


[1m[30mEpoch: [1m[31m01[1m[30m - [1m[30mloss: [1m[31m2.96179[1m[30m - [1m[30maccuracy: [1m[31m0.45073[1m[30m - [1m[30mval_loss: [1m[31m2.09467[1m[30m - [1m[30mval_accuracy: [1m[31m0.55874

[1m[30mEpoch: [1m[31m02[1m[30m - [1m[30mloss: [1m[31m1.82512[1m[30m - [1m[30maccuracy: [1m[31m0.59895[1m[30m - [1m[30mval_loss: [1m[31m1.73660[1m[30m - [1m[30mval_accuracy: [1m[31m0.61497

[1m[30mEpoch: [1m[31m03[1m[30m - [1m[30mloss: [1m[31m1.50049[1m[30m - [1m[30maccuracy: [1m[31m0.65149[1m[30m - [1m[30mval_loss: [1m[31m1.59598[1m[30m - [1m[30mval_accuracy: [1m[31m0.64259

[1m[30mEpoch: [1m[31m04[1m[30m - [1m[30mloss: [1m[31m1.29372[1m[30m - [1m[30maccuracy: [1m[31m0.68727[1m[30m - [1m[30mval_loss: [1m[31m1.50442[1m[30m - [1m[30mval_accuracy: [1m[31m0.66261

[1m[30mEpoch: [1m[31m05[1m[30m - [1m[30mloss: [1m[31m1.13093[1m[30m - [1m[30maccuracy: [1m[31m0.71746[1m[30m - [1

In [16]:
fig = px.line(
    bidirect_history.history,
    markers=True,
    height=540,
    width=840,
    symbol="variable",
    labels={"variable": "Variable", "value": "Value", "index": "Epoch"},
    title="Easy Dataset - Encoder-Decoder RNN Training Process",
    color_discrete_sequence=px.colors.diverging.balance_r,
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
)
fig.show()


In [17]:
translation1 = translate(bidirect_encoder_decoder, "Take a seat")
translation2 = translate(bidirect_encoder_decoder, "I wish Tom was here.")
translation3 = translate(bidirect_encoder_decoder, "She ordered him to do it.")

print(CLR + "Actual Possible Translations:")
print(BLUE + "Take a seat".ljust(25), RED + "-> ", BLUE + "Prends place !")
print(
    BLUE + "I wish Tom was here.".ljust(25),
    RED + "-> ",
    BLUE + "J'aimerais que Tom soit là.",
)
print(
    BLUE + "She ordered him to do it.".ljust(25),
    RED + "-> ",
    BLUE + "Elle lui a ordonné de le faire.",
)
print()
print(CLR + "Model Translations:")
print(BLUE + "Take a seat".ljust(25), RED + "-> ", BLUE + translation1)
print(BLUE + "I wish Tom was here.".ljust(25), RED + "-> ", BLUE + translation2)
print(BLUE + "She ordered him to do it.".ljust(25), RED + "-> ", BLUE + translation3)


[1m[30mActual Possible Translations:

[1m[34mTake a seat               [1m[31m->  [1m[34mPrends place !

[1m[34mI wish Tom was here.      [1m[31m->  [1m[34mJ'aimerais que Tom soit là.

[1m[34mShe ordered him to do it. [1m[31m->  [1m[34mElle lui a ordonné de le faire.



[1m[30mModel Translations:

[1m[34mTake a seat               [1m[31m->  [1m[34massiedstoi

[1m[34mI wish Tom was here.      [1m[31m->  [1m[34mjaimerais que tom soit là

[1m[34mShe ordered him to do it. [1m[31m->  [1m[34melle lui [UNK] de le faire


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>The model handles quite well with short sentences but struggles with longer ones. Also, sometimes the translation is far from ideal. One possible solution for better translations is the so-called <b>Beam Search</b>, but I won't be implementing this here. If you are interested in that, you will certainly find this concept.</li>
</ul>


# <p style="padding: 15px; background-color: #3F384A; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Transformer Architecture</p>


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>The Transformer architecture has revolutionized language translation tasks. It composes an architecture specifically designed for sequence-to-sequence tasks like language translation. It replaces traditional recurrent neural networks (RNNs) and introduces self-attention mechanisms for capturing dependencies and positional information in an input sequence.</li>
    <li>The encoder component processes the input sequence by a stack of identical encoder layers. Each encoder layer comprises a <b>multi-head self-attention mechanism</b> and a <b>position-wise feed-forward neural network</b>.</li>
    <li>The decoder component also consists of a stack of identical layers but with additional <b>masked self-attention</b> and <b>encoder-decoder attention mechanisms</b>. The masked self-attention prevents the decoder from attending to future positions during training, ensuring the model generates outputs based only on the current and previously generated tokens. The encoder-decoder attention allows the decoder to attend to relevant parts of the encoded input sequence.</li>
    <li>There is another completely new component, i.e. <b>positional encoding (PE)</b>. It provides positional information for the input embeddings to account for word order. It helps the model differentiate between words based on their relative positions. We implement this using sine and cosine functions of different frequencies and phases.</li>
    <li>You can find that groundbreaking article about transformer here: <a href="https://arxiv.org/abs/1706.03762" style="color:#FFB74D"><b>Attention Is All You Need</b></a>. I really encourage you to get familiar with this. Also, then you will be able to easy understand the code below.</br></br>
    <figure>
        <center><img src="https://raw.githubusercontent.com/mateuszk098/kaggle_notebooks/master/mt_with_transformers/transformer_architecture.png" alt="Transformer"></center></br>
        <center><figcaption><b>Transformer Architecture. Source: <a href="https://arxiv.org/abs/1706.03762" style="color:#FFB74D">Attention Is All You Need</b></a>.</figcaption></center>
    </figure>
    </li>
</ul>


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>In the original paper, to implement Position Encoding, the sine and cosine functions were used:</br>
    \[PE_{(pos, 2i)}     = \sin\left(pos\big/10000^{2i\big/d_{model}}\right)\]
    \[PE_{(pos, 2i + 1)} = \cos\left(pos\big/10000^{2i\big/d_{model}}\right)\]
    where $pos$ is the position, $i$ is the dimension and $d_{model}$ has the same dimension as the embeddings.</li>
    <li>You can find more about this in the original paper and here: <a href="https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/#:~:text=What%20Is%20Positional%20Encoding%3F,item's%20position%20in%20transformer%20models." style="color:#FFB74D"><b>A Gentle Introduction to Positional Encoding in Transformer Models</b></a>.</li>
</ul>


In [18]:
class PositionalEncoding(layers.Layer):
    def __init__(
        self, max_sentence_len=50, embedding_size=256, dtype=tf.float32, **kwargs
    ):
        super().__init__(dtype=dtype, **kwargs)
        if not embedding_size % 2 == 0:
            raise ValueError("The `embedding_size` must be even.")

        p, i = np.meshgrid(np.arange(max_sentence_len), np.arange(embedding_size // 2))
        pos_emb = np.empty((1, max_sentence_len, embedding_size))
        pos_emb[:, :, 0::2] = np.sin(p / 10_000 ** (2 * i / embedding_size)).T
        pos_emb[:, :, 1::2] = np.cos(p / 10_000 ** (2 * i / embedding_size)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))
        self.supports_masking = True

    def call(self, inputs):
        batch_max_length = tf.shape(inputs)[1]
        return inputs + self.positional_embedding[:, :batch_max_length]


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>To implement the encoder and decoder, we follow the diagram. Also, there is one crucial thing we have to do in the decoder - provide an appropriate mask. In the original paper we read: <i>"We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.".</i></li>
</ul>


In [19]:
class Encoder(layers.Layer):
    def __init__(
        self,
        embedding_size=256,
        n_attention_heads=8,
        n_units_dense=256,
        dropout_rate=0.2,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.multi_head_attention = layers.MultiHeadAttention(
            n_attention_heads, embedding_size, dropout=dropout_rate
        )
        self.feed_forward = keras.Sequential(
            [
                layers.Dense(
                    n_units_dense, activation="relu", kernel_initializer="he_normal"
                ),
                layers.Dense(embedding_size, kernel_initializer="he_normal"),
                layers.Dropout(dropout_rate),
            ]
        )
        self.add = layers.Add()
        self.normalization = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        Z = inputs
        skip_Z = Z
        Z = self.multi_head_attention(Z, value=Z, attention_mask=mask)
        Z = self.normalization(self.add([Z, skip_Z]))
        skip_Z = Z
        Z = self.feed_forward(Z)
        return self.normalization(self.add([Z, skip_Z]))


class Decoder(layers.Layer):
    def __init__(
        self,
        embedding_size=256,
        n_attention_heads=8,
        n_units_dense=256,
        dropout_rate=0.2,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.masked_multi_head_attention = layers.MultiHeadAttention(
            n_attention_heads, embedding_size, dropout=dropout_rate
        )
        self.multi_head_attention = layers.MultiHeadAttention(
            n_attention_heads, embedding_size, dropout=dropout_rate
        )
        self.feed_forward = keras.Sequential(
            [
                layers.Dense(
                    n_units_dense, activation="relu", kernel_initializer="he_normal"
                ),
                layers.Dense(embedding_size, kernel_initializer="he_normal"),
                layers.Dropout(dropout_rate),
            ]
        )
        self.add = layers.Add()
        self.normalization = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        decoder_mask, encoder_mask = mask  # type: ignore
        Z, encoder_output = inputs
        Z_skip = Z
        Z = self.masked_multi_head_attention(Z, value=Z, attention_mask=decoder_mask)
        Z = self.normalization(self.add([Z, Z_skip]))
        Z_skip = Z
        Z = self.multi_head_attention(
            Z, value=encoder_output, attention_mask=encoder_mask
        )
        Z = self.normalization(self.add([Z, Z_skip]))
        Z_skip = Z
        Z = self.feed_forward(Z)
        return self.normalization(self.add([Z, Z_skip]))


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>Since we have all components, it's time to handle the whole architecture of the Transformer.</li>
</ul>


In [20]:
class Transformer(keras.Model):
    def __init__(
        self,
        vocabulary_size=5000,
        max_sentence_len=50,
        embedding_size=256,
        n_encoder_decoder_blocks=1,
        n_attention_heads=8,
        n_units_dense=256,
        dropout_rate=0.2,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.max_sentence_len = max_sentence_len

        self.vectorization_en = layers.TextVectorization(
            vocabulary_size, output_sequence_length=max_sentence_len
        )
        self.vectorization_fr = layers.TextVectorization(
            vocabulary_size, output_sequence_length=max_sentence_len
        )
        self.encoder_embedding = layers.Embedding(
            vocabulary_size, embedding_size, mask_zero=True
        )
        self.decoder_embedding = layers.Embedding(
            vocabulary_size, embedding_size, mask_zero=True
        )
        self.positional_encoding = PositionalEncoding(max_sentence_len, embedding_size)
        self.encoder_blocks = [
            Encoder(embedding_size, n_attention_heads, n_units_dense, dropout_rate)
            for _ in range(n_encoder_decoder_blocks)
        ]
        self.decoder_blocks = [
            Decoder(embedding_size, n_attention_heads, n_units_dense, dropout_rate)
            for _ in range(n_encoder_decoder_blocks)
        ]
        self.output_layer = layers.Dense(vocabulary_size, activation="softmax")

    def call(self, inputs):
        encoder_inputs, decoder_inputs = inputs

        encoder_input_ids = self.vectorization_en(encoder_inputs)
        decoder_input_ids = self.vectorization_fr(decoder_inputs)

        encoder_embeddings = self.encoder_embedding(encoder_input_ids)
        decoder_embeddings = self.decoder_embedding(decoder_input_ids)

        encoder_pos_embeddings = self.positional_encoding(encoder_embeddings)
        decoder_pos_embeddings = self.positional_encoding(decoder_embeddings)

        encoder_pad_mask = tf.math.not_equal(encoder_input_ids, 0)[:, tf.newaxis]
        decoder_pad_mask = tf.math.not_equal(decoder_input_ids, 0)[:, tf.newaxis]

        # From original paper: "This masking, combined with fact that the output
        # embeddings are offset by one position, ensures that the predictions for
        # position i can depend only on the known outputs at positions less than i."
        batch_max_len_decoder = tf.shape(decoder_embeddings)[1]
        decoder_causal_mask = tf.linalg.band_part(  # Lower triangular matrix.
            tf.ones((batch_max_len_decoder, batch_max_len_decoder), tf.bool), -1, 0
        )
        decoder_mask = decoder_causal_mask & decoder_pad_mask

        Z = encoder_pos_embeddings
        for encoder_block in self.encoder_blocks:
            Z = encoder_block(Z, mask=encoder_pad_mask)

        encoder_output = Z
        Z = decoder_pos_embeddings
        for decoder_block in self.decoder_blocks:
            Z = decoder_block(
                [Z, encoder_output], mask=[decoder_mask, encoder_pad_mask]
            )

        return self.output_layer(Z)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>Now, we can train the above model as earlier.</li>
</ul>


In [21]:
K.clear_session()
tf.random.set_seed(42)

transformer = Transformer(max_sentence_len=15)
transformer_history = adapt_compile_and_fit(
    transformer, easy_train_ds, easy_valid_ds, colorama_verbose=True
)


[1m[30mEpoch: [1m[31m01[1m[30m - [1m[30mloss: [1m[31m4.48622[1m[30m - [1m[30maccuracy: [1m[31m0.26696[1m[30m - [1m[30mval_loss: [1m[31m3.59927[1m[30m - [1m[30mval_accuracy: [1m[31m0.36447

[1m[30mEpoch: [1m[31m02[1m[30m - [1m[30mloss: [1m[31m3.29729[1m[30m - [1m[30maccuracy: [1m[31m0.41515[1m[30m - [1m[30mval_loss: [1m[31m2.83743[1m[30m - [1m[30mval_accuracy: [1m[31m0.48068

[1m[30mEpoch: [1m[31m03[1m[30m - [1m[30mloss: [1m[31m2.73478[1m[30m - [1m[30maccuracy: [1m[31m0.50163[1m[30m - [1m[30mval_loss: [1m[31m2.49233[1m[30m - [1m[30mval_accuracy: [1m[31m0.53718

[1m[30mEpoch: [1m[31m04[1m[30m - [1m[30mloss: [1m[31m2.45631[1m[30m - [1m[30maccuracy: [1m[31m0.54773[1m[30m - [1m[30mval_loss: [1m[31m2.21382[1m[30m - [1m[30mval_accuracy: [1m[31m0.58532

[1m[30mEpoch: [1m[31m05[1m[30m - [1m[30mloss: [1m[31m2.29639[1m[30m - [1m[30maccuracy: [1m[31m0.57418[1m[30m - [1

In [22]:
fig = px.line(
    transformer_history.history,
    markers=True,
    height=540,
    width=840,
    symbol="variable",
    labels={"variable": "Variable", "value": "Value", "index": "Epoch"},
    title="Easy Dataset - Transformer Training Process",
    color_discrete_sequence=px.colors.diverging.balance_r,
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
)
fig.show()


In [23]:
translation1 = translate(transformer, "Take a seat")
translation2 = translate(transformer, "I wish Tom was here.")
translation3 = translate(transformer, "She ordered him to do it.")

print(CLR + "Actual Possible Translations:")
print(BLUE + "Take a seat".ljust(25), RED + "-> ", BLUE + "Prends place !")
print(
    BLUE + "I wish Tom was here.".ljust(25),
    RED + "-> ",
    BLUE + "J'aimerais que Tom soit là.",
)
print(
    BLUE + "She ordered him to do it.".ljust(25),
    RED + "-> ",
    BLUE + "Elle lui a ordonné de le faire.",
)
print()
print(CLR + "Model Translations:")
print(BLUE + "Take a seat".ljust(25), RED + "-> ", BLUE + translation1)
print(BLUE + "I wish Tom was here.".ljust(25), RED + "-> ", BLUE + translation2)
print(BLUE + "She ordered him to do it.".ljust(25), RED + "-> ", BLUE + translation3)


[1m[30mActual Possible Translations:

[1m[34mTake a seat               [1m[31m->  [1m[34mPrends place !

[1m[34mI wish Tom was here.      [1m[31m->  [1m[34mJ'aimerais que Tom soit là.

[1m[34mShe ordered him to do it. [1m[31m->  [1m[34mElle lui a ordonné de le faire.



[1m[30mModel Translations:

[1m[34mTake a seat               [1m[31m->  [1m[34mprends une place

[1m[34mI wish Tom was here.      [1m[31m->  [1m[34mjaimerais que tom était ici

[1m[34mShe ordered him to do it. [1m[31m->  [1m[34melle la commandé de le faire


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>It's hard to say about translations since we only used three different phrases. It's highly probable that the model can handle better with some sentences and worse with others.</li>
    <li>Additionally, there is a better learning rate and decay values to find, I think. Here I just used the default values from the function.</li>
</ul>


# <p style="padding: 15px; background-color: #3F384A; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Tackling Hard Dataset</p>


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>As so far, we didn't encounter a large dataset problem. Up to now.</li>
    <li>The second English-French dataset comprises over $22.5$ million sentences and weighs around $8$ GB. There may be a problem with loading this as one file. One of the solutions is to split this one large file into several small files and use <code>TensorFlow</code> data API to handle loading and prefetching data from several files. This is what we will do in this section. In that way, you will be able to train a model on the whole dataset (if you have time and resources, obviously).</li>
    <li>Firstly, let's download the data as before.</li>
</ul>


In [24]:
hard_dataset_user = "dhruvildave"
hard_dataset = "en-fr-translation-dataset"
data_dir = Path("data")

if not ON_KAGGLE:
    download_dataset_from_kaggle(hard_dataset_user, hard_dataset, data_dir)
    hard_dataset_path = data_dir / "en-fr.csv"
else:
    hard_dataset_path = Path("/kaggle/input/en-fr-translation-dataset/en-fr.csv")


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>We will use <code>pandas</code> to split the dataset into multiple small files.</li>
</ul>


In [25]:
chunk_size = 100_000
chunks_dir = Path("data_chunks")

if not os.path.exists(chunks_dir):
    chunks_dir.mkdir(parents=True)
    chunks = pd.read_csv(hard_dataset_path, chunksize=chunk_size, encoding="utf-8")
    for i, chunk in enumerate(chunks):
        chunk_path = chunks_dir / f"en-fr-chunk-{i:03}.csv"
        chunk.to_csv(chunk_path, index=False, encoding="utf-8")


In [26]:
filepaths = [f"{chunks_dir}/{chunk_file}" for chunk_file in os.listdir(chunks_dir)]
filepaths[:10]


['data_chunks/en-fr-chunk-025.csv',
 'data_chunks/en-fr-chunk-168.csv',
 'data_chunks/en-fr-chunk-040.csv',
 'data_chunks/en-fr-chunk-165.csv',
 'data_chunks/en-fr-chunk-201.csv',
 'data_chunks/en-fr-chunk-016.csv',
 'data_chunks/en-fr-chunk-085.csv',
 'data_chunks/en-fr-chunk-174.csv',
 'data_chunks/en-fr-chunk-214.csv',
 'data_chunks/en-fr-chunk-094.csv']

In [27]:
with open(filepaths[0], encoding="utf8") as f:
    for line in f.readlines()[:5]:
        print(line, end="")


en,fr

This system guides employees to ensure appropriate consideration of the environmental effects of a relevant decision.,Ce système guide les employés lorsqu’il s’agit de tenir adéquatement compte des effets d’une décision donnée sur l’environnement.

"Training and awareness sessions, consistent templates and tools, quality tracking, and monitoring are all part of DFO's commitments to SEA in this SDS.","Les séances de formation et de sensibilisation, des modèles et des outils harmonisés, un suivi de la qualité, ainsi que la surveillance, voilà des éléments qui font tous partie de l’engagement du MPO à l’égard de l’ÉES dans le cadre de cette SDS."

Activities Performance indicators Target date Raise awareness and support of the Strategic Environmental Assessment (SEA) process.,Activités Indicateurs de rendement Échéance Sensibiliser les gens et appuyer le processus d’évaluation environnementale stratégique.

10% increase in number of participants who have completed SEA training and 

In [28]:
with open(filepaths[-1], encoding="utf8") as f:
    for line in f.readlines()[:2]:
        print(line, end="")


en,fr

"(A medicinal ingredient previously evaluated within the last 3 years, to which reference is made is not required to be re-evaluated.)",(Il n'est pas nécessaire de réévaluer un ingrédient médicinal auquel il est fait référence dont la dernière évaluation remonte à moins de trois ans.)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>Let's see the distribution of words number in the example chunk.</li>
</ul>


In [29]:
hard_dataset_chunk = pd.read_csv(filepaths[0], encoding="utf-8", engine="pyarrow")
hard_dataset_chunk.head()


Unnamed: 0,en,fr
0,This system guides employees to ensure appropr...,Ce système guide les employés lorsqu’il s’agit...
1,"Training and awareness sessions, consistent te...",Les séances de formation et de sensibilisation...
2,Activities Performance indicators Target date ...,Activités Indicateurs de rendement Échéance Se...
3,10% increase in number of participants who hav...,Augmentation de 10 % du nombre de participants...
4,Annual review of DFO SEA process as part of th...,Examen annuel du processus d’ÉES du MPO dans l...


In [30]:
hard_dataset_chunk["English Words in Sentence"] = (
    hard_dataset_chunk["en"].str.split().apply(len)
)
hard_dataset_chunk["French Words in Sentence"] = (
    hard_dataset_chunk["fr"].str.split().apply(len)
)

fig = px.histogram(
    hard_dataset_chunk,
    x=["English Words in Sentence", "French Words in Sentence"],
    color_discrete_sequence=["#3f384a", "#e04c5f"],
    labels={"variable": "Variable", "value": "Words in Sentence"},
    marginal="box",
    barmode="group",
    range_x=(-10, 100),
    nbins=500,
    height=540,
    width=840,
    title="Hard Dataset Random Chunk - Words in Sentence",
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    bargap=0.2,
    bargroupgap=0.1,
    legend=dict(orientation="h", yanchor="bottom", xanchor="right", y=1.02, x=1),
    yaxis_title="Count",
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>I clipped the x-axis and reduced the number of bins to increase readability. Generally, the problem is more challenging here since the sentence may reach $50$ words. Moreover, there are elaborates with, for example, $800$ words and more!</li>
    <li>Now, we will write utility functions that load and preliminary prepare the appropriate <code>TensorFlow</code> dataset. We proceed as before but with two significant changes. We add parsing <code>csv</code> line and files interleaving. Thanks to interleaving, the pipeline will be loading subsequent sentences from given files until they are exhausted.</li>
</ul>


In [31]:
def parse_csv_line(line):
    "Decodes `csv` line and returns `(sentence_en, sentence_fr)` tensor."
    defaults = 2 * [tf.constant("", dtype=tf.string)]
    fields = tf.io.decode_csv(line, record_defaults=defaults)
    return tf.stack(fields[0]), tf.stack(fields[1])


def prepare_input_and_target(sentences_en, sentences_fr):
    """Return data in the format: `((encoder_input, decoder_input), target)`"""
    return (sentences_en, b"startofseq " + sentences_fr), sentences_fr + b" endofseq"


def from_csv_files_dataset(
    filepaths,
    batch_size=32,
    cache=True,
    shuffle=False,
    shuffle_buffer_size=50_000,
    seed=None,
):
    """Creates `TensorFlow` dataset from multiple csv files."""
    dataset = tf.data.Dataset.list_files(filepaths, seed=seed)
    dataset = dataset.interleave(  # type: ignore
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),  # Skip header.
        cycle_length=tf.data.AUTOTUNE,
        num_parallel_calls=tf.data.AUTOTUNE,
    )
    dataset = dataset.map(parse_csv_line, num_parallel_calls=tf.data.AUTOTUNE)
    dataset = dataset.map(prepare_input_and_target, num_parallel_calls=tf.data.AUTOTUNE)
    if cache:
        dataset = dataset.cache()
    if shuffle:
        dataset = dataset.shuffle(shuffle_buffer_size, seed=seed)
    return dataset.batch(batch_size)


In [32]:
example_ds = from_csv_files_dataset(filepaths[0:3], batch_size=2)
list(example_ds.take(1))[0]


((<tf.Tensor: shape=(2,), dtype=string, numpy=
  array([b'This system guides employees to ensure appropriate consideration of the environmental effects of a relevant decision.',
         b'Some have exceeded their useful business life and we are still paying to store them.'],
        dtype=object)>,
  <tf.Tensor: shape=(2,), dtype=string, numpy=
  array([b'startofseq Ce syst\xc3\xa8me guide les employ\xc3\xa9s lorsqu\xe2\x80\x99il s\xe2\x80\x99agit de tenir ad\xc3\xa9quatement compte des effets d\xe2\x80\x99une d\xc3\xa9cision donn\xc3\xa9e sur l\xe2\x80\x99environnement.',
         b'startofseq Quelques-uns ont une valeur archivistique et doivent \xc3\xaatre transf\xc3\xa9r\xc3\xa9s aux archives.'],
        dtype=object)>),
 <tf.Tensor: shape=(2,), dtype=string, numpy=
 array([b'Ce syst\xc3\xa8me guide les employ\xc3\xa9s lorsqu\xe2\x80\x99il s\xe2\x80\x99agit de tenir ad\xc3\xa9quatement compte des effets d\xe2\x80\x99une d\xc3\xa9cision donn\xc3\xa9e sur l\xe2\x80\x99environnement. 

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>Training neural network on this dataset is time-consuming, so we will only see whether it works.</li>
</ul>


In [33]:
hard_train_ds = from_csv_files_dataset(filepaths[0:2], shuffle=True, seed=42)
hard_valid_ds = from_csv_files_dataset(filepaths[2:3])


In [34]:
K.clear_session()
tf.random.set_seed(42)

bidirect_encoder_decoder = BidirectionalEncoderDecoderWithAttention()
bidirect_history = adapt_compile_and_fit(
    bidirect_encoder_decoder,
    hard_train_ds,
    hard_valid_ds,
    n_epochs=1,
    colorama_verbose=True,
)


[1m[30mEpoch: [1m[31m01[1m[30m - [1m[30mloss: [1m[31m5.23879[1m[30m - [1m[30maccuracy: [1m[31m0.17731[1m[30m - [1m[30mval_loss: [1m[31m4.81768[1m[30m - [1m[30mval_accuracy: [1m[31m0.22098


In [35]:
K.clear_session()
tf.random.set_seed(42)

transformer = Transformer()
bidirect_history = adapt_compile_and_fit(
    transformer,
    hard_train_ds,
    hard_valid_ds,
    n_epochs=1,
    colorama_verbose=True,
)


[1m[30mEpoch: [1m[31m01[1m[30m - [1m[30mloss: [1m[31m4.94478[1m[30m - [1m[30maccuracy: [1m[31m0.18895[1m[30m - [1m[30mval_loss: [1m[31m4.37700[1m[30m - [1m[30mval_accuracy: [1m[31m0.25626


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    border-bottom: 3px solid #e04c5f;
">
    <b>Notes</b> 📜
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>It actually works, but the task is much more difficult here. If you are interested in the possibilities of the two models discussed in this notebook, you can run this cell with more epochs. Moreover, you can play with the hyperparameters.</li>
</ul>


# <p style="padding: 15px; background-color: #3F384A; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Summary</p>


<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    margin-left: 8px;
    margin-right: 8px;
">
    <li>In this notebook, we tackled two different English-French datasets.</li>
    <li>We wrote utility functions to create efficient <code>TensorFlow</code> datasets.</li>
    <li>We implemented Bidirectional Encoder-Decoder RNN to translate English sentences into French ones.</li>
    <li>Similarly, we wrote the Transformer architecture to perform the same task.</li>
    <li>Here it's hard to claim which architecture is better since I spent little time on hyperparameters searching and learning rate schedule. But even now, I can claim that Transformer is more stable. Just see the training and validation accuracy.</li>
    <li>If you want, you can see which hyperparameters and scheduling work fine with these datasets.</li>
</ul>