<a href="https://colab.research.google.com/github/CelikAbdullah/deep-learning-notebooks/blob/main/Natural%20Language%20Processing%20(NLP)/Machine%20Translation/Machine%20Translation%20with%20RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Libraries

In [None]:
import tensorflow as tf
import string
import re
from tensorflow import keras
import random
import numpy as np

We are going to implement a sequence-to-sequence modeling on a machine translation task.

# Loading the dataset

First, we have to download an English-to-Spanish translation dataset from the following download link:

In [None]:
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip

--2023-09-13 20:03:29--  http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.193.207, 173.194.194.207, 173.194.195.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.193.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip’


2023-09-13 20:03:29 (206 MB/s) - ‘spa-eng.zip’ saved [2638744/2638744]



To complete the download, we unzip the .zip file:

In [None]:
!unzip -q spa-eng.zip

# Parse the text file

The text file contains one example per line:

**[an English sentence] [tab character] [corresponding Spanish sentence]**

Let's parse the .txt file:

In [None]:
text_file = "spa-eng/spa.txt"
with open(text_file) as f:
  lines = f.read().split("\n")[:-1]

text_pairs = []

# iterate over the lines in the file
for line in lines:
  # each line contains an English phrase and its Spanish translation
  # a Tab separates them
  english, spanish = line.split("\t")
  # prepend [start] and append [end] to the Spanish sentence
  spanish = "[start] " + spanish + " [end]"
  text_pairs.append((english, spanish))

Let's print a random sentence to see how it looks like:

In [None]:
random_example = random.choice(text_pairs)
print(random_example)

("I'm sorry I was rude to you.", '[start] Lo siento si fui grosero contigo. [end]')


# Prepare the dataset

Let's shuffle the dataset and split it into a training, validation and test sets:

In [None]:
# shuffle
random.shuffle(text_pairs)
# calculate number of validation samples
num_val_samples = int(0.15 * len(text_pairs))
# calculate number of training samples
num_train_samples = len(text_pairs) - 2 * num_val_samples
# training set
train_pairs = text_pairs[:num_train_samples]
# validation set
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
# test set
test_pairs = text_pairs[num_train_samples + num_val_samples:]

## Vectorize the English and Spanish text pairs

We create two TextVectorization layers:
one for English and one for Spanish.

For that, we preserve the [start] and [end] tokens that we inserted previously.
Keep in mind that punctuation is different in each language. In the Spanish TextVectorization layer, if we are going to strip punctuation characters, we need to also strip the character "¿". Normally, we wouldn't do that but for the sake of simplicity, we'll do it here.

In [None]:
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

# a custom string standardization function
# for the Spanish TextVectorization layer:
# it preserves [ and ] but strips ¿ as well
# as other characters from strings.punctuation
def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(
        lowercase, f"[{re.escape(strip_chars)}]", "")

In [None]:
# to keep things simple, we'll only look at the
# top 15000 words in each language
vocab_size = 15000
# we'll also restrict sentences to 20 words
sequence_length = 20

# define the English TextVectorization layer
source_vectorization = keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
# define the Spanish TextVectorization layer
target_vectorization = keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    # recall: that each Spanish sentence starts with the "[start]" token
    #         so, we need to offset the sentence by one step during training
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
# invoke adapt to learn the vocabulary of each language
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

## Turn data into a tf.data pipeline

We want to return a tuple (inputs, target) where inputs is a dict with two key, “encoder_inputs” (the English sentence) and “decoder_inputs” (the Spanish sentence), and target is the Spanish sentence offset by one step ahead:


In [None]:
batch_size = 64

def format_dataset(eng, spa):
  eng = source_vectorization(eng)
  spa = target_vectorization(spa)
  return ({
      "english": eng,
      # note: the input Spanish sentence doesn't include the last token
      #       to keep inputs and targets at the same length
      "spanish": spa[:, :-1],},
          # the target Spanish sentence is one step ahead. Both are still the same length(20 words)
          spa[:, 1:])

def make_dataset(pairs):
  eng_texts, spa_texts = zip(*pairs)
  eng_texts = list(eng_texts)
  spa_texts = list(spa_texts)
  dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
  dataset = dataset.batch(batch_size)
  dataset = dataset.map(format_dataset, num_parallel_calls=4)

  # use in-memory caching to speed up the preprocessing
  return dataset.shuffle(2048).prefetch(16).cache()

# create the Datasets for training and validation
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

What our dataset outputs look like:

In [None]:
for inputs, targets in train_ds.take(1):
  print(f"inputs['english'].shape: {inputs['english'].shape}")
  print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
  print(f"targets.shape: {targets.shape}")

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


# Create the RNN

## Create an GRU-based encoder

In [None]:
def encoder_gru(latent_dim, embed_dim):
  source = keras.Input(shape=(None,), dtype="int64", name="english")
  x = keras.layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
  encoder_gru = keras.layers.GRU(latent_dim)
  encoded_source = keras.layers.Bidirectional(encoder_gru, merge_mode="sum")(x)

  return source, encoded_source

## Create an GRU-based decoder

In [None]:
def decoder_gru(latent_dim):
  return keras.layers.GRU(latent_dim, return_sequences=True)

## Create the Seq-to-Seq RNN model

In [None]:
def seq2seq_rnn(embed_dim = 256, latent_dim = 1024):
  past_target = keras.Input(shape=(None,), dtype="int64", name="spanish")
  x = keras.layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
  source, encoded_source = encoder_gru(latent_dim, embed_dim)
  decoder = decoder_gru(latent_dim)
  x = decoder(x, initial_state=encoded_source)
  x = keras.layers.Dropout(0.5)(x)
  target_next_step = keras.layers.Dense(vocab_size, activation="softmax")(x)
  return keras.Model(inputs=[source, past_target], outputs=target_next_step, name="RNN")


In [None]:
# create the model
seq2seq_rnn = seq2seq_rnn()

# Compile and train the RNN

In [None]:
# compile
seq2seq_rnn.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
# train
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x7a511146a140>

# Testing our Seq2Seq RNN

In [None]:
# prepare a dict to convert token index predictions to string tokens
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
  tokenized_input_sentence = source_vectorization([input_sentence])
  # seed token
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = target_vectorization([decoded_sentence])
    # sample the next token
    next_token_predictions = seq2seq_rnn.predict([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(next_token_predictions[0, i, :])
    # convert the next token prediction to a string and append it to the generated sentence
    sampled_token = spa_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token
    # exit condition: either hit max length or a stop character
    if sampled_token == "[end]":
      break
  return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
  input_sentence = random.choice(test_eng_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence(input_sentence))

-
Try to fulfill your duty.
[start] intenta [UNK] tu nombre [end]
-
He is not stupid.
[start] Él no es estúpido [end]
-
He likes to travel by himself.
[start] le gusta viajar a tiempo [end]
-
It was unpardonable.
[start] fue [UNK] [end]
-
I don't care for the way he talks.
[start] no me importa por qué está hablando [end]
-
What is your blood type?
[start] cuál es tu grupo sanguíneo [end]
-
He doesn't want to live in the city.
[start] Él no quiere vivir en el campo [end]
-
I'm broke.
[start] soy [UNK] [end]
-
I want to see more.
[start] quiero verlo más [end]
-
I am baffled.
[start] estoy [UNK] [end]
-
What have you learned about Tom so far?
[start] qué has hecho de tom acerca de lo que estaba aquí [end]
-
Grab the bottom.
[start] [UNK] el suelo [end]
-
Tom is hitting Mary.
[start] tom está [UNK] a mary [end]
-
It's not too late.
[start] no es demasiado tarde [end]
-
I've seen that movie many times, but I'd like to see it again.
[start] he visto a esa película desde que he visto a ver 