Practical 11

**Aim:** Build a sequence-to-sequence (seq2seq) model using TensorFlow/Keras for a simple
machine translation task (e.g., translating English sentences to French). Use an LSTM or GRU for both
the encoder and decoder.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, LSTM, GRU, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
# 0 Small toy parallel corpus

en_sentences = [
"i am a student",
"you are a teacher",
"he is a doctor",
"she is a nurse",
"we are friends",
"they are engineers",
"i like apples",
"you like oranges",
"we love music",
"they play football"
]

**Theory:**

* Sequence-to-Sequence (Seq2Seq) Model: This architecture is commonly used for tasks where the input and output are sequences, such as machine translation. It consists of two main parts:
* Encoder: Processes the input sequence and compresses it into a fixed-length context vector (or set of state vectors) that theoretically captures the meaning of the input.
* Decoder: Takes the context vector from the encoder and generates the output sequence one element at a time.
* Recurrent Neural Networks (RNNs): Both the encoder and decoder typically use RNNs like LSTMs or GRUs, which are designed to handle sequential data by maintaining an internal state that remembers information from previous steps.
* Teacher Forcing: During training, the decoder is often fed the correct previous target token as input to predict the next token. This helps the model learn the correct output sequence more quickly.
* Greedy Decoding: During inference (translation), the decoder predicts the next token with the highest probability at each step. This is a simple decoding strategy but can sometimes lead to suboptimal translations.
* Padding and Tokenization: Text data needs to be converted into numerical sequences for the model. Tokenization breaks down sentences into words or sub-word units, and padding ensures all sequences have the same length.

In [None]:
fr_sentences = [
"je suis un étudiant",

"tu es un professeur",
"il est un médecin",
"elle est une infirmière",
"nous sommes des amis",
"ils sont des ingénieurs",
"j'aime les pommes",
"tu aimes les oranges",
"nous aimons la musique",
"ils jouent au football"
]

tokenize the sentences
make the sentences of equal length


In [None]:
# Add start/end tokens on target side
fr_sentences_in = [f"<sos> {s}" for s in fr_sentences]
fr_sentences_out = [f"{s} <eos>" for s in fr_sentences]

In [None]:
# 1 Tokenize
def make_tokenizer(texts, oov_token="<unk>"):
  t = Tokenizer(oov_token=oov_token, filters="") # keep punctuation if any
  t.fit_on_texts(texts)
  return t

# Source (English)
src_tok = make_tokenizer(en_sentences)
src_vocab = len(src_tok.word_index) + 1

# Target (French) uses both input (with <sos>) and output (with <eos>)
tgt_tok = make_tokenizer(fr_sentences_in + fr_sentences_out)

tgt_vocab = len(tgt_tok.word_index) + 1

# Sequences
src_seq = src_tok.texts_to_sequences(en_sentences)
tgt_seq_in = tgt_tok.texts_to_sequences(fr_sentences_in)
tgt_seq_out = tgt_tok.texts_to_sequences(fr_sentences_out)

# Max lengths
max_src_len = max(len(s) for s in src_seq)
max_tgt_len = max(len(s) for s in tgt_seq_in) # in and out have same length

# Pad
src_seq = pad_sequences(src_seq, maxlen=max_src_len, padding="post")
tgt_seq_in = pad_sequences(tgt_seq_in, maxlen=max_tgt_len, padding="post")
tgt_seq_out = pad_sequences(tgt_seq_out, maxlen=max_tgt_len, padding="post")

Build the LSTM or GRU model

In [None]:
# 2 Build the model
EMB_SRC = 64
EMB_TGT = 64
HID = 128
USE_GRU = False # set True to switch to GRU

# Encoder
enc_inputs = Input(shape=(max_src_len,), name="encoder_input")
enc_emb = Embedding(input_dim=src_vocab, output_dim=EMB_SRC,
name="enc_embedding")(enc_inputs)

if USE_GRU:
  enc_rnn, enc_state = GRU(HID, return_state=True, name="encoder_gru")(enc_emb)
  enc_states = [enc_state]
else:
  enc_rnn, state_h, state_c = LSTM(HID, return_sequences=False, return_state=True,
name="encoder_lstm")(enc_emb)
  enc_states = [state_h, state_c]

# Decoder
dec_inputs = Input(shape=(max_tgt_len,), name="decoder_input")
dec_emb = Embedding(input_dim=tgt_vocab, output_dim=EMB_TGT,
name="dec_embedding")(dec_inputs)

if USE_GRU:
  dec_rnn = GRU(HID, return_sequences=True, return_state=True, name="decoder_gru")
  dec_outputs, _ = dec_rnn(dec_emb, initial_state=enc_states)
else:
  dec_rnn = LSTM(HID, return_sequences=True, return_state=True, name="decoder_lstm")
  dec_outputs, _, _ = dec_rnn(dec_emb, initial_state=enc_states)

dec_dense = Dense(tgt_vocab, activation="softmax", name="decoder_dense")
dec_logits = dec_dense(dec_outputs) # (batch, max_tgt_len, tgt_vocab)

# Train model (teacher forcing)
model = Model([enc_inputs, dec_inputs], dec_logits)
model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
model.summary()

# Targets must be rank-3 for sparse CE: (batch, time, 1)
tgt_out_expanded = np.expand_dims(tgt_seq_out, -1)

# Early stopping to avoid overfitting on tiny dataset

es = EarlyStopping(monitor="loss", patience=8, restore_best_weights=True)

history = model.fit(
[src_seq, tgt_seq_in], tgt_out_expanded,
batch_size=4,
epochs=200,
callbacks=[es],
verbose=0 # change to 1 to watch training
)

print(f"Trained for {len(history.history['loss'])} epochs.")

Trained for 200 epochs.


In [None]:
# 3 Build inference models
# Encoder model (same as train, outputs states)
if USE_GRU:
  encoder_model = Model(enc_inputs, enc_states[0])
else:
  encoder_model = Model(enc_inputs, enc_states)

# Decoder inference: one step at a time
# Inputs: current token + previous state(s)
dec_token_input = Input(shape=(1,), name="dec_token_input")
dec_token_emb = model.get_layer("dec_embedding")(dec_token_input)

if USE_GRU:
  inf_state_in = Input(shape=(HID,), name="state_in")
  dec_out_step, state_out = dec_rnn(dec_token_emb, initial_state=[inf_state_in])
  dec_step_probs = dec_dense(dec_out_step)
  decoder_model = Model([dec_token_input, inf_state_in], [dec_step_probs, state_out])

else:
  inf_state_h = Input(shape=(HID,), name="state_h_in")
  inf_state_c = Input(shape=(HID,), name="state_c_in")
  dec_out_step, out_h, out_c = dec_rnn(dec_token_emb, initial_state=[inf_state_h, inf_state_c])
  dec_step_probs = dec_dense(dec_out_step)
  decoder_model = Model([dec_token_input, inf_state_h, inf_state_c],
[dec_step_probs, out_h, out_c])


In [None]:
# 4 Greedy decoding helper

sos_id = tgt_tok.word_index.get("<sos>")
eos_id = tgt_tok.word_index.get("<eos>")

index_to_tgt = {v: k for k, v in tgt_tok.word_index.items()}

def translate_sentence(sentence_en, max_len=20):
# Encode source
  seq = src_tok.texts_to_sequences([sentence_en])
  seq = pad_sequences(seq, maxlen=max_src_len, padding="post")
  if USE_GRU:
    state = encoder_model.predict(seq, verbose=0)
  else:
    state_h, state_c = encoder_model.predict(seq, verbose=0)

# Start with <sos>
  target_token = np.array([[sos_id]], dtype="int32")
  output_tokens = []

  for _ in range(max_len):
    if USE_GRU:
      probs, state = decoder_model.predict([target_token, state], verbose=0)

    else:
      probs, state_h, state_c = decoder_model.predict([target_token, state_h, state_c], verbose=0)

    next_id = np.argmax(probs[0, 0]) # greedy
    if next_id == eos_id or next_id == 0:
      break
    output_tokens.append(index_to_tgt.get(next_id, "<unk>"))
    target_token = np.array([[next_id]], dtype="int32")

  return " ".join(output_tokens)

In [None]:
"""#Testing the code"""

# Try a few translations
tests = [
"i am a student",
"you like oranges",
"we love music",
"they play football",
"she is a nurse"
]

for s in tests:
  print(f"EN: {s}")
  print(f"FR: {translate_sentence(s)}\n")

EN: i am a student
FR: je suis un étudiant

EN: you like oranges
FR: tu aimes les oranges

EN: we love music
FR: nous aimons la musique

EN: they play football
FR: ils jouent au football

EN: she is a nurse
FR: elle est une infirmière



**Observations:**

The model successfully translates the given English sentences to their corresponding French translations in the small toy dataset.
The translations are accurate for the examples provided.
The model was trained for 200 epochs and stopped early based on the defined criteria.
The model summary shows the architecture with embedding layers, LSTM (or GRU) layers, and a dense output layer with a softmax activation for probability distribution over the target vocabulary.

**Conclusion:**

The built seq2seq model, using LSTM (or GRU) layers for encoding and decoding, is capable of performing simple machine translation on the small, limited dataset. The training process converged, and the greedy decoding strategy produced correct translations for the test sentences. While effective on this small scale, larger and more complex datasets, along with more advanced techniques like attention mechanisms and beam search decoding, would be necessary for practical machine translation.