**What problem does Seq2Seq solve?**

Many tasks need input sequence → output sequence with different lengths:

translation (“i love cats” → “ich liebe katzen”)

summarization

question answering

A single RNN/LSTM that outputs one label per input token can’t do this cleanly.
Seq2Seq uses two RNNs/LSTMs:

Encoder: reads the whole input and produces a summary (final hidden & cell state).

Decoder: starts from that summary and generates the output tokens one by one.

Think: the encoder writes a summary note; the decoder reads that note and writes the translation.

**Part 0 – Setup**


In [2]:
# Basic setup

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer #Tokenizer → converts words to integers
from tensorflow.keras.preprocessing.sequence import pad_sequences #pad_sequences → ensures all sentences have same length
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model #connects all layers in the encoder-decoder architecture
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Attention, MultiHeadAttention #Embedding → turns integer tokens into dense vectors, LSTM --> our main sequence model
import matplotlib.pyplot as plt


**Step 1: Create a Dataset**

In [3]:
# Create mapping: digits to words
digit_to_word = {
    "1": "one", "2": "two", "3": "three", "4": "four", "5": "five",
    "6": "six", "7": "seven", "8": "eight", "9": "nine", "0": "zero"
}

# Generate input and target sequences
inputs = []
targets = []

for i in range(2000):   # 100 examples
    num_seq = "".join(np.random.choice(list(digit_to_word.keys()), size=3))
    word_seq = " ".join([digit_to_word[d] for d in num_seq])
    inputs.append(" ".join(list(num_seq)))   # e.g. "1 2 3"
    targets.append(word_seq)                 # e.g. "one two three"

# Show a few examples
for i in range(5):
    print(f"{inputs[i]}  →  {targets[i]}")



3 8 5  →  three eight five
1 7 1  →  one seven one
9 5 9  →  nine five nine
9 1 0  →  nine one zero
1 7 3  →  one seven three



**Step 2: Tokenize the Words**

Converts words to integer IDs (e.g., “1 2 3” → [1, 2, 3]).

Pads sequences so they have the same length for batching.

Separate vocabularies because “1” and “one” are different word sets.

In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Tokenize inputs (digits)
input_tokenizer = Tokenizer()
input_tokenizer.fit_on_texts(inputs)
encoder_input = input_tokenizer.texts_to_sequences(inputs)

# Tokenize outputs (words)
target_tokenizer = Tokenizer()
target_tokenizer.fit_on_texts(targets)
decoder_target = target_tokenizer.texts_to_sequences(targets)

# Pad both to same length
max_encoder_len = max(len(s) for s in encoder_input)
max_decoder_len = max(len(s) for s in decoder_target)
encoder_input = pad_sequences(encoder_input, maxlen=max_encoder_len, padding='post')
decoder_target = pad_sequences(decoder_target, maxlen=max_decoder_len, padding='post')

# Vocabulary sizes
input_vocab_size = len(input_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1

print("Input vocab:", input_vocab_size, "Target vocab:", target_vocab_size)
print("Encoder example:", encoder_input[0])




Input vocab: 11 Target vocab: 11
Encoder example: [ 8 10  4]


**Step 4: Prepare Decoder Input and Output**
1. The decoder needs input and output sequences:
2. Input starts with <start> token (for simplicity, we’ll just reuse the target sentence)
3. Output is the same sentence shifted one step ahead (so it learns to predict next word)

In [9]:
decoder_input = np.copy(decoder_target)
decoder_output = np.expand_dims(decoder_target, -1)

print("Decoder input shape:", decoder_input.shape)
print("Decoder output shape:", decoder_output.shape)


Decoder input shape: (2000, 3)
Decoder output shape: (2000, 3, 1)


Step 5: Build the Encoder
1. The Embedding layer converts words → vector representations.
2. The LSTM processes the entire input sequence and returns: 
state_h: the last hidden state
state_c: the last cell state

3. We store these as encoder_states — they become the initial memory for the decoder.

In [10]:
latent_dim = 64  # number of LSTM neurons

# 1️⃣ Input layer for encoder
encoder_inputs = Input(shape=(max_encoder_len,), name="encoder_inputs")

# 2️⃣ Word embeddings (turn word IDs into vectors)
encoder_emb = Embedding(input_vocab_size, latent_dim, name="encoder_embedding")(encoder_inputs)

# 3️⃣ LSTM processes the sequence and keeps the final hidden & cell states
_, state_h, state_c = LSTM(latent_dim, return_state=True, name="encoder_lstm")(encoder_emb)

encoder_states = [state_h, state_c]  # pass to decoder


**Step 6: Build the Decoder**
1. The decoder also uses an LSTM.

2. Its initial state = the encoder’s final state (so it starts with the encoder’s “memory”).

3. The LSTM outputs a sequence of predictions (one per time step).

4. Finally, a Dense layer with softmax gives probabilities for each word in the vocabulary.

In [11]:
# 1. Input for decoder
decoder_inputs = Input(shape=(max_decoder_len,), name="decoder_inputs")

# 2. Embedding for decoder side
decoder_emb = Embedding(target_vocab_size, latent_dim, name="decoder_embedding")(decoder_inputs)

# 3️. LSTM that generates output using encoder states as its start
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, name="decoder_lstm")
decoder_outputs, _, _ = decoder_lstm(decoder_emb, initial_state=encoder_states)

# 4️. Dense layer converts LSTM output to word probabilities
decoder_dense = Dense(target_vocab_size, activation='softmax', name="decoder_output")
decoder_outputs = decoder_dense(decoder_outputs)


**Step 7: Combine Encoder + Decoder → Full Model**

Explanation:
1. The input is a pair: [encoder_input, decoder_input]
2. The output is what the decoder should predict (the reversed sentence)
3. We use sparse_categorical_crossentropy because our outputs are integer word IDs (not one-hot vectors).

In [12]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.summary()


**Step 8: Train the Model**


1. We feed both encoder & decoder inputs for each training example.
2. The model learns to map encoder_input → decoder_output.
3. Training is fast since the dataset is tiny.

In [13]:
history = model.fit(
    [encoder_input, decoder_input],
    decoder_output,
    batch_size=2,
    epochs=50,
    verbose=1
)


Epoch 1/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - loss: 0.3106
Epoch 2/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 0.0020
Epoch 3/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 6.5797e-04
Epoch 4/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 3.0085e-04
Epoch 5/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 1.5576e-04
Epoch 6/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 8.5773e-05
Epoch 7/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 4.8796e-05
Epoch 8/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 2.8284e-05
Epoch 9/50
[1m1000/1000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - loss: 1.6595e-05
Epoch 10/50
[1m1000/1000[0m [32m━━━━━━━━━━

**Step 8 – Test: Make Predictions**
1. We feed an input sentence through the encoder and decoder.
2. We pick the most likely word at each timestep (argmax).
3. We decode the numbers back to readable text.

In [14]:
# Reverse lookup for words
rev_target_index = {v: k for k, v in target_tokenizer.word_index.items()}

def predict_sequence(input_seq):
    preds = model.predict([input_seq, np.zeros_like(decoder_input[:1])], verbose=0)
    pred_ids = np.argmax(preds[0], axis=1)
    return " ".join([rev_target_index.get(i, '') for i in pred_ids if i > 0])

# Try first few
for i in range(5):
    inp = encoder_input[i:i+1]
    print("Input:", inputs[i])
    print("Target:", targets[i])
    print("Predicted:", predict_sequence(inp))
    print("-" * 40)



Input: 3 8 5
Target: three eight five
Predicted: one one one
----------------------------------------
Input: 1 7 1
Target: one seven one
Predicted: one one one
----------------------------------------
Input: 9 5 9
Target: nine five nine
Predicted: one one one
----------------------------------------
Input: 9 1 0
Target: nine one zero
Predicted: one one one
----------------------------------------
Input: 1 7 3
Target: one seven three
Predicted: one one one
----------------------------------------
