<a href="https://colab.research.google.com/github/KhotNoorin/Deep-Learning/blob/main/Encoder_Decoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoder Decoder:

The Encoder-Decoder architecture is a foundational framework used in many sequence-to-sequence (seq2seq) tasks such as machine translation, text summarization, and question answering. In Transformers, this architecture is implemented without any recurrent or convolutional networks. Instead, it uses self-attention and positional encodings to model dependencies between tokens.

Transformers, introduced in the paper **"Attention is All You Need" (Vaswani et al., 2017)**, improved over traditional encoder-decoder RNNs and LSTMs by enabling parallelization and better long-range dependency modeling.

---

## Architecture Components

<img src="https://miro.medium.com/v2/resize:fit:1200/1*uuRstKwN3cxzbzv6u0oUOg.png" width="600"/>

### 1. **Encoder**
- The encoder processes the input sequence and produces a sequence of continuous representations.
- Consists of **N identical layers** (typically 6 in the original Transformer).
- Each encoder layer contains:
  - **Multi-Head Self-Attention Mechanism**
  - **Position-wise Feed-Forward Network**
  - **Residual Connections and Layer Normalization**

**Self-Attention in Encoder:**
Each token attends to every other token in the input sentence, capturing contextual relationships.

**Positional Encoding:**
Since there is no recurrence or convolution, positional encodings are added to input embeddings to retain the order of words.

### 2. **Decoder**
- The decoder generates the output sequence one token at a time.
- Also composed of **N identical layers**.
- Each decoder layer contains:
  - **Masked Multi-Head Self-Attention Layer**
  - **Multi-Head Attention over Encoder Output**
  - **Feed-Forward Network**
  - **Residual Connections and Layer Normalization**

**Masked Self-Attention:**
Prevents positions from attending to future positions. This maintains the autoregressive property during training.

**Cross-Attention (Encoder-Decoder Attention):**
Allows each position in the decoder to attend over all positions in the encoder output. This helps the decoder focus on relevant parts of the input sentence.

---

## Workflow

1. **Input sequence** is passed to the encoder.
2. The **encoder outputs** a set of continuous representations.
3. During training, the **target sequence** is shifted and passed to the decoder.
4. The decoder uses both the **encoder output** and the previously generated tokens to predict the next token.
5. A **softmax layer** generates the probability distribution over the vocabulary.

---

## Advantages

- **Parallelization:** Unlike RNNs, Transformers allow parallel processing of sequence data.
- **Long-Range Dependencies:** Self-attention captures long-term dependencies better than RNNs or LSTMs.
- **Scalability:** Easily scales with data and compute, making it ideal for large pretraining like BERT, GPT, T5, etc.

---

## Summary

The Encoder-Decoder architecture in Transformers revolutionized NLP tasks by replacing recurrence with self-attention mechanisms. By separating the encoding of inputs and the generation of outputs, and enabling tokens to attend over each other globally, Transformers achieve state-of-the-art performance on various NLP benchmarks.


In [17]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [18]:
# Sample English-Hindi sentence pairs
english_sentences = ['hello', 'how are you', 'i am fine', 'thank you', 'good night']
hindi_sentences = ['नमस्ते', 'तुम कैसे हो', 'मैं ठीक हूँ', 'धन्यवाद', 'शुभ रात्रि']

In [19]:
# Add <start> and <end> tokens to Hindi
hindi_input_sentences = ['<start> ' + sent for sent in hindi_sentences]
hindi_target_sentences = [sent + ' <end>' for sent in hindi_sentences]

In [20]:
# Tokenize English
eng_tokenizer = Tokenizer()
eng_tokenizer.fit_on_texts(english_sentences)
eng_sequences = eng_tokenizer.texts_to_sequences(english_sentences)
eng_word_index = eng_tokenizer.word_index
max_eng_len = max(len(seq) for seq in eng_sequences)
encoder_input_data = pad_sequences(eng_sequences, maxlen=max_eng_len, padding='post')

In [21]:
# Tokenize Hindi
hin_tokenizer = Tokenizer(filters='')
hin_tokenizer.fit_on_texts(hindi_input_sentences + hindi_target_sentences)
hin_input_sequences = hin_tokenizer.texts_to_sequences(hindi_input_sentences)
hin_target_sequences = hin_tokenizer.texts_to_sequences(hindi_target_sentences)
hin_word_index = hin_tokenizer.word_index
num_hin_tokens = len(hin_word_index) + 1
max_hin_len = max(len(seq) for seq in hin_input_sequences)

decoder_input_data = pad_sequences(hin_input_sequences, maxlen=max_hin_len, padding='post')
decoder_target_data = pad_sequences(hin_target_sequences, maxlen=max_hin_len, padding='post')
decoder_target_onehot = tf.keras.utils.to_categorical(decoder_target_data, num_hin_tokens)

In [22]:
# Model parameters
latent_dim = 256

In [23]:
# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(len(eng_word_index) + 1, latent_dim)(encoder_inputs)
_, state_h, state_c = LSTM(latent_dim, return_state=True)(enc_emb)
encoder_states = [state_h, state_c]

In [24]:
# Decoder
decoder_inputs = Input(shape=(None,))
dec_emb = Embedding(num_hin_tokens, latent_dim)(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
decoder_dense = Dense(num_hin_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [25]:
# Full model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

In [27]:
# Train the model
model.fit([encoder_input_data, decoder_input_data], decoder_target_onehot, batch_size=2, epochs=500, verbose=0)

<keras.src.callbacks.history.History at 0x7e2caf8a4bd0>

In [28]:
# Inference models
encoder_model = Model(encoder_inputs, encoder_states)

In [29]:
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

In [30]:
dec_emb2 = Embedding(num_hin_tokens, latent_dim)(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_outputs2 = decoder_dense(decoder_outputs2)
decoder_states2 = [state_h2, state_c2]

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2)

In [31]:
# Decode function
reverse_hin_index = {idx: word for word, idx in hin_word_index.items()}

In [32]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = hin_word_index.get('<start>', 1)

    stop_condition = False
    decoded_sentence = ''
    max_decoder_steps = max_hin_len + 5  # hard stop
    step_count = 0

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = reverse_hin_index.get(sampled_token_index, '')

        # Debug print (optional)
        # print(f"Predicted token: {sampled_token_index}, word: {sampled_word}")

        decoded_sentence += ' ' + sampled_word

        if (sampled_word == '<end>' or
            sampled_word == '' or
            step_count >= max_decoder_steps):
            stop_condition = True

        # Update the target sequence
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]
        step_count += 1

    return decoded_sentence.replace('<end>', '').strip()

In [33]:
# Test translation
for sent in english_sentences:
    input_seq = pad_sequences(eng_tokenizer.texts_to_sequences([sent]), maxlen=max_eng_len, padding='post')
    print(f"{sent} → {decode_sequence(input_seq)}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 210ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 218ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
hello → नमस्ते
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
how are you → तुम कैसे हो
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
i am fine → मैं ठीक ह