<a href="https://colab.research.google.com/github/AhmedGabl/sentiment-analysis-rnn/blob/main/notebooks/seq2seq_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoder-Decoder Architecture in Deep Learning

## Overview
The encoder-decoder architecture is a powerful neural network design used for sequence-to-sequence tasks, such as machine translation, text summarization, and speech recognition. It consists of two main components:

1. **Encoder**: The encoder reads the input sequence and compresses it into a fixed-size context vector.
2. **Decoder**: The decoder uses the context vector to generate the output sequence.

In this notebook, we will explore the theory behind the encoder-decoder architecture and implement a simple sequence-to-sequence model using TensorFlow/Keras.

---

## Part 1: Understanding the Encoder-Decoder Architecture

### 1.1 Encoder

The encoder is typically a Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or Gated Recurrent Unit (GRU). It processes the input sequence step by step and produces a context vector that summarizes the entire input sequence.

In a typical RNN, the hidden state at each time step `t` depends on the input at that time step and the previous hidden state:
\[ h_t = f(W \cdot x_t + U \cdot h_{t-1}) \]
Where:
- \( x_t \) is the input at time step `t`.
- \( h_t \) is the hidden state at time step `t`.
- \( f \) is an activation function (e.g., tanh or sigmoid).
- \( W \) and \( U \) are learnable weight matrices.

### 1.2 Decoder

The decoder is another RNN that generates the output sequence from the context vector produced by the encoder. The decoder uses the context vector (or initial hidden state) to initialize its hidden state and then generates the output sequence, often in an autoregressive manner (where each output depends on the previous one).

### 1.3 Attention Mechanism

The attention mechanism allows the model to focus on different parts of the input sequence at each step of the output sequence generation. This mechanism helps the model handle long sequences by providing dynamic attention to relevant parts of the input during decoding.

---

## Part 2: Implementing Encoder-Decoder with Attention

We will now build a simple sequence-to-sequence model with attention. We'll use a character-level translation task as an example. For simplicity, we will work with small sequences of text.

---

### Step 1: Installing Dependencies

```python
!pip install tensorflow
```

### Step 2: Importing Libraries

```python
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import string
```

### Step 3: Data Preparation

We'll create a simple dataset where the input sequence is a reversed string, and the output sequence is the original string. This is a basic example of a sequence-to-sequence task.

```python
# Sample data: reversing strings
def generate_data(num_samples=1000, max_length=10):
    chars = string.ascii_lowercase + " "  # Include space
    data_in = []
    data_out = []
    for _ in range(num_samples):
        length = np.random.randint(1, max_length+1)
        word = ''.join(np.random.choice(list(chars), length))
        data_in.append(word[::-1])  # Reversed string as input
        data_out.append(word)       # Original string as output
    return data_in, data_out

data_in, data_out = generate_data()
print(f"Input Example: {data_in[0]}, Output Example: {data_out[0]}")
```

### Step 4: Tokenizing the Data

```python
tokenizer_in = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer_out = tf.keras.preprocessing.text.Tokenizer(char_level=True)

# Fit tokenizers
tokenizer_in.fit_on_texts(data_in)
tokenizer_out.fit_on_texts(data_out)

input_seq = tokenizer_in.texts_to_sequences(data_in)
output_seq = tokenizer_out.texts_to_sequences(data_out)

max_input_len = max(len(seq) for seq in input_seq)
max_output_len = max(len(seq) for seq in output_seq)

input_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=max_input_len, padding='post')
output_seq = tf.keras.preprocessing.sequence.pad_sequences(output_seq, maxlen=max_output_len, padding='post')

print(f"Input shape: {input_seq.shape}, Output shape: {output_seq.shape}")
```

### Step 5: Building the Encoder-Decoder Model

Now, we'll build a simple encoder-decoder architecture with attention.

```python
def build_model(vocab_size_in, vocab_size_out, embedding_dim=256, hidden_units=256):
    # Encoder
    encoder_input = layers.Input(shape=(None,))
    encoder_emb = layers.Embedding(vocab_size_in, embedding_dim)(encoder_input)
    encoder_lstm, forward_h, forward_c, backward_h, backward_c = layers.LSTM(hidden_units, return_state=True, return_sequences=True)(encoder_emb)
    encoder_states = [forward_h, forward_c]
    
    # Decoder
    decoder_input = layers.Input(shape=(None,))
    decoder_emb = layers.Embedding(vocab_size_out, embedding_dim)(decoder_input)
    
    # Decoder LSTM with attention
    attention_layer = layers.Attention()([encoder_lstm, decoder_emb])
    decoder_lstm = layers.LSTM(hidden_units, return_state=True)
    decoder_output, _, _ = decoder_lstm(attention_layer, initial_state=encoder_states)
    
    decoder_dense = layers.Dense(vocab_size_out, activation='softmax')
    decoder_output_final = decoder_dense(decoder_output)
    
    model = tf.keras.Model([encoder_input, decoder_input], decoder_output_final)
    return model

vocab_size_in = len(tokenizer_in.word_index) + 1
vocab_size_out = len(tokenizer_out.word_index) + 1

model = build_model(vocab_size_in, vocab_size_out)
model.summary()
```

### Step 6: Compiling and Training the Model

```python
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
history = model.fit([input_seq, output_seq[:,:-1]], np.expand_dims(output_seq[:,1:], -1), epochs=10, batch_size=64)
```

### Step 7: Testing the Model

Now we can test the model by using it to reverse new sequences.

```python
def predict_sequence(input_sequence):
    input_sequence = tokenizer_in.texts_to_sequences([input_sequence])
    input_sequence = tf.keras.preprocessing.sequence.pad_sequences(input_sequence, maxlen=max_input_len, padding='post')

    decoder_input = np.zeros((1, max_output_len))
    output = []
    
    for t in range(max_output_len):
        prediction = model.predict([input_sequence, decoder_input])
        sampled_token = np.argmax(prediction[0, t, :])
        output.append(tokenizer_out.index_word.get(sampled_token, ''))
        decoder_input[0, t] = sampled_token
    
    return ''.join(output).strip()

input_example = "edcba"
predicted_output = predict_sequence(input_example)
print(f"Input: {input_example}, Predicted Output: {predicted_output}")
```

---

## Part 3: Conclusion

In this notebook, we implemented a basic encoder-decoder architecture with attention in TensorFlow/Keras. This architecture is widely used for sequence-to-sequence tasks like machine translation, text summarization, and more. By using attention, the model can focus on different parts of the input sequence while generating the output.

We demonstrated the process with a simple sequence reversal task. In real-world applications, more complex datasets and models would be used, but the basic principles outlined here are applicable to a variety of sequence generation problems.

---

## Next Steps

1. **Improving the Model**: You can try adding more layers to the encoder or decoder, or experimenting with different types of attention mechanisms.
2. **Real-World Tasks**: Use this architecture for machine translation (e.g., English to French), text summarization, or image captioning tasks.
3. **Fine-Tuning**: You can further fine-tune the model and experiment with hyperparameters.

---

Happy coding! 🎉
```

You can copy this notebook into a Google Colab file, and it will walk through the basic concepts of the encoder-decoder architecture, culminating in an implementation of a simple sequence-to-sequence task with attention.

### 1. **Delayed Batching**

Delayed batching refers to the process of waiting for a specific number of data samples to be accumulated (batched) before performing a training step. This is often used to prevent memory overload or reduce unnecessary computations, especially when sequences vary in length. In seq2seq models, it helps manage varying input and output lengths more effectively.

- **Why delay batching?**
  - In tasks like machine translation, input sequences can vary greatly in length.
  - Padding shorter sequences to match the longest one in a batch can lead to inefficient training (wasting computation on padded values).
  - By batching the sequences based on their lengths or other criteria, you can reduce wasted computation and make better use of the model's capacity.

- **Implementation:**
  - Typically, batching is performed after sorting the sequences by length. This is called "bucketed" or "dynamic" batching, and it's common in natural language processing (NLP) tasks.
  - In TensorFlow/Keras, delayed batching can be implemented using `tf.data.Dataset` API, which allows efficient batching based on sequence length.

Here's an example code for implementing delayed batching with `tf.data`:

```python
import tensorflow as tf
import numpy as np
import string

# Generate data function (like the reversed string task from before)
def generate_data(num_samples=1000, max_length=10):
    chars = string.ascii_lowercase + " "
    data_in = []
    data_out = []
    for _ in range(num_samples):
        length = np.random.randint(1, max_length+1)
        word = ''.join(np.random.choice(list(chars), length))
        data_in.append(word[::-1])  # Reversed string as input
        data_out.append(word)       # Original string as output
    return data_in, data_out

# Tokenizer and sequence processing
data_in, data_out = generate_data()
tokenizer_in = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer_out = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer_in.fit_on_texts(data_in)
tokenizer_out.fit_on_texts(data_out)

input_seq = tokenizer_in.texts_to_sequences(data_in)
output_seq = tokenizer_out.texts_to_sequences(data_out)

max_input_len = max(len(seq) for seq in input_seq)
max_output_len = max(len(seq) for seq in output_seq)

input_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=max_input_len, padding='post')
output_seq = tf.keras.preprocessing.sequence.pad_sequences(output_seq, maxlen=max_output_len, padding='post')

# Create dataset using tf.data
def create_dataset(input_data, output_data, batch_size=64):
    dataset = tf.data.Dataset.from_tensor_slices((input_data, output_data))
    dataset = dataset.shuffle(buffer_size=1024).batch(batch_size, drop_remainder=True)
    return dataset

# Example usage: Delayed batching with sequences
train_dataset = create_dataset(input_seq, output_seq)

# Check one batch
for input_batch, output_batch in train_dataset.take(1):
    print(f"Input batch shape: {input_batch.shape}, Output batch shape: {output_batch.shape}")
```

### 2. **Teacher Forcing in Seq2Seq Models**

**Teacher forcing** is a technique used during training of sequence-to-sequence models where, at each time step, the true output from the training set is fed as the next input to the decoder, rather than using the model's own prediction from the previous time step. This helps the model converge more quickly and avoids error accumulation, which can happen if the model gets confused early on.

- **Why teacher forcing?**
  - It helps the model learn faster since it is always shown the correct output at each time step during training.
  - It can mitigate the issue of error propagation, where mistakes made early in the sequence cause the decoder to make progressively worse predictions.
  
- **Drawbacks**:
  - The model can become overly dependent on the teacher forcing mechanism, leading to poor generalization during inference when the true outputs are not available.

Here’s an illustration of teacher forcing during training:

1. **With Teacher Forcing**: At each timestep during training, the true label is passed as input to the decoder.
2. **Without Teacher Forcing**: At each timestep, the model’s previous output is passed as input to the decoder (also known as **auto-regressive decoding**).

#### Example Code for Teacher Forcing:

```python
def train_with_teacher_forcing(model, input_seq, output_seq, batch_size=64, epochs=10):
    # Shift the output sequence by one timestep for teacher forcing
    output_seq_input = output_seq[:, :-1]
    output_seq_target = output_seq[:, 1:]
    
    # Train the model using teacher forcing (True labels as input during training)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    
    history = model.fit(
        [input_seq, output_seq_input],  # Use shifted target sequence as input
        np.expand_dims(output_seq_target, -1),  # True outputs as target
        batch_size=batch_size,
        epochs=epochs
    )

    return history

# Example usage: Training with teacher forcing
history = train_with_teacher_forcing(model, input_seq, output_seq)
```

### 3. **Seq2Seq Details and Other Concepts**

Now, let's quickly run through some other essential details and techniques commonly used in seq2seq architectures:

#### a) **Bidirectional Encoder**:
In some seq2seq models, the encoder is **bidirectional**, meaning it processes the input sequence in both forward and backward directions. This allows the model to capture context from both past and future elements of the sequence.

```python
encoder_input = layers.Input(shape=(None,))
encoder_emb = layers.Embedding(vocab_size_in, embedding_dim)(encoder_input)
encoder_bi_lstm = layers.Bidirectional(layers.LSTM(hidden_units, return_state=True, return_sequences=True))
encoder_lstm_output, forward_h, forward_c, backward_h, backward_c = encoder_bi_lstm(encoder_emb)
encoder_states = [forward_h, forward_c]  # Only take the forward hidden states
```

#### b) **Attention Mechanism**:
As we discussed earlier, the **attention mechanism** allows the model to focus on different parts of the input sequence at each step of the decoding process. This is particularly useful when dealing with long sequences.

```python
# Attention layer example (using Keras' Attention layer)
attention_layer = layers.Attention()([encoder_lstm_output, decoder_emb])
decoder_lstm = layers.LSTM(hidden_units, return_state=True)
decoder_output, _, _ = decoder_lstm(attention_layer, initial_state=encoder_states)
```

#### c) **Pointer-Generator Networks**:
In some seq2seq models (especially for tasks like text summarization), a **pointer-generator** mechanism is used to allow the model to both generate words from a fixed vocabulary and copy words from the input sequence (useful for handling out-of-vocabulary words).

#### d) **Scheduled Sampling**:
Instead of using pure teacher forcing, **scheduled sampling** introduces a trade-off between using true labels and model predictions as inputs. Early in training, teacher forcing is used, but as training progresses, the model starts using its own predictions as inputs to simulate real-world inference.

### Conclusion

- **Delayed Batching** helps efficiently handle variable-length sequences by batching sequences with similar lengths, improving memory usage and computation efficiency.
- **Teacher Forcing** accelerates training by preventing the model from diverging during early steps, though it may lead to poor performance during inference without true outputs.
- **Other Seq2Seq Details** include bidirectional encoders, attention mechanisms, and advanced techniques like scheduled sampling and pointer-generator networks to improve model performance in complex tasks like machine translation, summarization, and text generation.

You can experiment with these concepts in your seq2seq models to improve both training efficiency and model performance on real-world tasks.

After learning sequence-to-sequence (seq2seq) models, there are various interesting tasks and projects you can build that leverage this architecture. Seq2seq models are particularly useful for handling tasks where both the input and output are sequences, such as machine translation, text summarization, text generation, and more. Below are several project ideas, along with a basic structure of how you would implement each one. I’ll also highlight the differences in each project to help you understand how the implementation of seq2seq varies.

### 1. **Machine Translation (e.g., English to French)**

#### Task Description:
Translate sentences from one language to another. A seq2seq model is perfect for this task because the input and output are both sequences of words (or subword units).

#### Basic Structure:
- **Encoder**: Processes the input sentence (in the source language).
  - Tokenize the input sentence (e.g., word-level or subword-level).
  - Use an embedding layer to convert words to vectors.
  - Pass through an RNN, LSTM, or GRU to encode the input sequence.
  
- **Decoder**: Generates the translated sentence (in the target language).
  - The decoder uses the context vector (final hidden states from the encoder) as an initial state.
  - It generates the output word by word using teacher forcing (during training) or auto-regressive decoding (during inference).
  
- **Key Differences in Implementation**:
  - Tokenization needs to be done separately for both the source and target languages (which may require different vocabularies).
  - You will likely need a large parallel corpus (input-output pairs in different languages).
  - You may also introduce an attention mechanism to better handle longer sentences.

#### Example Framework:
1. Use the **TensorFlow/Keras** framework to build the model with `Embedding`, `LSTM/GRU`, and `Dense` layers.
2. Train the model on parallel language datasets (e.g., English-French corpus).
3. Use beam search or greedy decoding during inference for translation.

### 2. **Text Summarization (Extractive and Abstractive)**

#### Task Description:
Given a long document, generate a concise summary of it. There are two main types of summarization:
- **Extractive**: Selects sentences directly from the input text.
- **Abstractive**: Generates new sentences that paraphrase the original text.

#### Basic Structure:
- **Encoder**: Encodes the input text (the document).
  - Tokenize the document into words or subword tokens.
  - Embed the tokens and pass through an RNN/LSTM/GRU.
  
- **Decoder**: Generates the summary.
  - For abstractive summarization, the decoder generates a sequence of new words to summarize the document.
  - You can use attention mechanisms to focus on the most relevant parts of the input.
  
- **Key Differences in Implementation**:
  - **Abstractive Summarization** will require a more complex decoder (typically with an attention mechanism) to paraphrase content.
  - **Extractive Summarization** might use simpler methods like sentence embeddings to choose relevant sentences instead of generating text.

#### Example Framework:
1. **Preprocessing**: Tokenize the document and summary. You can use datasets like CNN/Daily Mail for summarization.
2. Use **LSTM** with **Attention** to improve the quality of the summary.
3. **Evaluation**: Use ROUGE scores to evaluate the quality of the generated summaries.

### 3. **Text Generation (e.g., Poetry, Jokes, or Stories)**

#### Task Description:
Generate new text given a starting prompt. This is often used in creative applications like generating poetry, stories, or jokes.

#### Basic Structure:
- **Encoder**: In this case, you might not have a traditional encoder as in translation or summarization. The model might just generate text from a starting prompt (which can be a single word or sentence).
  
- **Decoder**: Generates text one token (e.g., word or character) at a time.
  - You can use teacher forcing during training and then auto-regressive decoding during inference.
  
- **Key Differences in Implementation**:
  - The model typically starts with an initial token (a start-of-sequence token) and then generates the next token based on the previous ones.
  - Depending on the application, you may need to add randomness (e.g., temperature-based sampling) to control creativity.

#### Example Framework:
1. **Preprocessing**: Tokenize text and convert it into sequences of characters or words.
2. **Architecture**: Use an **LSTM** or **GRU** with a **softmax layer** for generating the next token.
3. **Evaluation**: Evaluate based on diversity or human subjective assessments.

### 4. **Speech Recognition**

#### Task Description:
Convert spoken language into text. Given an audio sequence, generate the corresponding transcription.

#### Basic Structure:
- **Encoder**: The encoder processes the input audio features (such as Mel spectrograms or MFCC features).
  - Use **Convolutional Neural Networks (CNNs)** to extract features from the spectrogram.
  - The output is passed through **RNNs/LSTMs** to capture the temporal dependencies in the audio.

- **Decoder**: Converts the encoded audio features into text sequences (words or phonemes).
  - Similar to other seq2seq tasks, you’ll use an RNN-based decoder that predicts the next character or word.

- **Key Differences in Implementation**:
  - **Audio Features**: The input to the model is audio, which must be preprocessed into a form that a neural network can understand (e.g., Mel spectrogram).
  - **Decoder Output**: Instead of generating words directly, the decoder might output phonemes or characters, which are then decoded into words.

#### Example Framework:
1. **Data**: You can use datasets like **LibriSpeech** or **CommonVoice** for training.
2. **Audio Preprocessing**: Extract features like Mel spectrograms or MFCC from the audio.
3. **Model Architecture**: Use a **CNN** followed by an **LSTM** or **GRU** network to handle the sequential nature of speech.

### 5. **Image Captioning**

#### Task Description:
Given an image, generate a natural language description of the image.

#### Basic Structure:
- **Encoder**: The encoder is a **Convolutional Neural Network (CNN)**, which extracts features from the image.
  - Use a pre-trained model like **InceptionV3** or **ResNet** to extract image features.
  
- **Decoder**: The decoder is an RNN (typically LSTM) that generates a caption word by word based on the image features provided by the encoder.
  - Use attention mechanisms to focus on different parts of the image while generating the caption.

- **Key Differences in Implementation**:
  - The input is an image, so you need a CNN-based feature extractor to process it.
  - The output is text, so you need a standard seq2seq model with an attention mechanism.

#### Example Framework:
1. **Preprocessing**: Resize and normalize images. Extract features using a pre-trained CNN.
2. **Architecture**: Combine image features with textual features using an **LSTM decoder** with an **attention mechanism**.
3. **Evaluation**: Evaluate captions using metrics like BLEU or METEOR.

---

### Key Differences in Seq2Seq Implementation Across Tasks

- **Input Type**:
  - **Text-to-text** tasks (e.g., machine translation, summarization) require tokenization of the text input.
  - **Audio-to-text** (e.g., speech recognition) requires audio feature extraction (e.g., Mel spectrogram).
  - **Image-to-text** (e.g., image captioning) requires image feature extraction using CNNs.
  
- **Encoder Type**:
  - **Text-based tasks** generally use **LSTMs** or **GRUs** for the encoder.
  - **Image captioning** uses a **CNN** for the encoder to extract features.
  - **Speech recognition** uses CNNs for feature extraction and LSTMs for sequential processing.

- **Attention Mechanism**:
  - **Attention** is critical for tasks like **machine translation**, **text summarization**, and **image captioning** to focus on relevant parts of the input sequence.
  
- **Output**:
  - In **text generation**, you may generate a sequence of words or characters.
  - In **speech recognition**, the model outputs text (e.g., words or phonemes).
  - In **image captioning**, the output is a sequence of words that describe the image.

### Conclusion

After learning seq2seq models, these projects offer hands-on experience in working with various types of sequence data. While the general structure of the encoder-decoder remains the same, the input type (text, audio, image) and the output (text or phonemes) influence how you design and implement each project. As you experiment with different tasks, you will deepen your understanding of the flexibility and challenges of seq2seq architectures.