<a href="https://colab.research.google.com/github/AhmedGabl/sentiment-analysis-rnn/blob/main/Seq2Seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Introduction to Seq2Seq Models

**Seq2Seq (Sequence-to-Sequence)** models are a class of machine learning models designed for tasks where both input and output are sequences. This architecture is commonly used in applications like:

- **Machine Translation**: Translating sentences from one language to another.
- **Text Summarization**: Generating summaries of long text.
- **Speech-to-Text**: Converting speech into text.
- **Question Answering**: Generating answers to questions based on context.

A Seq2Seq model typically consists of two main components:

1. **Encoder**: Encodes the input sequence into a context vector (fixed-size vector, typically using RNNs, LSTMs, or GRUs).
2. **Decoder**: Decodes the context vector into the output sequence.

#### Architecture Overview:

1. **Encoder**:
    - Takes the input sequence (such as a sentence) and processes it step-by-step.
    - Each step updates a hidden state that captures the information from the sequence.

2. **Decoder**:
    - Uses the encoder's final hidden state as the context to generate the output sequence.
    - It predicts the next word or token in the sequence one at a time.

The Seq2Seq model can be trained using teacher forcing, where the true output sequence is fed as input to the decoder at each step during training.

---

### 2. Libraries & Setup

To implement the Seq2Seq model, we'll use **TensorFlow 2.x** and **Keras** in a Google Colab notebook. Let's start by installing and importing the necessary libraries:

```python
# Install TensorFlow
!pip install tensorflow

# Import required libraries
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
```

---

### 3. Dataset Preparation

For this notebook, we'll use a simple example of **English to French Translation** using a small dataset of parallel sentences.

We'll first preprocess the data by:
- Tokenizing text into sequences.
- Padding sequences to ensure uniform length.
- Creating training and testing datasets.

#### Example Dataset:

Here’s a small sample of parallel sentences for English and French:

```plaintext
English: "Hello"
French: "Bonjour"

English: "How are you?"
French: "Comment ça va?"
```

We will create the following functions for dataset preparation.

```python
# Sample parallel sentences
english_sentences = ["Hello", "How are you?", "I am fine.", "Good morning", "Good night"]
french_sentences = ["Bonjour", "Comment ça va?", "Je vais bien.", "Bonjour", "Bonne nuit"]

# Tokenization and Padding
def tokenize_and_pad(texts, tokenizer=None, max_length=10):
    if tokenizer is None:
        tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
        tokenizer.fit_on_texts(texts)
    
    sequences = tokenizer.texts_to_sequences(texts)
    padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_length, padding='post')
    
    return padded_sequences, tokenizer

# Tokenizing and padding English and French texts
english_padded, english_tokenizer = tokenize_and_pad(english_sentences)
french_padded, french_tokenizer = tokenize_and_pad(french_sentences)
```

This function will tokenize the input and output sentences and ensure they are padded to the same length.

---

### 4. Building the Seq2Seq Model

Now that the data is preprocessed, let's build the Seq2Seq model. For simplicity, we'll use **LSTM** cells for both the encoder and decoder.

```python
# Hyperparameters
latent_dim = 256  # Latent dimension for LSTM
max_encoder_seq_length = max(len(sentence.split()) for sentence in english_sentences)
max_decoder_seq_length = max(len(sentence.split()) for sentence in french_sentences)
num_encoder_tokens = len(english_tokenizer.word_index) + 1
num_decoder_tokens = len(french_tokenizer.word_index) + 1

# Encoder Model
encoder_inputs = layers.Input(shape=(None,))
encoder_embedding = layers.Embedding(input_dim=num_encoder_tokens, output_dim=latent_dim)(encoder_inputs)
encoder_lstm, state_h, state_c = layers.LSTM(latent_dim, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]

# Decoder Model
decoder_inputs = layers.Input(shape=(None,))
decoder_embedding = layers.Embedding(input_dim=num_decoder_tokens, output_dim=latent_dim)(decoder_inputs)
decoder_lstm = layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_lstm_output, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = layers.Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_lstm_output)

# Seq2Seq Model
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
```

This is a basic Seq2Seq model:
- **Encoder**: An LSTM layer to process the input.
- **Decoder**: Another LSTM layer to process the output, using the encoder's final state.
- **Dense Layer**: To output a probability distribution over the target vocabulary for each timestep in the sequence.

---

### 5. Training the Model

Now, let's train the model. Since we are using **teacher forcing**, we need to provide the target sequence shifted by one timestep.

```python
# Preparing data for training
decoder_input_data = french_padded[:, :-1]  # Remove last token
decoder_target_data = french_padded[:, 1:]  # Remove first token

# Train the model
model.fit([english_padded, decoder_input_data], np.expand_dims(decoder_target_data, -1),
          batch_size=32, epochs=100, validation_split=0.2)
```

The target data is shifted by one timestep to ensure that the model learns to predict the next word in the sequence.

---

### 6. Inference: Decoding

Once the model is trained, we need to decode new sentences by generating the output sequence one word at a time.

```python
# Inference models for prediction

# Define an inference model for the encoder
encoder_model = keras.Model(encoder_inputs, encoder_states)

# Define an inference model for the decoder
decoder_state_input_h = layers.Input(shape=(latent_dim,))
decoder_state_input_c = layers.Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm_output, state_h, state_c = decoder_lstm(decoder_embedding, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_lstm_output)
decoder_model = keras.Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# Function to decode a sequence
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1))  # Start with a dummy token
    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Get the most likely next token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = french_tokenizer.index_word[sampled_token_index]
        decoded_sentence += ' ' + sampled_token

        # Exit condition: either hit max length or end token
        if sampled_token == '<end>' or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        # Update the target sequence and states
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index
        states_value = [h, c]

    return decoded_sentence

# Test decoding
test_sentence = "Hello"
test_sequence = english_tokenizer.texts_to_sequences([test_sentence])
test_sequence = tf.keras.preprocessing.sequence.pad_sequences(test_sequence, maxlen=max_encoder_seq_length, padding='post')

decoded_sentence = decode_sequence(test_sequence)
print(f"Input: {test_sentence}")
print(f"Decoded: {decoded_sentence}")
```

---

### 7. Conclusion

This notebook provides a complete guide to building, training, and using a Seq2Seq model for sequence generation tasks. We've created a simple Seq2Seq architecture using LSTM units in TensorFlow/Keras and performed English-to-French translation on a toy dataset.

You can further enhance this model by:
- Increasing dataset size.
- Experimenting with different architectures such as **GRU** or **Bidirectional LSTMs**.
- Fine-tuning hyperparameters like latent dimension size and sequence length.

Happy coding!

Certainly! Let's dive deeper into more advanced concepts related to Seq2Seq models, including **delayed batching**, **gradient clipping**, and other techniques that can be used to improve training and performance in Seq2Seq models. These concepts are crucial for improving the efficiency and stability of training for more complex tasks.

### 12. More Advanced Seq2Seq Techniques

In this section, we will cover some additional aspects of Seq2Seq training and optimization, including **delayed batching**, **gradient clipping**, **bucketing**, **scheduling**, and more.

---

### 12.1. Delayed Batching

**Delayed batching** is a technique used to improve training efficiency by waiting for multiple small batches before processing them together in one large batch. This approach can be useful when training on sequences of varying lengths or when you want to simulate larger batch sizes without needing large memory requirements.

In the case of Seq2Seq models, where sequences often have varying lengths, using **dynamic batching** (delayed batching) can save memory and speed up the training process.

For example, rather than padding all sequences to the maximum length and then creating a batch, you can collect smaller batches of sequences and then pad them to the maximum length within that batch.

#### Example of Delayed Batching:

```python
# Suppose we have sequences of varying lengths
sequences = [
    [1, 2, 3],     # Sequence of length 3
    [4, 5, 6, 7],  # Sequence of length 4
    [8, 9],        # Sequence of length 2
    [10, 11, 12, 13, 14],  # Sequence of length 5
]

# Delayed batching: we collect sequences and then pad them to the maximum length
def delayed_batch(sequences, batch_size):
    batch = []
    max_length = max([len(seq) for seq in sequences])  # Get the max length in this batch
    for seq in sequences:
        # Pad sequences to the same length in the batch
        padded_seq = seq + [0] * (max_length - len(seq))  # Padding with zero
        batch.append(padded_seq)
    
    # Now we have a batch with sequences padded to the same length
    return np.array(batch)

# Collect a batch
batch = delayed_batch(sequences, batch_size=4)
print(batch)
```

In this example:
- We wait until we have a few sequences to form a batch.
- The sequences are then padded to the same length before forming the batch, which allows more efficient training by utilizing dynamic padding.

---

### 12.2. Gradient Clipping

**Gradient clipping** is a technique used to avoid the problem of **exploding gradients** during training. This is particularly important for Seq2Seq models, as they often involve long sequences where gradients can grow exponentially, leading to unstable training.

Gradient clipping works by setting a threshold value for the gradients. If the gradients exceed this threshold, they are scaled down to keep them within the allowed range.

#### Example of Gradient Clipping in Keras:

In Keras, gradient clipping can be easily applied by specifying the `clipvalue` or `clipnorm` parameter in the optimizer.

```python
# Using gradient clipping with Keras optimizer
optimizer = tf.keras.optimizers.RMSprop(clipvalue=5.0)  # Clip gradients to max value of 5.0

model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
```

- **clipvalue**: This option clips the gradients by value, i.e., if the gradient exceeds this value, it will be scaled down.
- **clipnorm**: Clips the gradients by their norm, ensuring that the total magnitude does not exceed a given threshold.

Using gradient clipping ensures that the gradients are always within a manageable range, which stabilizes the training process.

---

### 12.3. Bucketing Sequences by Length

**Bucketing** is a technique used to group sequences of similar lengths together into batches. This helps in improving training efficiency by reducing padding. If we were to batch sequences of varying lengths, we would have to pad shorter sequences with zeros, leading to inefficient use of memory.

In bucketing, we group sequences into predefined buckets based on their length. Each bucket is padded to the maximum length within that bucket, which reduces unnecessary padding for shorter sequences.

#### Example of Bucketing Sequences:

```python
# Suppose we have the following sequences with varying lengths
sequences = [
    [1, 2, 3],    # Length 3
    [4, 5, 6, 7], # Length 4
    [8, 9],       # Length 2
    [10, 11, 12, 13, 14], # Length 5
]

# Define buckets (for example, sequences of length 1-3, 4-5, etc.)
buckets = [
    (1, 3),  # Sequences of length 1 to 3
    (4, 5),  # Sequences of length 4 to 5
]

# Function to bucket sequences
def bucket_sequences(sequences, buckets):
    binned_sequences = {bucket: [] for bucket in buckets}
    
    for seq in sequences:
        seq_length = len(seq)
        for bucket in buckets:
            if bucket[0] <= seq_length <= bucket[1]:
                binned_sequences[bucket].append(seq)
                break
    
    return binned_sequences

# Bucket the sequences
binned_sequences = bucket_sequences(sequences, buckets)
print(binned_sequences)
```

In this example:
- Sequences are grouped into buckets based on their length.
- The buckets ensure that sequences of similar lengths are processed together, which reduces the need for excessive padding.

---

### 12.4. Scheduled Sampling

**Scheduled Sampling** is a technique that helps in addressing the issue of **exposure bias** in Seq2Seq models. Exposure bias occurs when the model is trained using ground truth tokens (teacher forcing) during training, but during inference, it must generate tokens on its own. This discrepancy can lead to poor generalization.

Scheduled Sampling gradually transitions from using the ground truth to using the model's own predictions during training. Initially, the model is trained using the true previous tokens, but as training progresses, it uses more of its own predictions.

#### Example of Scheduled Sampling in Keras:

Scheduled Sampling can be implemented by modifying the decoder’s input at each timestep:

```python
import random

# Function for scheduled sampling
def scheduled_sampling(inputs, outputs, current_step, max_steps):
    """
    Inputs: ground truth sequence
    Outputs: decoder predicted sequence
    current_step: current step in training
    max_steps: total number of steps in the training loop
    """
    use_ground_truth = random.random() < (max_steps - current_step) / max_steps
    if use_ground_truth:
        return inputs  # Use the ground truth
    else:
        return outputs  # Use model’s prediction from previous step

# Example usage in a training loop
for step in range(max_steps):
    # Predict using the model (this would typically be done in a training loop)
    decoder_input = scheduled_sampling(decoder_input, decoder_output, step, max_steps)
    model.train_on_batch([encoder_input, decoder_input], target_output)
```

In this approach:
- **Initially**, the model uses the true labels (ground truth) during training (teacher forcing).
- **Gradually**, the model starts using its own predictions during training as the exposure bias is reduced.

---

### 12.5. Curriculum Learning

**Curriculum learning** is a strategy where the model is first trained on simpler tasks or easier sequences before progressing to more complex ones. In the context of Seq2Seq models, this might mean training on shorter sequences before moving on to longer sequences.

This method can help the model learn faster and perform better by starting with simpler examples and gradually increasing difficulty.

#### Example of Curriculum Learning:

```python
# Function to implement curriculum learning
def curriculum_learning(sequences, length_threshold):
    easier_sequences = [seq for seq in sequences if len(seq) <= length_threshold]
    harder_sequences = [seq for seq in sequences if len(seq) > length_threshold]
    
    return easier_sequences, harder_sequences

# Curriculum learning example
length_threshold = 5
easier_sequences, harder_sequences = curriculum_learning(sequences, length_threshold)
print("Easier sequences:", easier_sequences)
print("Harder sequences:", harder_sequences)
```

In this method:
- The model is trained first on **easier sequences** (shorter sequences) and progressively exposed to more difficult sequences as training progresses.

---

### 12.6. Masking and Attention for Variable-Length Sequences

When working with variable-length sequences, especially when using **attention mechanisms**, it’s crucial to handle padding tokens appropriately. Masking ensures that the padding tokens are ignored during attention calculations.

In Keras, you can use the `Masking` layer to automatically mask padded tokens during the training of attention-based models.

#### Example of Masking in Keras:

```python
# Add a masking layer to ignore padding tokens
from tensorflow.keras.layers import Masking

# Masking layer in the encoder
encoder_inputs = layers.Input(shape=(None,))
encoder_embedding = layers.Embedding(input_dim=num_encoder_tokens, output_dim=latent_dim)(encoder_inputs)
masked_encoder = Masking(mask_value=0)(encoder_embedding)

encoder_lstm = layers.LSTM(latent_dim, return_state=True, return_sequences=True)(masked_encoder)
encoder_output, state_h, state_c = encoder_lstm
```

- **Masking** ensures that padding tokens do not contribute to the attention mechanism or loss calculation, leading to more accurate models.

---

### Conclusion

In this extended explanation, we covered several advanced topics related to Seq2Seq models:

- **Delayed Batching**: Efficient batching by waiting for sequences of similar length to form a batch.
- **Gradient Clipping**: Preventing exploding gradients by clipping them during training.
- **Bucketing Sequences**: Grouping sequences of similar lengths to reduce padding.
- **Scheduled Sampling**: Gradually introducing the model's predictions during training to reduce exposure bias.
- **Curriculum Learning**: Starting with easier sequences and progressively increasing the difficulty.
- **Masking for Attention**: Ensuring padding tokens are ignored during attention-based processing.

These techniques are essential for improving the efficiency, stability, and performance of Seq2Seq models, especially in real-world tasks like machine translation and speech recognition. Experimenting with these methods will allow you to build more robust and scalable models.