# Project 2 - Introduction to the Transformer Architecture

### 1. Introduction & Objectives
Understanding Transformer Architectures in NLP

This project explores the implementation of transformer architectures in natural language processing (NLP), focusing on three key tasks:

- **Text Classification** (Chapter 11.4): Categorizing text data using a transformer encoder.

- **Machine Translation** (Chapter 11.5): Translating text between languages using a seq2seq model with a transformer.

- **Generative Modeling** (Chapter 12.1): Creating a generative language model for text prediction.

**Objectives:**

Apply transformer-based methods to practical NLP tasks.
Develop skills in building and fine-tuning models for text data.
Evaluate model performance and explore enhancements like data preprocessing and hyperparameter tuning.

# 2. Data Understanding

**Datasets Used:**

1. **IMDB Dataset (Text Classification):**

- Binary classification task to distinguish between positive and negative reviews.
- Challenges: Imbalanced classes, diverse language usage.

2. **TED Talks Dataset (Machine Translation):**

- English-Portuguese translation task using paired text examples.
- Challenges: Handling sequence alignment and language-specific nuances.

3. **Shakespeare Texts (Generative Modeling):**

- A corpus of Shakespeare's works for text generation.
- Challenges: Preserving stylistic elements and coherence in generated text.

**Key Data Challenges:**

High variability in text length and structure.
Large vocabulary sizes require efficient tokenization strategies.
Generalization across diverse test data.

### 2.1 Running the code

Let's start by running the Chapter 11.4 exercise code to implement text classification using a Transformer encoder. The code trains a model to classify movie reviews as positive or negative and evaluates its accuracy on the test dataset.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization, Embedding, LayerNormalization, Dense, Dropout, MultiHeadAttention, GlobalAveragePooling1D, Layer
from tensorflow.keras import Sequential, Input, Model
from tensorflow.keras.datasets import imdb

# Data Loading
max_features = 20000
sequence_length = 200

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=sequence_length)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=sequence_length)

# Transformer Encoder Layer
class TransformerEncoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = Sequential([Dense(ff_dim, activation="relu"), Dense(embed_dim)])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=False):  # Set default value for training
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Hyperparameters
embed_dim = 32  # Embedding size for each token
num_heads = 2   # Number of attention heads
ff_dim = 32     # Hidden layer size in feed-forward network

# Model
inputs = Input(shape=(sequence_length,))
embedding_layer = Embedding(input_dim=max_features, output_dim=embed_dim)
x = embedding_layer(inputs)
x = TransformerEncoder(embed_dim, num_heads, ff_dim)(x, training=True)  # Add training argument explicitly
x = GlobalAveragePooling1D()(x)
x = Dropout(0.1)(x)
x = Dense(20, activation="relu")(x)
x = Dropout(0.1)(x)
outputs = Dense(1, activation="sigmoid")(x)

model = Model(inputs=inputs, outputs=outputs)

# Compile and Train
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(x_train, y_train, batch_size=32, epochs=3, validation_split=0.2)

# Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test Accuracy: {test_acc:.2f}")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1us/step

Epoch 1/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 13ms/step - accuracy: 0.7018 - loss: 0.5349 - val_accuracy: 0.8852 - val_loss: 0.2799
Epoch 2/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 14ms/step - accuracy: 0.9280 - loss: 0.1903 - val_accuracy: 0.8830 - val_loss: 0.2921
Epoch 3/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 16ms/step - accuracy: 0.9646 - loss: 0.1041 - val_accuracy: 0.8734 - val_loss: 0.3876
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 11ms/step - accuracy: 0.8534 - loss: 0.4538
Test Accuracy: 0.85


### 2.2 Explanation of Results

After running the code for text classification using a Transformer encoder, here’s what we observed:

1. **Data Loading:**

- The IMDB dataset was downloaded, containing pre-tokenized movie reviews categorized as positive or negative.

- Reviews were padded to a fixed sequence length of 200 for uniformity.

2. **Training Process:**

- The model was trained for 3 epochs with a batch size of 32.

- The training accuracy improved significantly, starting from 70% and reaching over 96% by the final epoch.

- Validation accuracy, however, peaked early at 88%, suggesting possible overfitting.

3. **Evaluation:**

- On the test set, the model achieved an accuracy of 85%, indicating strong performance in distinguishing positive and negative reviews.

**Key Observations:**

- **Overfitting:** The gap between training accuracy (96%) and test accuracy (85%) suggests the model may benefit from regularization techniques like dropout or early stopping.

- **Loss Trends:** Training loss decreased consistently, while validation loss started to increase, further highlighting overfitting.

- **Test Accuracy:** A solid performance on unseen data demonstrates the effectiveness of the Transformer encoder for text classification tasks.

# 3. Machine Translation with Transformer (Chapter 11.5)

In this section, we will implement machine translation using a transformer-based encoder-decoder model. We will use the English-Portuguese translation dataset from TensorFlow Datasets (TFDS). The task is to translate sentences from Portuguese to English.

**Objective:**

- Implement a seq2seq model using transformers for machine translation.
- Preprocess the data, create tokenizers, and train the model for translation.

IMPORTANT! Install Tensorflow-datasets with the command **pip install tensorflow-datasets** before continuing.

In [47]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization, Embedding, LayerNormalization, Dense, Dropout, MultiHeadAttention, Input, Layer
from tensorflow.keras.models import Model
import tensorflow_datasets as tfds

# Data Loading: English-Portuguese translation dataset
dataset_name = "ted_hrlr_translate/pt_to_en"  # You can replace with any seq2seq dataset
examples, metadata = tfds.load(dataset_name, with_info=True, as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

# Preprocessing
max_tokens = 20000
sequence_length = 40

# Tokenizer setup
def tokenize_pairs(pt, en):
    return vectorize_layer(pt), vectorize_layer(en)

vectorize_layer = TextVectorization(max_tokens=max_tokens, output_mode='int', output_sequence_length=sequence_length)
train_text = train_examples.map(lambda pt, en: pt)  # Tokenize Portuguese only
vectorize_layer.adapt(train_text.batch(64))

# Transformer Encoder Layer
class TransformerEncoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([Dense(ff_dim, activation="relu"), Dense(embed_dim)])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Transformer Decoder Layer
class TransformerDecoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerDecoder, self).__init__()
        self.att1 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.att2 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([Dense(ff_dim, activation="relu"), Dense(embed_dim)])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.layernorm3 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)
        self.dropout3 = Dropout(rate)

    def call(self, enc_output, target, training):
        attn1 = self.att1(target, target)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(target + attn1)
        attn2 = self.att2(out1, enc_output)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(out1 + attn2)
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        return self.layernorm3(out2 + ffn_output)

# Model Building
embed_dim = 256
num_heads = 8
ff_dim = 512

encoder_inputs = Input(shape=(sequence_length,), name="encoder_inputs")
x = Embedding(max_tokens, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, num_heads, ff_dim)(x, training=True)

decoder_inputs = Input(shape=(sequence_length,), name="decoder_inputs")
y = Embedding(max_tokens, embed_dim)(decoder_inputs)
decoder_outputs = TransformerDecoder(embed_dim, num_heads, ff_dim)(encoder_outputs, y, training=True)

# Adjusting output shape and final layer
decoder_outputs = ReshapeLayer((-1, sequence_length, embed_dim))(decoder_outputs)
outputs = Dense(max_tokens, activation="softmax")(decoder_outputs)
transformer = Model([encoder_inputs, decoder_inputs], outputs)

# Compile Model
transformer.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Function to vectorize and prepare batches
def prepare_batch(pt, en):
    # Vectorize inputs
    pt_vectorized = vectorize_layer(pt)  # Shape: (sequence_length,)
    en_vectorized = vectorize_layer(en)  # Shape: (sequence_length,)

    # Ensure proper shape (batch_size, sequence_length)
    pt_vectorized = tf.ensure_shape(pt_vectorized, [None])  # Ensure it remains rank 1
    en_vectorized = tf.ensure_shape(en_vectorized, [None])

    # Pad the target sequence for decoder inputs
    en_vectorized = tf.pad(en_vectorized, [[0, 1]], constant_values=0)  # Pad with 1 zero

    # Return dictionary with encoder and decoder inputs
    return {
        "encoder_inputs": pt_vectorized,            # Encoder inputs
        "decoder_inputs": en_vectorized[:-1]       # Decoder inputs (without last token)
    }, en_vectorized[1:]                          # Target sequence (without first token)

# Set up the dataset pipeline
train_dataset = train_examples.map(prepare_batch).batch(64).prefetch(tf.data.AUTOTUNE)
val_dataset = val_examples.map(prepare_batch).batch(64).prefetch(tf.data.AUTOTUNE)

# Confirm the shape of batches
for batch in train_dataset.take(1):
    print("Batch Shapes:")
    print("Encoder Inputs:", batch[0]["encoder_inputs"].shape)
    print("Decoder Inputs:", batch[0]["decoder_inputs"].shape)
    print("Target Outputs:", batch[1].shape)

# Define the TextVectorization layer with fixed sequence length
VOCAB_SIZE = 10000  # Example vocab size
MAX_SEQUENCE_LENGTH = 40  # Example max length

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

# Check if input and output shapes match
transformer.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=3
)



Batch Shapes:
Encoder Inputs: (64, 40)
Decoder Inputs: (64, 40)
Target Outputs: (64, 40)
Epoch 1/3








[1m810/810[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m757s[0m 926ms/step - accuracy: 0.8033 - loss: 1.4985 - val_accuracy: 0.8370 - val_loss: 0.7475
Epoch 2/3
[1m810/810[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m733s[0m 904ms/step - accuracy: 0.8266 - loss: 0.7859 - val_accuracy: 0.8340 - val_loss: 0.6976
Epoch 3/3
[1m810/810[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m777s[0m 960ms/step - accuracy: 0.8279 - loss: 0.6944 - val_accuracy: 0.8326 - val_loss: 0.6741


<keras.src.callbacks.history.History at 0x200df7af050>

### 3.1 Explanation of Results

After running the machine translation model, we observed the following:

1. **Batch Shapes:**

- Encoder Inputs: (64, 40)
- Decoder Inputs: (64, 40)
- Target Outputs: (64, 40) This indicates that each batch consists of 64 samples, with each input and output sequence having a length of 40 tokens.

2. **Training Process:**

- The model was trained for 3 epochs, with the training accuracy starting at 80% and reaching 82.79% by the third epoch.

- The validation accuracy was slightly lower than training accuracy, with a final value of 83.26%.

3. **Warnings:**

- Several warnings related to TensorFlow retracing were triggered, which could indicate inefficiencies in model training. However, these warnings do not significantly affect the model's functionality.

**Key Observations:**

- **Accuracy Trends:** The model shows consistent improvement in accuracy, though the gap between training and validation accuracy suggests the model may benefit from further tuning or regularization techniques.

- **Loss Trends:** Training and validation losses decreased throughout the epochs, which is a good sign, indicating the model is learning and improving.

# 4. Generative Language Modeling with Transformer (Chapter 12.1)

In this section, we will implement a generative language model using a Transformer architecture. The goal is to generate text in the style of Shakespeare by training the model on Shakespeare's works.

**Objective:**

- Train a model to predict the next word in a sequence of text, generating text from a given seed.

### 4.1 Running the Code

Let's begin by running the code for generative language modeling. This code builds and trains a Transformer model on the Shakespeare dataset and generates text based on a given seed.

In [40]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, MultiHeadAttention, LayerNormalization, Dense, Dropout, Input, Layer
from tensorflow.keras.models import Model
import numpy as np

# Hyperparameters
max_tokens = 20000
sequence_length = 50
embed_dim = 128
num_heads = 4
ff_dim = 256
dropout_rate = 0.1

# Dataset: Simple example with TensorFlow's Shakespeare dataset
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# Tokenizer: Splitting text into sentences for better adaptation
text_split = text.split('\n')  # Split by lines or sentences
vectorize_layer = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_mode='int', output_sequence_length=sequence_length)
vectorize_layer.adapt(text_split)

# Prepare dataset
sequences = vectorize_layer([text])[0]
inputs = sequences[:-1]
targets = sequences[1:]
inputs = tf.expand_dims(inputs, axis=-1)  # Ensure inputs have shape (None, sequence_length)
targets = tf.expand_dims(targets, axis=-1)  # Ensure targets have shape (None, sequence_length)
dataset = tf.data.Dataset.from_tensor_slices((inputs, targets)).batch(64).prefetch(tf.data.AUTOTUNE)

# Transformer Components
class TransformerDecoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerDecoder, self).__init__()
        self.att1 = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential([Dense(ff_dim, activation="relu"), Dense(embed_dim)])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=False):
        attn1 = self.att1(inputs, inputs)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(inputs + attn1)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Model Building
decoder_inputs = Input(shape=(sequence_length,), name="input_layer_0")  # Define decoder input layer
embedding_layer = Embedding(input_dim=max_tokens, output_dim=embed_dim)
x = embedding_layer(decoder_inputs)
x = TransformerDecoder(embed_dim, num_heads, ff_dim, dropout_rate)(x, training=True)  # Ensure training=True is passed
outputs = Dense(max_tokens, activation="softmax")(x)

model = Model(inputs=decoder_inputs, outputs=outputs)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train Model
model.fit(dataset, epochs=3)

# Text Generation
def sample_next(predictions, temperature=1.0):
    predictions = np.asarray(predictions).astype("float64")
    predictions = np.log(predictions + 1e-10) / temperature
    exp_preds = np.exp(predictions)
    predictions = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas)

def generate_text(seed_text, num_tokens, temperature=1.0):
    # Vectorize seed_text and reshape it to (1, sequence_length)
    input_text = vectorize_layer([seed_text])  # Shape: (1, sequence_length)
    input_text = input_text[0]  # Remove extra batch dimension
    generated_text = seed_text
    vocab = vectorize_layer.get_vocabulary()  # Ensure the vocabulary is fetched correctly
    
    for _ in range(num_tokens):
        # Ensure input_text has the shape (1, sequence_length) for model prediction
        predictions = model.predict(tf.expand_dims(input_text, axis=0))[0, -1]  # Shape: (max_tokens,)
        next_index = sample_next(predictions, temperature)
        
        # Make sure we are within bounds of vocabulary
        if next_index < len(vocab):
            next_word = vocab[next_index]
        else:
            next_word = "<UNK>"  # Handle out-of-vocabulary index (in case of model mistakes)
        
        generated_text += next_word
        # Update the input for the next prediction (shifting the window)
        input_text = np.append(input_text[1:], [next_index])
    
    return generated_text

# Example of text generation
seed_text = "To be or not to be, that is the question"
generated_text = generate_text(seed_text, num_tokens=50, temperature=0.8)
print("Generated text:\n", generated_text)


Epoch 1/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.0000e+00 - loss: 9.9330
Epoch 2/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step - accuracy: 0.0612 - loss: 9.6824
Epoch 3/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 104ms/step - accuracy: 0.4898 - loss: 9.4542
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 126ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step

### 4.2 Explanation of Results

After running the generative language model for 3 epochs, the results were as follows:

1. **Training Process:**

- The model’s accuracy started very low, at 0% in the first epoch, but gradually increased, reaching 48.98% by the third epoch. This slow increase suggests that the model is still in the early stages of learning the structure of the Shakespearean text.

- The loss decreased from 9.93 in the first epoch to 9.45 in the final epoch, indicating improvement.

2. **Generated Text:**

- The model was used to generate text starting with the seed: "To be or not to be, that is the question."

- The output shows that the model has learned some aspects of Shakespeare’s style but is still struggling with coherent word choices. It outputs "<UNK>" (unknown tokens) frequently, which indicates it is encountering words it hasn't seen before in the training data.

- The model’s text generation is still rudimentary, with random sequences, but it gives a sense of how the model is learning patterns in the input text.

**Key Observations:**

- **Learning Curve:** The model's performance improved slowly over time, which is typical for generative models trained on complex text data like Shakespeare's works.

- **Text Quality:** The output demonstrates the model's ability to generate text in the correct format (e.g., punctuation and sentence structure), but it still produces many nonsensical words ("<UNK>"), showing it hasn't fully learned the language structure.

# 5. Summary, Conclusions, and Future Work

In this notebook, we implemented and evaluated three key NLP tasks using Transformer architectures:

1. **Text Classification with Transformer Encoder** (Chapter 11.4): We built a model for sentiment analysis of movie reviews, achieving good accuracy despite some signs of overfitting.

2. **Machine Translation with Transformer** (Chapter 11.5): We trained a sequence-to-sequence model for English-Portuguese translation, demonstrating solid performance with room for further improvement.

3. **Generative Language Modeling with Transformer** (Chapter 12.1): A model was trained on Shakespeare's works for text generation, producing creative but incoherent outputs.

**Key Takeaways:**

- **Transformer Encoders** are powerful tools for text classification, but may require regularization to avoid overfitting.

- **Machine Translation** showed the potential of Transformers for sequence-to-sequence tasks, though further tuning could enhance translation quality.

- **Generative Models** demonstrated the ability to produce novel text, but more epochs and data would likely improve the coherence of generated sentences.

**Challenges Faced:**

1. **Overfitting in Classification:** The text classification model showed signs of overfitting, with performance on the training set higher than on the validation and test sets.

2. **Data Constraints:** The generative model’s output was limited by the small size and complexity of the training corpus.

3. **Training Time:** Complex models like Transformers, especially with large datasets, require significant computational resources.

**Future Work and Recommendations:**

1. **Regularization Techniques:** Apply techniques like dropout or early stopping to prevent overfitting in text classification.

2. **Advanced Architectures:** Experiment with BERT or GPT-style models for more advanced NLP tasks.

3. **Data Augmentation:** Expand datasets for translation and text generation tasks to improve model robustness.

4. **Hyperparameter Tuning:** Fine-tune model parameters like learning rates and batch sizes for better performance.

5. **Improved Tokenization:** Use subword tokenization (e.g., Byte Pair Encoding) to handle rare words more effectively in translation and generative tasks.

By exploring these avenues, we can further improve the models’ performance and extend their application to more complex tasks in NLP.