# ============================
# Encoder-Decoder Architecture for Sentiment Analysis
# ============================

## Overview:

The goal is to implement an encoder-decoder architecture for sentiment analysis using the IMDB dataset. The process includes importing libraries, preprocessing data, defining the model, preparing decoder inputs, fitting the model, and evaluating its performance. The conclusion summarizes the effectiveness of this architecture in sentiment analysis.

## Table of Contents
1. [Introduction](#introduction)
2. [Importing Necessary Libraries](#importing-necessary-libraries)
3. [Loading and Preprocessing the IMDB Dataset](#loading-and-preprocessing-the-imdb-dataset)
4. [Defining the Encoder-Decoder Model](#defining-the-encoder-decoder-model)
5. [Preparing Decoder Input Data](#preparing-decoder-input-data)
6. [Fitting the Model](#fitting-the-model)
7. [Evaluating the Model](#evaluating-the-model)
8. [Conclusion](#conclusion)

## Introduction

Encoder-decoder architectures are widely used in Natural Language Processing (NLP) tasks such as machine translation, text summarization, and sentiment analysis. This architecture consists of two main components:

- **Encoder**: The encoder processes the input data (in our case, movie reviews) and compresses it into a fixed-size context vector. This vector summarizes the semantic meaning of the input.
  
- **Decoder**: The decoder uses the context vector to generate an output sequence (e.g., predicting sentiment). It can produce output one token at a time, incorporating the previously generated tokens for context.

In this notebook, we will implement an encoder-decoder model for sentiment analysis using the IMDB movie reviews dataset. The goal is to classify movie reviews as either positive or negative based on their content.

# ============================
# Encoder-Decoder Architecture
# ============================

## Introduction to Encoder-Decoder Models

Encoder-decoder models are a type of architecture used in various natural language processing (NLP) tasks, such as machine translation, text summarization, and more. These models consist of two main components: the encoder and the decoder.

### What is an Encoder?

The encoder processes the input sequence and converts it into a fixed-size context vector that captures the semantic information of the input. This vector summarizes the input data, allowing the decoder to generate the output sequence.

### What is a Decoder?

The decoder takes the context vector produced by the encoder and generates the output sequence one token at a time. At each step, it uses the context vector along with the previously generated tokens to predict the next token in the sequence.

### How the Encoder-Decoder Works

1. **Encoding**: The input sequence is passed to the encoder, which processes it and produces a context vector.
2. **Decoding**: The decoder uses the context vector and generates the output sequence token by token.

### Use Cases

- **Machine Translation**: Translating text from one language to another.
- **Text Summarization**: Creating a summary of a longer piece of text.
- **Image Captioning**: Generating textual descriptions for images.

## Example: Encoder-Decoder for Sentiment Analysis using IMDB Dataset

Here, we will build an encoder-decoder model using TensorFlow and Keras for sentiment analysis on the IMDB movie reviews dataset.

# ============================
# Encoder-Decoder Model for Sentiment Analysis
# ============================

In [1]:
# Importing Necessary Libraries
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb

2024-10-28 04:06:33.099799: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-28 04:06:33.139456: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


VOC-NOTICE: GPU memory for this assignment is capped at 1024MiB


2024-10-28 04:06:35.109368: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [None]:
# Load the IMDB dataset
max_words = 10000  # Limit vocabulary size to the top 10,000 words
max_len = 100  # Set maximum length of input sequences

# Load the dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_words)

# Pad sequences to ensure uniform input size
x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)

# Print shapes of the training and testing data
print(f'x_train shape: {x_train.shape}, y_train shape: {y_train.shape}')
print(f'x_test shape: {x_test.shape}, y_test shape: {y_test.shape}')

# Define parameters for the model
embedding_dim = 128  # Dimension of the embedding vector
latent_dim = 256      # Dimensionality of the LSTM layer

# ----------------------------
# Encoder
# ----------------------------
# Input layer for the encoder
encoder_inputs = Input(shape=(None,))
# Embedding layer to convert integer sequences to dense vectors
encoder_embedding = Embedding(input_dim=max_words, output_dim=embedding_dim)(encoder_inputs)
# LSTM layer for encoding
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embedding)
# Save the states to use them in the decoder
encoder_states = [state_h, state_c]

# ----------------------------
# Decoder
# ----------------------------
# Input layer for the decoder
decoder_inputs = Input(shape=(None,))
# Embedding layer for the decoder
decoder_embedding = Embedding(input_dim=max_words, output_dim=embedding_dim)(decoder_inputs)
# LSTM layer for decoding
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
# Passing the encoder states to the decoder
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
# Dense layer for generating output predictions
decoder_dense = Dense(1, activation='sigmoid')  # For binary classification
# Final output from the decoder
decoder_outputs = decoder_dense(decoder_outputs)

# ----------------------------
# Model Definition
# ----------------------------
# Define the model with encoder and decoder inputs
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile the model with Adam optimizer and binary cross-entropy loss
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display the model summary
model.summary()

# ----------------------------
# Prepare Decoder Input Data
# ----------------------------
# Prepare decoder input data (shifted input)
decoder_input_data = np.zeros_like(x_train)  # Initialize array with zeros
decoder_input_data[:, 1:] = x_train[:, :-1]  # Shift inputs by one position

# ----------------------------
# Fit the Model
# ----------------------------
# Fit the model on the training data
model.fit([x_train, decoder_input_data], 
          np.expand_dims(y_train, axis=-1), 
          batch_size=64, epochs=10, validation_split=0.2)

# ----------------------------
# Evaluate the Model
# ----------------------------
# Prepare decoder input data for testing
test_decoder_input_data = np.zeros_like(x_test)
test_decoder_input_data[:, 1:] = x_test[:, :-1]  # Shift inputs for testing

# Evaluate the model on the test set
loss, accuracy = model.evaluate([x_test, test_decoder_input_data], 
                                 np.expand_dims(y_test, axis=-1))

# Print test results
print(f'Test loss: {loss:.4f}, Test accuracy: {accuracy:.4f}')

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
x_train shape: (25000, 100), y_train shape: (25000,)
x_test shape: (25000, 100), y_test shape: (25000,)
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 embedding (Embedding)       (None, None, 128)            1280000   ['input_1[0][0]']             
                                                                                                 