# Recurrent Neural Networks (RNNs) for Sequence Analysis with Keras and TensorFlow

This notebook demonstrates how to build and train Recurrent Neural Networks (RNNs) for sequence analysis tasks using Keras and TensorFlow. We will focus on using Long Short-Term Memory (LSTM) networks for sentiment analysis on text data.

## 1. Introduction to RNNs

### What are RNNs?
Recurrent Neural Networks (RNNs) are a type of neural network designed to work with sequential data. Unlike feedforward neural networks, RNNs have internal memory, allowing them to process sequences of inputs of arbitrary length.  This is achieved through recurrent connections, where the output of a neuron at time *t* is fed back as input to the same neuron at time *t+1*. This feedback loop allows the network to retain information about past inputs, making them suitable for tasks where context is important, such as natural language processing, time series analysis, and speech recognition.

**Key features of RNNs:**

- **Sequential Processing:** RNNs process input sequences one element at a time, maintaining a hidden state that is updated at each step based on the current input and the previous hidden state.
- **Memory:** The hidden state acts as a memory, allowing the network to retain information about past inputs and use it to influence the processing of future inputs.
- **Handling Variable Length Sequences:** RNNs can naturally handle input sequences of varying lengths, which is crucial for many real-world applications.

### Applications of RNNs
RNNs are used in a wide range of applications, including:

- **Natural Language Processing (NLP):** Sentiment analysis, machine translation, text generation, language modeling.
- **Time Series Analysis:** Stock price prediction, weather forecasting, anomaly detection.
- **Speech Recognition:** Converting spoken language into text.
- **Video Analysis:** Action recognition, video captioning.

### Why RNNs for Sequential Data?
RNNs are particularly useful for sequential data because they can capture dependencies between elements in a sequence.  For example, in a sentence, the meaning of a word often depends on the words that came before it.  RNNs can model these dependencies, whereas traditional feedforward networks treat each input independently.

### LSTM Architecture
In this notebook, we will implement a type of RNN called Long Short-Term Memory (LSTM) network. LSTMs are designed to address the vanishing gradient problem in traditional RNNs, which makes it difficult to learn long-range dependencies in sequences. LSTMs achieve this through a more complex cell structure that includes:

- **Cell State:**  Acts as a long-term memory, carrying information across many time steps.
- **Hidden State:**  Acts as a short-term memory, similar to the hidden state in a traditional RNN.
- **Gates:**  Control the flow of information into and out of the cell state. These include the input gate, forget gate, and output gate.

## 2. Dataset: IMDB Reviews for Sentiment Analysis

We will use the IMDB dataset for sentiment analysis. This dataset consists of movie reviews from the Internet Movie Database (IMDB), labeled as either positive or negative. It is a widely used dataset for binary sentiment classification and is readily available through Keras datasets. The task is to perform sentiment analysis: we want our model to predict whether a given review is positive or negative.

### Key Features
- 25,000 training reviews
- 25,000 test reviews
- Binary sentiment classification (positive/negative)
- Average review length: 200-300 words

In [32]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout, Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import train_test_split

# For evaluation: classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
# Load the IMDB dataset
vocab_size = 5000 # Consider only the top 5,000 most frequent words
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

print("Dataset loaded successfully!")
print(f"Training samples: {len(x_train)}")
print(f"Test samples: {len(x_test)}")

## 3. Data Preprocessing

RNNs require input data to be in a specific format. For text data, this typically involves tokenization, padding, and splitting into training, validation, and test sets.

**Steps:**

- **Padding Sequences:**  Movie reviews have varying lengths. We need to pad sequences to a fixed length so that they can be processed in batches. We will pad sequences to a maximum length of 256 words.
- **Splitting into Training and Validation Sets:** We will split the original training set into training and validation sets to monitor the model's performance during training and prevent overfitting.

In [None]:
# Pad sequences so that each review is of the same length.
max_length = 256    # We will fix the maximum review length at 256 words
x_train_padded = pad_sequences(x_train, maxlen=max_length, padding='pre', truncating='post')
x_test_padded = pad_sequences(x_test, maxlen=max_length, padding='pre', truncating='post')

print("Training data shape after padding:", x_train_padded.shape)
print("Test data shape after padding:", x_test_padded.shape)

In [None]:
x_train_final, x_val, y_train_final, y_val = train_test_split(x_train_padded, y_train, test_size=0.2, random_state=42)

print("Final training set shape:", x_train_final.shape, len(y_train_final))
print("Validation set shape:", x_val.shape, len(y_val))

## 4. Model Building

We will build an LSTM model for sentiment classification. The model architecture will include:

- **Embedding Layer:** Converts word indices into dense vector representations. This layer learns word embeddings during training.
- **Bidirectional LSTM Layer:** The core RNN layer that processes the sequences in both forward and backward directions, learning temporal dependencies.
- **Dropout Layer:** Regularization to prevent overfitting.
- **Dense Layer:** A fully connected layer with a sigmoid activation function for binary classification (positive or negative sentiment).

### Hyperparameter Choices and Rationale

- **Embedding Dimension (embedding_dim=256):** A dimension of 256 is a common starting point for word embeddings, providing a balance between expressiveness and computational efficiency. Higher dimensions can capture more semantic information but may lead to overfitting or increased training time.

  - **LSTM Units (64):** Sufficient to capture temporal dependencies while keeping the model efficient.
  - **Dropout Rates (0.3):** Dropout is used to prevent overfitting. Used after the LSTM layers provides strong regularization before the dense layers.
  - **Max Sequence Length (max_length=256):** This value covers the majority of review lengths in the IMDB dataset, ensuring most information is retained while keeping computation manageable.
  - **Vocabulary Size (vocab_size=5000):** Limiting to the top 5,000 most frequent words reduces noise from rare words and keeps the embedding matrix size reasonable.

In [None]:
def create_rnn_model(vocab_size, embedding_dim, max_length):
    """
    Build and return a Sequential LSTM-based RNN model for binary sentiment classification.
    
    Args:
    vocab_size (int): Size of the vocabulary (number of unique words to consider).
        embedding_dim (int): Dimension of the embedding vectors.
        max_length (int): Maximum length of input sequences
    Returns:
        keras.Sequential: Compiled RNN model
    """
    model = Sequential()
    # Embedding layer to learn word embeddings
    model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
    # LSTM layer with a Bidirectional wrapper
    model.add(Bidirectional(LSTM(units=64, dropout=0.2)))
    # Dropout layer to reduce overfitting
    model.add(Dropout(0.3))
    # Dense Layer
    model.add(Dense(32, activation='relu'))
    # Final output layer for binary classification
    model.add(Dense(1, activation='sigmoid'))
    
    return model

# Create an RNN model
embedding_dim = 256
model = create_rnn_model(vocab_size, embedding_dim, max_length)
# Explicitly build the model to finalize shapes
model.build(input_shape=(None, max_length))
model.summary()

## 5. Model Training

We compile the model with an appropriate loss function, optimizer, and metrics, and then train it.

**Compilation:**

*   **Loss Function:** `binary_crossentropy` is used because this is a binary classification problem (positive or negative sentiment).
*   **Optimizer:** `Adam` optimizer with a learning rate of 0.0005.
*   **Metrics:** We track `accuracy` to measure the model's performance.

**Training:**

*   We train the model using `model.fit()`, providing the training and validation data.
*   `epochs` specifies the number of training iterations over the entire dataset.
*   `batch_size` defines the number of samples processed in each gradient update.
*   We use `EarlyStopping` and `ModelCheckpoint` callbacks for efficient training. `EarlyStopping` prevents overfitting by stopping training when the validation loss stops improving. `ModelCheckpoint` saves the best model based on validation accuracy.

In [24]:
# Compile the model
model.compile(
    loss='binary_crossentropy',
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005),
    metrics=['accuracy']
)

In [25]:
# Define callbacks
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_rnn_model.h5', monitor='val_accuracy', save_best_only=True)

In [None]:
# Train the model
history = model.fit(
    x_train_final, y_train_final,
    epochs=20,
    batch_size=32,
    validation_data=(x_val, y_val),
    callbacks=[early_stop, model_checkpoint]
)

### Visualize Training and Validation Metrics
**Accuracy:**
The training and validation accuracy curves illustrate how the model's performance improves over epochs. Specifically, it is observed that the training and validation accuracy and loss curves show strong learning and generalization, with minimal overfitting.

In [None]:
plt.figure(figsize=(12, 5))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.tight_layout()

**Loss:**
The training and validation loss curves indicate how well the model is learning.

In [None]:
plt.figure(figsize=(12, 5))

# Plot loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

## 6. Model Evaluation

In this section, we assess the performance of our trained LSTM model on unseen test data. 

- The first code cell loads the best saved model and computes its loss and accuracy on the padded test set. 
- Next, predictions are generated using the best saved model, and these output probabilities are thresholded to obtain binary classifications.  
- The subsequent code cell prints a detailed classification report—including precision, recall, and F1-score—offering insights into the model's performance on each class. 
- Finally, a confusion matrix is plotted as a heatmap to clearly visualize the model’s true positives, false negatives, and overall prediction accuracy.

The trained model achieves the following performance on the IMDB test set:

- **Training Accuracy:** ~0.93
- **Validation Accuracy:** ~0.86
- **Test Accuracy:** ~0.86
- **Test Loss:** ~0.35
- **Precision/Recall/F1:** ~0.86 (balanced)

In [None]:
# Load the best model
best_model = tf.keras.models.load_model('best_rnn_model.h5')

# Evaluate on the test set
test_loss, test_accuracy = best_model.evaluate(x_test_padded, y_test, verbose=0)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

In [None]:
# Generate predictions
y_pred = best_model.predict(x_test_padded)
y_pred_binary = (y_pred > 0.5).astype(int)  # Convert probabilities to binary predictions

In [None]:
# Evaluate the model on the test set
print("Classification Report:")
print(classification_report(y_test, y_pred_binary))

In [None]:
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred_binary)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.tight_layout()
plt.show()

## 7. Prediction
In this section, we select a random review from the test set and use our best saved model to predict its sentiment. A prediction probability above 0.5 indicates a positive review, while a probability at or below 0.5 indicates a negative review.
The code cell below randomly chooses a review from the padded test data, makes a prediction, and prints the review text,predicted sentiment and the associated probability."


In [None]:
import random

# Get the word index from Keras and build a reverse mapping
def decode_review(encoded_review, word_index):
    # The first indices are reserved
    index_word = {v + 3: k for k, v in word_index.items()}
    index_word[0] = "<PAD>"
    index_word[1] = "<START>"
    index_word[2] = "<UNK>"
    index_word[3] = "<UNUSED>"
    return ' '.join([index_word.get(i, '?') for i in encoded_review])

# Load the word index
word_index = imdb.get_word_index()

# Select a random index from the padded test set (x_test_padded is available from preprocessing)
random_index = np.random.randint(0, len(x_test_padded))
print("Random review index:", random_index)

# Retrieve the corresponding padded review (ensuring shape is (1, max_length) for prediction)
random_review = x_test_padded[random_index:random_index+1]

# Use the previously loaded best model to predict the sentiment of the selected review
prediction_probability = best_model.predict(random_review)[0][0]
predicted_sentiment = "Positive" if prediction_probability > 0.5 else "Negative"

# Decode the original (unpadded) review to text
original_review = x_test[random_index]
decoded_review = decode_review(original_review, word_index)

# Display the review text, prediction probability, and the resulting sentiment classification
print("Review text:")
print(decoded_review)
print("\nPrediction probability: {:.4f}".format(prediction_probability))
print("Predicted Sentiment:", predicted_sentiment)

## 8. Conclusion

In this notebook, we successfully built and trained an Bidirectional LSTM network for sentiment analysis using the IMDB dataset. The model achieved a reasonable accuracy on the test set, demonstrating the effectiveness of RNNs for sequence classification tasks and the importance of careful model design and training. We covered data loading, preprocessing, model building, training, evaluation and prediction.

**Potential Improvements and Future Work:**

*   **Hyperparameter Tuning:** Experiment with different values for `num_words`, `maxlen`, `embedding_dim`, `lstm_units`, and `dropout_rate` to optimize model performance.  Techniques like grid search or random search can be used.
*   **Different RNN Architectures:** Explore other RNN architectures, such as GRUs (Gated Recurrent Units), which are often faster to train than LSTMs and can achieve comparable performance.
*   **Pre-trained Word Embeddings:** Utilize pre-trained word embeddings like Word2Vec or GloVe instead of learning embeddings from scratch.  This can improve performance, especially with limited training data.
*   **Attention Mechanisms:** Incorporate attention mechanisms to allow the model to focus on the most relevant parts of the input sequence.
*   **Deeper Models:** Experiment with stacking multiple LSTM layers to create a deeper model, although this can increase training time and complexity.

## 9. References

*   **Keras:** [https://keras.io/](https://keras.io/)
*   **TensorFlow:** [https://www.tensorflow.org/](https://www.tensorflow.org/)
*   **IMDB Dataset:** [https://keras.io/api/datasets/imdb/](https://keras.io/api/datasets/imdb/)
*   **scikit-learn:** [https://scikit-learn.org/stable/](https://scikit-learn.org/stable/)