# Understanding Bidirectional LSTMs (BiLSTMs) 🧠

This notebook explains the concept of Bidirectional Long Short-Term Memory (BiLSTM) networks and provides a practical implementation using TensorFlow and Keras.

## What is an LSTM?

A standard **Long Short-Term Memory (LSTM)** network is a type of Recurrent Neural Network (RNN) that is excellent at learning from sequential data, like text or time series. It processes data in a sequence, step-by-step, carrying information from past steps to future ones.

**Limitation:** A standard LSTM only looks at the past. When reading a sentence, it processes words from left to right. At any given word, it only knows about the words that came *before* it.

## Why do we need Bidirectionality?

In many tasks, especially in Natural Language Processing (NLP), context from both the past **and** the future is crucial for understanding.

Consider the sentence: "The **bank** of the river was muddy."

To understand that "bank" refers to a riverside and not a financial institution, you need to see the word "river" which comes *after* it. A standard LSTM would struggle here.

This is where **Bidirectional LSTMs** come in!

## How does a BiLSTM work?

A BiLSTM is simple in concept: it's two LSTMs working together.

1.  **Forward LSTM:** One LSTM processes the sequence from left-to-right (from beginning to end).
2.  **Backward LSTM:** A second, separate LSTM processes the sequence from right-to-left (from end to beginning).

At each time step (i.e., for each word), the outputs from both the forward and backward LSTMs are concatenated. This combined output gives the model a rich representation of each word, informed by both its preceding (past) and succeeding (future) context.

![BiLSTM Architecture](https://i.imgur.com/h532nL4.png)


---
## 1. Setup and Imports

Let's start by importing the necessary libraries from TensorFlow and Keras.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, Dense
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

---
## 2. Data Loading and Preprocessing

We will use the **IMDB movie review dataset**, a classic dataset for binary sentiment classification (positive/negative).

We will perform two key preprocessing steps:
1.  **Vocabulary Limit:** We'll only consider the top 10,000 most frequent words to keep our model's vocabulary manageable.
2.  **Padding:** Neural networks require inputs to have a consistent length. We will pad or truncate all movie reviews to be exactly 200 words long.

In [None]:
# --- Parameters ---
max_features = 10000  # Number of words to consider as features (vocabulary size)
maxlen = 200        # Max length of sequences (reviews)

# --- Load the data ---
print("Loading data...")
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

# --- Pad sequences ---
# This ensures all sequences in a list have the same length.
print("\nPad sequences (samples x time)")
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences

Pad sequences (samples x time)
x_train shape: (25000, 200)
x_test shape: (25000, 200)


---
## 3. Building the BiLSTM Model

Now, let's define our model architecture. It will have three main layers:

1.  `Embedding` Layer: This layer takes the integer-encoded vocabulary and maps each word index to a dense vector of a specified size (in our case, 128). This helps the model learn relationships between words.

2.  `Bidirectional(LSTM(...))` Layer: This is the core of our model. We wrap a standard `LSTM` layer inside a `Bidirectional` wrapper. Keras handles the creation of the forward and backward LSTMs automatically. The output of this layer will be the concatenated features from both directions.

3.  `Dense` Layer: A standard fully connected neural network layer with a `sigmoid` activation function to output a probability between 0 (negative sentiment) and 1 (positive sentiment).

In [None]:
# --- Model Definition ---
model = Sequential()

# 1. Embedding Layer
# It turns positive integers (indexes) into dense vectors of fixed size.
model.add(Embedding(max_features, 128))

# 2. Bidirectional LSTM Layer
# We wrap our LSTM layer in a Bidirectional wrapper.
# The LSTM has 64 units (a hyperparameter you can tune).
model.add(Bidirectional(LSTM(64)))

# 3. Output Layer
# A Dense layer with 1 neuron and sigmoid activation for binary classification.
model.add(Dense(1, activation='sigmoid'))

# --- Compile the model ---
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# --- Print model summary ---
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 128)         1280000   
                                                                 
 bidirectional (Bidirectiona  (None, 128)              98816     
 l)                                                              
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,378,945
Trainable params: 1,378,945
Non-trainable params: 0
_________________________________________________________________


Notice the output shape of the `bidirectional` layer is `(None, 128)`. This is because the forward LSTM outputs 64 features and the backward LSTM outputs 64 features. By default, the `Bidirectional` wrapper **concatenates** them, so $64 + 64 = 128$.

---
## 4. Training the Model

Let's train our model. We'll use a `batch_size` of 32 and train for 5 epochs. We'll also use 20% of our training data for validation during training to monitor performance on unseen data.

In [None]:
print("Training model...")
history = model.fit(x_train,
                    y_train,
                    batch_size=32,
                    epochs=5,
                    validation_split=0.2)

Training model...
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


---
## 5. Evaluating the Model

Now that the model is trained, let's evaluate its performance on the hold-out test set.

In [None]:
score, acc = model.evaluate(x_test, y_test, batch_size=32)
print(f"Test score: {score:.4f}")
print(f"Test accuracy: {acc:.4f}")

Test score: 0.5750
Test accuracy: 0.8465


---
## 6. Making Predictions on New Data

Let's write a simple function to see our model in action. This function will take a raw text sentence, preprocess it in the same way as our training data, and predict the sentiment.

In [None]:
# Get the word index from the IMDB dataset
word_index = imdb.get_word_index()
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # Unknown
word_index["<UNUSED>"] = 3

def predict_sentiment(text):
    # Preprocess the text
    # 1. Tokenize: Convert text to a sequence of integers
    words = text.lower().split()
    sequence = [word_index.get(word, 2) for word in words] # Use 2 for unknown words

    # 2. Pad the sequence
    padded_sequence = pad_sequences([sequence], maxlen=maxlen)

    # 3. Predict
    prediction = model.predict(padded_sequence)
    probability = prediction[0][0]

    print(f"\nReview: '{text}'")
    print(f"Positive Sentiment Probability: {probability:.4f}")

    if probability > 0.5:
        print("Result: Positive sentiment 😊")
    else:
        print("Result: Negative sentiment 😞")

# --- Test with some reviews ---
predict_sentiment("This was an absolutely fantastic film with brilliant acting and a great plot")
predict_sentiment("The movie was a complete waste of time boring and predictable")

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Review: 'This was an absolutely fantastic film with brilliant acting and a great plot'
Positive Sentiment Probability: 0.9818
Result: Positive sentiment 😊

Review: 'The movie was a complete waste of time boring and predictable'
Positive Sentiment Probability: 0.0076
Result: Negative sentiment 😞


---
## Conclusion

In this notebook, we learned that:

-   Standard LSTMs process sequences in one direction (past to present).
-   This limits their ability to capture context, as the meaning of a word often depends on words that come after it.
-   **Bidirectional LSTMs** solve this by using two LSTMs—one processing the sequence forward and one backward.
-   The outputs are concatenated, providing a much richer contextual representation for each element in the sequence.
-   Implementing a BiLSTM in Keras is straightforward using the `Bidirectional` layer wrapper.

For many NLP tasks like sentiment analysis, named entity recognition, and machine translation, BiLSTMs often outperform their unidirectional counterparts.