# Recurrent Neural Networks for Sequence Labeling


You are a developer at a FinTech company that needs to automatically extract key pieces of information (like names, organizations, and dates) from financial news articles. This task, known as Named Entity Recognition (NER), requires the model to process a sequence of words and assign a specific label to each word.
Since sentences have varying lengths and meaning depends on context (the sequence), a standard Feed-Forward Network is insufficient. Your task is to implement and compare different types of Recurrent Neural Networks (RNNs, GRUs, LSTMs) to solve this sequence labeling problem.

## Tasks:

In [1]:
!pip install gensim nltk scikit-learn matplotlib

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [2]:
# word2Vec Model
import nltk
from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
nltk.download('brown')  # Brown corpus for demo
nltk.download('punkt')  # Tokenizer

# Load and preprocess data
sentences = brown.sents()  # Get sentences from the Brown corpus
processed_sentences = [simple_preprocess(" ".join(sent)) for sent in sentences]

# printing
print(f"processed_sentences = {processed_sentences[:10]}, length = {len(processed_sentences)}")



# Train Word2Vec model
model = Word2Vec(
    sentences=processed_sentences,
    vector_size=100,  # Dimensionality of word vectors
    window=5,         # Context window size
    min_count=2,      # Minimum word frequency
    workers=4,        # Number of threads
    sg=0              # CBOW (0) or Skip-gram (1)
)

# Save and load the model if needed
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


processed_sentences = [['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', 'atlanta', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities', 'took', 'place'], ['the', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'city', 'executive', 'committee', 'which', 'had', 'over', 'all', 'charge', 'of', 'the', 'election', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'city', 'of', 'atlanta', 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted'], ['the', 'september', 'october', 'term', 'jury', 'had', 'been', 'charged', 'by', 'fulton', 'superior', 'court', 'judge', 'durwood', 'pye', 'to', 'investigate', 'reports', 'of', 'possible', 'irregularities', 'in', 'the', 'hard', 'fought', 'primary', 'which', 'was', 'won', 'by', 'mayor', 'nominate', 'ivan', 'allen', 'jr'], ['only', 'relative', 'handful', 'of', 'such', 'reports', 'was', 'received', 'the', 'jury

In [3]:
# finding Similar words
similar_words = model.wv.most_similar("learning", topn=5)
print("Words similar to 'learning':", similar_words)

Words similar to 'learning': [('wisdom', 0.9754779934883118), ('conduct', 0.9709805846214294), ('enjoyment', 0.9703553318977356), ('principles', 0.9697201251983643), ('roles', 0.9685952663421631)]


In [4]:
# Saving the model
model.wv.save_word2vec_format("word2vec_vectors.txt", binary=False)

In [5]:
# Import the library
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)

print("Libraries imported. PyTorch Version:", torch.__version__)

Libraries imported. PyTorch Version: 2.9.0+cu128


### Task 1: Data Preparation and Embedding Layer

1. **Tokenization and Mapping:** Load the NER dataset and create two dictionaries: a `word-to-index` map for the tokens, and a `tag-to-index` map for the labels (B-PER, I-PER, O, etc.).
2. **Padding:** Since sentences have different lengths, implement a padding mechanism to ensure all input sequences have the same length .
3. **Embedding:** Create a torch.nn.Embedding layer that will map the integer indices (from Task 1) to dense, continuous word vectors (e.g., dimension 100).

In [7]:
# Task-1
# Synthetic Data:
# Data Format: (Sentence, Tags)
# Tags: B- (Beginning), I- (Inside), O (Outside)
training_data = [
    ("Elon Musk is the CEO of Tesla".split(), ["B-PER", "I-PER", "O", "O", "O", "O", "B-ORG"]),
    ("Apple released a new iPhone on Monday".split(), ["B-ORG", "O", "O", "O", "O", "O", "B-DATE"]),
    ("JP Morgan Chase reported strong profits".split(), ["B-ORG", "I-ORG", "I-ORG", "O", "O", "O"]),
    ("Stocks fell on Tuesday after the announcement".split(), ["O", "O", "O", "B-DATE", "O", "O", "O"]),
    ("Amazon plans to hire more engineers in 2024".split(), ["B-ORG", "O", "O", "O", "O", "O", "O", "B-DATE"]),
    ("Satya Nadella spoke about AI at Microsoft".split(), ["B-PER", "I-PER", "O", "O", "O", "O", "B-ORG"]),
]

# Print a sample to check structure
print(f"Sample Text: {training_data[0][0]}")
print(f"Sample Tags: {training_data[0][1]}")

Sample Text: ['Elon', 'Musk', 'is', 'the', 'CEO', 'of', 'Tesla']
Sample Tags: ['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'B-ORG']


In [8]:
# Tokenization and Mapping
word_to_index = {"<PAD>": 0}
tags = {"<PAD>": 0}

for sentence, tag_sequence in training_data:
    for word in sentence:
        if word not in word_to_index:
            word_to_index[word] = len(word_to_index)
    for tag in tag_sequence:
        if tag not in tags:
            tags[tag] = len(tags)

tag_to_index = tags
index_to_tag = {idx: tag for tag, idx in tag_to_index.items()}

print(f"Vocabulary Size: {len(word_to_index)}")
print(f"Tags: {tag_to_index}")

Vocabulary Size: 41
Tags: {'<PAD>': 0, 'B-PER': 1, 'I-PER': 2, 'O': 3, 'B-ORG': 4, 'B-DATE': 5, 'I-ORG': 6}


In [10]:
# Padding:

from torch.nn.utils.rnn import pad_sequence
import torch

def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

# Convert all data to tensors
X = [prepare_sequence(s[0], word_to_index) for s in training_data]
y = [prepare_sequence(s[1], tag_to_index) for s in training_data]

# Pad sequences to the max length in the batch
# batch_first=True makes the output shape (Batch_Size, Seq_Len)
X_padded = pad_sequence(X, batch_first=True, padding_value=word_to_index["<PAD>"])
y_padded = pad_sequence(y, batch_first=True, padding_value=tag_to_index["<PAD>"])

print("Shape of Input Tensor:", X_padded.shape)
print("Shape of Target Tensor:", y_padded.shape)

Shape of Input Tensor: torch.Size([6, 8])
Shape of Target Tensor: torch.Size([6, 8])


In [11]:
# Embedding
# Hyperparameters
VOCAB_SIZE = len(word_to_index)
EMBEDDING_DIM = 100  # As requested

# 1. Define the Embedding Layer
embedding_layer = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIM, padding_idx=word_to_index["<PAD>"])

# Forward pass (Lookup)
input_indices = X_padded # Use the padded input from the previous step
dense_vectors = embedding_layer(input_indices)

print(f"\nInput Indices Shape: {input_indices.shape}")
print(f"Dense Vectors Shape: {dense_vectors.shape}")

# 3. Verification
pad_vector = dense_vectors[0][-1]
print(f"\nVector for PAD token (Sum of values): {pad_vector.sum().item()}")


Input Indices Shape: torch.Size([6, 8])
Dense Vectors Shape: torch.Size([6, 8, 100])

Vector for PAD token (Sum of values): 0.0


### Task 2: Building the LSTM Sequence Labeler
Implement a PyTorch class LSTMNERSegmenter that utilizes an LSTM for the sequence labeling task.

1. `Initialization (__init__)`:
  * Define the nn.Embedding layer (from Task 1).
  *  Define a Bidirectional LSTM layer (nn.LSTM).

    Input Size: Embedding dimension.
    Hidden Size: Choose a reasonable hidden dimension (e.g., 256).
    bidirectional=True: Essential for sequence labeling.
  
  * Define a final Linear Layer (nn.Linear) that maps the output of the Bi-LSTM (which is ) to the number of unique tags/labels.

2. `Forward Pass (forward)`:
* Pass the input indices through the embedding layer.
* Pass the embeddings into the Bi-LSTM.
* Pass the LSTM output through the final linear layer.

In [12]:
# Task-2
class LSTMNERSegmenter(nn.Module):
    def __init__(self, vocab_size, tagset_size, embedding_dim, hidden_dim=256):
        super(LSTMNERSegmenter, self).__init__()
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True, batch_first=True)

        self.hidden2tag = nn.Linear(hidden_dim * 2, tagset_size)

    def forward(self, x):
        embeddings = self.embedding(x)

        lstm_out, _ = self.lstm(embeddings)

        tag_space = self.hidden2tag(lstm_out)

        return tag_space

In [13]:
# Instantiation

# Hyperparameters
VOCAB_SIZE = len(word_to_index)
TAGSET_SIZE = len(tag_to_index)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256

# Initialize the model
model = LSTMNERSegmenter(VOCAB_SIZE, TAGSET_SIZE, EMBEDDING_DIM, HIDDEN_DIM)

# Check architecture
print(model)


LSTMNERSegmenter(
  (embedding): Embedding(41, 100, padding_idx=0)
  (lstm): LSTM(100, 256, batch_first=True, bidirectional=True)
  (hidden2tag): Linear(in_features=512, out_features=7, bias=True)
)


### Task 3: Training and Evaluation

1. Define the `Loss Function` (e.g., nn.CrossEntropyLoss) and the Optimizer (e.g., optim.Adam).
2. **Implement the training loop:** iterate through batches of the dataset, perform the forward pass, calculate the loss, perform backpropagation (loss.backward()), and update the weights (optimizer.step()).
3. **Evaluation:** Calculate the `token-level` accuracy (the percentage of words correctly tagged). Briefly discuss why the more robust `F1-Score` is preferred over simple accuracy for the NER task, which involves class imbalance (many 'O' tags).


In [16]:
# Task-3
import torch.optim as optim
import torch.nn as nn

# 1. Hyperparameters
LEARNING_RATE = 0.01
EPOCHS = 10

# 2. Initialize Model, Loss, and Optimizer
model = LSTMNERSegmenter(VOCAB_SIZE, TAGSET_SIZE, EMBEDDING_DIM, HIDDEN_DIM)
PAD_IDX = tag_to_index["<PAD>"] # Get the index for the padding token
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX) # Ignore padding in loss calculation
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# 3. Helper Function for Accuracy
def categorical_accuracy(preds, y, tag_pad_idx=0):
    max_preds = preds.argmax(dim=2, keepdim=True).squeeze(2)
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements[:, 0], non_pad_elements[:, 1]].eq(
              y[non_pad_elements[:, 0], non_pad_elements[:, 1]])

    # Ensure the divisor is on the same device as other tensors
    return correct.sum() / torch.tensor([y[non_pad_elements[:, 0], non_pad_elements[:, 1]].shape[0]], dtype=torch.float32).to(y.device)

In [17]:
# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
X_padded = X_padded.to(device)
y_padded = y_padded.to(device)
criterion = criterion.to(device)

print(f"Training on {device}...")

for epoch in range(EPOCHS):
    model.train() # Set model to training mode
    optimizer.zero_grad() # Zero the gradients before running the backward pass

    # Forward pass
    predictions = model(X_padded)

    # Reshape for loss calculation: (N * L, C) where N=batch, L=seq_len, C=num_classes
    predictions = predictions.view(-1, predictions.shape[-1])
    # Reshape target for loss calculation: (N * L)
    target = y_padded.view(-1)

    # Calculate loss
    loss = criterion(predictions, target)

    # Calculate accuracy
    acc = categorical_accuracy(model(X_padded), y_padded, PAD_IDX)

    # Backward pass and optimization
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 2 == 0:
        print(f"Epoch: {epoch+1:02} | Loss: {loss.item():.4f} | Accuracy: {acc.item()*100:.2f}%")

Training on cuda...
Epoch: 02 | Loss: 1.0412 | Accuracy: 85.71%
Epoch: 04 | Loss: 0.2295 | Accuracy: 100.00%
Epoch: 06 | Loss: 0.1726 | Accuracy: 100.00%
Epoch: 08 | Loss: 0.0312 | Accuracy: 100.00%
Epoch: 10 | Loss: 0.0085 | Accuracy: 100.00%


### Task 4: Architecture Comparison (Theoretical)

Briefly modify your code (or simply discuss the required modifications) to compare:

* `Simple RNN vs. LSTM vs. GRU.` Explain the conceptual difference between the Cell State (in LSTM) and the simple Hidden State (in RNN/GRU).

### Task 4

#### Simple RNN vs. LSTM vs. GRU

*   **Simple RNN (Recurrent Neural Network):** These are the most basic form of RNNs. They process sequences by passing information from one step to the next in a hidden state. However, they suffer from the vanishing/exploding gradient problem, making them ineffective at capturing long-term dependencies in sequences.

*   **LSTM (Long Short-Term Memory):** LSTMs address the vanishing gradient problem through a more complex architecture that includes three 'gates' (input, forget, and output) and a 'cell state'. These gates regulate the flow of information into and out of the cell state, allowing the network to selectively remember or forget information over long sequences. This makes LSTMs very effective at learning long-term dependencies.

*   **GRU (Gated Recurrent Unit):** GRUs are a simpler variant of LSTMs, combining the forget and input gates into a single 'update gate' and merging the cell state and hidden state. They also have a 'reset gate'. GRUs are computationally less expensive than LSTMs and often perform similarly, especially on smaller datasets.

#### Cell State (LSTM) vs. Hidden State (RNN/GRU)

*   **Cell State (LSTM):** The cell state in an LSTM acts as a 'memory highway' that runs straight through the entire chain of the LSTM. It is largely unaffected by the operations of the gates unless they explicitly act on it. This allows information to flow through many time steps without being significantly altered or vanishing, making it ideal for retaining long-term dependencies. The cell state is explicitly controlled by the forget and input gates.

*   **Hidden State (RNN/GRU):** In Simple RNNs and GRUs, there is only a hidden state (in GRUs, it combines the functions of LSTM's hidden state and cell state). The hidden state is constantly updated at each time step based on the current input and the previous hidden state. While it carries information forward, in Simple RNNs, it is much more susceptible to vanishing gradients, leading to a loss of information about distant past inputs. GRUs mitigate this to a large extent with their gating mechanisms, making their hidden state more robust than a Simple RNN's, but it lacks the separate, less-interrupted memory pathway of LSTM's cell state.

#### Bidirectional LSTM (Bi-LSTM) vs. Unidirectional LSTM for NER

A Bi-LSTM is inherently better suited than a unidirectional LSTM for sequence labeling tasks like Named Entity Recognition because NER often requires understanding context from *both* the past and the future words in a sentence to correctly classify a given word. A unidirectional LSTM processes a sequence only from left-to-right (or right-to-left), meaning its prediction for a word 'X' can only consider words that came before 'X' (or after, if processing reversed). In contrast, a Bi-LSTM consists of two LSTMs: one processing the sequence forward and another processing it backward. Their outputs are then concatenated. This allows the model to capture rich contextual information from both preceding and succeeding words, which is crucial for accurately identifying entities. For example, to classify 'Morgan' in 'JP Morgan Chase', knowing 'Chase' follows is as important as knowing 'JP' precedes it.

#### Code Modifications for Comparison:

To compare these architectures, you would modify the `LSTMNERSegmenter` class. The primary change would be replacing `nn.LSTM` with `nn.RNN` for a Simple RNN or `nn.GRU` for a Gated Recurrent Unit. The input and output dimensions of these layers would remain similar, but you would need to adjust for the specific return values (e.g., `nn.RNN` and `nn.GRU` typically return only the output and the final hidden state, without a separate cell state).

For example, to use a GRU:
```python
class GRUNERSegmenter(nn.Module):
    def __init__(self, vocab_size, tagset_size, embedding_dim, hidden_dim=256):
        super(GRUNERSegmenter, self).__init__()
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        # Replace LSTM with GRU
        self.gru = nn.GRU(embedding_dim, hidden_dim, bidirectional=True, batch_first=True)
        self.hidden2tag = nn.Linear(hidden_dim * 2, tagset_size)

    def forward(self, x):
        embeddings = self.embedding(x)
        # GRU returns output, h_n (no cell state)
        gru_out, _ = self.gru(embeddings)
        tag_space = self.hidden2tag(gru_out)
        return tag_space
```

In [18]:
# Output Snapshot: Displaying True vs. Predicted Labels

# Select a sample from the training data
sample_index = 0 # You can change this to see other samples
sample_sentence_words = training_data[sample_index][0]
sample_true_tags_str = training_data[sample_index][1]

print(f"Sample Sentence: {' '.join(sample_sentence_words)}")
print(f"True Tags: {sample_true_tags_str}")

# Prepare the sample sentence for the model
model.eval() # Set the model to evaluation mode
with torch.no_grad(): # Disable gradient calculations
    input_indices = prepare_sequence(sample_sentence_words, word_to_index).unsqueeze(0).to(device) # Add batch dimension and move to device

    # Get predictions
    raw_predictions = model(input_indices)

    # Get the predicted tag index for each word
    # raw_predictions is (batch_size, seq_len, tagset_size)
    # We need to take argmax along the last dimension
    predicted_tag_indices = raw_predictions.argmax(dim=2).squeeze(0) # Remove batch dimension

    # Convert predicted indices back to tag strings
    predicted_tags_str = [index_to_tag[idx.item()] for idx in predicted_tag_indices]

# Trim padding tags if necessary (though our example padding is simple)
# Find the actual length of the original sentence, excluding padding words
original_sentence_length = len(sample_sentence_words)
predicted_tags_trimmed = predicted_tags_str[:original_sentence_length]

print(f"Predicted Tags: {predicted_tags_trimmed}")

# Visual comparison
print("\n--- Visual Comparison ---")
for i, (word, true_tag, pred_tag) in enumerate(zip(sample_sentence_words, sample_true_tags_str, predicted_tags_trimmed)):
    print(f"Word: {word:<10} | True: {true_tag:<8} | Pred: {pred_tag:<8}")


Sample Sentence: Elon Musk is the CEO of Tesla
True Tags: ['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'B-ORG']
Predicted Tags: ['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'B-ORG']

--- Visual Comparison ---
Word: Elon       | True: B-PER    | Pred: B-PER   
Word: Musk       | True: I-PER    | Pred: I-PER   
Word: is         | True: O        | Pred: O       
Word: the        | True: O        | Pred: O       
Word: CEO        | True: O        | Pred: O       
Word: of         | True: O        | Pred: O       
Word: Tesla      | True: B-ORG    | Pred: B-ORG   


### Deliverables

* A fully commented Python Notebook (.ipynb) implementing the Bi-LSTM NER model (Tasks 1-3).
* Analysis: A short theoretical comparison (100 words) explaining why a Bi-LSTM is inherently better suited than a unidirectional LSTM for Sequence Labeling tasks like NER.
* Output Snapshot: Show a sample sentence from the test set, displaying the true labels and the model's predicted labels.