# Bidirectional RNN for Named Entity Recognition (NER)

### Introduction
In standard RNNs, information flows **only in one direction** (past → future).  
But in many NLP tasks (like NER), the meaning of a word depends on both its **left context** and **right context**.

**Example:**  
- In "Apple is looking at buying a U.K. startup",  
  - "Apple" is an organization, not a fruit — we know this by looking at the words *after* it.  

**Bidirectional RNNs** process the sequence **forward and backward**, then combine both outputs.  
This helps the model capture **full context** around each word.

---

### Dataset
We’ll use the **NER dataset**: [Kaggle NER Dataset](https://www.kaggle.com/datasets/namanj27/ner-dataset?).  
It contains:  
- `Word` → actual word in the sentence  
- `POS` → part-of-speech tag  
- `Tag` → NER label (e.g., `B-geo`, `B-org`, `O`).


In [13]:
# -------------------------------
# Step 1: Import libraries
# -------------------------------
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, TimeDistributed, Dense
from sklearn.model_selection import train_test_split


In [14]:
# -------------------------------
# Step 2: Load small sample dataset
# -------------------------------
df = pd.read_csv("dataset/ner_datasetreference.csv", encoding="latin1").fillna(method="ffill")
df = df.sample(n=2000, random_state=42)  # take small sample for faster training

# Unique words and tags
words = list(set(df["Word"].values))
tags = list(set(df["Tag"].values))
n_words, n_tags = len(words), len(tags)

# Mapping
word2idx = {w:i+2 for i,w in enumerate(words)}
word2idx["PAD"], word2idx["UNK"] = 0, 1
tag2idx = {t:i for i,t in enumerate(tags)}


  df = pd.read_csv("dataset/ner_datasetreference.csv", encoding="latin1").fillna(method="ffill")


In [15]:
# -------------------------------
# Step 3: Prepare sequences
# -------------------------------
MAX_LEN = 50
sentences = df.groupby("Sentence #").apply(lambda s: [(w,t) for w,t in zip(s["Word"],s["Tag"])]).tolist()

X = [[word2idx.get(w, word2idx["UNK"]) for w,t in s] for s in sentences]
X = pad_sequences(X, maxlen=MAX_LEN, padding='post', value=word2idx["PAD"])

y = [[tag2idx[t] for w,t in s] for s in sentences]
y = pad_sequences(y, maxlen=MAX_LEN, padding='post', value=tag2idx["O"])
y = np.array([to_categorical(seq, num_classes=n_tags) for seq in y])

  sentences = df.groupby("Sentence #").apply(lambda s: [(w,t) for w,t in zip(s["Word"],s["Tag"])]).tolist()


In [17]:
# -------------------------------
# Step 4: Train/Test split
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# -------------------------------
# Step 5: Build Bidirectional LSTM
# -------------------------------
model = Sequential([
    Embedding(input_dim=n_words+2, output_dim=64, input_length=MAX_LEN, mask_zero=True),
    Bidirectional(LSTM(50, return_sequences=True)),
    TimeDistributed(Dense(n_tags, activation="softmax"))
])
model.build(input_shape=(None, MAX_LEN))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()


In [18]:
# -------------------------------
# Step 6: Train the model
# -------------------------------
history = model.fit(X_train, y_train, validation_split=0.1, batch_size=32, epochs=5, verbose=1)


Epoch 1/5
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 110ms/step - accuracy: 0.9761 - loss: 2.3806 - val_accuracy: 0.9968 - val_loss: 1.9962
Epoch 2/5
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 65ms/step - accuracy: 0.9967 - loss: 1.3368 - val_accuracy: 0.9973 - val_loss: 1.0578
Epoch 3/5
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 65ms/step - accuracy: 0.9969 - loss: 0.6613 - val_accuracy: 0.9975 - val_loss: 0.8204
Epoch 4/5
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 98ms/step - accuracy: 0.9972 - loss: 0.4849 - val_accuracy: 0.9974 - val_loss: 0.7626
Epoch 5/5
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 73ms/step - accuracy: 0.9974 - loss: 0.4483 - val_accuracy: 0.9974 - val_loss: 0.7401


In [20]:
# Predict on test set
y_pred = model.predict(X_test)

# Convert predictions to tag names (only first 5 sentences for simplicity)
for i in range(5):
    pred_tags = [list(tag2idx.keys())[np.argmax(p)] for p in y_pred[i]]
    true_tags = [list(tag2idx.keys())[np.argmax(p)] for p in y_test[i]]
    # Only show non-PAD words
    sentence_words = [list(word2idx.keys())[X_test[i][j]] if X_test[i][j] in word2idx.values() else "UNK" for j in range(MAX_LEN)]
    print("Sentence:", sentence_words[:10])
    print("Predicted:", pred_tags[:10])
    print("Actual   :", true_tags[:10])
    print("-"*50)


[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 376ms/step
Sentence: ['Palestinian', 'month', 'month', 'month', 'month', 'month', 'month', 'month', 'month', 'month']
Predicted: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Actual   : ['B-per', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
--------------------------------------------------
Sentence: ['genocide', 'month', 'month', 'month', 'month', 'month', 'month', 'month', 'month', 'month']
Predicted: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Actual   : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
--------------------------------------------------
Sentence: ['Hungarian', 'month', 'month', 'month', 'month', 'month', 'month', 'month', 'month', 'month']
Predicted: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
Actual   : ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
--------------------------------------------------
Sentence: ['advisors', 'month', 'month', 'month', 'month', 'month', 'month', 'mont