# Model 4: Character-CNN + BiLSTM-CRF

**Approach:** Deep learning with character-level and word-level features

**Expected F1:** 90-91%

**Key Features:**
- Character-level CNN (handles OOV words)
- Pre-trained word embeddings (GloVe)
- Bidirectional LSTM (captures context)
- CRF layer (enforces valid BIO sequences)

**Why This Architecture?**
- **Character CNN**: Captures morphology (prefixes/suffixes), handles unknown words
- **BiLSTM**: Bidirectional context (both left and right)
- **CRF**: Enforces valid tag transitions, better boundary detection

**Research Results:**
- CoNLL-2003: 90.94% F1 (Lample et al., 2016)
- CoNLL++: 91.47% F1 (Akbik et al., 2018)

**References:**
1. Lample et al. (2016). "Neural Architectures for Named Entity Recognition." NAACL. https://arxiv.org/abs/1603.01360
2. Ma & Hovy (2016). "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF." ACL. https://arxiv.org/abs/1603.01354

## 1. Setup and Imports

In [59]:
# Install required packages using the notebook's Python interpreter
import sys
print(f"Jupyter kernel Python: {sys.executable}")

# Install packages to the correct Python environment
# Note: Use pytorch-crf (not torchcrf) for proper CRF implementation
!{sys.executable} -m pip uninstall -y torchcrf
!{sys.executable} -m pip install torch pytorch-crf gensim

print("\n‚úÖ Packages installed! Please RESTART THE KERNEL before continuing.")
print("   Kernel ‚Üí Restart Kernel (or Ctrl+Shift+P ‚Üí 'Restart Kernel')")

Jupyter kernel Python: /usr/local/bin/python3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

‚úÖ Packages installed! Please RESTART THE KERNEL before continuing.
   Kernel ‚Üí Restart Kernel (or Ctrl+Shift+P ‚Üí 'Restart Kernel')


In [60]:
import json
import numpy as np
import pickle
import time
from collections import Counter
import os

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

# CRF (using pytorch-crf library)
from torchcrf import CRF

# Embeddings
import gensim.downloader as api

# Our evaluation utilities
from utils import print_evaluation_report, evaluate_entity_spans

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

print("Imports successful!")

Using device: cpu
Imports successful!


## 2. Load Data

In [61]:
def load_jsonl(file_path):
    """Load JSONL file"""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

# Load data
train_data = load_jsonl('train_split.jsonl')
val_data = load_jsonl('val_split.jsonl')

print(f"Training samples: {len(train_data):,}")
print(f"Validation samples: {len(val_data):,}")

# Extract tokens and tags
train_tokens = [sample['tokens'] for sample in train_data]
train_tags = [sample['ner_tags'] for sample in train_data]

val_tokens = [sample['tokens'] for sample in val_data]
val_tags = [sample['ner_tags'] for sample in val_data]

Training samples: 90,320
Validation samples: 10,036


## 3. Build Vocabularies

Create vocabularies for:
1. **Words**: Map words to indices
2. **Characters**: Map characters to indices
3. **Tags**: Map NER tags to indices

In [62]:
# Build word vocabulary
word_counts = Counter()
for tokens in train_tokens:
    word_counts.update(tokens)

# Keep words with frequency >= 2 (helps reduce vocabulary size)
MIN_WORD_FREQ = 2
word2idx = {'<PAD>': 0, '<UNK>': 1}
for word, count in word_counts.items():
    if count >= MIN_WORD_FREQ:
        word2idx[word] = len(word2idx)

idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(word2idx)

print(f"Word vocabulary size: {vocab_size:,}")
print(f"Words filtered (freq < {MIN_WORD_FREQ}): {len(word_counts) - vocab_size + 2:,}")

Word vocabulary size: 36,790
Words filtered (freq < 2): 70,853


In [63]:
# Build character vocabulary
char2idx = {'<PAD>': 0, '<UNK>': 1}

# All printable ASCII characters
chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,!?:;\'"()-[]{}@#$%^&*+=/<>\\|`~_'
for char in chars:
    if char not in char2idx:
        char2idx[char] = len(char2idx)

idx2char = {idx: char for char, idx in char2idx.items()}
char_vocab_size = len(char2idx)

print(f"Character vocabulary size: {char_vocab_size}")

Character vocabulary size: 97


In [64]:
# Build tag vocabulary
tag2idx = {}
for tags in train_tags:
    for tag in tags:
        if tag not in tag2idx:
            tag2idx[tag] = len(tag2idx)

idx2tag = {idx: tag for tag, idx in tag2idx.items()}
num_tags = len(tag2idx)

print(f"Number of NER tags: {num_tags}")
print(f"Tags: {list(tag2idx.keys())}")

Number of NER tags: 15
Tags: ['O', 'B-ORG', 'I-ORG', 'B-Facility', 'I-Facility', 'B-OtherPER', 'I-OtherPER', 'B-Politician', 'I-Politician', 'B-HumanSettlement', 'I-HumanSettlement', 'B-Artist', 'I-Artist', 'B-PublicCorp', 'I-PublicCorp']


## 4. Load Pre-trained Word Embeddings (GloVe)

Using pre-trained GloVe embeddings provides semantic information.

In [65]:
print("Downloading GloVe embeddings (this may take a few minutes)...")
print("Using glove-wiki-gigaword-100 (100-dimensional, 400K vocabulary)")

# Download GloVe 100d embeddings
# Options: 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-300'
glove_model = api.load('glove-wiki-gigaword-100')

EMBEDDING_DIM = 100
print(f"\nGloVe embeddings loaded!")
print(f"Embedding dimension: {EMBEDDING_DIM}")

Downloading GloVe embeddings (this may take a few minutes)...
Using glove-wiki-gigaword-100 (100-dimensional, 400K vocabulary)

GloVe embeddings loaded!
Embedding dimension: 100


In [66]:
# Create embedding matrix
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

# Initialize with GloVe vectors
found = 0
for word, idx in word2idx.items():
    if word in ['<PAD>', '<UNK>']:
        continue
    
    # Try to find word in GloVe
    try:
        embedding_matrix[idx] = glove_model[word.lower()]
        found += 1
    except KeyError:
        # Word not in GloVe, initialize randomly
        embedding_matrix[idx] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

# Special tokens
embedding_matrix[word2idx['<PAD>']] = np.zeros(EMBEDDING_DIM)  # Padding = zeros
embedding_matrix[word2idx['<UNK>']] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

print(f"Words found in GloVe: {found:,} / {vocab_size:,} ({found/vocab_size*100:.1f}%)")
print(f"Words initialized randomly: {vocab_size - found:,}")

Words found in GloVe: 33,849 / 36,790 (92.0%)
Words initialized randomly: 2,941


## 5. Dataset Class

In [67]:
MAX_CHAR_LEN = 20  # Maximum characters per word

class NERDataset(Dataset):
    def __init__(self, tokens_list, tags_list, word2idx, char2idx, tag2idx):
        self.tokens_list = tokens_list
        self.tags_list = tags_list
        self.word2idx = word2idx
        self.char2idx = char2idx
        self.tag2idx = tag2idx
    
    def __len__(self):
        return len(self.tokens_list)
    
    def __getitem__(self, idx):
        tokens = self.tokens_list[idx]
        tags = self.tags_list[idx]
        
        # Convert words to indices
        word_ids = [self.word2idx.get(token, self.word2idx['<UNK>']) for token in tokens]
        
        # Convert characters to indices
        char_ids = []
        for token in tokens:
            chars = [self.char2idx.get(c, self.char2idx['<UNK>']) for c in token[:MAX_CHAR_LEN]]
            # Pad to MAX_CHAR_LEN
            if len(chars) < MAX_CHAR_LEN:
                chars += [self.char2idx['<PAD>']] * (MAX_CHAR_LEN - len(chars))
            char_ids.append(chars)
        
        # Convert tags to indices
        tag_ids = [self.tag2idx[tag] for tag in tags]
        
        return {
            'word_ids': torch.LongTensor(word_ids),
            'char_ids': torch.LongTensor(char_ids),
            'tag_ids': torch.LongTensor(tag_ids),
            'length': len(tokens)
        }

# Create datasets
train_dataset = NERDataset(train_tokens, train_tags, word2idx, char2idx, tag2idx)
val_dataset = NERDataset(val_tokens, val_tags, word2idx, char2idx, tag2idx)

print(f"Train dataset: {len(train_dataset)} samples")
print(f"Val dataset: {len(val_dataset)} samples")

Train dataset: 90320 samples
Val dataset: 10036 samples


In [68]:
# Collate function for batching
def collate_fn(batch):
    """Custom collate function to handle variable-length sequences, including empty ones"""
    # Sort batch by length (descending) for pack_padded_sequence
    batch = sorted(batch, key=lambda x: x['length'], reverse=True)
    
    word_ids = [item['word_ids'] for item in batch]
    char_ids = [item['char_ids'] for item in batch]
    tag_ids = [item['tag_ids'] for item in batch]
    lengths = [item['length'] for item in batch]
    
    # Pad sequences
    word_ids_padded = pad_sequence(word_ids, batch_first=True, padding_value=word2idx['<PAD>'])
    tag_ids_padded = pad_sequence(tag_ids, batch_first=True, padding_value=tag2idx['O'])
    
    # For char_ids, we need to handle 2D padding manually
    # char_ids is a list of tensors with shape (seq_len, MAX_CHAR_LEN)
    max_len = max(1, word_ids_padded.size(1))  # Ensure at least 1 to avoid 0-dim tensors
    batch_size = len(batch)
    
    # Create padded char_ids tensor: (batch_size, max_len, MAX_CHAR_LEN)
    char_ids_padded = torch.full(
        (batch_size, max_len, MAX_CHAR_LEN),
        fill_value=char2idx['<PAD>'],
        dtype=torch.long
    )
    
    # Fill in the actual character IDs
    for i, chars in enumerate(char_ids):
        seq_len = chars.size(0)
        if seq_len > 0:  # Only copy if sequence is not empty
            char_ids_padded[i, :seq_len, :] = chars
    
    # Create mask (True for valid positions, False for padding and empty sequences)
    mask = torch.zeros((batch_size, max_len), dtype=torch.bool)
    for i, length in enumerate(lengths):
        if length > 0:  # Handle empty sequences - mask stays all False
            mask[i, :length] = True
    
    return {
        'word_ids': word_ids_padded,
        'char_ids': char_ids_padded,
        'tag_ids': tag_ids_padded,
        'lengths': lengths,
        'mask': mask
    }

# Create DataLoaders
BATCH_SIZE = 32

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collate_fn
)

print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")
print(f"\nNote: Dataset contains empty sequences that will be handled correctly:")
print(f"  - Training: 425 empty sequences")
print(f"  - Validation: 54 empty sequences")
print(f"  - Empty sequences will produce empty predictions (as expected)")

# Test the collate function with an empty sequence
print("\nTesting empty sequence handling...")
test_batch = [train_dataset[359]]  # This is an empty sequence
test_collated = collate_fn(test_batch)
print(f"  Empty sequence length: {test_collated['lengths'][0]}")
print(f"  Mask sum (should be 0): {test_collated['mask'].sum().item()}")
print(f"  ‚úì Empty sequence handling verified!")

Train batches: 2823
Val batches: 314

Note: Dataset contains empty sequences that will be handled correctly:
  - Training: 425 empty sequences
  - Validation: 54 empty sequences
  - Empty sequences will produce empty predictions (as expected)

Testing empty sequence handling...
  Empty sequence length: 0
  Mask sum (should be 0): 0
  ‚úì Empty sequence handling verified!


## 6. Model Architecture

### Architecture Overview:

```
Input: (words, characters)
    ‚Üì                    ‚Üì
Word Embedding      Character Embedding
(GloVe 100d)        (25d) ‚Üí 1D CNN ‚Üí MaxPool (30d)
    ‚Üì                    ‚Üì
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ Concatenate (130d) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
              ‚Üì
      BiLSTM (256 hidden units)
              ‚Üì
         Dropout (0.5)
              ‚Üì
      Linear (num_tags)
              ‚Üì
         CRF Layer
```

In [69]:
class CharCNN(nn.Module):
    """Character-level CNN for extracting character features"""
    def __init__(self, char_vocab_size, char_emb_dim, char_hidden_dim, max_char_len):
        super().__init__()
        self.char_embedding = nn.Embedding(char_vocab_size, char_emb_dim, padding_idx=0)
        
        # 1D Convolutional layer
        self.conv = nn.Conv1d(
            in_channels=char_emb_dim,
            out_channels=char_hidden_dim,
            kernel_size=3,
            padding=1
        )
        self.relu = nn.ReLU()
    
    def forward(self, char_ids):
        # char_ids: (batch_size, seq_len, max_char_len)
        batch_size, seq_len, max_char_len = char_ids.size()
        
        # Reshape to (batch_size * seq_len, max_char_len)
        char_ids = char_ids.view(-1, max_char_len)
        
        # Character embeddings: (batch_size * seq_len, max_char_len, char_emb_dim)
        char_embeds = self.char_embedding(char_ids)
        
        # Transpose for Conv1d: (batch_size * seq_len, char_emb_dim, max_char_len)
        char_embeds = char_embeds.transpose(1, 2)
        
        # Convolution: (batch_size * seq_len, char_hidden_dim, max_char_len)
        char_conv = self.relu(self.conv(char_embeds))
        
        # Max pooling over character sequence: (batch_size * seq_len, char_hidden_dim)
        char_features = torch.max(char_conv, dim=2)[0]
        
        # Reshape back: (batch_size, seq_len, char_hidden_dim)
        char_features = char_features.view(batch_size, seq_len, -1)
        
        return char_features


class BiLSTM_CRF(nn.Module):
    """BiLSTM-CRF model with character-level CNN"""
    def __init__(self, vocab_size, char_vocab_size, embedding_dim, char_emb_dim,
                 char_hidden_dim, lstm_hidden_dim, num_tags, embedding_matrix=None):
        super().__init__()
        
        # Word embeddings
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        if embedding_matrix is not None:
            self.word_embedding.weight.data.copy_(torch.from_numpy(embedding_matrix))
            # Fine-tune embeddings
            self.word_embedding.weight.requires_grad = True
        
        # Character CNN
        self.char_cnn = CharCNN(char_vocab_size, char_emb_dim, char_hidden_dim, MAX_CHAR_LEN)
        
        # BiLSTM
        self.lstm = nn.LSTM(
            input_size=embedding_dim + char_hidden_dim,
            hidden_size=lstm_hidden_dim,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )
        
        # Dropout
        self.dropout = nn.Dropout(0.5)
        
        # Linear layer (emission scores)
        self.fc = nn.Linear(lstm_hidden_dim * 2, num_tags)  # *2 for bidirectional
        
        # CRF layer
        self.crf = CRF(num_tags, batch_first=True)
    
    def forward(self, word_ids, char_ids, tags=None, mask=None):
        # Word embeddings
        word_embeds = self.word_embedding(word_ids)
        
        # Character features
        char_features = self.char_cnn(char_ids)
        
        # Concatenate word and character features
        combined = torch.cat([word_embeds, char_features], dim=-1)
        
        # BiLSTM
        lstm_out, _ = self.lstm(combined)
        lstm_out = self.dropout(lstm_out)
        
        # Emission scores
        emissions = self.fc(lstm_out)
        
        if tags is not None:
            # Training: compute CRF loss
            loss = -self.crf(emissions, tags, mask=mask, reduction='mean')
            return loss
        else:
            # Inference: decode best path
            predictions = self.crf.decode(emissions, mask=mask)
            return predictions

print("Model architecture defined!")

Model architecture defined!


## 7. Initialize Model

In [70]:
# Hyperparameters
CHAR_EMB_DIM = 25
CHAR_HIDDEN_DIM = 30
LSTM_HIDDEN_DIM = 256
LEARNING_RATE = 0.001
NUM_EPOCHS = 20

# Initialize model
model = BiLSTM_CRF(
    vocab_size=vocab_size,
    char_vocab_size=char_vocab_size,
    embedding_dim=EMBEDDING_DIM,
    char_emb_dim=CHAR_EMB_DIM,
    char_hidden_dim=CHAR_HIDDEN_DIM,
    lstm_hidden_dim=LSTM_HIDDEN_DIM,
    num_tags=num_tags,
    embedding_matrix=embedding_matrix
).to(device)

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nModel initialized!")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"\nHyperparameters:")
print(f"  Word embedding dim: {EMBEDDING_DIM}")
print(f"  Char embedding dim: {CHAR_EMB_DIM}")
print(f"  Char CNN output: {CHAR_HIDDEN_DIM}")
print(f"  LSTM hidden dim: {LSTM_HIDDEN_DIM} (x2 for bidirectional)")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Epochs: {NUM_EPOCHS}")


Model initialized!
Total parameters: 4,486,279
Trainable parameters: 4,486,279

Hyperparameters:
  Word embedding dim: 100
  Char embedding dim: 25
  Char CNN output: 30
  LSTM hidden dim: 256 (x2 for bidirectional)
  Learning rate: 0.001
  Batch size: 32
  Epochs: 20


## 8. Training Loop

**Empty Sequence Handling:**

The dataset contains empty sequences (425 in training, 54 in validation). Our implementation handles them correctly:

1. **Collate Function**: Empty sequences get `length=0` and all-False mask
2. **Training**: Skips batches where all sequences are empty (avoids NaN loss)
3. **Evaluation**: Returns empty predictions for empty inputs (correct behavior)
4. **Test Set**: Will work the same way - empty inputs ‚Üí empty outputs

This ensures the model produces correct predictions for all test cases, including empty ones.

In [None]:
def train_epoch(model, data_loader, optimizer, device):
    """Training with empty sequence filtering"""
    from tqdm import tqdm
    
    model.train()
    total_loss = 0
    num_batches = 0
    
    # Add progress bar
    pbar = tqdm(data_loader, desc="Training", leave=False)
    
    for batch in pbar:
        word_ids = batch['word_ids'].to(device)
        char_ids = batch['char_ids'].to(device)
        tag_ids = batch['tag_ids'].to(device)
        mask = batch['mask'].to(device)
        lengths = batch['lengths']
        
        # Filter out empty sequences (pytorch-crf requires first timestep to be valid)
        non_empty_indices = [i for i, length in enumerate(lengths) if length > 0]
        
        # Skip batch if all sequences are empty
        if len(non_empty_indices) == 0:
            continue
        
        # Keep only non-empty sequences
        if len(non_empty_indices) < len(lengths):
            word_ids = word_ids[non_empty_indices]
            char_ids = char_ids[non_empty_indices]
            tag_ids = tag_ids[non_empty_indices]
            mask = mask[non_empty_indices]
        
        optimizer.zero_grad()
        
        # Forward pass (returns loss)
        loss = model(word_ids, char_ids, tag_ids, mask)
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        
        optimizer.step()
        
        total_loss += loss.item()
        num_batches += 1
        
        # Update progress bar
        pbar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    return total_loss / max(num_batches, 1)


def evaluate(model, data_loader, device):
    """
    Evaluate model with empty sequence handling.
    Empty sequences bypass the model and get empty predictions directly.
    """
    from tqdm import tqdm
    
    model.eval()
    all_predictions = []
    all_true_tags = []
    
    with torch.no_grad():
        pbar = tqdm(data_loader, desc="Evaluating", leave=False)
        
        for batch in pbar:
            word_ids = batch['word_ids'].to(device)
            char_ids = batch['char_ids'].to(device)
            tag_ids = batch['tag_ids'].to(device)
            mask = batch['mask'].to(device)
            lengths = batch['lengths']
            
            # Find empty and non-empty sequences
            non_empty_indices = [i for i, length in enumerate(lengths) if length > 0]
            
            # Process non-empty sequences through model
            if len(non_empty_indices) > 0:
                word_ids_non_empty = word_ids[non_empty_indices]
                char_ids_non_empty = char_ids[non_empty_indices]
                mask_non_empty = mask[non_empty_indices]
                
                # Forward pass (returns predictions)
                predictions_non_empty = model(word_ids_non_empty, char_ids_non_empty, mask=mask_non_empty)
            else:
                predictions_non_empty = []
            
            # Reconstruct full predictions list (including empty sequences)
            predictions = []
            non_empty_iter = iter(predictions_non_empty)
            for i in range(len(lengths)):
                if lengths[i] == 0:
                    predictions.append([])  # Empty prediction for empty sequence
                else:
                    predictions.append(next(non_empty_iter))
            
            # Convert to tag strings
            for i, (pred, length) in enumerate(zip(predictions, lengths)):
                if length == 0:
                    # Empty sequence -> empty prediction (both pred and true)
                    pred_tags = []
                    true_tags = []
                else:
                    # Normal case: slice to actual length
                    pred_tags = [idx2tag[idx] for idx in pred[:length]]
                    true_tags = [idx2tag[tag_ids[i][j].item()] for j in range(length)]
                
                all_predictions.append(pred_tags)
                all_true_tags.append(true_tags)
    
    return all_true_tags, all_predictions

print("Training functions defined!")
print("\nEmpty sequence handling:")
print("  ‚úì Training: Filters out empty sequences before CRF (pytorch-crf requirement)")
print("  ‚úì Evaluation: Empty sequences bypass model, get [] predictions directly")
print("  ‚úì Result: Empty input ‚Üí Empty output (no model crash!)")
print("\n‚ö†Ô∏è  NOTE: BiLSTM-CRF on CPU is VERY SLOW (~75 minutes per epoch)")
print("    Consider: Reduce model size, use subset of data, or get GPU access")

In [73]:
print("Starting training...\n")
print("=" * 80)

best_f1 = 0
patience = 3
patience_counter = 0

training_start = time.time()

for epoch in range(NUM_EPOCHS):
    epoch_start = time.time()
    
    # Train
    train_loss = train_epoch(model, train_loader, optimizer, device)
    
    # Evaluate
    val_true_tags, val_pred_tags = evaluate(model, val_loader, device)
    
    # Calculate F1
    results = evaluate_entity_spans(val_true_tags, val_pred_tags, val_tokens)
    val_f1 = results['f1']
    val_precision = results['precision']
    val_recall = results['recall']
    
    epoch_time = time.time() - epoch_start
    
    print(f"Epoch {epoch+1:2d}/{NUM_EPOCHS} | "
          f"Loss: {train_loss:.4f} | "
          f"Val P: {val_precision:.4f} R: {val_recall:.4f} F1: {val_f1:.4f} | "
          f"Time: {epoch_time:.1f}s")
    
    # Early stopping
    if val_f1 > best_f1:
        best_f1 = val_f1
        patience_counter = 0
        # Save best model
        os.makedirs('models', exist_ok=True)
        torch.save(model.state_dict(), 'models/bilstm_crf_best.pt')
        print(f"  ‚Üí New best F1! Model saved.")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"\nEarly stopping after {epoch+1} epochs (patience={patience})")
            break

training_time = time.time() - training_start

print("=" * 80)
print(f"\nTraining completed in {training_time:.1f}s ({training_time/60:.1f} minutes)")
print(f"Best validation F1: {best_f1:.4f}")

Starting training...



KeyboardInterrupt: 

## 9. Load Best Model and Final Evaluation

In [None]:
# Load best model
model.load_state_dict(torch.load('models/bilstm_crf_best.pt'))
model.eval()

print("Best model loaded!")

# Final evaluation
val_true_tags, val_pred_tags = evaluate(model, val_loader, device)

# Comprehensive report
print_evaluation_report(
    val_true_tags,
    val_pred_tags,
    val_tokens,
    model_name="Character-CNN + BiLSTM-CRF"
)

## 10. Save Model and Results

In [None]:
# Save vocabularies
vocab_data = {
    'word2idx': word2idx,
    'char2idx': char2idx,
    'tag2idx': tag2idx,
    'idx2word': idx2word,
    'idx2char': idx2char,
    'idx2tag': idx2tag
}

with open('models/bilstm_crf_vocab.pkl', 'wb') as f:
    pickle.dump(vocab_data, f)

print("Vocabularies saved to models/bilstm_crf_vocab.pkl")

# Save results
final_results = evaluate_entity_spans(val_true_tags, val_pred_tags, val_tokens)

results_summary = {
    'model': 'Character-CNN + BiLSTM-CRF',
    'precision': final_results['precision'],
    'recall': final_results['recall'],
    'f1': final_results['f1'],
    'training_time': training_time,
    'num_epochs': epoch + 1,
    'hyperparameters': {
        'embedding_dim': EMBEDDING_DIM,
        'char_emb_dim': CHAR_EMB_DIM,
        'char_hidden_dim': CHAR_HIDDEN_DIM,
        'lstm_hidden_dim': LSTM_HIDDEN_DIM,
        'learning_rate': LEARNING_RATE,
        'batch_size': BATCH_SIZE,
        'dropout': 0.5
    }
}

with open('models/bilstm_crf_results.json', 'w') as f:
    json.dump(results_summary, f, indent=2)

print("Results saved to models/bilstm_crf_results.json")

## 11. Summary

### Model Characteristics:

**Strengths:**
- ‚úÖ **Handles OOV words**: Character-level CNN captures morphology
- ‚úÖ **Bidirectional context**: BiLSTM sees both left and right
- ‚úÖ **Valid BIO sequences**: CRF enforces transition constraints
- ‚úÖ **Pre-trained knowledge**: GloVe embeddings provide semantics
- ‚úÖ **Strong performance**: ~90-91% F1 expected

**Weaknesses:**
- ‚ùå Slower training than CRF (GPU recommended)
- ‚ùå More complex architecture
- ‚ùå Requires careful hyperparameter tuning
- ‚ùå May overfit on small datasets

**Comparison to CRF (Model 1):**
- CRF F1: ~68%
- BiLSTM-CRF F1: ~90% (expected)
- **Improvement: +20-22% absolute F1** üéâ

**Why the Improvement?**
1. **Deep learning** captures complex patterns vs manual features
2. **Character-level features** handle unknown words
3. **Bidirectional LSTM** captures long-range dependencies
4. **Pre-trained embeddings** provide semantic knowledge
5. **CRF layer** still enforces valid sequences

### Research Context:

**CoNLL-2003 English NER Results:**
- Lample et al. (2016): **90.94% F1**
- Ma & Hovy (2016): **91.21% F1**
- Akbik et al. (2018) with Flair: **93.09% F1**

**Your model should achieve similar results!**

### Next Steps:
1. Try different hyperparameters (LSTM size, dropout, learning rate)
2. Experiment with different pre-trained embeddings (GloVe 300d, FastText)
3. Add more LSTM layers
4. Move to M7 (Attention-based BiLSTM-CRF) for even better results

### References:

1. **Lample et al. (2016)**. "Neural Architectures for Named Entity Recognition." NAACL.
   - Source: https://arxiv.org/abs/1603.01360
   - Result: 90.94% F1 on CoNLL-2003

2. **Ma & Hovy (2016)**. "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF." ACL.
   - Source: https://arxiv.org/abs/1603.01354
   - Result: 91.21% F1 on CoNLL-2003

3. **pytorch-crf**. CRF layer implementation for PyTorch.
   - Source: https://pytorch-crf.readthedocs.io/

4. **GloVe**. Global Vectors for Word Representation.
   - Source: https://nlp.stanford.edu/projects/glove/