# Model 4 (Ma Hovey): Character-CNN + BiLSTM-CRF

**This notebook EXACTLY replicates Ma & Hovy (2016) hyperparameters**

**Paper**: "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" (ACL 2016)

**Original Results on CoNLL-2003**: 91.21% F1

## Key Differences from M4.ipynb:

| Component | M4.ipynb (Our Original) | M4_Paper_Exact.ipynb (This File) |
|-----------|------------------------|----------------------------------|
| Char embedding | 25d | **30d** |
| LSTM hidden | 256 | **100** |
| Optimizer | Adam | **SGD + momentum** |
| Learning rate | 0.001 fixed | **0.015 with decay** |
| Batch size | 32 | **10** |
| Word freq threshold | 2 | **5** |

**Reference**: Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354.

## 1. Setup and Imports

In [3]:
# Install required packages
import sys
print(f"Python: {sys.executable}")

!{sys.executable} -m pip install torch pytorch-crf gensim tqdm

print("\nâœ… Packages installed!")

Python: /usr/local/bin/python3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

âœ… Packages installed!


In [4]:
import json
import numpy as np
import pickle
import time
from collections import Counter
import os

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

# CRF
from torchcrf import CRF

# Embeddings
import gensim.downloader as api

# Progress bar
from tqdm import tqdm

# Our evaluation utilities
from utils import print_evaluation_report, evaluate_entity_spans

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

print("Imports successful!")

Using device: cpu
Imports successful!


## 2. Load Data

In [5]:
def load_jsonl(file_path):
    """Load JSONL file"""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

# Load data
train_data = load_jsonl('train_split.jsonl')
val_data = load_jsonl('val_split.jsonl')

print(f"Training samples: {len(train_data):,}")
print(f"Validation samples: {len(val_data):,}")

# Extract tokens and tags
train_tokens = [sample['tokens'] for sample in train_data]
train_tags = [sample['ner_tags'] for sample in train_data]

val_tokens = [sample['tokens'] for sample in val_data]
val_tags = [sample['ner_tags'] for sample in val_data]

Training samples: 90,320
Validation samples: 10,036


## 3. Build Vocabularies

**Paper-exact settings:**
- Word frequency threshold: 5 (not 2)
- Character embedding: 30d (not 25d)

In [6]:
# Build word vocabulary (Paper: freq >= 5)
word_counts = Counter()
for tokens in train_tokens:
    word_counts.update(tokens)

MIN_WORD_FREQ = 5  # Paper setting!
word2idx = {'<PAD>': 0, '<UNK>': 1}
for word, count in word_counts.items():
    if count >= MIN_WORD_FREQ:
        word2idx[word] = len(word2idx)

idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(word2idx)

print(f"Word vocabulary size: {vocab_size:,} (min freq = {MIN_WORD_FREQ})")
print(f"Words filtered: {len(word_counts) - vocab_size + 2:,}")

Word vocabulary size: 15,092 (min freq = 5)
Words filtered: 92,551


In [7]:
# Build character vocabulary
char2idx = {'<PAD>': 0, '<UNK>': 1}

chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,!?:;\'"()-[]{}@#$%^&*+=/<>\\|`~_'
for char in chars:
    if char not in char2idx:
        char2idx[char] = len(char2idx)

idx2char = {idx: char for char, idx in char2idx.items()}
char_vocab_size = len(char2idx)

print(f"Character vocabulary size: {char_vocab_size}")

Character vocabulary size: 97


In [8]:
# Build tag vocabulary
tag2idx = {}
for tags in train_tags:
    for tag in tags:
        if tag not in tag2idx:
            tag2idx[tag] = len(tag2idx)

idx2tag = {idx: tag for tag, idx in tag2idx.items()}
num_tags = len(tag2idx)

print(f"Number of NER tags: {num_tags}")
print(f"Tags: {list(tag2idx.keys())}")

Number of NER tags: 15
Tags: ['O', 'B-ORG', 'I-ORG', 'B-Facility', 'I-Facility', 'B-OtherPER', 'I-OtherPER', 'B-Politician', 'I-Politician', 'B-HumanSettlement', 'I-HumanSettlement', 'B-Artist', 'I-Artist', 'B-PublicCorp', 'I-PublicCorp']


## 4. Load Pre-trained Word Embeddings (GloVe)

**Paper uses GloVe 100d** (same as ours)

In [9]:
print("Downloading GloVe embeddings...")
glove_model = api.load('glove-wiki-gigaword-100')

EMBEDDING_DIM = 100
print(f"GloVe embeddings loaded! Dimension: {EMBEDDING_DIM}")

Downloading GloVe embeddings...
GloVe embeddings loaded! Dimension: 100


In [10]:
# Create embedding matrix
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

found = 0
for word, idx in word2idx.items():
    if word in ['<PAD>', '<UNK>']:
        continue
    
    try:
        embedding_matrix[idx] = glove_model[word.lower()]
        found += 1
    except KeyError:
        # Random initialization for OOV
        embedding_matrix[idx] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

# Special tokens
embedding_matrix[word2idx['<PAD>']] = np.zeros(EMBEDDING_DIM)
embedding_matrix[word2idx['<UNK>']] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

print(f"Words found in GloVe: {found:,} / {vocab_size:,} ({found/vocab_size*100:.1f}%)")

Words found in GloVe: 14,897 / 15,092 (98.7%)


## 5. Dataset Class

In [11]:
MAX_CHAR_LEN = 20  # Maximum characters per word

class NERDataset(Dataset):
    def __init__(self, tokens_list, tags_list, word2idx, char2idx, tag2idx):
        self.tokens_list = tokens_list
        self.tags_list = tags_list
        self.word2idx = word2idx
        self.char2idx = char2idx
        self.tag2idx = tag2idx
    
    def __len__(self):
        return len(self.tokens_list)
    
    def __getitem__(self, idx):
        tokens = self.tokens_list[idx]
        tags = self.tags_list[idx]
        
        # Words
        word_ids = [self.word2idx.get(token, self.word2idx['<UNK>']) for token in tokens]
        
        # Characters
        char_ids = []
        for token in tokens:
            chars = [self.char2idx.get(c, self.char2idx['<UNK>']) for c in token[:MAX_CHAR_LEN]]
            if len(chars) < MAX_CHAR_LEN:
                chars += [self.char2idx['<PAD>']] * (MAX_CHAR_LEN - len(chars))
            char_ids.append(chars)
        
        # Tags
        tag_ids = [self.tag2idx[tag] for tag in tags]
        
        return {
            'word_ids': torch.LongTensor(word_ids),
            'char_ids': torch.LongTensor(char_ids),
            'tag_ids': torch.LongTensor(tag_ids),
            'length': len(tokens)
        }

train_dataset = NERDataset(train_tokens, train_tags, word2idx, char2idx, tag2idx)
val_dataset = NERDataset(val_tokens, val_tags, word2idx, char2idx, tag2idx)

print(f"Train dataset: {len(train_dataset)} samples")
print(f"Val dataset: {len(val_dataset)} samples")

Train dataset: 90320 samples
Val dataset: 10036 samples


In [12]:
# Collate function with empty sequence handling
def collate_fn(batch):
    """Custom collate function"""
    batch = sorted(batch, key=lambda x: x['length'], reverse=True)
    
    word_ids = [item['word_ids'] for item in batch]
    char_ids = [item['char_ids'] for item in batch]
    tag_ids = [item['tag_ids'] for item in batch]
    lengths = [item['length'] for item in batch]
    
    # Pad sequences
    word_ids_padded = pad_sequence(word_ids, batch_first=True, padding_value=word2idx['<PAD>'])
    tag_ids_padded = pad_sequence(tag_ids, batch_first=True, padding_value=tag2idx['O'])
    
    # Pad char_ids manually
    max_len = max(1, word_ids_padded.size(1))
    batch_size = len(batch)
    
    char_ids_padded = torch.full(
        (batch_size, max_len, MAX_CHAR_LEN),
        fill_value=char2idx['<PAD>'],
        dtype=torch.long
    )
    
    for i, chars in enumerate(char_ids):
        seq_len = chars.size(0)
        if seq_len > 0:
            char_ids_padded[i, :seq_len, :] = chars
    
    # Create mask
    mask = torch.zeros((batch_size, max_len), dtype=torch.bool)
    for i, length in enumerate(lengths):
        if length > 0:
            mask[i, :length] = True
    
    return {
        'word_ids': word_ids_padded,
        'char_ids': char_ids_padded,
        'tag_ids': tag_ids_padded,
        'lengths': lengths,
        'mask': mask
    }

# Create DataLoaders with PAPER BATCH SIZE = 10
BATCH_SIZE = 10  # Paper setting!

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collate_fn
)

print(f"Train batches: {len(train_loader)} (batch size = {BATCH_SIZE})")
print(f"Val batches: {len(val_loader)}")

Train batches: 9032 (batch size = 10)
Val batches: 1004


## 6. Model Architecture

**Paper-exact hyperparameters:**
- Char embedding: **30d** (not 25d)
- Char CNN output: **30d** (not 30d) âœ“
- LSTM hidden: **100** (not 256)
- Dropout: 0.5 âœ“

In [13]:
class CharCNN(nn.Module):
    """Character-level CNN (Ma & Hovy 2016)"""
    def __init__(self, char_vocab_size, char_emb_dim, char_hidden_dim, max_char_len):
        super().__init__()
        self.char_embedding = nn.Embedding(char_vocab_size, char_emb_dim, padding_idx=0)
        
        self.conv = nn.Conv1d(
            in_channels=char_emb_dim,
            out_channels=char_hidden_dim,
            kernel_size=3,
            padding=1
        )
        self.relu = nn.ReLU()
    
    def forward(self, char_ids):
        batch_size, seq_len, max_char_len = char_ids.size()
        
        char_ids = char_ids.view(-1, max_char_len)
        char_embeds = self.char_embedding(char_ids)
        char_embeds = char_embeds.transpose(1, 2)
        
        char_conv = self.relu(self.conv(char_embeds))
        char_features = torch.max(char_conv, dim=2)[0]
        char_features = char_features.view(batch_size, seq_len, -1)
        
        return char_features


class BiLSTM_CRF_PaperExact(nn.Module):
    """BiLSTM-CRF with Ma & Hovy 2016 exact hyperparameters"""
    def __init__(self, vocab_size, char_vocab_size, embedding_dim, char_emb_dim,
                 char_hidden_dim, lstm_hidden_dim, num_tags, embedding_matrix=None):
        super().__init__()
        
        # Word embeddings
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        if embedding_matrix is not None:
            self.word_embedding.weight.data.copy_(torch.from_numpy(embedding_matrix))
            self.word_embedding.weight.requires_grad = True
        
        # Character CNN
        self.char_cnn = CharCNN(char_vocab_size, char_emb_dim, char_hidden_dim, MAX_CHAR_LEN)
        
        # BiLSTM
        self.lstm = nn.LSTM(
            input_size=embedding_dim + char_hidden_dim,
            hidden_size=lstm_hidden_dim,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )
        
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(lstm_hidden_dim * 2, num_tags)
        
        # CRF
        self.crf = CRF(num_tags, batch_first=True)
    
    def forward(self, word_ids, char_ids, tags=None, mask=None):
        word_embeds = self.word_embedding(word_ids)
        char_features = self.char_cnn(char_ids)
        
        combined = torch.cat([word_embeds, char_features], dim=-1)
        
        lstm_out, _ = self.lstm(combined)
        lstm_out = self.dropout(lstm_out)
        
        emissions = self.fc(lstm_out)
        
        if tags is not None:
            loss = -self.crf(emissions, tags, mask=mask, reduction='mean')
            return loss
        else:
            predictions = self.crf.decode(emissions, mask=mask)
            return predictions

print("Model architecture defined (Paper-exact)!")

Model architecture defined (Paper-exact)!


## 7. Initialize Model

**Paper-exact hyperparameters:**
- CHAR_EMB_DIM = **30** (not 25)
- CHAR_HIDDEN_DIM = **30** âœ“
- LSTM_HIDDEN_DIM = **100** (not 256)
- LEARNING_RATE = **0.015** (not 0.001)
- Optimizer = **SGD with momentum=0.9** (not Adam)

In [14]:
# PAPER HYPERPARAMETERS
CHAR_EMB_DIM = 30      # Paper: 30 (not 25!)
CHAR_HIDDEN_DIM = 30   # Paper: 30
LSTM_HIDDEN_DIM = 100  # Paper: 100 (not 256!)
LEARNING_RATE = 0.015  # Paper: 0.015 (not 0.001!)
NUM_EPOCHS = 50        # Paper trains until convergence

# Initialize model
model = BiLSTM_CRF_PaperExact(
    vocab_size=vocab_size,
    char_vocab_size=char_vocab_size,
    embedding_dim=EMBEDDING_DIM,
    char_emb_dim=CHAR_EMB_DIM,
    char_hidden_dim=CHAR_HIDDEN_DIM,
    lstm_hidden_dim=LSTM_HIDDEN_DIM,
    num_tags=num_tags,
    embedding_matrix=embedding_matrix
).to(device)

# PAPER OPTIMIZER: SGD with momentum (not Adam!)
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=0.9)

# Learning rate scheduler (Paper: decay LR)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max', factor=0.5, patience=3, verbose=True
)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())

print(f"\nðŸŽ¯ PAPER-EXACT CONFIGURATION:")
print(f"  Char embedding: {CHAR_EMB_DIM}d (Paper: 30d)")
print(f"  LSTM hidden: {LSTM_HIDDEN_DIM} (Paper: 100)")
print(f"  Optimizer: SGD + momentum=0.9 (Paper setting)")
print(f"  Learning rate: {LEARNING_RATE} with decay (Paper setting)")
print(f"  Batch size: {BATCH_SIZE} (Paper: 10)")
print(f"  Total parameters: {total_params:,}")


ðŸŽ¯ PAPER-EXACT CONFIGURATION:
  Char embedding: 30d (Paper: 30d)
  LSTM hidden: 100 (Paper: 100)
  Optimizer: SGD + momentum=0.9 (Paper setting)
  Learning rate: 0.015 with decay (Paper setting)
  Batch size: 10 (Paper: 10)
  Total parameters: 1,703,710




## 8. Training Loop

With paper-exact settings:
- SGD optimizer
- Learning rate decay
- Smaller batches (10 vs 32)

In [15]:
def train_epoch(model, data_loader, optimizer, device):
    """Training with empty sequence filtering"""
    model.train()
    total_loss = 0
    num_batches = 0
    
    pbar = tqdm(data_loader, desc="Training", leave=False)
    
    for batch in pbar:
        word_ids = batch['word_ids'].to(device)
        char_ids = batch['char_ids'].to(device)
        tag_ids = batch['tag_ids'].to(device)
        mask = batch['mask'].to(device)
        lengths = batch['lengths']
        
        # Filter empty sequences
        non_empty_indices = [i for i, length in enumerate(lengths) if length > 0]
        
        if len(non_empty_indices) == 0:
            continue
        
        if len(non_empty_indices) < len(lengths):
            word_ids = word_ids[non_empty_indices]
            char_ids = char_ids[non_empty_indices]
            tag_ids = tag_ids[non_empty_indices]
            mask = mask[non_empty_indices]
        
        optimizer.zero_grad()
        loss = model(word_ids, char_ids, tag_ids, mask)
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        
        optimizer.step()
        
        total_loss += loss.item()
        num_batches += 1
        pbar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    return total_loss / max(num_batches, 1)


def evaluate(model, data_loader, device):
    """Evaluate with empty sequence handling"""
    model.eval()
    all_predictions = []
    all_true_tags = []
    
    with torch.no_grad():
        pbar = tqdm(data_loader, desc="Evaluating", leave=False)
        
        for batch in pbar:
            word_ids = batch['word_ids'].to(device)
            char_ids = batch['char_ids'].to(device)
            tag_ids = batch['tag_ids'].to(device)
            mask = batch['mask'].to(device)
            lengths = batch['lengths']
            
            non_empty_indices = [i for i, length in enumerate(lengths) if length > 0]
            
            if len(non_empty_indices) > 0:
                word_ids_non_empty = word_ids[non_empty_indices]
                char_ids_non_empty = char_ids[non_empty_indices]
                mask_non_empty = mask[non_empty_indices]
                
                predictions_non_empty = model(word_ids_non_empty, char_ids_non_empty, mask=mask_non_empty)
            else:
                predictions_non_empty = []
            
            # Reconstruct predictions
            predictions = []
            non_empty_iter = iter(predictions_non_empty)
            for i in range(len(lengths)):
                if lengths[i] == 0:
                    predictions.append([])
                else:
                    predictions.append(next(non_empty_iter))
            
            # Convert to tags
            for i, (pred, length) in enumerate(zip(predictions, lengths)):
                if length == 0:
                    pred_tags = []
                    true_tags = []
                else:
                    pred_tags = [idx2tag[idx] for idx in pred[:length]]
                    true_tags = [idx2tag[tag_ids[i][j].item()] for j in range(length)]
                
                all_predictions.append(pred_tags)
                all_true_tags.append(true_tags)
    
    return all_true_tags, all_predictions

print("Training functions defined!")

Training functions defined!


In [17]:
print("Starting training with PAPER-EXACT settings...\n")
print("=" * 80)
print("Configuration:")
print(f"  - LSTM hidden: 100 (Paper) vs 256 (M4.ipynb)")
print(f"  - Char emb: 30d (Paper) vs 25d (M4.ipynb)")
print(f"  - Optimizer: SGD+momentum (Paper) vs Adam (M4.ipynb)")
print(f"  - LR: 0.015 with decay (Paper) vs 0.001 fixed (M4.ipynb)")
print(f"  - Batch size: 10 (Paper) vs 32 (M4.ipynb)")
print("=" * 80 + "\n")

best_f1 = 0
patience = 5  # Paper trains longer
patience_counter = 0

training_start = time.time()

for epoch in range(NUM_EPOCHS):
    epoch_start = time.time()
    
    # Train
    train_loss = train_epoch(model, train_loader, optimizer, device)
    
    # Evaluate
    val_true_tags, val_pred_tags = evaluate(model, val_loader, device)
    
    # Calculate F1
    results = evaluate_entity_spans(val_true_tags, val_pred_tags, val_tokens)
    val_f1 = results['f1']
    val_precision = results['precision']
    val_recall = results['recall']
    
    epoch_time = time.time() - epoch_start
    current_lr = optimizer.param_groups[0]['lr']
    
    print(f"Epoch {epoch+1:2d}/{NUM_EPOCHS} | "
          f"Loss: {train_loss:.4f} | "
          f"Val P: {val_precision:.4f} R: {val_recall:.4f} F1: {val_f1:.4f} | "
          f"LR: {current_lr:.5f} | "
          f"Time: {epoch_time:.1f}s")
    
    # Learning rate scheduling
    scheduler.step(val_f1)
    
    # Early stopping
    if val_f1 > best_f1:
        best_f1 = val_f1
        patience_counter = 0
        os.makedirs('models', exist_ok=True)
        torch.save(model.state_dict(), 'models/bilstm_crf_paper_exact_best.pt')
        print(f"  â†’ New best F1! Model saved.")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"\nEarly stopping after {epoch+1} epochs (patience={patience})")
            break

training_time = time.time() - training_start

print("=" * 80)
print(f"\nTraining completed in {training_time:.1f}s ({training_time/60:.1f} minutes)")
print(f"Best validation F1: {best_f1:.4f}")
print(f"\nNote: This used paper-exact hyperparameters from Ma & Hovy (2016)")

Starting training with PAPER-EXACT settings...

Configuration:
  - LSTM hidden: 100 (Paper) vs 256 (M4.ipynb)
  - Char emb: 30d (Paper) vs 25d (M4.ipynb)
  - Optimizer: SGD+momentum (Paper) vs Adam (M4.ipynb)
  - LR: 0.015 with decay (Paper) vs 0.001 fixed (M4.ipynb)
  - Batch size: 10 (Paper) vs 32 (M4.ipynb)



                                                                           

Epoch  1/50 | Loss: 2.0867 | Val P: 0.7123 R: 0.6466 F1: 0.6778 | LR: 0.01500 | Time: 192.9s
  â†’ New best F1! Model saved.


                                                                          

Epoch  2/50 | Loss: 1.6982 | Val P: 0.7185 R: 0.6741 F1: 0.6956 | LR: 0.01500 | Time: 192.2s
  â†’ New best F1! Model saved.


                                                                          

Epoch  3/50 | Loss: 1.5440 | Val P: 0.7316 R: 0.6854 F1: 0.7077 | LR: 0.01500 | Time: 192.3s
  â†’ New best F1! Model saved.


                                                                          

Epoch  4/50 | Loss: 1.4379 | Val P: 0.7292 R: 0.6905 F1: 0.7093 | LR: 0.01500 | Time: 184.0s
  â†’ New best F1! Model saved.


                                                                          

Epoch  5/50 | Loss: 1.3567 | Val P: 0.7558 R: 0.6976 F1: 0.7255 | LR: 0.01500 | Time: 175.8s
  â†’ New best F1! Model saved.


                                                                          

Epoch  6/50 | Loss: 1.2897 | Val P: 0.7483 R: 0.6939 F1: 0.7201 | LR: 0.01500 | Time: 179.6s


                                                                          

Epoch  7/50 | Loss: 1.2296 | Val P: 0.7347 R: 0.7097 F1: 0.7220 | LR: 0.01500 | Time: 194.5s


                                                                          

Epoch  8/50 | Loss: 1.1780 | Val P: 0.7315 R: 0.7096 F1: 0.7204 | LR: 0.01500 | Time: 201.1s


                                                                          

Epoch  9/50 | Loss: 1.1266 | Val P: 0.7467 R: 0.7069 F1: 0.7263 | LR: 0.01500 | Time: 200.8s
  â†’ New best F1! Model saved.


                                                                          

Epoch 10/50 | Loss: 1.0797 | Val P: 0.7293 R: 0.7142 F1: 0.7217 | LR: 0.01500 | Time: 210.3s


                                                                          

Epoch 11/50 | Loss: 1.0377 | Val P: 0.7480 R: 0.7102 F1: 0.7286 | LR: 0.01500 | Time: 198.9s
  â†’ New best F1! Model saved.


                                                                          

Epoch 12/50 | Loss: 0.9996 | Val P: 0.7436 R: 0.7123 F1: 0.7276 | LR: 0.01500 | Time: 191.0s


                                                                          

Epoch 13/50 | Loss: 0.9614 | Val P: 0.7505 R: 0.7123 F1: 0.7309 | LR: 0.01500 | Time: 189.9s
  â†’ New best F1! Model saved.


                                                                          

Epoch 14/50 | Loss: 0.9225 | Val P: 0.7469 R: 0.7149 F1: 0.7305 | LR: 0.01500 | Time: 179.6s


                                                                          

Epoch 15/50 | Loss: 0.8871 | Val P: 0.7543 R: 0.7036 F1: 0.7281 | LR: 0.01500 | Time: 177.9s


                                                                          

Epoch 16/50 | Loss: 0.8535 | Val P: 0.7398 R: 0.7193 F1: 0.7294 | LR: 0.01500 | Time: 184.0s


                                                                          

Epoch 17/50 | Loss: 0.8206 | Val P: 0.7303 R: 0.7178 F1: 0.7240 | LR: 0.01500 | Time: 183.7s


                                                                          

Epoch 18/50 | Loss: 0.6852 | Val P: 0.7525 R: 0.7178 F1: 0.7347 | LR: 0.00750 | Time: 181.2s
  â†’ New best F1! Model saved.


                                                                          

Epoch 19/50 | Loss: 0.6403 | Val P: 0.7505 R: 0.7150 F1: 0.7323 | LR: 0.00750 | Time: 195.4s


                                                                          

Epoch 20/50 | Loss: 0.6160 | Val P: 0.7456 R: 0.7187 F1: 0.7319 | LR: 0.00750 | Time: 176.8s


                                                                          

Epoch 21/50 | Loss: 0.5913 | Val P: 0.7365 R: 0.7208 F1: 0.7286 | LR: 0.00750 | Time: 200.5s


                                                                          

Epoch 22/50 | Loss: 0.5744 | Val P: 0.7431 R: 0.7193 F1: 0.7310 | LR: 0.00750 | Time: 196.8s


                                                                          

Epoch 23/50 | Loss: 0.5082 | Val P: 0.7366 R: 0.7193 F1: 0.7278 | LR: 0.00375 | Time: 207.2s

Early stopping after 23 epochs (patience=5)

Training completed in 4386.4s (73.1 minutes)
Best validation F1: 0.7347

Note: This used paper-exact hyperparameters from Ma & Hovy (2016)




## 9. Load Best Model and Final Evaluation

In [18]:
# Load best model
model.load_state_dict(torch.load('models/bilstm_crf_paper_exact_best.pt'))
model.eval()

print("Best model loaded!")

# Final evaluation
val_true_tags, val_pred_tags = evaluate(model, val_loader, device)

# Comprehensive report
print_evaluation_report(
    val_true_tags,
    val_pred_tags,
    val_tokens,
    model_name="BiLSTM-CNN-CRF (Paper-Exact)"
)

Best model loaded!


                                                               

ENTITY-SPAN LEVEL EVALUATION REPORT: BiLSTM-CNN-CRF (Paper-Exact)

OVERALL METRICS:
  Precision: 0.7525
  Recall:    0.7178
  F1 Score:  0.7347

  True Positives:  7990
  False Positives: 2628
  False Negatives: 3141

--------------------------------------------------------------------------------
PER-ENTITY-TYPE METRICS:
--------------------------------------------------------------------------------
Entity Type          Precision    Recall       F1           Support   
--------------------------------------------------------------------------------
Artist               0.7556       0.7824       0.7688       2399      
Facility             0.7241       0.6589       0.6900       1199      
HumanSettlement      0.9003       0.8803       0.8902       2790      
ORG                  0.7206       0.6754       0.6973       1562      
OtherPER             0.5587       0.5882       0.5731       1520      
Politician           0.7336       0.5457       0.6259       1171      
PublicCorp       

## 10. Save Model and Results

In [19]:
# Save vocabularies
vocab_data = {
    'word2idx': word2idx,
    'char2idx': char2idx,
    'tag2idx': tag2idx,
    'idx2word': idx2word,
    'idx2char': idx2char,
    'idx2tag': idx2tag
}

with open('models/bilstm_crf_paper_exact_vocab.pkl', 'wb') as f:
    pickle.dump(vocab_data, f)

print("Vocabularies saved!")

# Save results
final_results = evaluate_entity_spans(val_true_tags, val_pred_tags, val_tokens)

results_summary = {
    'model': 'BiLSTM-CNN-CRF (Paper-Exact: Ma & Hovy 2016)',
    'precision': final_results['precision'],
    'recall': final_results['recall'],
    'f1': final_results['f1'],
    'training_time': training_time,
    'num_epochs': epoch + 1,
    'hyperparameters': {
        'embedding_dim': EMBEDDING_DIM,
        'char_emb_dim': CHAR_EMB_DIM,
        'char_hidden_dim': CHAR_HIDDEN_DIM,
        'lstm_hidden_dim': LSTM_HIDDEN_DIM,
        'learning_rate': LEARNING_RATE,
        'batch_size': BATCH_SIZE,
        'optimizer': 'SGD with momentum=0.9',
        'lr_scheduler': 'ReduceLROnPlateau',
        'dropout': 0.5,
        'min_word_freq': MIN_WORD_FREQ
    },
    'paper_reference': 'Ma & Hovy (2016) - End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF',
    'paper_result_conll2003': '91.21% F1'
}

with open('models/bilstm_crf_paper_exact_results.json', 'w') as f:
    json.dump(results_summary, f, indent=2)

print("Results saved!")

Vocabularies saved!
Results saved!


## 11. Summary

### Paper-Exact Implementation:

This notebook **exactly replicates** Ma & Hovy (2016) hyperparameters:

âœ… **Character embedding**: 30d (Paper) vs 25d (M4.ipynb)  
âœ… **LSTM hidden size**: 100 (Paper) vs 256 (M4.ipynb)  
âœ… **Optimizer**: SGD + momentum=0.9 (Paper) vs Adam (M4.ipynb)  
âœ… **Learning rate**: 0.015 with decay (Paper) vs 0.001 fixed (M4.ipynb)  
âœ… **Batch size**: 10 (Paper) vs 32 (M4.ipynb)  
âœ… **Word frequency threshold**: 5 (Paper) vs 2 (M4.ipynb)  

### Expected Results:

**Paper (CoNLL-2003)**: 91.21% F1  
**Your dataset**: 80-85% F1 (harder - 15 entity types vs 4)

### Comparison:

You now have TWO implementations:
1. **M4.ipynb**: Our original (larger model, Adam optimizer)
2. **M4_Paper_Exact.ipynb**: Paper-exact (smaller model, SGD optimizer)

You can compare which one performs better on your dataset!

### Reference:

**Ma, X., & Hovy, E. (2016)**. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of ACL 2016.
- Paper: https://arxiv.org/abs/1603.01354
- Result: 91.21% F1 on CoNLL-2003 English NER