# Model 4 v2: Character-CNN + BiLSTM-CRF (Improved)

**Improvements over v1:**

This version includes several optimizations to improve performance beyond 0.74 F1:

1. ‚úÖ **Higher Dropout (0.6)**: Reduces overfitting (was 0.5)
2. ‚úÖ **Dropout on Embeddings**: Applied to word and character embeddings
3. ‚úÖ **Learning Rate Scheduler**: ReduceLROnPlateau for adaptive LR decay
4. ‚úÖ **Increased Patience (7)**: Allows more training before early stopping (was 3)
5. ‚úÖ **GloVe 300d**: Richer word representations (was 100d)
6. ‚úÖ **Gradient Clipping (1.0)**: More aggressive clipping for stability (was 5.0)

**Expected Improvement:** 0.74 ‚Üí 0.78-0.80 F1 (+4-6% absolute)

**Target F1:** 78-80% (excellent for 15 entity types)

**Original v1 Results:**
- Epoch 3: 0.7484 F1 (peak)
- Early stopping at epoch 6
- Issue: Overfitting after epoch 3

## 1. Setup and Imports

In [1]:
# Install required packages using the notebook's Python interpreter
import sys
print(f"Jupyter kernel Python: {sys.executable}")

# Install packages to the correct Python environment
!{sys.executable} -m pip uninstall -y torchcrf
!{sys.executable} -m pip install torch pytorch-crf gensim tqdm

print("\n‚úÖ Packages installed! Please RESTART THE KERNEL before continuing.")
print("   Kernel ‚Üí Restart Kernel (or Ctrl+Shift+P ‚Üí 'Restart Kernel')")

Jupyter kernel Python: /usr/local/bin/python3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

‚úÖ Packages installed! Please RESTART THE KERNEL before continuing.
   Kernel ‚Üí Restart Kernel (or Ctrl+Shift+P ‚Üí 'Restart Kernel')


In [2]:
import json
import numpy as np
import pickle
import time
from collections import Counter
import os

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

# CRF
from torchcrf import CRF

# Embeddings
import gensim.downloader as api

# Progress bar
from tqdm import tqdm

# Our evaluation utilities
from utils import print_evaluation_report, evaluate_entity_spans

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

print("Imports successful!")

Using device: cpu
Imports successful!


## 2. Load Data

In [3]:
def load_jsonl(file_path):
    """Load JSONL file"""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

# Load data
train_data = load_jsonl('train_split.jsonl')
val_data = load_jsonl('val_split.jsonl')

print(f"Training samples: {len(train_data):,}")
print(f"Validation samples: {len(val_data):,}")

# Extract tokens and tags
train_tokens = [sample['tokens'] for sample in train_data]
train_tags = [sample['ner_tags'] for sample in train_data]

val_tokens = [sample['tokens'] for sample in val_data]
val_tags = [sample['ner_tags'] for sample in val_data]

Training samples: 90,320
Validation samples: 10,036


## 3. Build Vocabularies

In [4]:
# Build word vocabulary
word_counts = Counter()
for tokens in train_tokens:
    word_counts.update(tokens)

# Keep words with frequency >= 2
MIN_WORD_FREQ = 2
word2idx = {'<PAD>': 0, '<UNK>': 1}
for word, count in word_counts.items():
    if count >= MIN_WORD_FREQ:
        word2idx[word] = len(word2idx)

idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(word2idx)

print(f"Word vocabulary size: {vocab_size:,}")

Word vocabulary size: 36,790


In [5]:
# Build character vocabulary
char2idx = {'<PAD>': 0, '<UNK>': 1}

chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 .,!?:;\'"()-[]{}@#$%^&*+=/<>\\|`~_'
for char in chars:
    if char not in char2idx:
        char2idx[char] = len(char2idx)

idx2char = {idx: char for char, idx in char2idx.items()}
char_vocab_size = len(char2idx)

print(f"Character vocabulary size: {char_vocab_size}")

Character vocabulary size: 97


In [6]:
# Build tag vocabulary
tag2idx = {}
for tags in train_tags:
    for tag in tags:
        if tag not in tag2idx:
            tag2idx[tag] = len(tag2idx)

idx2tag = {idx: tag for tag, idx in tag2idx.items()}
num_tags = len(tag2idx)

print(f"Number of NER tags: {num_tags}")
print(f"Tags: {list(tag2idx.keys())}")

Number of NER tags: 15
Tags: ['O', 'B-ORG', 'I-ORG', 'B-Facility', 'I-Facility', 'B-OtherPER', 'I-OtherPER', 'B-Politician', 'I-Politician', 'B-HumanSettlement', 'I-HumanSettlement', 'B-Artist', 'I-Artist', 'B-PublicCorp', 'I-PublicCorp']


## 4. Load Pre-trained Word Embeddings (GloVe 300d)

**IMPROVEMENT #5: Using GloVe 300d instead of 100d**
- Richer semantic representations
- Expected improvement: +1-2% F1

In [7]:
print("Downloading GloVe 300d embeddings (this may take ~5-10 minutes)...")
print("Using glove-wiki-gigaword-300 (300-dimensional, 400K vocabulary)")

# Download GloVe 300d embeddings
glove_model = api.load('glove-wiki-gigaword-300')

EMBEDDING_DIM = 300  # Upgraded from 100d!
print(f"\nGloVe 300d embeddings loaded!")
print(f"Embedding dimension: {EMBEDDING_DIM}")

Downloading GloVe 300d embeddings (this may take ~5-10 minutes)...
Using glove-wiki-gigaword-300 (300-dimensional, 400K vocabulary)

GloVe 300d embeddings loaded!
Embedding dimension: 300


In [8]:
# Create embedding matrix
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

# Initialize with GloVe vectors
found = 0
for word, idx in word2idx.items():
    if word in ['<PAD>', '<UNK>']:
        continue
    
    try:
        embedding_matrix[idx] = glove_model[word.lower()]
        found += 1
    except KeyError:
        embedding_matrix[idx] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

# Special tokens
embedding_matrix[word2idx['<PAD>']] = np.zeros(EMBEDDING_DIM)
embedding_matrix[word2idx['<UNK>']] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))

print(f"Words found in GloVe: {found:,} / {vocab_size:,} ({found/vocab_size*100:.1f}%)")
print(f"Words initialized randomly: {vocab_size - found:,}")

Words found in GloVe: 33,849 / 36,790 (92.0%)
Words initialized randomly: 2,941


## 5. Dataset Class

In [9]:
MAX_CHAR_LEN = 20

class NERDataset(Dataset):
    def __init__(self, tokens_list, tags_list, word2idx, char2idx, tag2idx):
        self.tokens_list = tokens_list
        self.tags_list = tags_list
        self.word2idx = word2idx
        self.char2idx = char2idx
        self.tag2idx = tag2idx
    
    def __len__(self):
        return len(self.tokens_list)
    
    def __getitem__(self, idx):
        tokens = self.tokens_list[idx]
        tags = self.tags_list[idx]
        
        word_ids = [self.word2idx.get(token, self.word2idx['<UNK>']) for token in tokens]
        
        char_ids = []
        for token in tokens:
            chars = [self.char2idx.get(c, self.char2idx['<UNK>']) for c in token[:MAX_CHAR_LEN]]
            if len(chars) < MAX_CHAR_LEN:
                chars += [self.char2idx['<PAD>']] * (MAX_CHAR_LEN - len(chars))
            char_ids.append(chars)
        
        tag_ids = [self.tag2idx[tag] for tag in tags]
        
        return {
            'word_ids': torch.LongTensor(word_ids),
            'char_ids': torch.LongTensor(char_ids),
            'tag_ids': torch.LongTensor(tag_ids),
            'length': len(tokens)
        }

train_dataset = NERDataset(train_tokens, train_tags, word2idx, char2idx, tag2idx)
val_dataset = NERDataset(val_tokens, val_tags, word2idx, char2idx, tag2idx)

print(f"Train dataset: {len(train_dataset)} samples")
print(f"Val dataset: {len(val_dataset)} samples")

Train dataset: 90320 samples
Val dataset: 10036 samples


In [10]:
# Collate function
def collate_fn(batch):
    batch = sorted(batch, key=lambda x: x['length'], reverse=True)
    
    word_ids = [item['word_ids'] for item in batch]
    char_ids = [item['char_ids'] for item in batch]
    tag_ids = [item['tag_ids'] for item in batch]
    lengths = [item['length'] for item in batch]
    
    word_ids_padded = pad_sequence(word_ids, batch_first=True, padding_value=word2idx['<PAD>'])
    tag_ids_padded = pad_sequence(tag_ids, batch_first=True, padding_value=tag2idx['O'])
    
    max_len = max(1, word_ids_padded.size(1))
    batch_size = len(batch)
    
    char_ids_padded = torch.full(
        (batch_size, max_len, MAX_CHAR_LEN),
        fill_value=char2idx['<PAD>'],
        dtype=torch.long
    )
    
    for i, chars in enumerate(char_ids):
        seq_len = chars.size(0)
        if seq_len > 0:
            char_ids_padded[i, :seq_len, :] = chars
    
    mask = torch.zeros((batch_size, max_len), dtype=torch.bool)
    for i, length in enumerate(lengths):
        if length > 0:
            mask[i, :length] = True
    
    return {
        'word_ids': word_ids_padded,
        'char_ids': char_ids_padded,
        'tag_ids': tag_ids_padded,
        'lengths': lengths,
        'mask': mask
    }

BATCH_SIZE = 32

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=collate_fn
)

print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")

Train batches: 2823
Val batches: 314


## 6. Model Architecture (Improved)

**IMPROVEMENTS #1 & #2:**
- Higher dropout (0.6 instead of 0.5)
- Dropout applied to embeddings (not just LSTM output)

In [11]:
class CharCNN(nn.Module):
    def __init__(self, char_vocab_size, char_emb_dim, char_hidden_dim, max_char_len, dropout=0.6):
        super().__init__()
        self.char_embedding = nn.Embedding(char_vocab_size, char_emb_dim, padding_idx=0)
        self.dropout = nn.Dropout(dropout)  # IMPROVEMENT: Dropout on char embeddings
        
        self.conv = nn.Conv1d(
            in_channels=char_emb_dim,
            out_channels=char_hidden_dim,
            kernel_size=3,
            padding=1
        )
        self.relu = nn.ReLU()
    
    def forward(self, char_ids):
        batch_size, seq_len, max_char_len = char_ids.size()
        
        char_ids = char_ids.view(-1, max_char_len)
        char_embeds = self.char_embedding(char_ids)
        char_embeds = self.dropout(char_embeds)  # Apply dropout
        char_embeds = char_embeds.transpose(1, 2)
        
        char_conv = self.relu(self.conv(char_embeds))
        char_features = torch.max(char_conv, dim=2)[0]
        char_features = char_features.view(batch_size, seq_len, -1)
        
        return char_features


class BiLSTM_CRF_v2(nn.Module):
    """Improved BiLSTM-CRF with higher dropout and embedding dropout"""
    def __init__(self, vocab_size, char_vocab_size, embedding_dim, char_emb_dim,
                 char_hidden_dim, lstm_hidden_dim, num_tags, dropout=0.6, embedding_matrix=None):
        super().__init__()
        
        # Word embeddings
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        if embedding_matrix is not None:
            self.word_embedding.weight.data.copy_(torch.from_numpy(embedding_matrix))
            self.word_embedding.weight.requires_grad = True
        
        # IMPROVEMENT #2: Dropout on word embeddings
        self.embedding_dropout = nn.Dropout(dropout)
        
        # Character CNN (with dropout)
        self.char_cnn = CharCNN(char_vocab_size, char_emb_dim, char_hidden_dim, MAX_CHAR_LEN, dropout=dropout)
        
        # BiLSTM
        self.lstm = nn.LSTM(
            input_size=embedding_dim + char_hidden_dim,
            hidden_size=lstm_hidden_dim,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )
        
        # IMPROVEMENT #1: Higher dropout (0.6 instead of 0.5)
        self.lstm_dropout = nn.Dropout(dropout)
        
        # Linear layer
        self.fc = nn.Linear(lstm_hidden_dim * 2, num_tags)
        
        # CRF
        self.crf = CRF(num_tags, batch_first=True)
    
    def forward(self, word_ids, char_ids, tags=None, mask=None):
        # Word embeddings with dropout
        word_embeds = self.word_embedding(word_ids)
        word_embeds = self.embedding_dropout(word_embeds)
        
        # Character features (already has dropout inside)
        char_features = self.char_cnn(char_ids)
        
        # Concatenate
        combined = torch.cat([word_embeds, char_features], dim=-1)
        
        # BiLSTM with dropout
        lstm_out, _ = self.lstm(combined)
        lstm_out = self.lstm_dropout(lstm_out)
        
        # Emission scores
        emissions = self.fc(lstm_out)
        
        if tags is not None:
            loss = -self.crf(emissions, tags, mask=mask, reduction='mean')
            return loss
        else:
            predictions = self.crf.decode(emissions, mask=mask)
            return predictions

print("‚úÖ Improved model architecture defined!")
print("   - Higher dropout: 0.6 (was 0.5)")
print("   - Dropout on embeddings: Added")
print("   - GloVe 300d: Enabled")

‚úÖ Improved model architecture defined!
   - Higher dropout: 0.6 (was 0.5)
   - Dropout on embeddings: Added
   - GloVe 300d: Enabled


## 7. Initialize Model with Improvements

In [12]:
# Hyperparameters
CHAR_EMB_DIM = 25
CHAR_HIDDEN_DIM = 30
LSTM_HIDDEN_DIM = 256
DROPOUT = 0.6  # IMPROVEMENT #1: Increased from 0.5
LEARNING_RATE = 0.001
NUM_EPOCHS = 20

# Initialize model
model = BiLSTM_CRF_v2(
    vocab_size=vocab_size,
    char_vocab_size=char_vocab_size,
    embedding_dim=EMBEDDING_DIM,  # 300d now!
    char_emb_dim=CHAR_EMB_DIM,
    char_hidden_dim=CHAR_HIDDEN_DIM,
    lstm_hidden_dim=LSTM_HIDDEN_DIM,
    num_tags=num_tags,
    dropout=DROPOUT,
    embedding_matrix=embedding_matrix
).to(device)

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# IMPROVEMENT #3: Learning rate scheduler
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='max', factor=0.5, patience=2, verbose=True
)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())

print(f"\n‚úÖ Model initialized with improvements!")
print(f"Total parameters: {total_params:,}")
print(f"\nüéØ Improvements Summary:")
print(f"  1. Dropout: {DROPOUT} (was 0.5)")
print(f"  2. Embedding dropout: Enabled")
print(f"  3. LR scheduler: ReduceLROnPlateau (patience=2)")
print(f"  4. Patience: 7 epochs (was 3)")
print(f"  5. GloVe: 300d (was 100d)")
print(f"  6. Grad clipping: 1.0 (was 5.0)")


‚úÖ Model initialized with improvements!
Total parameters: 12,253,879

üéØ Improvements Summary:
  1. Dropout: 0.6 (was 0.5)
  2. Embedding dropout: Enabled
  3. LR scheduler: ReduceLROnPlateau (patience=2)
  4. Patience: 7 epochs (was 3)
  5. GloVe: 300d (was 100d)
  6. Grad clipping: 1.0 (was 5.0)




## 8. Training Loop (Improved)

**IMPROVEMENTS #3, #4, #6:**
- Learning rate scheduler (adaptive LR)
- Increased patience (7 epochs)
- More aggressive gradient clipping (1.0)

In [13]:
def train_epoch(model, data_loader, optimizer, device):
    model.train()
    total_loss = 0
    num_batches = 0
    
    pbar = tqdm(data_loader, desc="Training", leave=False)
    
    for batch in pbar:
        word_ids = batch['word_ids'].to(device)
        char_ids = batch['char_ids'].to(device)
        tag_ids = batch['tag_ids'].to(device)
        mask = batch['mask'].to(device)
        lengths = batch['lengths']
        
        # Filter empty sequences
        non_empty_indices = [i for i, length in enumerate(lengths) if length > 0]
        
        if len(non_empty_indices) == 0:
            continue
        
        if len(non_empty_indices) < len(lengths):
            word_ids = word_ids[non_empty_indices]
            char_ids = char_ids[non_empty_indices]
            tag_ids = tag_ids[non_empty_indices]
            mask = mask[non_empty_indices]
        
        optimizer.zero_grad()
        loss = model(word_ids, char_ids, tag_ids, mask)
        loss.backward()
        
        # IMPROVEMENT #6: More aggressive gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Was 5.0
        
        optimizer.step()
        
        total_loss += loss.item()
        num_batches += 1
        pbar.set_postfix({'loss': f'{loss.item():.4f}'})
    
    return total_loss / max(num_batches, 1)


def evaluate(model, data_loader, device):
    model.eval()
    all_predictions = []
    all_true_tags = []
    
    with torch.no_grad():
        pbar = tqdm(data_loader, desc="Evaluating", leave=False)
        
        for batch in pbar:
            word_ids = batch['word_ids'].to(device)
            char_ids = batch['char_ids'].to(device)
            tag_ids = batch['tag_ids'].to(device)
            mask = batch['mask'].to(device)
            lengths = batch['lengths']
            
            non_empty_indices = [i for i, length in enumerate(lengths) if length > 0]
            
            if len(non_empty_indices) > 0:
                word_ids_non_empty = word_ids[non_empty_indices]
                char_ids_non_empty = char_ids[non_empty_indices]
                mask_non_empty = mask[non_empty_indices]
                
                predictions_non_empty = model(word_ids_non_empty, char_ids_non_empty, mask=mask_non_empty)
            else:
                predictions_non_empty = []
            
            predictions = []
            non_empty_iter = iter(predictions_non_empty)
            for i in range(len(lengths)):
                if lengths[i] == 0:
                    predictions.append([])
                else:
                    predictions.append(next(non_empty_iter))
            
            for i, (pred, length) in enumerate(zip(predictions, lengths)):
                if length == 0:
                    pred_tags = []
                    true_tags = []
                else:
                    pred_tags = [idx2tag[idx] for idx in pred[:length]]
                    true_tags = [idx2tag[tag_ids[i][j].item()] for j in range(length)]
                
                all_predictions.append(pred_tags)
                all_true_tags.append(true_tags)
    
    return all_true_tags, all_predictions

print("Training functions defined!")

Training functions defined!


In [14]:
print("Starting training with IMPROVEMENTS...\n")
print("=" * 80)
print("üéØ v2 Improvements:")
print("  ‚úì Dropout: 0.6 (reduce overfitting)")
print("  ‚úì Embedding dropout: Applied")
print("  ‚úì LR scheduler: Adaptive decay")
print("  ‚úì Patience: 7 epochs (more training)")
print("  ‚úì GloVe 300d: Richer embeddings")
print("  ‚úì Gradient clipping: 1.0 (more aggressive)")
print("="  * 80 + "\n")

best_f1 = 0
patience = 7  # IMPROVEMENT #4: Increased from 3
patience_counter = 0

training_start = time.time()

for epoch in range(NUM_EPOCHS):
    epoch_start = time.time()
    
    # Train
    train_loss = train_epoch(model, train_loader, optimizer, device)
    
    # Evaluate
    val_true_tags, val_pred_tags = evaluate(model, val_loader, device)
    
    # Calculate F1
    results = evaluate_entity_spans(val_true_tags, val_pred_tags, val_tokens)
    val_f1 = results['f1']
    val_precision = results['precision']
    val_recall = results['recall']
    
    # IMPROVEMENT #3: Update learning rate scheduler
    scheduler.step(val_f1)
    
    epoch_time = time.time() - epoch_start
    current_lr = optimizer.param_groups[0]['lr']
    
    print(f"Epoch {epoch+1:2d}/{NUM_EPOCHS} | "
          f"Loss: {train_loss:.4f} | "
          f"Val P: {val_precision:.4f} R: {val_recall:.4f} F1: {val_f1:.4f} | "
          f"LR: {current_lr:.6f} | "
          f"Time: {epoch_time:.1f}s")
    
    # Early stopping
    if val_f1 > best_f1:
        best_f1 = val_f1
        patience_counter = 0
        os.makedirs('models', exist_ok=True)
        torch.save(model.state_dict(), 'models/bilstm_crf_v2_best.pt')
        print(f"  ‚Üí New best F1! Model saved.")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"\nEarly stopping after {epoch+1} epochs (patience={patience})")
            break

training_time = time.time() - training_start

print("=" * 80)
print(f"\nTraining completed in {training_time:.1f}s ({training_time/60:.1f} minutes)")
print(f"Best validation F1: {best_f1:.4f}")
print(f"\nüéØ Target achieved: {'‚úÖ YES!' if best_f1 >= 0.78 else '‚ö†Ô∏è  Close! Try training longer.'}")
print(f"   v1 baseline: 0.7484")
print(f"   v2 result: {best_f1:.4f}")
print(f"   Improvement: +{(best_f1 - 0.7484)*100:.2f}% absolute")

Starting training with IMPROVEMENTS...

üéØ v2 Improvements:
  ‚úì Dropout: 0.6 (reduce overfitting)
  ‚úì Embedding dropout: Applied
  ‚úì LR scheduler: Adaptive decay
  ‚úì Patience: 7 epochs (more training)
  ‚úì GloVe 300d: Richer embeddings
  ‚úì Gradient clipping: 1.0 (more aggressive)



                                                                          

Epoch  1/20 | Loss: 3.4774 | Val P: 0.7058 R: 0.6531 F1: 0.6784 | LR: 0.001000 | Time: 207.5s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch  2/20 | Loss: 1.7946 | Val P: 0.7473 R: 0.6955 F1: 0.7205 | LR: 0.001000 | Time: 200.4s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch  3/20 | Loss: 1.4767 | Val P: 0.7552 R: 0.7129 F1: 0.7334 | LR: 0.001000 | Time: 202.0s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch  4/20 | Loss: 1.3012 | Val P: 0.7638 R: 0.7270 F1: 0.7450 | LR: 0.001000 | Time: 199.3s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch  5/20 | Loss: 1.1825 | Val P: 0.7624 R: 0.7338 F1: 0.7478 | LR: 0.001000 | Time: 201.3s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch  6/20 | Loss: 1.0908 | Val P: 0.7690 R: 0.7365 F1: 0.7524 | LR: 0.001000 | Time: 208.1s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch  7/20 | Loss: 1.0164 | Val P: 0.7729 R: 0.7444 F1: 0.7584 | LR: 0.001000 | Time: 201.8s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch  8/20 | Loss: 0.9640 | Val P: 0.7726 R: 0.7389 F1: 0.7554 | LR: 0.001000 | Time: 203.8s


                                                                          

Epoch  9/20 | Loss: 0.9009 | Val P: 0.7712 R: 0.7470 F1: 0.7589 | LR: 0.001000 | Time: 203.9s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch 10/20 | Loss: 0.8655 | Val P: 0.7711 R: 0.7415 F1: 0.7560 | LR: 0.001000 | Time: 206.1s


                                                                          

Epoch 11/20 | Loss: 0.8241 | Val P: 0.7734 R: 0.7433 F1: 0.7581 | LR: 0.001000 | Time: 205.8s


                                                                          

Epoch 12/20 | Loss: 0.7927 | Val P: 0.7750 R: 0.7439 F1: 0.7591 | LR: 0.001000 | Time: 214.1s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch 13/20 | Loss: 0.7585 | Val P: 0.7684 R: 0.7399 F1: 0.7539 | LR: 0.001000 | Time: 214.1s


                                                                          

Epoch 14/20 | Loss: 0.7296 | Val P: 0.7707 R: 0.7454 F1: 0.7578 | LR: 0.001000 | Time: 207.5s


                                                                          

Epoch 15/20 | Loss: 0.7078 | Val P: 0.7659 R: 0.7449 F1: 0.7552 | LR: 0.000500 | Time: 221.4s


                                                                          

Epoch 16/20 | Loss: 0.6404 | Val P: 0.7700 R: 0.7480 F1: 0.7589 | LR: 0.000500 | Time: 209.7s


                                                                          

Epoch 17/20 | Loss: 0.6089 | Val P: 0.7740 R: 0.7453 F1: 0.7594 | LR: 0.000500 | Time: 216.9s
  ‚Üí New best F1! Model saved.


                                                                          

Epoch 18/20 | Loss: 0.5874 | Val P: 0.7739 R: 0.7421 F1: 0.7577 | LR: 0.000500 | Time: 210.8s


                                                                          

Epoch 19/20 | Loss: 0.5708 | Val P: 0.7660 R: 0.7464 F1: 0.7561 | LR: 0.000500 | Time: 212.9s


                                                                          

Epoch 20/20 | Loss: 0.5534 | Val P: 0.7725 R: 0.7455 F1: 0.7588 | LR: 0.000250 | Time: 209.0s

Training completed in 4158.2s (69.3 minutes)
Best validation F1: 0.7594

üéØ Target achieved: ‚ö†Ô∏è  Close! Try training longer.
   v1 baseline: 0.7484
   v2 result: 0.7594
   Improvement: +1.10% absolute




## 9. Load Best Model and Final Evaluation

In [17]:
# Load best model
model.load_state_dict(torch.load('models/bilstm_crf_v2_best.pt'))
model.eval()

print("Best model loaded!")

# Final evaluation
val_true_tags, val_pred_tags = evaluate(model, val_loader, device)

# Comprehensive report
print_evaluation_report(
    val_true_tags,
    val_pred_tags,
    val_tokens,
    model_name="Character-CNN + BiLSTM-CRF v2 (Improved)"
)

Best model loaded!


                                                             

ENTITY-SPAN LEVEL EVALUATION REPORT: Character-CNN + BiLSTM-CRF v2 (Improved)

OVERALL METRICS:
  Precision: 0.7740
  Recall:    0.7453
  F1 Score:  0.7594

  True Positives:  8200
  False Positives: 2394
  False Negatives: 2802

--------------------------------------------------------------------------------
PER-ENTITY-TYPE METRICS:
--------------------------------------------------------------------------------
Entity Type          Precision    Recall       F1           Support   
--------------------------------------------------------------------------------
Artist               0.7703       0.8144       0.7918       2430      
Facility             0.7622       0.6667       0.7112       1173      
HumanSettlement      0.9040       0.9077       0.9058       2697      
ORG                  0.7486       0.6933       0.7199       1542      
OtherPER             0.6122       0.5842       0.5979       1527      
Politician           0.7399       0.6183       0.6736       1150      
Publi

## 10. Save Results

In [18]:
# Save vocabularies
vocab_data = {
    'word2idx': word2idx,
    'char2idx': char2idx,
    'tag2idx': tag2idx,
    'idx2word': idx2word,
    'idx2char': idx2char,
    'idx2tag': idx2tag
}

with open('models/bilstm_crf_v2_vocab.pkl', 'wb') as f:
    pickle.dump(vocab_data, f)

print("Vocabularies saved!")

# Save results
final_results = evaluate_entity_spans(val_true_tags, val_pred_tags, val_tokens)

results_summary = {
    'model': 'Character-CNN + BiLSTM-CRF v2 (Improved)',
    'version': 'v2',
    'improvements': [
        'Higher dropout (0.6)',
        'Dropout on embeddings',
        'Learning rate scheduler',
        'Increased patience (7)',
        'GloVe 300d',
        'Gradient clipping (1.0)'
    ],
    'precision': final_results['precision'],
    'recall': final_results['recall'],
    'f1': final_results['f1'],
    'training_time': training_time,
    'num_epochs': epoch + 1,
    'hyperparameters': {
        'embedding_dim': EMBEDDING_DIM,
        'char_emb_dim': CHAR_EMB_DIM,
        'char_hidden_dim': CHAR_HIDDEN_DIM,
        'lstm_hidden_dim': LSTM_HIDDEN_DIM,
        'learning_rate': LEARNING_RATE,
        'batch_size': BATCH_SIZE,
        'dropout': DROPOUT,
        'patience': patience,
        'gradient_clipping': 1.0
    },
    'v1_baseline_f1': 0.7484,
    'improvement_over_v1': final_results['f1'] - 0.7484
}

with open('models/bilstm_crf_v2_results.json', 'w') as f:
    json.dump(results_summary, f, indent=2)

print("Results saved!")

Vocabularies saved!
Results saved!


## 11. Summary

### Improvements Over v1:

**v1 Issues:**
- Peaked at epoch 3 (0.7484 F1)
- Overfitting after epoch 3
- Early stopping at epoch 6

**v2 Solutions:**

1. **‚úÖ Higher Dropout (0.6)**: Reduces overfitting
   - Expected: +1-2% F1

2. **‚úÖ Embedding Dropout**: Regularizes input representations
   - Expected: +0.5-1% F1

3. **‚úÖ Learning Rate Scheduler**: Adaptive LR decay
   - Expected: +1-2% F1

4. **‚úÖ Increased Patience (7)**: More training opportunities
   - Expected: Better convergence

5. **‚úÖ GloVe 300d**: Richer semantic representations
   - Expected: +1-2% F1

6. **‚úÖ Gradient Clipping (1.0)**: More stable training
   - Expected: Better convergence

**Total Expected Improvement:** +4-6% absolute F1

**Target:** 0.78-0.80 F1 (excellent for 15 entity types!)

### Next Steps:

If still not reaching 0.78-0.80:
1. Try stacked BiLSTM (2 layers)
2. Experiment with different LSTM sizes (128, 384, 512)
3. Add layer normalization
4. Try focal loss for class imbalance
5. Move to M7 (Attention-based BiLSTM-CRF)

### Comparison:

| Model | F1 Score | Notes |
|-------|----------|-------|
| M1 (CRF) | 0.6815 | Classical ML baseline |
| M4 v1 | 0.7484 | Deep learning |
| M4 v2 | **0.78-0.80** | Target with improvements |
| M7 (Attention) | 0.82-0.87 | State-of-the-art |

Good luck! üöÄ