# Entity-Span Level Evaluation Metrics

This notebook demonstrates how to use the evaluation functions from `utils.py` for entity-span level evaluation.

## Why Entity-Span Level?

The assignment requires **entity-span level evaluation**, not token-level:
- A predicted entity is correct **ONLY** if the entire span matches (start, end, and type)
- Partial matches do NOT count
- This is "strict" evaluation

## What's in utils.py?

1. `extract_entities()` - Extract entity spans from BIO tags
2. `evaluate_entity_spans()` - Calculate overall P/R/F1
3. `evaluate_entity_spans_by_type()` - Calculate per-entity-type metrics
4. `print_evaluation_report()` - Pretty print results

In [1]:
# Import evaluation functions
from utils import (
    extract_entities,
    evaluate_entity_spans,
    evaluate_entity_spans_by_type,
    print_evaluation_report
)

import json
import numpy as np

## 1. Test Evaluation Functions

Let's test the evaluation functions with simple examples to understand how they work.

In [2]:
print("=" * 80)
print("TEST CASE 1: Perfect Prediction")
print("=" * 80)

tokens = [["Barack", "Obama", "visited", "Paris", "."]]
true_tags = [["B-Politician", "I-Politician", "O", "B-HumanSettlement", "O"]]
pred_tags = [["B-Politician", "I-Politician", "O", "B-HumanSettlement", "O"]]

print("\nTokens:", tokens[0])
print("True tags:", true_tags[0])
print("Pred tags:", pred_tags[0])

# Extract entities
true_entities = extract_entities(tokens[0], true_tags[0])
pred_entities = extract_entities(tokens[0], pred_tags[0])

print("\nTrue entities:", true_entities)
print("Pred entities:", pred_entities)

# Evaluate
result = evaluate_entity_spans(true_tags, pred_tags, tokens)
print(f"\nPrecision: {result['precision']:.4f}")
print(f"Recall:    {result['recall']:.4f}")
print(f"F1 Score:  {result['f1']:.4f}")
print("\n✅ Perfect match! F1 = 1.0")

TEST CASE 1: Perfect Prediction

Tokens: ['Barack', 'Obama', 'visited', 'Paris', '.']
True tags: ['B-Politician', 'I-Politician', 'O', 'B-HumanSettlement', 'O']
Pred tags: ['B-Politician', 'I-Politician', 'O', 'B-HumanSettlement', 'O']

True entities: [('Barack Obama', 'Politician', 0, 1), ('Paris', 'HumanSettlement', 3, 3)]
Pred entities: [('Barack Obama', 'Politician', 0, 1), ('Paris', 'HumanSettlement', 3, 3)]

Precision: 1.0000
Recall:    1.0000
F1 Score:  1.0000

✅ Perfect match! F1 = 1.0


In [3]:
print("=" * 80)
print("TEST CASE 2: Missed Entity (False Negative)")
print("=" * 80)

tokens = [["Barack", "Obama", "visited", "Paris", "."]]
true_tags = [["B-Politician", "I-Politician", "O", "B-HumanSettlement", "O"]]
pred_tags = [["B-Politician", "I-Politician", "O", "O", "O"]]  # Missed Paris

print("\nTokens:", tokens[0])
print("True tags:", true_tags[0])
print("Pred tags:", pred_tags[0])

# Extract entities
true_entities = extract_entities(tokens[0], true_tags[0])
pred_entities = extract_entities(tokens[0], pred_tags[0])

print("\nTrue entities:", true_entities)
print("Pred entities:", pred_entities)

# Evaluate
result = evaluate_entity_spans(true_tags, pred_tags, tokens)
print(f"\nPrecision: {result['precision']:.4f} (1 predicted, 1 correct)")
print(f"Recall:    {result['recall']:.4f} (2 true, 1 found)")
print(f"F1 Score:  {result['f1']:.4f}")
print("\n⚠️ Missed 'Paris', so recall drops to 0.5")

TEST CASE 2: Missed Entity (False Negative)

Tokens: ['Barack', 'Obama', 'visited', 'Paris', '.']
True tags: ['B-Politician', 'I-Politician', 'O', 'B-HumanSettlement', 'O']
Pred tags: ['B-Politician', 'I-Politician', 'O', 'O', 'O']

True entities: [('Barack Obama', 'Politician', 0, 1), ('Paris', 'HumanSettlement', 3, 3)]
Pred entities: [('Barack Obama', 'Politician', 0, 1)]

Precision: 1.0000 (1 predicted, 1 correct)
Recall:    0.5000 (2 true, 1 found)
F1 Score:  0.6667

⚠️ Missed 'Paris', so recall drops to 0.5


In [4]:
print("=" * 80)
print("TEST CASE 3: Wrong Span Boundary (Incorrect Prediction)")
print("=" * 80)

tokens = [["Barack", "Hussein", "Obama", "Jr", "."]]
true_tags = [["B-Politician", "I-Politician", "I-Politician", "I-Politician", "O"]]
pred_tags = [["B-Politician", "I-Politician", "O", "O", "O"]]  # Span too short!

print("\nTokens:", tokens[0])
print("True tags:", true_tags[0])
print("Pred tags:", pred_tags[0])

# Extract entities
true_entities = extract_entities(tokens[0], true_tags[0])
pred_entities = extract_entities(tokens[0], pred_tags[0])

print("\nTrue entities:", true_entities)
print("Pred entities:", pred_entities)

# Evaluate
result = evaluate_entity_spans(true_tags, pred_tags, tokens)
print(f"\nTrue Positives:  {result['true_positives']}")
print(f"False Positives: {result['false_positives']}")
print(f"False Negatives: {result['false_negatives']}")
print(f"\nPrecision: {result['precision']:.4f}")
print(f"Recall:    {result['recall']:.4f}")
print(f"F1 Score:  {result['f1']:.4f}")
print("\n❌ Span boundaries don't match! Counts as both FP and FN")

TEST CASE 3: Wrong Span Boundary (Incorrect Prediction)

Tokens: ['Barack', 'Hussein', 'Obama', 'Jr', '.']
True tags: ['B-Politician', 'I-Politician', 'I-Politician', 'I-Politician', 'O']
Pred tags: ['B-Politician', 'I-Politician', 'O', 'O', 'O']

True entities: [('Barack Hussein Obama Jr', 'Politician', 0, 3)]
Pred entities: [('Barack Hussein', 'Politician', 0, 1)]

True Positives:  0
False Positives: 1
False Negatives: 1

Precision: 0.0000
Recall:    0.0000
F1 Score:  0.0000

❌ Span boundaries don't match! Counts as both FP and FN


In [5]:
print("=" * 80)
print("TEST CASE 4: Wrong Entity Type")
print("=" * 80)

tokens = [["Barack", "Obama", "."]]
true_tags = [["B-Politician", "I-Politician", "O"]]
pred_tags = [["B-Artist", "I-Artist", "O"]]  # Wrong type!

print("\nTokens:", tokens[0])
print("True tags:", true_tags[0])
print("Pred tags:", pred_tags[0])

# Extract entities
true_entities = extract_entities(tokens[0], true_tags[0])
pred_entities = extract_entities(tokens[0], pred_tags[0])

print("\nTrue entities:", true_entities)
print("Pred entities:", pred_entities)

# Evaluate
result = evaluate_entity_spans(true_tags, pred_tags, tokens)
print(f"\nTrue Positives:  {result['true_positives']}")
print(f"False Positives: {result['false_positives']}")
print(f"False Negatives: {result['false_negatives']}")
print(f"\nF1 Score: {result['f1']:.4f}")
print("\n❌ Span correct but type wrong! Still counts as incorrect")

TEST CASE 4: Wrong Entity Type

Tokens: ['Barack', 'Obama', '.']
True tags: ['B-Politician', 'I-Politician', 'O']
Pred tags: ['B-Artist', 'I-Artist', 'O']

True entities: [('Barack Obama', 'Politician', 0, 1)]
Pred entities: [('Barack Obama', 'Artist', 0, 1)]

True Positives:  0
False Positives: 1
False Negatives: 1

F1 Score: 0.0000

❌ Span correct but type wrong! Still counts as incorrect


## 2. Load Real Data and Test

Now let's load the actual validation data and test the evaluation on a subset.

In [6]:
# Load validation data
def load_jsonl(file_path):
    """Load JSONL file"""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

val_data = load_jsonl('val_split.jsonl')
print(f"Loaded {len(val_data):,} validation samples")

# Extract tokens and tags
val_tokens = [sample['tokens'] for sample in val_data]
val_true_tags = [sample['ner_tags'] for sample in val_data]

print(f"\nExample validation sample:")
print(f"Tokens: {val_tokens[0][:10]}...")  # First 10 tokens
print(f"Tags:   {val_true_tags[0][:10]}...")

Loaded 10,036 validation samples

Example validation sample:
Tokens: ['in', '1933', 'phil', 'spitalny', 'directed', 'the', 'orchestra', 'for', 'the']...
Tags:   ['O', 'O', 'B-Artist', 'I-Artist', 'O', 'O', 'O', 'O', 'O']...


## 3. Simulate Model Predictions

Let's create some dummy predictions to test the full evaluation pipeline.

In [7]:
# Create dummy predictions (for testing)
# In reality, this would come from your trained model

print("Creating dummy predictions for testing...\n")

# Scenario 1: Perfect predictions (sanity check)
perfect_preds = val_true_tags.copy()

print("Evaluating perfect predictions (sanity check):")
result = evaluate_entity_spans(val_true_tags, perfect_preds, val_tokens)
print(f"  F1: {result['f1']:.4f} (should be 1.0)")
assert result['f1'] == 1.0, "Perfect predictions should have F1 = 1.0"
print("  ✅ Sanity check passed!\n")

Creating dummy predictions for testing...

Evaluating perfect predictions (sanity check):
  F1: 1.0000 (should be 1.0)
  ✅ Sanity check passed!



In [8]:
# Scenario 2: Random baseline (predict all O)
random_preds = [['O'] * len(tags) for tags in val_true_tags]

print("Evaluating random baseline (all O):")
result = evaluate_entity_spans(val_true_tags, random_preds, val_tokens)
print(f"  Precision: {result['precision']:.4f}")
print(f"  Recall:    {result['recall']:.4f}")
print(f"  F1:        {result['f1']:.4f}")
print("\n  (F1 should be 0.0 since we predict no entities)\n")

Evaluating random baseline (all O):
  Precision: 0.0000
  Recall:    0.0000
  F1:        0.0000

  (F1 should be 0.0 since we predict no entities)



In [9]:
# Scenario 3: Noisy predictions (80% correct)
import random
random.seed(42)

noisy_preds = []
for tags in val_true_tags:
    noisy_tags = []
    for tag in tags:
        # 80% chance of correct tag, 20% chance of O
        if random.random() < 0.8:
            noisy_tags.append(tag)
        else:
            noisy_tags.append('O')
    noisy_preds.append(noisy_tags)

print("Evaluating noisy predictions (80% token accuracy):")
result = evaluate_entity_spans(val_true_tags, noisy_preds, val_tokens)
print(f"  Precision: {result['precision']:.4f}")
print(f"  Recall:    {result['recall']:.4f}")
print(f"  F1:        {result['f1']:.4f}")
print("\n  (Note: 80% token accuracy != 80% span F1, because entire spans must match!)")

Evaluating noisy predictions (80% token accuracy):
  Precision: 0.6537
  Recall:    0.6387
  F1:        0.6461

  (Note: 80% token accuracy != 80% span F1, because entire spans must match!)


## 4. Using print_evaluation_report()

This function gives you a nice formatted report with both overall and per-entity-type metrics.

In [10]:
# Print detailed evaluation report
print_evaluation_report(
    val_true_tags,
    noisy_preds,
    val_tokens,
    model_name="Noisy Baseline (80% token accuracy)"
)

ENTITY-SPAN LEVEL EVALUATION REPORT: Noisy Baseline (80% token accuracy)

OVERALL METRICS:
  Precision: 0.6537
  Recall:    0.6387
  F1 Score:  0.6461

  True Positives:  8607
  False Positives: 4559
  False Negatives: 4868

--------------------------------------------------------------------------------
PER-ENTITY-TYPE METRICS:
--------------------------------------------------------------------------------
Entity Type          Precision    Recall       F1           Support   
--------------------------------------------------------------------------------
Artist               0.6609       0.6423       0.6515       2849      
Facility             0.5382       0.5683       0.5528       1487      
HumanSettlement      0.8281       0.7207       0.7707       3476      
ORG                  0.5455       0.5800       0.5622       1893      
OtherPER             0.6029       0.6060       0.6044       1779      
Politician           0.5821       0.6020       0.5919       1402      
PublicCorp

## 5. Per-Entity-Type Analysis

Understanding which entity types your model struggles with is crucial for improvement.

In [11]:
# Get per-type metrics
by_type = evaluate_entity_spans_by_type(val_true_tags, noisy_preds, val_tokens)

print("Per-Entity-Type Performance:\n")
print(f"{'Entity Type':<20} {'F1':<10} {'Support':<10}")
print("-" * 40)

# Sort by F1 score
sorted_types = sorted(by_type.items(), key=lambda x: x[1]['f1'], reverse=True)

for entity_type, metrics in sorted_types:
    print(f"{entity_type:<20} {metrics['f1']:<10.4f} {metrics['support']:<10}")

print("\nInsights:")
print("- Entity types with low F1 might need more training data or better features")
print("- Entity types with low support might benefit from data augmentation")

Per-Entity-Type Performance:

Entity Type          F1         Support   
----------------------------------------
HumanSettlement      0.7707     3476      
PublicCorp           0.7140     589       
Artist               0.6515     2849      
OtherPER             0.6044     1779      
Politician           0.5919     1402      
ORG                  0.5622     1893      
Facility             0.5528     1487      

Insights:
- Entity types with low F1 might need more training data or better features
- Entity types with low support might benefit from data augmentation


## 6. How to Use in Your Model Notebooks

When you train a model, use these functions like this:

```python
# In your model notebook (e.g., 3_HMM.ipynb, 4_CRF.ipynb, etc.)

from utils import print_evaluation_report

# 1. Train your model
model.fit(train_data)

# 2. Make predictions on validation set
val_predictions = model.predict(val_tokens)

# 3. Evaluate with one line!
print_evaluation_report(
    val_true_tags,
    val_predictions,
    val_tokens,
    model_name="My Model Name"
)
```

## Summary

### Key Takeaways:

1. **Entity-span level evaluation is STRICT**:
   - Entire span must match (start, end, type)
   - Partial matches don't count
   - 80% token accuracy ≠ 80% span F1

2. **Functions available in utils.py**:
   - `extract_entities()` - Extract spans from BIO tags
   - `evaluate_entity_spans()` - Overall P/R/F1
   - `evaluate_entity_spans_by_type()` - Per-type metrics
   - `print_evaluation_report()` - Pretty print results

3. **How to use**:
   - Import functions from utils.py
   - Pass true tags, predicted tags, and tokens
   - Get comprehensive evaluation report

4. **For your report**:
   - Always report entity-span F1 (not token-level!)
   - Include per-entity-type breakdown
   - Analyze which entity types are challenging

### Next Steps:

Now you're ready to start building models! Each model notebook should:
1. Import evaluation functions from utils.py
2. Train the model
3. Predict on validation set
4. Evaluate using `print_evaluation_report()`
5. Save results for comparison