In [1]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../Course_Tools/auto_update_introdl.py

✅ introdl v1.6.46 already up to date


# Named Entity Recognition

In this notebook we'll

* List some common applications of NER
* Give a brief history of NER
* Demonstrate how to setup and fine-tune a DistilBERT model for NER
* Discuss some of the issues with using an LLM for an NER task

After running that cell, you should restart the kernel.

## Applications of NER

I wasn't really familiar with Named Entity Recognition before building this course.  However, after studying it for a bit I realize it's very similar to object detection and instance segmentation in computer vision where we're trying to "tag" individual objects in an image.  Now we're doing it with text.  Now that I know more about it I realize that NER is everywhere:

- **Information Extraction from Text**
  - Identify names of people, places, organizations, and dates in news articles, legal documents, and academic papers.

- **Search and Question Answering**
  - Improve retrieval and understanding by recognizing key entities in queries and documents (e.g., “Where was Barack Obama born?”).

- **Social Media Monitoring**
  - Detect mentions of public figures, brands, products, and locations in tweets, posts, and comments for sentiment analysis or moderation.

- **Marketing and Trend Analysis**
  - Track mentions of brands, competitors, or topics over time to identify emerging trends and customer interests.

- **Content Recommendation**
  - Extract entities (e.g., movies, products, places) from reviews and user posts to personalize content or advertisements.

- **Customer Support Automation**
  - Identify product names, user accounts, and issue types in support chats and emails to assist routing and auto-response systems.

- **Financial and Business Intelligence**
  - Extract company names, stock tickers, monetary values, and events from reports or articles to support decision-making.

- **Medical and Clinical Text Analysis**
  - Identify diseases, medications, and procedures in clinical notes for tasks like anonymization, coding, or record analysis.

- **Legal and Compliance Monitoring**
  - Recognize case names, organizations, and laws in legal documents to support research, auditing, or compliance checks.

- **Resume and Job Post Parsing**
  - Extract structured information such as skills, education, job titles, and companies to streamline recruitment processes.


**Side Note:**  I'm using NER heavily right now to extract structured information from radiologist and patholgist findings in electronic health records. This feeds into model training for cancer diagnosis from breast ultrasound exams.

## An Analogy

* Image Classification - classify the entire image into a category
* Text Classification - classify the entire text into a category
* Image Segmentation - classify each pixel in an image into a category
* Named Entity Recognition - class each word (token) into a category

So text classification is to image classification as named entity recognition is to image segmentation.

Or, more simply:  **Named Entity Recognition = Text Segmentation**

## **Chronology of State-of-the-Art Approaches for Named Entity Recognition (NER)**  

The evolution of NER closely parallels the evolution of algorithms for text classification.  Early approaches were based on statistical models, then word embeddings and recurrent neural networks, before transformer architectures revolutionized the field since 2017.  

Here's a timeline of some of the key advancements in NER:

---

### **Pre-2010s: Rule-Based Systems and Feature Engineering**  
Early NER systems used **hand-crafted rules**, lookup lists (called **gazetteers**), and basic statistical models like **Hidden Markov Models (HMMs)** and **Conditional Random Fields (CRFs)**.  
- **HMMs** modeled sequences by predicting the most likely tag (e.g., PERSON, LOCATION) for each word based on probabilities.
- **CRFs** improved on HMMs by allowing more flexible features and considering the entire sequence when making predictions.

These approaches required heavy manual feature engineering—like marking whether a word is capitalized, its part of speech, or its prefix/suffix.

- **1990s–2000s**: Rule-based systems and statistical models dominated tasks like newswire NER.
- **2003**: The CoNLL-2003 shared task standardized benchmarks and boosted interest in developing better NER models.

---

### **2010s: Word Embeddings and Neural Sequence Models**  
NER systems improved significantly with the introduction of **word embeddings** like **Word2Vec** and **GloVe**, which represented words in continuous vector space based on context. These embeddings replaced sparse, manual features.

- **2013–2015**: **Word2Vec** and **GloVe** made it easier to train neural models for NER.
- **2015–2016**: **BiLSTM-CRF** architectures became popular—combining bidirectional LSTMs (which read sentences both forward and backward) with a CRF layer to model dependencies between entity tags.
- **2015**: **spaCy** launched as a fast, practical NLP library with built-in NER support, making NER accessible for developers and educators.
- **2016–2017**: Character-level embeddings and CNNs were added to improve robustness to spelling variation and rare words.

---

### **Late 2010s: Contextual Embeddings and Transformers**  
NER took a major leap with **contextualized embeddings** from transformer-based models.

- **2018**: **ELMo** introduced deep contextualized word representations that vary based on sentence context.
- **2018**: **BERT** achieved state-of-the-art NER results by treating NER as a token classification problem using bidirectional transformer layers.
- **2019**: **Flair** added character-level contextual embeddings to further improve performance on small or domain-specific datasets.

---

### **2020s: Prompting and Large Language Models (LLMs)**  
Recent NER approaches increasingly use **LLMs** like **GPT-4**, **Claude**, and **Gemini**, which can extract entities using **natural language prompts** instead of token-level supervision.

- **2020–2022**: Models like **RoBERTa**, **SpanBERT**, and **LUKE** fine-tuned transformer architectures for better span detection and entity-aware representations.
- **spaCy** added support for transformer-based pipelines (e.g., `en_core_web_trf`) to make state-of-the-art NER accessible for production use.
- **2023–2025**: Instruction-tuned models like **GLiNER** and general-purpose LLMs now handle **zero-shot or few-shot NER** using prompts like *"Find all organizations and people in this sentence."* These models reduce the need for annotated datasets and allow rapid prototyping for new entity types.

  While LLMs offer flexibility and ease of use, they may be less precise than traditional models. Hybrid systems often combine LLMs with structured postprocessing or constrained decoding to improve accuracy.

---

We'll focus on two of these tools.  We'll fine-tune a BERT model for NER and we'll look at some of the hurdles to using LLMs for NER.  You'll explore both of these topics further in the homework.

Here's our main import cell before we dive into the rest of the material.

In [2]:
from Lesson_10_Helpers import (display_ner_html, display_pipeline_ner_html, format_ner_eval_results, 
                                evaluate_ner, extract_gold_entities, predict_ner_tags)
from introdl import (config_paths_keys, wrap_print_text, llm_generate, Trainer)

from datasets import load_dataset
import evaluate # Hugging Face library for evaluation
from IPython.display import display
import numpy as np
import pandas as pd
import torch
import json

from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                          TrainingArguments, DataCollatorForTokenClassification,
                          pipeline)

print = wrap_print_text(print, width=120)

paths = config_paths_keys()
MODELS_PATH = paths['MODELS_PATH']
DATA_PATH = paths['DATA_PATH']

✅ Environment: Unknown Environment | Course root: /mnt/e/GDrive_baggett.jeff/Teaching/Classes_current/2025-2026_Fall_DS776/DS776
   Using workspace: <DS776_ROOT_DIR>/home_workspace

📂 Storage Configuration:
   DATA_PATH: <DS776_ROOT_DIR>/home_workspace/data
   MODELS_PATH: /home/jbaggett/DS776_new/Lessons/Lesson_10_Named_Entity_Recognition/Lesson_10_Models (local to this notebook)
   CACHE_PATH: <DS776_ROOT_DIR>/home_workspace/downloads
🔑 API keys: 9 loaded from home_workspace/api_keys.env
🔐 Available: ANTHROPIC_API_KEY, GEMINI_API_KEY, GOOGLE_API_KEY... (9 total)
✅ HuggingFace Hub: Logged in
✅ Loaded pricing for 347 OpenRouter models
✅ Cost tracking initialized ($9.15 credit remaining)
📦 introdl v1.6.46 ready



## The Dataset - CoNLL2003 for NER

For our examples, well use the [CoNLL2003 dataset](https://www.clips.uantwerpen.be/conll2003/ner/).  It is one of the first widely used benchmarks for Named Entity Recognition (NER). It was introduced as part of the [CoNLL-2003 shared task](https://aclanthology.org/W03-0419.pdf) and contains annotated text for four entity types: **PER** (person), **LOC** (location), **ORG** (organization), and **MISC** (miscellaneous). The dataset is derived from Reuters news articles and is structured in the BIO format, making it a standard for evaluating NER models.  BIO format marks the (B) beginning token, (I) inside tokens, and (O) outside tokens of an entity.  For example, 'B-PER' would be the first token in the name of a person, 'I-PER' would tag subsequent tokens in the name, while 'O' tags any tokens that are not part of an entity -- similar to labeling background pixels in image segmentation.

Multiple versions of the dataset are available in Hugging Face.  We chose "tomaarsen/conll2003" because the NER tags are available in BIO format and because the list of possible labels is easy to extract.

In [3]:
# Load CoNLL2003 dataset (using parquet version for compatibility with datasets 4.0+)
dataset = load_dataset("tomaarsen/conll2003", revision="refs/convert/parquet")
BIO_tags_list = dataset["train"].features["ner_tags"].feature.names
print("Possible BIO tags", BIO_tags_list)

# delete the pos_tags and chunk_tags columns, as we don't need them
for split in dataset.keys():
    dataset[split] = dataset[split].remove_columns(["pos_tags", "chunk_tags"])


Possible BIO tags ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Each sample in the dataset consists of a single sentence or headline.  Here is how it's stored:

In [4]:
print(dataset["train"][12])

{'id': '12', 'document_id': 1, 'sentence_id': 12, 'tokens': ['Only', 'France', 'and', 'Britain', 'backed', 'Fischler',
"'s", 'proposal', '.'], 'ner_tags': [0, 5, 0, 5, 0, 1, 0, 0, 0]}


Notice that the tokens are the words in sentence split up by whitespace and punctuation.  The ner_tags correspond to indices of the entity tags in our list.  The next bit of code also shows you how to get the BIO tags corresponding to each token:

In [5]:
# Extract tokens and ner_tags from dataset["train"][12]
tokens = dataset["train"][12]["tokens"]
ner_tags = dataset["train"][12]["ner_tags"]

# Map ner_tags to their corresponding BIO tags using label_list
bio_tags = [BIO_tags_list[tag] for tag in ner_tags]

# Create a DataFrame
df = pd.DataFrame({"Tokens": tokens, "NER Tags (IDs)": ner_tags, "BIO Tags": bio_tags})

# Display the DataFrame
display(df)

Unnamed: 0,Tokens,NER Tags (IDs),BIO Tags
0,Only,0,O
1,France,5,B-LOC
2,and,0,O
3,Britain,5,B-LOC
4,backed,0,O
5,Fischler,1,B-PER
6,'s,0,O
7,proposal,0,O
8,.,0,O


[spaCy is a whole ecosystem](https://spacy.io/) of tools for NLP that we won't really dive into much in this course, but it's worth a look if you're going to be working in this area.  They provide some great tools for visualization of tagged text.  We've use their package to make a little function called `display_ner_html` which takes lists of tokens, tag IDs, and the list of labels to produce HTML visualizations of the tags.  The function is in helper.py if you're curious.  Here's how we can use it:

In [6]:
# tokens and ner_tags were defined in the previous code cell

display_ner_html(tokens, ner_tags, BIO_tags_list)

In [7]:
# here's another example
display_ner_html(dataset["train"][4]["tokens"], dataset["train"][4]["ner_tags"], BIO_tags_list)

## Fine-tune DistilBERT for ConNLL2003

**NOTE:** I've updated this section since Spring 2025 and will make a new video if time allows.  The main changes:
- Use of our introdl.Trainer class which enables pretend_train (otherwise it's just like tranformers.Trainer)
- Now we're using a `pipeline` to make inferences on new texts.  This simplifies using a trained model greatly.  
- Provided a new "helper" function that takes the output of `pipeline` and produces dictionaries with the extracted entities.

#### L10_1_Fine-tune_BERT Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_fine-tune_bert/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_fine-tune_bert/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/EgBF1mreyjw" target="_blank">Open Descript version of video in new tab</a>


Now we want to fine-tune a BERT model so that it can provide similar tagging for new text.  First we'll load a model and its tokenizer.
`distilbert-base-cased` is a smaller, faster, and lighter version of BERT that retains 97% of its language understanding capabilities while being 40% smaller. It is case-sensitive, meaning it distinguishes between "Apple" and "apple" which is useful for NER tasks. It was trained using masked language modeling on the same data as BERT, including the English Wikipedia and BookCorpus, but with a reduced architecture to improve efficiency. 

In practice, you might choose a more recent variation on a BERT model for NER tasks.  For example, the DeBERTa-v3 will typically beat BERT and RoBERTa for NER.  We focus on DistilBERT in the lessons so you can see how things work while using a smaller model for faster training.

Note that we make use of `AutoModelForTokenClassification` which adds a classification head to the backbone the same way we did for transfer learning applications in image classification.  The backbone uses pretrained weights while the classification head weights are randomly initialized and learned during fine-tuning.

In [8]:
# Load tokenizer and model
model_name = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(BIO_tags_list))


One of the main issues we'll need to deal with is to map the BIO tags to the tokens that are produced by tokenizer that comes with our selected BERT model.  That tokenizer will break some of our words into subwords.  For those subwords we'll introduce an ID of -100 that tells the model not to predict tags for those tokens.

We're using the `tokenize_and_align` labels function from the [Hugging Face tutorial on NER](https://huggingface.co/learn/llm-course/en/chapter7/2) to align the BIO ID tags from the input sequence in the dataset to the output tokens in the tokenizer.  We've included some extra comments in the code if you want to study it, or you can use an AI to help you walk through the details. 

In [9]:
# Helper function to align labels with tokens
def tokenize_and_align_labels(examples):
    # Tokenize the input text (list of tokens) while keeping track of word-to-token alignment
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    
    # Initialize a list to store the aligned labels for each example
    labels = []
    
    # Iterate over each example in the batch
    for i, label in enumerate(examples["ner_tags"]):
        # Get the word-to-token mapping for the current example
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        
        # Initialize variables to track the previous word index and the label IDs
        previous_word_idx = None
        label_ids = []
        
        # Iterate over the word IDs corresponding to the tokens
        for word_idx in word_ids:
            if word_idx is None:
                # If the token is a special token (e.g., [CLS], [SEP]), ignore it by assigning -100
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # If the token corresponds to a new word, assign the label of that word
                label_ids.append(label[word_idx])
            else:
                # If the token is part of the same word (e.g., subword tokens), ignore it by assigning -100
                label_ids.append(-100)
            
            # Update the previous word index to the current one
            previous_word_idx = word_idx
        
        # Append the aligned label IDs for the current example
        labels.append(label_ids)
    
    # Add the aligned labels to the tokenized inputs
    tokenized_inputs["labels"] = labels
    
    # Return the tokenized inputs with aligned labels
    return tokenized_inputs

# Tokenize datasets
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


The next cell demonstrates how our tokenizer works the alignment function to get the tokenization expected by the model and to introduce IDs of -100 for each of the subwords introduced by the tokenizer.  

In [10]:
# Get the example
example = dataset["train"][7]

# Wrap in a batch of one for compatibility with tokenize_and_align_labels
batch = {"tokens": [example["tokens"]], "ner_tags": [example["ner_tags"]]}

# Apply the tokenization and alignment function
tokenized = tokenize_and_align_labels(batch)

# Extract and display results
tokens = tokenizer.convert_ids_to_tokens(tokenized["input_ids"][0])
labels = tokenized["labels"][0]

print(("Before model tokenization:\n"))
display_ner_html(dataset["train"][7]["tokens"], dataset["train"][7]["ner_tags"], BIO_tags_list)
print(("\nAfter model tokenization:\n"))
display_ner_html(tokens, labels, BIO_tags_list)


Before model tokenization:




After model tokenization:



You can see that the tokenizer divided some of the original words into subwords which get assigned an ID of -100 to be ignored by the model.  During training those tokens are ignored by the loss function and the outputs corresponding to those tokens are ignored during model evaluation.

Before we fine-tune the model we define a custom metrics function that does two things:
1. Uses the `seqeval` package to evaluate entire entity spans (e.g, e.g., `B-LOC`, `I-LOC`, etc. forming `"New York"`) instead of evaluating individual labels as we'd do with the scikit-learn metrics.
2. Ignores the tokens with IDs of -100 for the evaluation metrics:

In [11]:
# Load seqeval metric
metric = evaluate.load("seqeval")

# Note if you have a different list of possible tags, you'll need to change the default value of label_list
def compute_metrics(p, label_list=BIO_tags_list):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return metric.compute(predictions=true_predictions, references=true_labels)

For the actual fine-tuning we use a similar setup to what we did for text classification:

In [12]:
# Training arguments
training_args = TrainingArguments(
    output_dir= MODELS_PATH / "distilbert-ner",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
    seed=42,
    disable_tqdm=False,
    save_total_limit=1,  # Only keep the best model
    load_best_model_at_end=True,  # Load best model at end of training
    metric_for_best_model="eval_overall_f1",  # Use F1 as the best model metric
)

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

# Trainer setup with pretend_train mode
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    pretend_train=True,  # Enable smart loading: local → train from scratch
)

# Train the model (or load if already trained)
trainer.train()

✓ Checking HuggingFace Hub: hobbes99/DS776-models/distilbert-ner
✓ Model cached locally to: Lesson_10_Models/distilbert-ner/best_model
Model already trained. Loading checkpoint...



Loading:   0%|          | 0/3 [00:00<?, ?it/s]


📊 Training History:


Unnamed: 0,Epoch,Training Loss,Validation Loss,Loc,Misc,Org,Per,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
0,1,,0.055805,"{'precision': 0.926777, 'recall': 0.943930, 'f...","{'precision': 0.835373, 'recall': 0.814534, 'f...","{'precision': 0.854328, 'recall': 0.905295, 'f...","{'precision': 0.972254, 'recall': 0.932139, 'f...",0.909182,0.911478,0.910329,0.984249
1,2,,0.046183,"{'precision': 0.960925, 'recall': 0.950463, 'f...","{'precision': 0.833161, 'recall': 0.872017, 'f...","{'precision': 0.902295, 'recall': 0.909023, 'f...","{'precision': 0.965761, 'recall': 0.964712, 'f...",0.928512,0.933356,0.930927,0.988299
2,3,,0.043658,"{'precision': 0.963247, 'recall': 0.955906, 'f...","{'precision': 0.840292, 'recall': 0.873102, 'f...","{'precision': 0.895849, 'recall': 0.917226, 'f...","{'precision': 0.965368, 'recall': 0.968512, 'f...",0.928857,0.938236,0.933523,0.989019



✓ Best model: Epoch 3 | Overall F1: 0.9335


TrainOutput(global_step=0, training_loss=0.0, metrics={})

### Making Predictions and Evaluating Performance

Of course, we have the metrics for the validation set shown in the training output above.  However, we can use the trainer's predict method to make predictions on any dataset, including the test set, and pass them to compute_metrics like this:

In [13]:

# Evaluate on test set using predict (compatible with pretend_train)
predictions = trainer.predict(tokenized_datasets["test"])

# Extract predictions and labels
pred_logits = predictions.predictions
pred_labels = np.argmax(pred_logits, axis=2)
true_labels = predictions.label_ids

# Compute metrics using the same compute_metrics function
results_BERT = compute_metrics((pred_logits, true_labels))

print("\nTest set evaluation results:")
print(results_BERT)



Test set evaluation results:
{'LOC': {'precision': 0.9195402298850575, 'recall': 0.9112709832134293, 'f1': 0.9153869316470943, 'number': 1668},
'MISC': {'precision': 0.7235142118863049, 'recall': 0.7977207977207977, 'f1': 0.7588075880758809, 'number': 702}, 'ORG':
{'precision': 0.8451352907311457, 'recall': 0.8838049367850692, 'f1': 0.8640376692171866, 'number': 1661}, 'PER':
{'precision': 0.961875, 'recall': 0.9517625231910947, 'f1': 0.9567920422754119, 'number': 1617}, 'overall_precision':
0.8825468424705066, 'overall_recall': 0.9006728045325779, 'overall_f1': 0.8915177006659657, 'overall_accuracy':
0.9780122752234306}


The dictionary of metrics certainly isn't very pretty, but all the metrics are there.  You can display it as pandas.DataFrame or you could use our little utility `Lesson_10_Helpers.format_ner_eval_results` (or write your own).

In [14]:
# pd.DataFrame converts nested dict to a flat table for easier viewing
display(pd.DataFrame(results_BERT))

# or make it prettier with our helper function
display(format_ner_eval_results(results_BERT))

Unnamed: 0,LOC,MISC,ORG,PER,overall_precision,overall_recall,overall_f1,overall_accuracy
precision,0.91954,0.723514,0.845135,0.961875,0.882547,0.900673,0.891518,0.978012
recall,0.911271,0.797721,0.883805,0.951763,0.882547,0.900673,0.891518,0.978012
f1,0.915387,0.758808,0.864038,0.956792,0.882547,0.900673,0.891518,0.978012
number,1668.0,702.0,1661.0,1617.0,0.882547,0.900673,0.891518,0.978012


Unnamed: 0,Entity,Precision,Recall,F1,Number,Accuracy
0,LOC,0.9195,0.9113,0.9154,1668.0,
1,MISC,0.7235,0.7977,0.7588,702.0,
2,ORG,0.8451,0.8838,0.864,1661.0,
3,PER,0.9619,0.9518,0.9568,1617.0,
4,Overall,0.8825,0.9007,0.8915,,0.978


Note that the overall accuracy is almost 98%, which seems amazing, but this is including all the background words in the text.   We could get a very accurate model by classifying all words as background so accuracy isn't very meaningful here just like in image segmentation when we include background pixels.

If you're looking for one number to quantify the performance of an NER model, use F1.  F1 is the harmonic mean (an equal blend) of precision and recall.  Particularly for imbalanced datasets, it is much better than accuracy.  In image segmentation **F1 = Dice Score**!

Finally, we can retrieve the evaluation metrics from training (the training loss isn't saved) for display or plotting:

In [15]:
# Get training history as a DataFrame
history_df = trainer.get_training_history()
display(history_df)

Unnamed: 0,epoch,eval_loss,eval_LOC,eval_MISC,eval_ORG,eval_PER,eval_overall_precision,eval_overall_recall,eval_overall_f1,eval_overall_accuracy,eval_runtime,eval_samples_per_second,eval_steps_per_second
0,1,0.0558,"{'precision': 0.9267771245, 'recall': 0.943930...","{'precision': 0.8353726363, 'recall': 0.814533...","{'precision': 0.8543279381000001, 'recall': 0....","{'precision': 0.9722536806000001, 'recall': 0....",0.9092,0.9115,0.9103,0.9842,2.1542,1508.7,94.7
1,2,0.0462,"{'precision': 0.9609246010000001, 'recall': 0....","{'precision': 0.8331606218000001, 'recall': 0....","{'precision': 0.9022945966, 'recall': 0.909023...","{'precision': 0.9657608696000001, 'recall': 0....",0.9285,0.9334,0.9309,0.9883,2.1495,1511.989,94.906
2,3,0.0437,"{'precision': 0.9632473944000001, 'recall': 0....","{'precision': 0.8402922756000001, 'recall': 0....","{'precision': 0.8958485069000001, 'recall': 0....","{'precision': 0.9653679654, 'recall': 0.968512...",0.9289,0.9382,0.9335,0.989,2.1323,1524.208,95.673


## Using the Best Model for Inference

Now we'll use our fine-tuned model to make predictions on new text. Rather than using `trainer.predict`, we'll use `pipeline` like you've seen in recent lessons. The HuggingFace `pipeline` provides a simple, industry-standard interface for NER inference. We'll explore different configuration options and demonstrate how to extract entities in a structured format.

In [16]:
# Define path to our best trained model
best_model_path = MODELS_PATH / "distilbert-ner" / "best_model"

In [17]:
# Example text for demonstration
example_text = """
It's only been a day since ChatGPT's new AI image generator went live, 
and social media feeds are already flooded with AI-generated memes in the style of Studio Ghibli, 
the cult-favorite Japanese animation studio behind blockbuster films such as "My Neighbor Totoro" and "Spirited Away."

In the last 24 hours, we've seen AI-generated images representing Studio Ghibli versions of Elon Musk, 
"The Lord of the Rings", and President Donald Trump. OpenAI CEO Sam Altman even seems to have made his new 
profile picture a Studio Ghibli-style image, presumably made with GPT-4o's native image generator. Users seem to be 
uploading existing images and pictures into ChatGPT and asking the chatbot to re-create it in new styles.
"""

### Understanding Aggregation Strategies

The `pipeline` function supports different `aggregation_strategy` options that control how subword tokens are combined:

- **`"simple"`** (recommended): Groups consecutive tokens with the same entity type into spans
- **`"first"`**: Uses only the first subword token's prediction for each word
- **`"average"`**: Averages confidence scores across all subword tokens
- **`"max"`**: Takes the maximum confidence score across subword tokens
- **`None`**: Returns raw token-level predictions without grouping (every subword gets a separate prediction)

Using no strategy returns a label for every subtoken:

In [18]:
# Load pipeline with no aggregation (not recommended)
ner_pipeline = pipeline("token-classification", model=best_model_path, aggregation_strategy=None)

# Make predictions
results_raw = ner_pipeline(example_text)
print(f"Found {len(results_raw)} individual tokens")
print("\nFirst 5 tokens:")
print(results_raw[:5])

print("\nOr displayed as a table:")
display(pd.DataFrame(results_raw[:5]))

Found 183 individual tokens

First 5 tokens:
[{'entity': 'LABEL_0', 'score': 0.9997923, 'index': 1, 'word': 'It', 'start': 1, 'end': 3}, {'entity': 'LABEL_0',
'score': 0.9998636, 'index': 2, 'word': "'", 'start': 3, 'end': 4}, {'entity': 'LABEL_0', 'score': 0.999912, 'index': 3,
'word': 's', 'start': 4, 'end': 5}, {'entity': 'LABEL_0', 'score': 0.999918, 'index': 4, 'word': 'only', 'start': 6,
'end': 10}, {'entity': 'LABEL_0', 'score': 0.99992347, 'index': 5, 'word': 'been', 'start': 11, 'end': 15}]

Or displayed as a table:


Unnamed: 0,entity,score,index,word,start,end
0,LABEL_0,0.999792,1,It,1,3
1,LABEL_0,0.999864,2,',3,4
2,LABEL_0,0.999912,3,s,4,5
3,LABEL_0,0.999918,4,only,6,10
4,LABEL_0,0.999923,5,been,11,15


If we wanted to use that to extract the entities we'd have to do quite a bit of post-processing.  All of these methods will required us map the entity labels like "LABEL_0" to the appropriate BIO ID tags.

The "simple" aggregation strategy merges sub-tokens but it does it in a literal way so that "ChatGPT" becomes "Chat##t##GPT" which isn't so helpful (this is due to the way subwords are tokenized).  Here's an example:

In [19]:
# Compare with simple aggregation (better, but still needs some post-processing)
ner_pipeline_simple = pipeline("token-classification", model=best_model_path, aggregation_strategy="simple")

results_simple = ner_pipeline_simple(example_text)
print(f"Found {len(results_simple)} entity spans")
print("\nFirst 5 results:")
print(results_simple[:5])

print("\nOr displayed as a table:")
display(pd.DataFrame(results_simple[:5]))

Found 54 entity spans

First 5 results:
[{'entity_group': 'LABEL_0', 'score': 0.9998968, 'word': "It ' s only been a day since", 'start': 1, 'end': 27},
{'entity_group': 'LABEL_3', 'score': 0.98500407, 'word': 'Cha', 'start': 28, 'end': 31}, {'entity_group': 'LABEL_0',
'score': 0.7621933, 'word': '##t', 'start': 31, 'end': 32}, {'entity_group': 'LABEL_4', 'score': 0.9892545, 'word':
'##GPT', 'start': 32, 'end': 35}, {'entity_group': 'LABEL_0', 'score': 0.99977636, 'word': "' s new", 'start': 35,
'end': 41}]

Or displayed as a table:


Unnamed: 0,entity_group,score,word,start,end
0,LABEL_0,0.999897,It ' s only been a day since,1,27
1,LABEL_3,0.985004,Cha,28,31
2,LABEL_0,0.762193,##t,31,32
3,LABEL_4,0.989254,##GPT,32,35
4,LABEL_0,0.999776,' s new,35,41


We could process the output to contruct the complete entities, but a far simpler way is to use the "first" aggregation strategy which merges the entities belong to the same groups and merges the subtokens:

In [20]:
# Load pipeline with 'first' aggregation (recommended for entity extraction)
ner_pipeline_first = pipeline("token-classification", model=best_model_path, aggregation_strategy="first")

# Make predictions
results_first = ner_pipeline_first(example_text)
print(f"Found {len(results_first)} entity spans")
print("\nFirst 5 results:")
print(results_first[:5])

print("\nOr displayed as a table:")
display(pd.DataFrame(results_first[:5]))

Found 45 entity spans

First 5 results:
[{'entity_group': 'LABEL_0', 'score': 0.9998968, 'word': "It ' s only been a day since", 'start': 1, 'end': 27},
{'entity_group': 'LABEL_3', 'score': 0.98500407, 'word': 'ChatGPT', 'start': 28, 'end': 35}, {'entity_group': 'LABEL_0',
'score': 0.99977636, 'word': "' s new", 'start': 35, 'end': 41}, {'entity_group': 'LABEL_7', 'score': 0.95365113,
'word': 'AI', 'start': 42, 'end': 44}, {'entity_group': 'LABEL_0', 'score': 0.9998733, 'word': 'image generator went
live, and social media feeds are already flooded with', 'start': 45, 'end': 120}]

Or displayed as a table:


Unnamed: 0,entity_group,score,word,start,end
0,LABEL_0,0.999897,It ' s only been a day since,1,27
1,LABEL_3,0.985004,ChatGPT,28,35
2,LABEL_0,0.999776,' s new,35,41
3,LABEL_7,0.953651,AI,42,44
4,LABEL_0,0.999873,"image generator went live, and social media fe...",45,120


Notice how `aggregation_strategy="first"` groups related tokens into entity spans and merges subtokens.  This makes it relatively easy to extract entities from the text because we usually don't need to do any (or at least much) post-processing.  The "first", "max", and "average" only differ in how they assign the confidence score to each merged group, but the extracted entities are the same.  We'll use the "first" aggregation strategy results in the remainder of this section.

In [21]:
# Visualize the tagged text with colors
display_pipeline_ner_html(example_text, results_first, BIO_tags_list)

Note, that we'll still have to do a bit of processing to extract complete entities.  We'll have to merge each B-tag with the subsequent I-tags to get complete entities so that "Elon" + "Musk" becomes "Elon Musk", for example.

### Extracting Entities by Type

The pipeline returns results in a list format with entity positions. For many applications, we want to extract entities organized by type (PER, ORG, LOC, MISC) as a dictionary - similar to what LLMs return naturally (we'll see this below)

Let's create a helper function to convert pipeline results to this format.  If you want to dive into this function to really understand it we encourage you to work through it, perhaps with the help of AI, to figure out how it works.  You can also import this function from Lesson_10_Helper to use in your homework.

In [22]:
def extract_entities_dict(pipeline_results, label_list):
    """
    Convert pipeline results to dictionary (or list of dictionaries) organized by entity type.
    
    This function works with HuggingFace token classification pipelines to extract named entities
    and organize them by type (PER, ORG, LOC, MISC). It properly merges multi-token entities
    (e.g., "Elon Musk") by combining consecutive B- and I- tags.
    
    Args:
        pipeline_results: Either:
            - Single result: List of dicts from pipeline (one text)
            - Batch results: List of lists of dicts from pipeline (multiple texts)
        label_list: List of BIO tags (e.g., ['O', 'B-PER', 'I-PER', 'B-LOC', ...])
        
    Returns:
        If single text input:
            dict: {'PER': ['Elon Musk', 'Sam Altman'], 'ORG': ['OpenAI'], ...}
        If batch input:
            list of dict: [{'PER': [...], 'ORG': [...]}, {'PER': [...], 'LOC': [...]}, ...]
    
    Example:
        >>> # Single text
        >>> results = pipeline("Elon Musk founded OpenAI.")
        >>> extract_entities_dict(results, BIO_tags_list)
        {'PER': ['Elon Musk'], 'ORG': ['OpenAI'], 'LOC': [], 'MISC': []}
        
        >>> # Batch of texts
        >>> results = pipeline(["Elon Musk lives in Texas.", "OpenAI is in San Francisco."])
        >>> extract_entities_dict(results, BIO_tags_list)
        [{'PER': ['Elon Musk'], 'LOC': ['Texas'], ...}, 
         {'ORG': ['OpenAI'], 'LOC': ['San Francisco'], ...}]
    """
    
    # Check if input is batched (list of lists) or single (list of dicts)
    # Batched: [[{result1}, {result2}], [{result3}, {result4}]]
    # Single:  [{result1}, {result2}, {result3}]
    is_batched = isinstance(pipeline_results[0], list) if pipeline_results else False
    
    # If batched, recursively process each text's results
    if is_batched:
        return [extract_entities_dict(single_result, label_list) 
                for single_result in pipeline_results]
    
    # ============================================================================
    # Single text processing starts here
    # ============================================================================
    
    # Step 1: Initialize the output dictionary with all entity types
    # This ensures every entity type has a key, even if no entities are found
    entities = {}
    for label in label_list:
        if label != 'O':  # Skip 'O' which means "Outside" (no entity)
            # Extract entity type from BIO tag: 'B-PER' -> 'PER'
            entity_type = label.split('-')[-1]
            if entity_type not in entities:
                entities[entity_type] = []
    
    # Step 2: Track the current entity being built across multiple tokens
    # Example: "Elon" (B-PER) + "Musk" (I-PER) = "Elon Musk" (complete entity)
    current_entity_tokens = []  # Accumulates tokens for current entity
    current_entity_type = None  # Tracks which entity type we're building
    
    # Step 3: Process each token from the pipeline results sequentially
    # The pipeline returns results in text order, which is crucial for merging
    for result in pipeline_results:
        # Pipeline outputs entity_group as 'LABEL_X' where X is the label index
        # Example: 'LABEL_3' means index 3 in label_list
        entity_label = result['entity_group']
        
        # Convert 'LABEL_X' to the actual BIO tag string
        if entity_label.startswith('LABEL_'):
            label_idx = int(entity_label.replace('LABEL_', ''))
            
            # Look up the BIO tag (e.g., 'B-PER', 'I-LOC', 'O')
            if label_idx < len(label_list):
                bio_label = label_list[label_idx]
                
                # ----------------------------------------------------------------
                # Handle 'O' (Outside) tags - marks end of entity
                # ----------------------------------------------------------------
                if bio_label == 'O':
                    # If we were building an entity, save it now
                    if current_entity_tokens:
                        entity_text = ' '.join(current_entity_tokens).strip()
                        # Avoid adding duplicates or empty strings
                        if entity_text and entity_text not in entities[current_entity_type]:
                            entities[current_entity_type].append(entity_text)
                        # Reset state for next entity
                        current_entity_tokens = []
                        current_entity_type = None
                    continue  # Move to next token
                
                # ----------------------------------------------------------------
                # Extract entity information from BIO tag
                # ----------------------------------------------------------------
                entity_type = bio_label.split('-')[-1]  # 'B-PER' -> 'PER'
                entity_text = result['word'].strip()    # Token text
                
                # ----------------------------------------------------------------
                # Handle 'B-' (Beginning) tags - starts new entity
                # ----------------------------------------------------------------
                if bio_label.startswith('B-'):
                    # Save the previous entity if we were building one
                    if current_entity_tokens:
                        complete_entity = ' '.join(current_entity_tokens).strip()
                        if complete_entity and complete_entity not in entities[current_entity_type]:
                            entities[current_entity_type].append(complete_entity)
                    
                    # Start building a new entity
                    current_entity_tokens = [entity_text]
                    current_entity_type = entity_type
                
                # ----------------------------------------------------------------
                # Handle 'I-' (Inside) tags - continues current entity
                # ----------------------------------------------------------------
                elif bio_label.startswith('I-'):
                    # Check if this I- tag matches the current entity type
                    if current_entity_type == entity_type:
                        # Add token to current entity
                        # Example: current=['Elon'], adding 'Musk' -> ['Elon', 'Musk']
                        current_entity_tokens.append(entity_text)
                    else:
                        # Mismatched I- tag (tagging error or special case)
                        # Treat it as starting a new entity
                        if current_entity_tokens:
                            complete_entity = ' '.join(current_entity_tokens).strip()
                            if complete_entity and complete_entity not in entities[current_entity_type]:
                                entities[current_entity_type].append(complete_entity)
                        # Start new entity with this token
                        current_entity_tokens = [entity_text]
                        current_entity_type = entity_type
    
    # Step 4: Don't forget the last entity if text ends while building one
    # Example: "... lives in New York" - need to save "New York" at the end
    if current_entity_tokens:
        complete_entity = ' '.join(current_entity_tokens).strip()
        if complete_entity and complete_entity not in entities[current_entity_type]:
            entities[current_entity_type].append(complete_entity)
    
    return entities

Now we'll apply our extraction function extract the entities from our our example text based on the "first" aggregation strategy.

In [23]:
# Extract entities from our example
entities_dict = extract_entities_dict(results_first, BIO_tags_list)

print("Extracted entities by type:")
print(json.dumps(entities_dict, indent=2)) # Pretty print the dictionary, json was imported earlier

Extracted entities by type:
{
  "PER": [
    "Elon Musk",
    "Donald Trump",
    "Sam Altman"
  ],
  "ORG": [
    "ChatGPT",
    "Studio Ghibli",
    "OpenAI"
  ],
  "LOC": [],
  "MISC": [
    "AI",
    "Japanese",
    "My Neighbor Totoro",
    "Spirited Away",
    "Studio Ghibli",
    "The Lord of the Rings",
    "GPT - 4o"
  ]
}


## NER by Zero-Shot LLM Prompting

#### L10_1_LLM_NER Video

<iframe 
    src="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_llm_ner/" 
    width="800" 
    height="450" 
    style="border: 5px solid cyan;"  
    allowfullscreen>
</iframe>
<br>
<a href="https://media.uwex.edu/content/ds/ds776/ds776_l10_1_llm_ner/" target="_blank">Open UWEX version of video in new tab</a>
<br>
<a href="https://share.descript.com/view/EgBF1mreyjw" target="_blank">Open Descript version of video in new tab</a>

In this section we'll explore using LLMs for NER.  LLMs can do this quite well, but there are some differences to be aware of though.  LLMs are naturally better at extracting spans (the relevant words for each identified entity) or structured output, not token-level labeling, because:

* The process text holistically, not token-by-token.
* There's no inherent token alignment.
* They can hallucinate or skip tokens when generating lists.
* The extracted spans may not exactly match the strings in the text, e.g. "ChatGPT's" gets extracted as "ChatGPT"

When we use an LLM to extract entities, we'll get lists of spans of each type.  You'll need to prompt carefully:
* try to get the LLM to extract the entities as they appear in the text
* you may need to provide examples or explanations of the entity types

When we evaluate the results, we won't be able to compare token by token as we did above for the output of our BERT model (that kind of evaluation is similar to evaluating semantic segmentation results where we can compare every pixel in the image to every pixel in the mask).  Instead we can just determine if each found each entity and whether it had the correct entity type.  It will help to use "fuzzy" matching which doesn't require exacty matching of strings to accout for misspellings and different presentations of words.

**Note:**  It's possible to use an LLM to produce token-level tags for each token through a combination of careful prompting and post-processing, but we'll stick with the simpler problem of identifying entities without identifying their positions in the text which is adequate for many applications.

We'll use `llm_generate` as we've done previously.    Here's the list of models that are easy to use with `llm_generate`.  You can adjust the code below to use other models, or the Groq or Together.AI APIs.

In [27]:
# Use gemini-flash-lite as default model
model_name = "gemini-flash-lite"
#model_name = "mistral-medium"

# System instruction for the model
system_instruct = "You are a helpful assistant for named entity recognition. You return entity spans in JSON."

# Example Text
example_text = """It's only been a day since ChatGPT's new AI image generator went live, and social media feeds 
are already flooded with AI-generated memes in the style of Studio Ghibli, the cult-favorite 
Japanese animation studio behind blockbuster films such as "My Neighbor Totoro" and "Spirited Away."

In the last 24 hours, we've seen AI-generated images representing Studio Ghibli versions of Elon Musk, 
"The Lord of the Rings", and President Donald Trump. OpenAI CEO Sam Altman even seems to have made his 
new profile picture a Studio Ghibli-style image, presumably made with GPT-4o's native image generator. 
Users seem to be uploading existing images and pictures into ChatGPT and asking the chatbot to re-create 
it in new styles."""

# Prompt for CoNLL2003-style entity extraction
prompt = """
Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess. 
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: """ + example_text + " \nThe Entities JSON:"

response = llm_generate(model_name, prompt, system_prompt=system_instruct, 
                       mode='json', temperature=0)

print("Extracted entities by type from LLM:")
print(json.dumps(response, indent=2))

Extracted entities by type from LLM:
{
  "PER": [
    "Elon Musk",
    "Donald Trump",
    "Sam Altman"
  ],
  "ORG": [
    "ChatGPT",
    "OpenAI"
  ],
  "LOC": [],
  "MISC": [
    "AI",
    "Studio Ghibli",
    "My Neighbor Totoro",
    "Spirited Away",
    "The Lord of the Rings",
    "GPT-4o"
  ]
}


In [29]:
print("Extracted entities by type by DistilBERT:")
print(json.dumps(entities_dict, indent=2))

Extracted entities by type by DistilBERT:
{
  "PER": [
    "Elon Musk",
    "Donald Trump",
    "Sam Altman"
  ],
  "ORG": [
    "ChatGPT",
    "Studio Ghibli",
    "OpenAI"
  ],
  "LOC": [],
  "MISC": [
    "AI",
    "Japanese",
    "My Neighbor Totoro",
    "Spirited Away",
    "Studio Ghibli",
    "The Lord of the Rings",
    "GPT - 4o"
  ]
}


There are some differences.  Also note that our DistilBERT model isn't perfect either.  See how it tagged "Studio Ghibli" as both an "ORG" and "MISC".  LLM models seem to tag a lot of things as "MISC".  This could probably be improved by giving the LLM better instructions about what is meant my "MISC."  Overall the results are pretty impressive though for using a model that hasn't been explictly trained for NER on this data.

### Using an LLM for NER - Streamlining the Process

Similar to the way we made `llm_text_classifier` for text classification, we'll put our pipeline together here in a single function that expects us to input a list of texts to be tagged and outputs a list of entity dictionaries.   You could alsom import this function from Lesson_10_Helpers for use in the homework.

If you have to do a lot of this sort of work you should explore [LangChain](https://www.langchain.com/) which is an ecosystem of tools for developing applications powered by LLMs.  If you're curious check out the [documentation here](https://python.langchain.com/docs/introduction/).  Look at the tutorial for text classification to see how it compares to what we did in Lesson 8.

In [30]:
def llm_ner_extractor(model_name,
                      texts,
                      system_prompt,
                      prompt_template,
                      temperature=0):
    """
    Extract named entities using a Large Language Model (LLM) in zero-shot fashion.

    Args:
        model_name (str): Name of the LLM model to use (e.g., 'gemini-flash-lite').
        texts (list of str): List of input texts to process.
        system_prompt (str): System prompt guiding the LLM behavior.
        prompt_template (str): Template to construct the user prompt for each text.
        temperature (float, optional): Temperature for generation (0 = deterministic). Defaults to 0.

    Returns:
        list of dict: List of JSON objects containing extracted entities for each input text.
    """

    # Step 1: Create user prompts by formatting the prompt template with each input text.
    # This ensures that each text is passed to the LLM with the same structure.
    user_prompts = [prompt_template.format(text=text) for text in texts]

    # Step 2: Generate json outputs from the LLM using the provided model name and prompts.
    # The `llm_generate` function sends the prompts to the LLM and retrieves the responses.
    json_outputs = llm_generate(model_name,
                               user_prompts,
                               system_prompt=system_prompt,
                               mode='json',
                               temperature=temperature)

    return json_outputs

Now we'll apply `llm_ner_extractor` to the first 100 texts in the validation set to extract the entity dictionaries. We'll use the `gemini-flash-lite` model which is fast and inexpensive.

In [31]:
model_name = 'gemini-flash-lite'

# Extract N examples from the validation split of CoNLL2003
N = 100
subset = dataset["validation"].select(range(N))

texts = [' '.join(tokens) for tokens in subset["tokens"]] # Convert tokens to text

# System instruction for the model
system_instruct = "You are a helpful assistant for named entity recognition. You return entity spans in JSON."

# Prompt template adapted for CoNLL2003-style entity extraction.  
# You must keep {text} in the template for the text to be inserted.
prompt_template = """
Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess. 
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: {text}
The Entities JSON:
"""

# Call the LLM-based NER extractor
predicted_entities = llm_ner_extractor(
    model_name,
    texts,
    system_instruct,
    prompt_template,
    temperature=0
)

# Display the first few predictions for inspection
for i, text in enumerate(texts[:10]):
    print(f"Text: {text}")
    print("The Entities JSON:")
    print(predicted_entities[i])
    print("\n")

Generating:   0%|          | 0/100 [00:00<?, ?prompt/s]

Text: CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .
The Entities JSON:
{'PER': [], 'ORG': ['LEICESTERSHIRE'], 'LOC': [], 'MISC': ['CRICKET']}


Text: LONDON 1996-08-30
The Entities JSON:
{'PER': [], 'ORG': [], 'LOC': ['LONDON'], 'MISC': []}


Text: West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and
39 runs in two days to take over at the head of the county championship .
The Entities JSON:
{'PER': ['Phil Simmons'], 'ORG': ['Leicestershire', 'Somerset'], 'LOC': [], 'MISC': ['West Indian', 'county
championship']}


Text: Their stay on top , though , may be short-lived as title rivals Essex , Derbyshire and Surrey all closed in on
victory while Kent made up for lost time in their rain-affected match against Nottinghamshire .
The Entities JSON:
{'PER': [], 'ORG': ['Essex', 'Derbyshire', 'Surrey', 'Kent', 'Nottinghamshire'], 'LOC': [], 'MISC': ['title']}


Text: After bowling Somerset out for 83 on the openin

Let's look to see if there were any problems extracting JSON from the LLM output.  We can count the number of output dictionaries that include 'Error' as a key (this will depend on the LLM and your prompt):

In [32]:
error_count = sum(1 for prediction in predicted_entities if 'Error' in prediction)
print(f"Number of dictionaries with 'Error' as a key: {error_count}")

Number of dictionaries with 'Error' as a key: 0


Great.  We were able to successfully extract JSON from every response.  Let's now evaluate the performance.  Since we're not comparing tags token-by-token what we'll do is:

1.  Use the token-by-token tags in the dataset to compute an entity dictionary for each input text.

2.  Compare the predicted entity dictionary to the "gold" entity dictionary for each example using fuzzy matching (inexact string matches).  In the context of NER the ground-truth labels are sometime called the "gold" labels!

You can learn more about fuzzy string matching and the package in the [RapidFuzz Documentation](https://rapidfuzz.github.io/RapidFuzz/).

We built a helper function called `extract_gold_entities` which takes an example from our dataset and extracts the "gold" dictionary.  For example, here's an example from the validation set:



Here's the extracted gold or ground-truth entities:

In [33]:
gold_entities = extract_gold_entities(subset[2], BIO_tags_list)
gold_entities

{'MISC': ['West Indian'],
 'PER': ['Phil Simmons'],
 'ORG': ['Leicestershire', 'Somerset']}

While here are the predicted entities from our LLM model:

In [34]:
predicted_entities[2]

{'PER': ['Phil Simmons'],
 'ORG': ['Leicestershire', 'Somerset'],
 'LOC': [],
 'MISC': ['West Indian', 'county championship']}

### Computing the Performance Metrics

To evaluate Named Entity Recognition (NER), we compare the entities predicted by the model with the **gold (true)** entities from the dataset.

We compute the following metrics **for each entity type** (e.g., PER, LOC, ORG):

- **Precision** = Correct predictions / All predictions  
- **Recall** = Correct predictions / All gold (true) entities  
- **F1 score** = Harmonic mean of precision and recall  
- **Accuracy** = Correct predictions / (Correct + Wrong + Missed predictions)

We include the function `evaluate_ner` in `helpers.py` to do the computations.  It's imported above.  We show you how to use it in the next cell assuming that `subset` from above for which our LLM NER model gave us the entity `predicted_entities`.

In [35]:
# Extract gold entities
gold_entities = [extract_gold_entities(ex, BIO_tags_list) for ex in subset]

# Evaluate
results_llm = evaluate_ner(predicted_entities, gold_entities, labels = ["PER", "ORG", "LOC", "MISC"])

# Format the evaluation results
df_results_llm = format_ner_eval_results(results_llm)
display(df_results_llm)


Unnamed: 0,Entity,Precision,Recall,F1,Number,Accuracy
0,PER,0.9828,0.9828,0.9828,58.0,
1,ORG,0.7797,0.6571,0.7132,70.0,
2,LOC,0.7015,0.8103,0.752,58.0,
3,MISC,0.06,0.25,0.0968,12.0,
4,Overall,0.6538,0.7727,0.7083,,0.5484


The LLM NER results are terrific for people, and pretty good for locations and organizations, but only find about 25% the true MISC entities in the texts.  Maybe you can get it to work better by providing examples of MISC entities and additional instructions in the prompt.

Here are the results from the BERT model (applied to the whole test set) for comparison:

In [37]:
display(format_ner_eval_results(results_BERT))

Unnamed: 0,Entity,Precision,Recall,F1,Number,Accuracy
0,LOC,0.9195,0.9113,0.9154,1668.0,
1,MISC,0.7235,0.7977,0.7588,702.0,
2,ORG,0.8451,0.8838,0.864,1661.0,
3,PER,0.9619,0.9518,0.9568,1617.0,
4,Overall,0.8825,0.9007,0.8915,,0.978


**Note:** The accuracies are very different, in part, because they're computed differently.  In the case of the BERT model we are able to include all the tokens tagged as 'O' (other) which is most of the tokens.  This inflates the accuracy just like computing accuracy for a segmentation model in which most of the pixels are background.  