# Named Entity Recognition

In this notebook we'll

* List some common applications of NER
* Give a brief history of NER
* Demonstrate how to setup and fine-tune a DistilBERT model for NER
* Discuss some of the issues with using an LLM for an NER task

First, make sure your course package is updated for this lesson and homework.  You need to do this once per server, but not once per notebook.  The exact path will depend on where this notebook is in relation to the folder /Lessons/Course_Tools.

In [1]:
!pip install ../Course_Tools/introdl

Processing c:\users\bagge\my drive\python_projects\ds776_develop_project\ds776\lessons\course_tools\introdl
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: introdl
  Building wheel for introdl (pyproject.toml): started
  Building wheel for introdl (pyproject.toml): finished with status 'done'
  Created wheel for introdl: filename=introdl-1.0-py3-none-any.whl size=46690 sha256=04667f3efcf3b8d1a4c3e7edc3884b460c3496c150a9e16083514387489faeba
  Stored in directory: C:\Users\bagge\AppData\Local\Temp\pip-ephem-wheel-cache-ykyii0_f\wheels\f5\d5\0f\11f1d5af64d00defb23fa33cf51b2946a0899888d73571e687
Successfully built introdl
Installing collected packages: introdl
  Attempting 

After running that cell, you should restart the kernel.

## Applications of NER

I wasn't really familiar with Named Entity Recognition before building this course.  However, after studying it for a bit I realize it's very similar to object detection and instance segmentation in computer vision where we're trying to "tag" individual objects in an image.  Now we're doing it with text.  Now that I know more about it I realize that NER is everywhere:

- **Information Extraction from Text**
  - Identify names of people, places, organizations, and dates in news articles, legal documents, and academic papers.

- **Search and Question Answering**
  - Improve retrieval and understanding by recognizing key entities in queries and documents (e.g., “Where was Barack Obama born?”).

- **Social Media Monitoring**
  - Detect mentions of public figures, brands, products, and locations in tweets, posts, and comments for sentiment analysis or moderation.

- **Marketing and Trend Analysis**
  - Track mentions of brands, competitors, or topics over time to identify emerging trends and customer interests.

- **Content Recommendation**
  - Extract entities (e.g., movies, products, places) from reviews and user posts to personalize content or advertisements.

- **Customer Support Automation**
  - Identify product names, user accounts, and issue types in support chats and emails to assist routing and auto-response systems.

- **Financial and Business Intelligence**
  - Extract company names, stock tickers, monetary values, and events from reports or articles to support decision-making.

- **Medical and Clinical Text Analysis**
  - Identify diseases, medications, and procedures in clinical notes for tasks like anonymization, coding, or record analysis.

- **Legal and Compliance Monitoring**
  - Recognize case names, organizations, and laws in legal documents to support research, auditing, or compliance checks.

- **Resume and Job Post Parsing**
  - Extract structured information such as skills, education, job titles, and companies to streamline recruitment processes.

## **Chronology of State-of-the-Art Approaches for Named Entity Recognition (NER)**  

The evolution of NER closely parallels the evolution of algorithms for text classification.  Early approaches were based on statistical models, then word embeddings and recurrent neural networks, before transformer architectures revolutionized the field since 2017.  

Here's a timeline of some of the key advancements in NER:

---

### **Pre-2010s: Rule-Based Systems and Feature Engineering**  
Early NER systems used **hand-crafted rules**, lookup lists (called **gazetteers**), and basic statistical models like **Hidden Markov Models (HMMs)** and **Conditional Random Fields (CRFs)**.  
- **HMMs** modeled sequences by predicting the most likely tag (e.g., PERSON, LOCATION) for each word based on probabilities.
- **CRFs** improved on HMMs by allowing more flexible features and considering the entire sequence when making predictions.

These approaches required heavy manual feature engineering—like marking whether a word is capitalized, its part of speech, or its prefix/suffix.

- **1990s–2000s**: Rule-based systems and statistical models dominated tasks like newswire NER.
- **2003**: The CoNLL-2003 shared task standardized benchmarks and boosted interest in developing better NER models.

---

### **2010s: Word Embeddings and Neural Sequence Models**  
NER systems improved significantly with the introduction of **word embeddings** like **Word2Vec** and **GloVe**, which represented words in continuous vector space based on context. These embeddings replaced sparse, manual features.

- **2013–2015**: **Word2Vec** and **GloVe** made it easier to train neural models for NER.
- **2015–2016**: **BiLSTM-CRF** architectures became popular—combining bidirectional LSTMs (which read sentences both forward and backward) with a CRF layer to model dependencies between entity tags.
- **2015**: **spaCy** launched as a fast, practical NLP library with built-in NER support, making NER accessible for developers and educators.
- **2016–2017**: Character-level embeddings and CNNs were added to improve robustness to spelling variation and rare words.

---

### **Late 2010s: Contextual Embeddings and Transformers**  
NER took a major leap with **contextualized embeddings** from transformer-based models.

- **2018**: **ELMo** introduced deep contextualized word representations that vary based on sentence context.
- **2018**: **BERT** achieved state-of-the-art NER results by treating NER as a token classification problem using bidirectional transformer layers.
- **2019**: **Flair** added character-level contextual embeddings to further improve performance on small or domain-specific datasets.

---

### **2020s: Prompting and Large Language Models (LLMs)**  
Recent NER approaches increasingly use **LLMs** like **GPT-4**, **Claude**, and **Gemini**, which can extract entities using **natural language prompts** instead of token-level supervision.

- **2020–2022**: Models like **RoBERTa**, **SpanBERT**, and **LUKE** fine-tuned transformer architectures for better span detection and entity-aware representations.
- **spaCy** added support for transformer-based pipelines (e.g., `en_core_web_trf`) to make state-of-the-art NER accessible for production use.
- **2023–2025**: Instruction-tuned models like **GLiNER** and general-purpose LLMs now handle **zero-shot or few-shot NER** using prompts like *"Find all organizations and people in this sentence."* These models reduce the need for annotated datasets and allow rapid prototyping for new entity types.

  While LLMs offer flexibility and ease of use, they may be less precise than traditional models. Hybrid systems often combine LLMs with structured postprocessing or constrained decoding to improve accuracy.

---

We'll focus on two of these tools.  We'll fine-tune a BERT model for NER and we'll look at some of the hurdles to using LLMs for NER.  You'll explore both of these topics further in the homework.

Here's our main import cell before we dive into the rest of the material.

In [1]:
from datasets import load_dataset
import evaluate # Hugging Face library for evaluation
from IPython.display import display
import json
import numpy as np
import pandas as pd
import torch
from transformers import (pipeline, AutoTokenizer, AutoModelForTokenClassification, 
                          TrainingArguments, Trainer, DataCollatorForTokenClassification)

# local packages
from helpers import (display_ner_html, predict_ner_tags, format_ner_eval_results, 
                     match_entity_spans, json_extractor, spans_to_bio_tags) 
from introdl.utils import config_paths_keys, wrap_print_text
from introdl.nlp import llm_generate, llm_configure, llm_list_models

print = wrap_print_text(print, width=120) # you can specify the wrap width for all print statements

paths = config_paths_keys() # import paths and keys
MODELS_PATH = paths['MODELS_PATH']
DATA_PATH = paths['DATA_PATH']


MODELS_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\models
DATA_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\data
TORCH_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_HUB_CACHE=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
Successfully logged in to Hugging Face Hub.


## The Dataset - CoNLL2003 for NER

For our examples, well use the CoNLL2003 dataset.  It is one of the first widely used benchmarks for Named Entity Recognition (NER). It was introduced as part of the CoNLL-2003 shared task and contains annotated text for four entity types: **PER** (person), **LOC** (location), **ORG** (organization), and **MISC** (miscellaneous). The dataset is derived from Reuters news articles and is structured in the BIO format, making it a standard for evaluating NER models.

Multiple versions of the dataset are available in Hugging Face.  We chose "tomaarsen/conll2003" because the NER tags are available in BIO format and because the list of possible labels is easy to extract.

In [2]:

# Load CoNLL2003 dataset (this is not the most well known version of teh dataset, but it is the one that is easiest to load with the datasets library)
dataset = load_dataset("tomaarsen/conll2003")
BIO_tags_list = dataset["train"].features["ner_tags"].feature.names
print("Possible BIO tags", BIO_tags_list)

# delete the pos_tags and chunk_tags columns, as we don't need them
for split in dataset.keys():
    dataset[split] = dataset[split].remove_columns(["pos_tags", "chunk_tags"])


Possible BIO tags ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Each sample in the dataset consists of a single sentence or headline.  Here is how it's stored:

In [3]:
print(dataset["train"][12])

{'id': '12', 'document_id': 1, 'sentence_id': 12, 'tokens': ['Only', 'France', 'and', 'Britain', 'backed', 'Fischler',
"'s", 'proposal', '.'], 'ner_tags': [0, 5, 0, 5, 0, 1, 0, 0, 0]}


Notice that the tokens are the words in sentence split up by whitespace and punctuation.  The ner_tags correspond to indices of the entity tags in our list.  The next bit of code also shows you how to get the BIO tags corresponding to each token:

In [4]:
# Extract tokens and ner_tags from dataset["train"][12]
tokens = dataset["train"][12]["tokens"]
ner_tags = dataset["train"][12]["ner_tags"]

# Map ner_tags to their corresponding BIO tags using label_list
bio_tags = [BIO_tags_list[tag] for tag in ner_tags]

# Create a DataFrame
df = pd.DataFrame({"Tokens": tokens, "NER Tags (IDs)": ner_tags, "BIO Tags": bio_tags})

# Display the DataFrame
display(df)

Unnamed: 0,Tokens,NER Tags (IDs),BIO Tags
0,Only,0,O
1,France,5,B-LOC
2,and,0,O
3,Britain,5,B-LOC
4,backed,0,O
5,Fischler,1,B-PER
6,'s,0,O
7,proposal,0,O
8,.,0,O


[spaCy is a whole ecosystem](https://spacy.io/) of tools for NLP that we won't really dive into much in this course, but it's worth a look if you're going to be working in this area.  They provide some great tools for visualization of tagged text.  We've use their package to make a little function called `display_ner_html` which takes lists of tokens, tag IDs, and the list of labels to produce HTML visualizations of the tags.  The function is in helper.py if you're curious.  Here's how we can use it:

In [5]:
# tokens and ner_tags were defined in the previous code cell

display_ner_html(tokens, ner_tags, BIO_tags_list)

In [6]:
# here's another example
display_ner_html(dataset["train"][4]["tokens"], dataset["train"][4]["ner_tags"], BIO_tags_list)

## Fine-tune DistilBERT for ConNLL2003

Now we want to fine-tune a BERT model so that it can provide similar tagging for new text.  First we'll load a model and its tokenizer.
`distilbert-base-cased` is a smaller, faster, and lighter version of BERT that retains 97% of its language understanding capabilities while being 40% smaller. It is case-sensitive, meaning it distinguishes between "Apple" and "apple" which is useful for NER tasks. It was trained using masked language modeling on the same data as BERT, including the English Wikipedia and BookCorpus, but with a reduced architecture to improve efficiency. 

Note that we make use of `AutoModelForTokenClassification` which adds a classification head to the backbone the same way we did for transfer learning applications in image classification.  The backbone uses pretrained weights while the classification head weights are randomly initialized and learned during fine-tuning.

In [7]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-cased", num_labels=len(BIO_tags_list))


One of the main issues we'll need to deal with is to map the BIO tags to the tokens that are produced by tokenizer that comes with our selected BERT model.  That tokenizer will break some of our words into subwords.  For those subwords we'll introduce an ID of -100 that tells the model not to predict tags for those tokens.

The function, `tokenize_and_align_labels` below takes care of aligning the ID tags from the input sequence in the dataset to the output tokens in the tokenizer.  We've included some comments in the code if you want to study it, or you can use an AI to help you walk through the details.

In [8]:
# Helper function to align labels with tokens
def tokenize_and_align_labels(examples):
    # Tokenize the input text (list of tokens) while keeping track of word-to-token alignment
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    
    # Initialize a list to store the aligned labels for each example
    labels = []
    
    # Iterate over each example in the batch
    for i, label in enumerate(examples["ner_tags"]):
        # Get the word-to-token mapping for the current example
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        
        # Initialize variables to track the previous word index and the label IDs
        previous_word_idx = None
        label_ids = []
        
        # Iterate over the word IDs corresponding to the tokens
        for word_idx in word_ids:
            if word_idx is None:
                # If the token is a special token (e.g., [CLS], [SEP]), ignore it by assigning -100
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # If the token corresponds to a new word, assign the label of that word
                label_ids.append(label[word_idx])
            else:
                # If the token is part of the same word (e.g., subword tokens), ignore it by assigning -100
                label_ids.append(-100)
            
            # Update the previous word index to the current one
            previous_word_idx = word_idx
        
        # Append the aligned label IDs for the current example
        labels.append(label_ids)
    
    # Add the aligned labels to the tokenized inputs
    tokenized_inputs["labels"] = labels
    
    # Return the tokenized inputs with aligned labels
    return tokenized_inputs

# Tokenize datasets
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)


The next cell demonstrates how our tokenizer works the alignment function to get the tokenization expected by the model and to introduce IDs of -100 for each of the subwords introduced by the tokenizer.  

In [9]:
# Get the example
example = dataset["train"][7]

# Wrap in a batch of one for compatibility with tokenize_and_align_labels
batch = {"tokens": [example["tokens"]], "ner_tags": [example["ner_tags"]]}

# Apply the tokenization and alignment function
tokenized = tokenize_and_align_labels(batch)

# Extract and display results
tokens = tokenizer.convert_ids_to_tokens(tokenized["input_ids"][0])
labels = tokenized["labels"][0]

print(("Before model tokenization:\n"))
display_ner_html(dataset["train"][7]["tokens"], dataset["train"][7]["ner_tags"], BIO_tags_list)
print(("\nAfter model tokenization:\n"))
display_ner_html(tokens, labels, BIO_tags_list)


Before model tokenization:




After model tokenization:



You can see that the tokenizer divided some of the original words into subwords which get assigned an ID of -100 to be ignored by the model.  During training those tokens are ignored by the loss function and the outputs corresponding to those tokens are ignored during model evaluation.

Before we fine-tune the model we define a custom metrics function that does two things:
1. Uses the `seqeval` package to evaluate entire entity spans (e.g, e.g., `B-LOC`, `I-LOC`, etc. forming `"New York"`) instead of evaluating individual labels as we'd do with the scikit-learn metrics.
2. Ignores the tokens with IDs of -100 for the evaluation metrics:

In [10]:
# Load seqeval metric
metric = evaluate.load("seqeval")

# Note if you have a different list of possible tags, you'll need to change the default value of label_list
def compute_metrics(p, label_list=BIO_tags_list):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return metric.compute(predictions=true_predictions, references=true_labels)


For the actual fine-tuning we use a similar setup to what we did for text classification:

In [11]:

# Training arguments
training_args = TrainingArguments(
    output_dir= MODELS_PATH / "distilbert-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none",
    seed=42,
)

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Loc,Misc,Org,Per,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1,0.0562,0.057373,"{'precision': 0.9320021586616298, 'recall': 0.9401197604790419, 'f1': 0.9360433604336044, 'number': 1837}","{'precision': 0.7926565874730022, 'recall': 0.7960954446854663, 'f1': 0.7943722943722944, 'number': 922}","{'precision': 0.8468531468531468, 'recall': 0.9030574198359433, 'f1': 0.8740526885600866, 'number': 1341}","{'precision': 0.9754846066134549, 'recall': 0.9288816503800217, 'f1': 0.9516129032258064, 'number': 1842}",0.902734,0.905924,0.904326,0.98386
2,0.0117,0.047674,"{'precision': 0.9544468546637744, 'recall': 0.9580838323353293, 'f1': 0.95626188535724, 'number': 1837}","{'precision': 0.83991462113127, 'recall': 0.8535791757049892, 'f1': 0.8466917697686928, 'number': 922}","{'precision': 0.8945022288261516, 'recall': 0.8978374347501864, 'f1': 0.8961667286937105, 'number': 1341}","{'precision': 0.9647314161692893, 'recall': 0.9652551574375678, 'f1': 0.9649932157394844, 'number': 1842}",0.926131,0.930495,0.928308,0.987851
3,0.0124,0.045236,"{'precision': 0.9591503267973857, 'recall': 0.9586281981491562, 'f1': 0.9588891913966784, 'number': 1837}","{'precision': 0.8459915611814346, 'recall': 0.8698481561822126, 'f1': 0.8577540106951872, 'number': 922}","{'precision': 0.9022222222222223, 'recall': 0.9082774049217002, 'f1': 0.9052396878483836, 'number': 1341}","{'precision': 0.965386695511087, 'recall': 0.9690553745928339, 'f1': 0.967217556217827, 'number': 1842}",0.930303,0.936722,0.933501,0.988688


TrainOutput(global_step=2634, training_loss=0.06490828401084506, metrics={'train_runtime': 82.5699, 'train_samples_per_second': 510.15, 'train_steps_per_second': 31.9, 'total_flos': 525319502290632.0, 'train_loss': 0.06490828401084506, 'epoch': 3.0})

In [12]:

# Evaluate on test set
results = trainer.evaluate(tokenized_datasets["test"])
print("\nTest set evaluation results:")
print(results)



Test set evaluation results:
{'eval_loss': 0.1198493242263794, 'eval_LOC': {'precision': 0.9148681055155875, 'recall': 0.9148681055155875, 'f1':
0.9148681055155875, 'number': 1668}, 'eval_MISC': {'precision': 0.7110266159695817, 'recall': 0.7991452991452992, 'f1':
0.7525150905432596, 'number': 702}, 'eval_ORG': {'precision': 0.8554710356933879, 'recall': 0.8801926550270921, 'f1':
0.8676557863501483, 'number': 1661}, 'eval_PER': {'precision': 0.960551033187226, 'recall': 0.9486703772418058, 'f1':
0.95457373988799, 'number': 1617}, 'eval_overall_precision': 0.8820058997050148, 'eval_overall_recall':
0.8999645892351275, 'eval_overall_f1': 0.8908947506791691, 'eval_overall_accuracy': 0.9785291267362981, 'eval_runtime':
1.7902, 'eval_samples_per_second': 1928.783, 'eval_steps_per_second': 120.654, 'epoch': 3.0}


That's some ugly output!  Let's put it in a data frame with some formatting

In [None]:
df_results = format_ner_eval_results(results)
display(df_results)

Unnamed: 0,Entity,Precision,Recall,F1,Number,Accuracy
0,LOC,0.9149,0.9149,0.9149,1668.0,
1,MISC,0.711,0.7991,0.7525,702.0,
2,ORG,0.8555,0.8802,0.8677,1661.0,
3,PER,0.9606,0.9487,0.9546,1617.0,
4,Overall,0.882,0.9,0.8909,,0.9785


That's better!  You can think of f1 as a balanced version of accuracy.  We can see that the model does a great job on identifying people and is also good at identifying locations and organizations.  It doesn't do quite as well as identifying miscellaneous entities (but I'm not sure what those are supposed to be either ...).

You can also use the model to do inference by making predictions on new text.  The function below is also included in the helpers.py file, but we include it here so you can study it and see how it works:

In [27]:
def predict_ner_tags(text, model, tokenizer):
    """
    Tokenizes and predicts NER tags for the given text using a Hugging Face model.

    Args:
        text (str): Input sentence (e.g., "Barack Obama was born in Hawaii").
        model: A Hugging Face token classification model (e.g., DistilBERT).
        tokenizer: The tokenizer corresponding to the model.

    Returns:
        tokens (List[str]): Original word tokens from the input text.
        predicted_tag_ids (List[int]): One predicted tag index per word (subwords/specials skipped).
    """

    # Step 1: Split the input text into whitespace-separated words
    words = text.split()

    # Step 2: Tokenize the list of words and retain word alignment
    inputs = tokenizer(words, return_tensors="pt", is_split_into_words=True).to(model.device)

    # Step 3: Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Step 4: Convert logits to predicted label indices
    predictions = torch.argmax(outputs.logits, dim=2)[0].cpu().numpy()

    # Step 5: Get word IDs for each token
    word_ids = inputs.word_ids(batch_index=0)

    # Step 6: Extract one prediction per word (first subword only)
    predicted_tag_ids = []
    seen_words = set()
    for token_idx, word_idx in enumerate(word_ids):
        if word_idx is not None and word_idx not in seen_words:
            predicted_tag_ids.append(int(predictions[token_idx]))
            seen_words.add(word_idx)
        # skip subwords and special tokens

    # Step 7: Return the original words and corresponding predicted tags
    return words, predicted_tag_ids


Now we'll apply it to some example text we copied from the internet (about GPT-4o's new image generation capability)

In [28]:
example_text = """
It’s only been a day since ChatGPT’s new AI image generator went live, and social media feeds are already flooded with AI-generated memes in the style of Studio Ghibli, the cult-favorite Japanese animation studio behind blockbuster films such as “My Neighbor Totoro” and “Spirited Away.”

In the last 24 hours, we’ve seen AI-generated images representing Studio Ghibli versions of Elon Musk, “The Lord of the Rings“, and President Donald Trump. OpenAI CEO Sam Altman even seems to have made his new profile picture a Studio Ghibli-style image, presumably made with GPT-4o’s native image generator. Users seem to be uploading existing images and pictures into ChatGPT and asking the chatbot to re-create it in new styles.
"""

tokens, tags = predict_ner_tags(example_text, model, tokenizer)
print(tokens,tags)

['It’s', 'only', 'been', 'a', 'day', 'since', 'ChatGPT’s', 'new', 'AI', 'image', 'generator', 'went', 'live,', 'and',
'social', 'media', 'feeds', 'are', 'already', 'flooded', 'with', 'AI-generated', 'memes', 'in', 'the', 'style', 'of',
'Studio', 'Ghibli,', 'the', 'cult-favorite', 'Japanese', 'animation', 'studio', 'behind', 'blockbuster', 'films',
'such', 'as', '“My', 'Neighbor', 'Totoro”', 'and', '“Spirited', 'Away.”', 'In', 'the', 'last', '24', 'hours,', 'we’ve',
'seen', 'AI-generated', 'images', 'representing', 'Studio', 'Ghibli', 'versions', 'of', 'Elon', 'Musk,', '“The', 'Lord',
'of', 'the', 'Rings“,', 'and', 'President', 'Donald', 'Trump.', 'OpenAI', 'CEO', 'Sam', 'Altman', 'even', 'seems', 'to',
'have', 'made', 'his', 'new', 'profile', 'picture', 'a', 'Studio', 'Ghibli-style', 'image,', 'presumably', 'made',
'with', 'GPT-4o’s', 'native', 'image', 'generator.', 'Users', 'seem', 'to', 'be', 'uploading', 'existing', 'images',
'and', 'pictures', 'into', 'ChatGPT', 'and', 'asking', '

Of course, the raw output is kind of difficult to interpret, but we can easily visualize it with `display_ner_html`

In [30]:
display_ner_html(tokens, tags, BIO_tags_list)


That seems pretty amazing for entity recognition on text the model has never seen!

## NER by Zero-Shot LLM Prompting

In this section we'll explore using LLMs for NER.  LLMs can do this quite well, but there are some differences to be aware of though.  LLMs are naturally better at extracting spans (the relevant words for each identified entity) or structured output, not token-level labeling, because:

* The process text holistically, not token-by-token.
* There's no inherent token alignment.
* They can hallucinate or skip tokens when generating lists.

When we use an LLM to extract entities, we'll get lists of strings (also called spans in this context) for each entity type.  You must prompt carefully to get the LLM to return only exact matching strings for each entity type.  You also may need to explain, or give examples of, each entity type since the LLM can't learn these meanings from a corpus of labeled text.  Finally, to evaluate metrics without having token level classification you'll have to tokenize the identified strings and compare these to tokens from the text to identify matches, then compare the LLM predicted entity to the BIO tags for that entity.  It may help to use "fuzzy" or inexact matching since the LLM may adjust spellings in its identified strings.

We'll use `llm_generate` as we've done previously.    Here's the list of models that are easy to use with `llm_generate`.  You can adjust the code below to use other models, or the Groq or Together.AI APIs.

In [7]:
llm_list_models()


Available models:
 llama-3p2-3B => HuggingFace: unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit
 llama-3p1-8B => HuggingFace: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
 mistral-7B => HuggingFace: unsloth/mistral-7b-instruct-v0.3-bnb-4bit
 qwen-2p5-3B => HuggingFace: unsloth/Qwen2.5-3B-Instruct-bnb-4bit
 qwen-2p5-7B => HuggingFace: unsloth/Qwen2.5-7B-Instruct-bnb-4bit
 gemini-flash-lite => needs GEMINI_API_KEY
 gemini-flash => needs GEMINI_API_KEY
 gpt-4o => needs OPENAI_API_KEY
 gpt-4o-mini => needs OPENAI_API_KEY


<zip at 0x19384c0d8c0>

### Using an LLM for NER - The Basics

We'll start by crafting a prompt and asking a local model to identify the CoNLL entities (PER, LOC, ORG, MISC) in the example text from the last section.  We're going to specify the entity types in the prompt and try to get the model to produce JSON ouput.  JSON is output that's been formatted like a Python dictionary.  Let's see what happens.

In [6]:
llm_config = llm_configure("llama-3p1-8B")

# System instruction for the model
system_instruct = "You are a helpful assistant for named entity recognition. You return entity spans in JSON."

# Example Text
example_text = """It’s only been a day since ChatGPT’s new AI image generator went live, and social media feeds 
are already flooded with AI-generated memes in the style of Studio Ghibli, the cult-favorite 
Japanese animation studio behind blockbuster films such as “My Neighbor Totoro” and “Spirited Away.”

In the last 24 hours, we’ve seen AI-generated images representing Studio Ghibli versions of Elon Musk, 
“The Lord of the Rings“, and President Donald Trump. OpenAI CEO Sam Altman even seems to have made his 
new profile picture a Studio Ghibli-style image, presumably made with GPT-4o’s native image generator. 
Users seem to be uploading existing images and pictures into ChatGPT and asking the chatbot to re-create 
it in new styles."""

# Prompt for CoNLL2003-style entity extraction
prompt = """
Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess. 
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: """ + example_text + " \nThe Entities JSON:"

response = llm_generate(llm_config, prompt, system_prompt = system_instruct, search_strategy='deterministic', remove_input_prompt=False)
print(response)

🚀 Loading model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit (this may take a while)...
🟢 Model unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit loaded successfully.

system

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant for named entity recognition. You return entity spans in JSON.user

Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess.
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: It’s only been a day since ChatGPT’s new AI image generator went live, and social media feeds
are already flooded with AI-generated memes in the style of Studio

Note some things about the output:
* `llm_generate` may fail to remove the prompt from the output. We forced it to keep it above by passing `remove_input_prompt = False`, but sometimes it fails because our cleaning algorithm doesn't correctly detect the input prompt in the output.  You should generally use `remove_input_prompt=True` or just leave it out since it defaults to True.
* It mis-identified "ChatGPT" as a person.
* It also returned the the span as "ChatGPT" instead of "ChatGPT's" as it occurs in the text.

We can fix the first issue by passing a `split_string` to `llm_generate` which will delete all the text up to the string.  We might be able to fix the second issue by providing examples (few-shot prompting) or more careful instructions to the LLM.  The third issue is why we'll need to use some inexact matching to match predicted spans with the input text.  

First let's see how to get rid of that input prompt if necessary.

In [7]:
response = llm_generate(llm_config, prompt, system_prompt = system_instruct, 
                        search_strategy='deterministic', split_string='JSON:assistant')
print(response)

{"PER":["ChatGPT", "Elon Musk", "Sam Altman", "Donald Trump"], "ORG":["Studio Ghibli", "OpenAI"], "LOC":[], "MISC":[]}


That's better.  You may not need the split_string with some LLMs (particularly the API-based LLMs) or you may need to adjust it for different models.  

Finally, the output is still a string, but we'd like to load that string as an actual dictionary. We can use `json.loads` to load the JSON formatted string as a dictionary in Python.  Some LLMs, like Gemini, will return the output with Markdown formatting like this:

<pre>
```json
{"PER":["ChatGPT", "Elon Musk", "Sam Altman", "Donald Trump"], "ORG":["Studio Ghibli", "OpenAI"], "LOC":[], "MISC":[]}
```
</pre>

So we may need to strip those extra characters away before using `json.loads`.  Here's a little function to do both of those things.  It's also in helpers.py:

In [8]:
def json_extractor(text):
    # Extract the JSON object from the response
    try:
        text = text.strip("```json").strip("```").strip()
        json_object = json.loads(text)
    except json.JSONDecodeError:
        json_object = {"error": "Could not parse JSON"}
    return json_object

Now, to see it in action:

In [9]:
entities = json_extractor(response)
entities

{'PER': ['ChatGPT', 'Elon Musk', 'Sam Altman', 'Donald Trump'],
 'ORG': ['Studio Ghibli', 'OpenAI'],
 'LOC': [],
 'MISC': []}

### Finding the Predicted Entities in the Original Text

Now we still need to figure out where each of the entities identified by the LLM appear in the text.  In NER we call this finding the span of each entity - it means identifing the position of each entity in the text.  We'll use two helper functions (both are in `helpers.py`) to help us do this:

1. `clean_token` removes any punctuation from a token.
2. `match_entity_spans` searches for each entity identified in our JSON output and returns a tuple with (entity_type, start_index, end_index) where the indices are the positions in the list of tokens.  We make use of fuzzy text matching as implemented in the `rapidfuzz` package.  You can learn more about fuzzy string matching and the package in the [RapidFuzz Documentation](https://rapidfuzz.github.io/RapidFuzz/).

I encourage you to look at the `helpers.py` file to see how these functions are implemented and to study how they work.  We imported `match_entity_spans` near the beginning of this notebook.  

To better understand what `match_entity_spans` does, we'll split our example text into a list of tokens, then search for the spans for each entity identified by our LLM model.

In [10]:
tokens = example_text.split()
spans = match_entity_spans(entities, tokens)
spans

[('PER', 6, 7),
 ('PER', 104, 105),
 ('PER', 59, 61),
 ('PER', 72, 74),
 ('PER', 68, 70),
 ('ORG', 27, 29),
 ('ORG', 55, 57),
 ('ORG', 70, 71)]

We have another helper function, `spans_to_bio_tags` that will convert these spans into BIO tag IDs corresonding to the list of tokens for visualization or evaluation.  Because we want the IDs for each tag, we need to pass our list of possible tags.  Below we generate the BIO tags IDs (one for each token in our example text), then visualize the results:

In [12]:
tags

[0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 3,
 4,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 3,
 4,
 0,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 2,
 3,
 0,
 1,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [11]:
tags = spans_to_bio_tags(tokens, spans, BIO_tags_list)
display_ner_html(tokens, tags, BIO_tags_list)


While the LLM is misidentifying "ChatGPT" as a person, that's still pretty impressive to be able to identify most of the entities in the text despite not being trained for that task!

To summarize the steps we followed to generate BIO tags for an input text using an LLM:
1.  Prompt and LLM to identify the desired entities in JSON format.
2.  Strip the LLM output and load the JSON output.
3.  Find the spans of the identified entities among tokens from the original text (the tokens are based on whitespace)
4.  Use the spans to generate BIO tag ID for each input token. 

### Automating the LLM NER Pipeline

Here we build a function where we can input a nested list of tokens and get out a nested list of BIO tag IDs.  The inputs will be like this

```python
tokens = [ ["It's", "only", "been", "a", "day", "since", "ChatGPT's", "new", "AI"],
           ["In", "the", "last", "24", "hours"] ]
```

The outputs will be 
```python
tags = [ [0,0,0,0,0,0,1,0,0],[0,0,0,0,0]]
```

Where 0 is the ID for the 'O' tag, and 1 is the ID for the 'PER' tag.   

The rest of the function relies on a prompt template where the input text formed from the tokens will be included to replace "{text}" in the template.  Other variables are similar to those we use in our previous text classification example.

In [None]:
def llm_ner_extractor(llm_config,
                      tokens,
                      system_prompt,
                      prompt_template,
                      labels_list,
                      batch_size=1,
                      estimate_cost=False,
                      rate_limit=None,
                      split_string=None,
                      return_raw = False):
    """
    Extract named entities using a Large Language Model (LLM) in zero-shot fashion.

    Args:
        llm_config (ModelConfig): Configuration for the LLM.
        tokens (list of list of str): Nested list of tokens for each text.
        system_prompt (str): System prompt guiding the LLM behavior.
        prompt_template (str): Template to construct the user prompt for each text.
        labels_list (list of str): List of possible BIO tags.
        batch_size (int, optional): Batch size for local LLMs. Defaults to 1.
        estimate_cost (bool, optional): Estimate LLM cost. Defaults to False.
        rate_limit (int, optional): Throttle requests for API models. Defaults to None.
        split_string (str, optional): String to split the LLM output. Defaults to None.
        return_raw (bool, optional): Whether to return raw LLM outputs. Defaults to False.

    Returns:
        list of list of int: Nested list of BIO tag IDs for each token in the input.
    """

    texts = [" ".join(token) for token in tokens]

    user_prompts = [prompt_template.format(text=text) for text in texts]

    raw_outputs = llm_generate(llm_config,
                               user_prompts,
                               system_prompt=system_prompt,
                               search_strategy='deterministic',
                               batch_size=batch_size,
                               estimate_cost=estimate_cost,
                               rate_limit=rate_limit,
                               split_string=split_string)
    
    if return_raw:
        return raw_outputs
    else:
        parsed_outputs = []
        for output in raw_outputs:
            try:
                output = output.strip("```json").strip("```").strip()
                parsed_outputs.append(json.loads(output))
            except json.JSONDecodeError:
                parsed_outputs.append({"error": "Could not parse JSON", "raw_output": output})

        # identify the spans of the entities, but only if the JSON was successfully parsed 
        # if the JSON was not parsed we should get an empty list of spans for that text
        spans = []
        for entities, tokens in zip(parsed_outputs, tokens):
            if "error" in entities:
                spans.append([])
            else:
                spans.append(match_entity_spans(entities, tokens))
        # now map the spans to BIO tags
        tags = []
        for token, span in zip(tokens, spans):
            tags.append(spans_to_bio_tags(token, span, labels_list))

        return tags


In [None]:
llm_config = llm_configure('gemini-flash-lite')

# Extract N examples from the validation split of CoNLL2003
N = 100
texts = [' '.join(tokens) for tokens in dataset["validation"]["tokens"][:N]]

# System instruction for the model
system_instruct = "You are a helpful assistant for named entity recognition. You return entity spans in JSON."

# Prompt template adapted for CoNLL2003-style entity extraction
prompt_template = """
Extract the following named entities from the text below, if they appear:
- PER (Person)
- ORG (Organization)
- LOC (Location)
- MISC (Miscellaneous)

Only include named entities that are explicitly mentioned in the text — do not infer or guess. 
Return each entity **exactly as it appears in the text**, preserving casing and punctuation.

Return the result as a JSON object in the format:
{{
  "PER": [...],
  "ORG": [...],
  "LOC": [...],
  "MISC": [...]
}}

Return only the JSON object, nothing else.

Text: {text}
The Entities JSON:
"""

# Used to split off the assistant output from the JSON (if needed)
split_string = "JSON:assistant"

# Call the LLM-based NER extractor
predictions = llm_ner_extractor(
    llm_config,
    texts,
    system_instruct,
    prompt_template,
    batch_size=10,
    estimate_cost=False,
    rate_limit=None,
    split_string=split_string
)

# Display the first few predictions for inspection
for i, text in enumerate(texts[:10]):
    print(f"Text: {text}")
    print("The Entities JSON:")
    print(predictions[i])
    print("\n")


Text: CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .
The Entities JSON:
{'PER': [], 'ORG': ['LEICESTERSHIRE'], 'LOC': [], 'MISC': []}


Text: LONDON 1996-08-30
The Entities JSON:
{'PER': [], 'ORG': [], 'LOC': ['LONDON'], 'MISC': []}


Text: West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and
39 runs in two days to take over at the head of the county championship .
The Entities JSON:
{'PER': ['Phil Simmons'], 'ORG': ['Leicestershire', 'Somerset'], 'LOC': [], 'MISC': []}


Text: Their stay on top , though , may be short-lived as title rivals Essex , Derbyshire and Surrey all closed in on
victory while Kent made up for lost time in their rain-affected match against Nottinghamshire .
The Entities JSON:
{'PER': [], 'ORG': ['Essex', 'Derbyshire', 'Surrey', 'Kent', 'Nottinghamshire'], 'LOC': [], 'MISC': []}


Text: After bowling Somerset out for 83 on the opening morning at Grace Road , Leicestershire extended th

In [None]:
from evaluate import load
import re

# ----------------------------------------
# Helper function to normalize text
# This removes punctuation and makes everything lowercase
def normalize(text):
    return re.sub(r'\W+', '', text.lower())

# ----------------------------------------
# Convert span-level entity predictions (from LLM) into token-level IOB tags
# Inputs:
#   - tokens: list of words in the sentence
#   - predicted_entities: dictionary from LLM output, e.g. {'TITLE': ['The Matrix']}
# Output:
#   - list of IOB tags aligned to the tokens, e.g. ['O', 'B-TITLE', 'I-TITLE', 'O']
def span_to_token_tags(tokens, predicted_entities):
    tags = ["O"] * len(tokens)  # Initialize all tokens with 'O' (outside any entity)
    lowered_tokens = [normalize(tok) for tok in tokens]  # Normalize tokens for fuzzy matching

    # Loop over each entity type (e.g. TITLE, ACTOR)
    for entity_type, spans in predicted_entities.items():
        for span in spans:
            # Normalize each word in the entity span for matching
            norm_span_tokens = [normalize(w) for w in str(span).split()]
            span_len = len(norm_span_tokens)

            # Slide over the token sequence and look for a matching span
            for i in range(len(tokens) - span_len + 1):
                # If tokens match the span, assign IOB tags
                if lowered_tokens[i:i+span_len] == norm_span_tokens:
                    tags[i] = f"B-{entity_type}"  # Beginning of entity
                    for j in range(1, span_len):
                        tags[i+j] = f"I-{entity_type}"  # Inside of entity
                    break  # Stop searching after first match

    return tags

# ----------------------------------------
# Load the seqeval metric (used for NER evaluation)
seqeval = load("seqeval")

true_tags = []  # Ground-truth IOB tags for each example
pred_tags = []  # LLM-predicted IOB tags for each example

# Evaluate on all predictions
N = len(predictions)

for i in range(N):
    example = dataset["valid"][i]
    tokens = example["tokens"]
    pred_entities = predictions[i]  # LLM output (JSON dict with spans)

    # Convert integer tag IDs to IOB tag strings using label_list
    true_iob = [label_list[tag_id] for tag_id in example["ner_tags"]]

    # Convert LLM's predicted spans to IOB tag sequence
    pred_iob = span_to_token_tags(tokens, pred_entities)

    # Add to the overall evaluation list
    true_tags.append(true_iob)
    pred_tags.append(pred_iob)

# ----------------------------------------
# Compute token-level precision, recall, F1 using seqeval
results = seqeval.compute(predictions=pred_tags, references=true_tags)

# Print the evaluation results
print(results)


{'ACTOR': {'precision': np.float64(1.0), 'recall': np.float64(0.918918918918919), 'f1': np.float64(0.9577464788732395), 'number': np.int64(37)}, 'CHARACTER': {'precision': np.float64(0.5), 'recall': np.float64(0.16666666666666666), 'f1': np.float64(0.25), 'number': np.int64(6)}, 'DIRECTOR': {'precision': np.float64(0.9411764705882353), 'recall': np.float64(0.8421052631578947), 'f1': np.float64(0.8888888888888888), 'number': np.int64(19)}, 'GENRE': {'precision': np.float64(0.8235294117647058), 'recall': np.float64(0.42424242424242425), 'f1': np.float64(0.5599999999999999), 'number': np.int64(33)}, 'TITLE': {'precision': np.float64(0.625), 'recall': np.float64(0.47619047619047616), 'f1': np.float64(0.5405405405405405), 'number': np.int64(21)}, 'YEAR': {'precision': np.float64(1.0), 'recall': np.float64(0.7916666666666666), 'f1': np.float64(0.8837209302325582), 'number': np.int64(24)}, 'overall_precision': np.float64(0.8952380952380953), 'overall_recall': np.float64(0.6714285714285714), '

In [None]:
import re
import string
from rapidfuzz import fuzz

def robust_normalize(text):
    # Lowercase and remove punctuation only from the beginning and end.
    return text.lower().strip(string.punctuation)

def improved_span_to_token_tags_fuzzy_adjusted(tokens, predicted_entities, threshold=90):
    """
    Convert span-level entity predictions into token-level IOB tags using fuzzy matching
    and additional adjustments to handle trailing "s" issues.
    
    Args:
        tokens (list of str): List of tokens in the sentence.
        predicted_entities (dict): Dictionary mapping entity types to a list of span strings.
        threshold (int): Fuzzy matching threshold (0-100). Default is 90.
    
    Returns:
        list of str: IOB tags aligned with the tokens.
    """
    tags = ["O"] * len(tokens)
    lowered_tokens = [robust_normalize(tok) for tok in tokens]
    
    # Process each entity type (e.g., 'TITLE', 'ACTOR', etc.)
    for entity_type, spans in predicted_entities.items():
        for span in spans:
            norm_span_tokens = [robust_normalize(w) for w in str(span).split()]
            span_len = len(norm_span_tokens)
            span_str = " ".join(norm_span_tokens)
            found_any = False
            
            # Slide over token windows of the same length as the span
            for i in range(len(tokens) - span_len + 1):
                window_tokens = lowered_tokens[i:i+span_len]
                window_str = " ".join(window_tokens)
                
                # Compute the base fuzzy matching score.
                score = fuzz.ratio(span_str, window_str)
                scores = [score]
                
                # Check alternative: if window ends with an "s", try removing it.
                if window_str.endswith("s"):
                    alt_window = window_str[:-1]
                    scores.append(fuzz.ratio(span_str, alt_window))
                
                # Check alternative: if span ends with an "s", try removing it.
                if span_str.endswith("s"):
                    alt_span = span_str[:-1]
                    scores.append(fuzz.ratio(alt_span, window_str))
                
                # Check alternative: if both end with an "s", remove from both.
                if window_str.endswith("s") and span_str.endswith("s"):
                    alt_window = window_str[:-1]
                    alt_span = span_str[:-1]
                    scores.append(fuzz.ratio(alt_span, alt_window))
                
                max_score = max(scores)
                
                if max_score >= threshold:
                    tags[i] = f"B-{entity_type}"
                    for j in range(1, span_len):
                        tags[i+j] = f"I-{entity_type}"
                    found_any = True
                    # Continue looping to allow multiple occurrences
            
            if not found_any:
                print(f"Warning: Could not align span '{span}' for entity type '{entity_type}' in tokens: {tokens}")
    
    return tags


def evaluate_ner_predictions(dataset, predictions, label_list):
    seqeval = load("seqeval")
    true_tags = []
    pred_tags = []

    for example, pred_entity_dict in zip(dataset, predictions):
        tokens = example["tokens"]
        true_iob = [label_list[i] for i in example["ner_tags"]]
        pred_iob = span_to_token_tags(tokens, pred_entity_dict)

        true_tags.append(true_iob)
        pred_tags.append(pred_iob)

    return seqeval.compute(predictions=pred_tags, references=true_tags)

results = evaluate_ner_predictions(dataset["valid"].select(range(N)), predictions, label_list)
print(results)

{'ACTOR': {'precision': np.float64(1.0), 'recall': np.float64(0.918918918918919), 'f1': np.float64(0.9577464788732395), 'number': np.int64(37)}, 'CHARACTER': {'precision': np.float64(0.5), 'recall': np.float64(0.16666666666666666), 'f1': np.float64(0.25), 'number': np.int64(6)}, 'DIRECTOR': {'precision': np.float64(0.9411764705882353), 'recall': np.float64(0.8421052631578947), 'f1': np.float64(0.8888888888888888), 'number': np.int64(19)}, 'GENRE': {'precision': np.float64(0.8235294117647058), 'recall': np.float64(0.42424242424242425), 'f1': np.float64(0.5599999999999999), 'number': np.int64(33)}, 'TITLE': {'precision': np.float64(0.625), 'recall': np.float64(0.47619047619047616), 'f1': np.float64(0.5405405405405405), 'number': np.int64(21)}, 'YEAR': {'precision': np.float64(1.0), 'recall': np.float64(0.7916666666666666), 'f1': np.float64(0.8837209302325582), 'number': np.int64(24)}, 'overall_precision': np.float64(0.8952380952380953), 'overall_recall': np.float64(0.6714285714285714), '

## Zero-Shot NER with GliNER

In [None]:
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WINDOWS"] = "1"

from gliner import GLiNER

# Load the GLiNER model (you can also try gliner_medium-v2.1 or gliner_large-v2.1)
model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]



In [None]:

# Define your NER schema — use Title Case or lower case (recommended by GLiNER authors)
labels = ["Actor", "Character", "Director", "Genre", "Title", "Year"]

# Sample input texts (you can also loop over your dataset)
N = 100
texts = [' '.join(tokens) for tokens in dataset["valid"]["tokens"][0:N] ]
gliner_raw = model.batch_predict_entities(texts, labels)
gliner_predictions = gliner_to_entity_dicts(gliner_raw)
gliner_results = evaluate_ner_predictions(dataset["valid"], gliner_predictions, label_list)

In [None]:
gliner_results

{'ACTOR': {'precision': np.float64(0.8157894736842105),
  'recall': np.float64(0.8378378378378378),
  'f1': np.float64(0.8266666666666665),
  'number': np.int64(37)},
 'CHARACTER': {'precision': np.float64(0.2),
  'recall': np.float64(0.5),
  'f1': np.float64(0.28571428571428575),
  'number': np.int64(6)},
 'DIRECTOR': {'precision': np.float64(0.6666666666666666),
  'recall': np.float64(0.7368421052631579),
  'f1': np.float64(0.7),
  'number': np.int64(19)},
 'GENRE': {'precision': np.float64(0.56),
  'recall': np.float64(0.42424242424242425),
  'f1': np.float64(0.4827586206896552),
  'number': np.int64(33)},
 'TITLE': {'precision': np.float64(1.0),
  'recall': np.float64(0.23809523809523808),
  'f1': np.float64(0.3846153846153846),
  'number': np.int64(21)},
 'YEAR': {'precision': np.float64(0.8695652173913043),
  'recall': np.float64(0.8333333333333334),
  'f1': np.float64(0.851063829787234),
  'number': np.int64(24)},
 'overall_precision': np.float64(0.6850393700787402),
 'overall_r

In [None]:
import torch
from gliner import GLiNER, GLiNERConfig
from gliner.training import Trainer, TrainingArguments
from gliner.data_processing.collator import DataCollator
from gliner.data_processing import WordsSplitter, GLiNERDataset
from transformers import AutoTokenizer
from datasets import DatasetDict
import re

# --------------------------
# 1. Set up model + config
# --------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "urchade/gliner_small"
model = GLiNER.from_pretrained(model_id)
tokenizer = model.data_processor.transformer_tokenizer
config = model.config
words_splitter = WordsSplitter()

dataset = load_dataset("hobbes99/mit-movie-ner-simplified")
label_list = dataset["train"].features["ner_tags"].feature.names

# --------------------------
# 2. Extract entity types
# --------------------------
entity_types = sorted(set(label[2:] for label in label_list if label != "O"))


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]



In [None]:
def convert_iob_to_gliner_format(example, label_list):
    tokens = example["tokens"]
    tags = example["ner_tags"]
    
    spans = []
    i = 0
    while i < len(tags):
        tag = tags[i]
        if tag == 0:  # "O"
            i += 1
            continue
        label = label_list[tag]
        if label.startswith("B-"):
            label_name = label[2:]
            start = i
            end = i
            i += 1
            while i < len(tags) and label_list[tags[i]] == f"I-{label_name}":
                end = i
                i += 1
            spans.append([start, end, label_name])
        else:
            i += 1  # skip stray I- just in case

    return {
        "tokenized_text": tokens,
        "ner": spans
    }

# Convert Hugging Face datasets to plain Python lists of dictionaries
train_dataset = [
    convert_iob_to_gliner_format(example, label_list)
    for example in dataset["train"]
]

valid_dataset = [
    convert_iob_to_gliner_format(example, label_list)
    for example in dataset["valid"]
]


In [None]:
from gliner import GLiNER, GLiNERConfig
from gliner.training import Trainer, TrainingArguments
from gliner.data_processing.collator import DataCollator
from gliner.data_processing import WordsSplitter
import torch

# 1. Device and Model Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "urchade/gliner_small"
model = GLiNER.from_pretrained(model_id)
tokenizer = model.data_processor.transformer_tokenizer
config = model.config
words_splitter = WordsSplitter()

# 2. Data Collator (for tokenization and batching)
data_collator = DataCollator(
    config=config,
    tokenizer=tokenizer,
    words_splitter=words_splitter,
    prepare_labels=True
)

# 3. Epoch Calculation
batch_size = 8
num_steps = 500
num_epochs = max(3, num_steps // max(1, len(train_dataset) // batch_size))

# 4. Training Arguments
training_args = TrainingArguments(
    output_dir="gliner-movies",
    learning_rate=5e-6,
    weight_decay=0.01,
    others_lr=1e-5,
    others_weight_decay=0.01,
    lr_scheduler_type="linear",
    warmup_ratio=0.1,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    focal_loss_alpha=0.75,
    focal_loss_gamma=2,
    num_train_epochs=num_epochs,
    eval_strategy="steps",  # updated from deprecated evaluation_strategy
    save_steps=100,
    save_total_limit=2,
    dataloader_num_workers=0,
    use_cpu=not torch.cuda.is_available(),
    report_to="none"
)

# 5. Trainer Initialization (removed deprecated tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
    # no tokenizer needed anymore
)

# 6. Train
trainer.train()


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Step,Training Loss,Validation Loss
500,2.8882,8.95
1000,1.0204,7.190355
1500,0.8838,5.591519
2000,0.8931,4.77438
2500,0.8882,4.225507
3000,0.8015,4.288225
3500,0.8178,3.958858


TrainOutput(global_step=3645, training_loss=1.150873778124732, metrics={'train_runtime': 573.9006, 'train_samples_per_second': 50.81, 'train_steps_per_second': 6.351, 'total_flos': 0.0, 'train_loss': 1.150873778124732, 'epoch': 3.0})

In [None]:
from collections import defaultdict

def gliner_to_entity_dicts(gliner_output):
    """
    Convert GLiNER's output format (list of {'text', 'label'}) per sample
    into a dict of label -> list of strings for each example.
    """
    all_entity_dicts = []
    for entities in gliner_output:
        ent_dict = defaultdict(list)
        for e in entities:
            ent_dict[e["label"]].append(e["text"])
        all_entity_dicts.append(dict(ent_dict))
    return all_entity_dicts


In [None]:
from pathlib import Path
from gliner import GLiNER

# 1. Load fine-tuned model WITHOUT trying to load tokenizer from disk
model_path = Path("gliner-movies/checkpoint-3645")
model = GLiNER.from_pretrained(str(model_path), load_tokenizer=False)

# 2. Reuse the tokenizer from the original base model
# This must match the one used during training
base_model_id = "urchade/gliner_small"
base_model = GLiNER.from_pretrained(base_model_id)
model.data_processor.transformer_tokenizer = base_model.data_processor.transformer_tokenizer

# 3. Reuse the words splitter too (optional, but best for consistency)
model.data_processor.words_splitter = base_model.data_processor.words_splitter

# 4. Define labels from your original `label_list`
labels = sorted(list(set(label[2:] for label in label_list if label != "O")))

# 5. Prepare input texts
N = 100
texts = [' '.join(tokens) for tokens in dataset["valid"]["tokens"][:N]]

# 6. Predict with batch_predict_entities
gliner_raw = model.batch_predict_entities(texts, labels)

# 7. Convert to standard format for evaluation
gliner_predictions = gliner_to_entity_dicts(gliner_raw)

# 8. Evaluate
from datasets import Dataset

# Convert sliced subset to a proper Dataset object
valid_subset = Dataset.from_dict(dataset["valid"][:N])

gliner_results = evaluate_ner_predictions(valid_subset, gliner_predictions, label_list)


# Optional: print results
from pprint import pprint
pprint(gliner_results)


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'ACTOR': {'f1': np.float64(0.8641975308641975),
           'number': np.int64(37),
           'precision': np.float64(0.7954545454545454),
           'recall': np.float64(0.9459459459459459)},
 'CHARACTER': {'f1': np.float64(0.3157894736842105),
               'number': np.int64(6),
               'precision': np.float64(0.23076923076923078),
               'recall': np.float64(0.5)},
 'DIRECTOR': {'f1': np.float64(0.7142857142857143),
              'number': np.int64(19),
              'precision': np.float64(0.6521739130434783),
              'recall': np.float64(0.7894736842105263)},
 'GENRE': {'f1': np.float64(0.7222222222222221),
           'number': np.int64(33),
           'precision': np.float64(0.6666666666666666),
           'recall': np.float64(0.7878787878787878)},
 'TITLE': {'f1': np.float64(0.5500000000000002),
           'number': np.int64(21),
           'precision': np.float64(0.5789473684210527),
           'recall': np.float64(0.5238095238095238)},
 'YEAR': {'f1': n

In [None]:
gliner_predictions[0]

[{'start': 17, 'end': 29, 'label': 'ACTOR'}]


---

### 6. **Run Zero-Shot LLM Prompting (e.g., GPT-4 or Claude)**

```python
import openai

def ner_with_gpt(text, system_prompt=None):
    messages = [
        {"role": "system", "content": system_prompt or "You are an assistant that extracts named entities."},
        {"role": "user", "content": f"Extract all named entities (actor, character, director, genre, title) from the following sentence:\n\n{text}"}
    ]
    response = openai.ChatCompletion.create(
        model="gpt-4",  # Or "gpt-3.5-turbo"
        messages=messages,
        temperature=0,
    )
    return response['choices'][0]['message']['content']
```

Test it on a few examples and show extracted entities. Let students compare accuracy/coverage with fine-tuned output.

---

### 7. **Run GliNER (Zero/Few-shot)**

```python
from gliner import GLiNER

gliner_model = GLiNER.from_pretrained("urchade/gliner_index")
labels = ["actor", "character", "director", "genre", "title"]

gliner_model.predict("The film was directed by Christopher Nolan and stars Christian Bale.", labels)
```

Let students vary the label set or add new entity types.

---





## 📌 Add Step: 7. Fine-Tune GLiNER

GLiNER expects training data in a specific format:
```python
{
    "text": "A movie description",
    "entities": [{"label": "director", "text": "Quentin Tarantino"}, ...]
}
```

### 🔹 Prepare MIT Movie Data for GLiNER Fine-Tuning

```python
from datasets import DatasetDict

# Map ID to label name
label_names = dataset["train"].features["ner_tags"].feature.names

def convert_to_gliner_format(example):
    tokens = example["tokens"]
    tags = example["ner_tags"]
    text = " ".join(tokens)
    entities = []
    current_entity = []
    current_label = None

    for token, tag in zip(tokens, tags):
        tag_name = label_names[tag]
        if tag_name.startswith("B-"):
            if current_entity:
                entities.append({"label": current_label, "text": " ".join(current_entity)})
            current_label = tag_name[2:]
            current_entity = [token]
        elif tag_name.startswith("I-") and current_label:
            current_entity.append(token)
        else:
            if current_entity:
                entities.append({"label": current_label, "text": " ".join(current_entity)})
            current_entity = []
            current_label = None

    if current_entity:
        entities.append({"label": current_label, "text": " ".join(current_entity)})

    return {"text": text, "entities": entities}

gliner_train = dataset["train"].map(convert_to_gliner_format)
gliner_test = dataset["test"].map(convert_to_gliner_format)
```

---

### 🔹 Fine-Tune GLiNER

```python
from gliner import GLiNERTrainer

gliner_model_ft = GLiNER.from_pretrained("urchade/gliner_index")

trainer = GLiNERTrainer(
    model=gliner_model_ft,
    train_data=gliner_train,
    eval_data=gliner_test,
    output_dir="./gliner_movie_ft",
    batch_size=16,
    lr=5e-5,
    num_epochs=3,
)

trainer.train()
```

---

### 🔹 Evaluate Fine-Tuned GLiNER

```python
results = trainer.evaluate()
print(results)
```

Or test on custom text:

```python
gliner_model_ft.predict("The movie was directed by James Cameron and stars Sigourney Weaver.", labels)
```

---

## 📊 Step 8: Compare All Methods

Update your comparison table:

| Sentence | Fine-Tuned DistilBERT | GPT-4 | GliNER Zero-shot | GliNER Fine-tuned |
|----------|------------------------|-------|-------------------|-------------------|
| ...      | ...                    | ...   | ...               | ...               |

Encourage students to analyze:
- Which method catches more entities?
- Which is more accurate?
- How does GliNER behave with new/changed label sets?

---

In [None]:
from transformers import AutoTokenizer
bert_model_name = "bert-base-cased"
xlmr_model_name = "xlm-roberta-base"
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
xlmr_tokenizer = AutoTokenizer.from_pretrained(xlmr_model_name)


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
text = "Jack Sparrow loves New York!"
bert_tokens = bert_tokenizer(text).tokens()
xlmr_tokens = xlmr_tokenizer(text).tokens()

print("Tokens from BERT tokenizer:")
print(bert_tokens)  
print("\nTokens from XLM-R tokenizer:")
print(xlmr_tokens)

Tokens from BERT tokenizer:
['[CLS]', 'Jack', 'Spa', '##rrow', 'loves', 'New', 'York', '!', '[SEP]']

Tokens from XLM-R tokenizer:
['<s>', '▁Jack', '▁Spar', 'row', '▁love', 's', '▁New', '▁York', '!', '</s>']
