# AI for Research: Customizing spaCy's Entity Recognition Models (Virtual)

**Welcome to this interactive notebook!**  
In this workshop, we'll walk through how Named Entity Recognition (NER) works, test pre-trained models, and learn how to customize them for research tasks, particularly useful in domains like hate speech detection, misinformation research, or media studies.


### Prerequisites
- Basic knowledge of Python
- No prior experience with NER or spaCy required

*This workshop was held on November 11, 2025, as part of the Research Computing and Data Services' **AI for Research** workshop series at Northwestern University, led by [Miriam Schirmer](https://miriamschirmer.github.io/).*

##**1. Introduction to Named Entity Recognition**

### What is Named Entity Recognition?

**NER** is a technique in Natural Language Processing (NLP) that identifies and classifies real-world entities in text.

This sentence for example, has the following entities:

*Dr. Jane Smith from the World Health Organization gave a talk in Geneva on July 15, 2021, about COVID-19.*

- **PERSON** ‚Äì e.g., "Dr. Jane Smith"
- **ORG** ‚Äì e.g., "World Health Organization"
- **GPE** ‚Äì Geopolitical Entities, e.g., "Geneva"
- **DATE** ‚Äì e.g., "July 15, 2021"
- **Others** ‚Äì PRODUCT, EVENT, LAW, NORP (Nationalities or religious or political groups), etc.




### Why is NER Important?

NER helps:
- Structure very raw and unformatted text, e.g., to get an overview of common terms used
- Enable information extraction from social media, news, legal texts, etc.
- Use it as an additional step for other NLP tasks (e.g., look at who is targeted when training a model to detect hate speech)


### Prep: Import relevant libraries

In [None]:
# Import spaCy itself to build our pipeline: spaCy is the core NLP library we'll use.
# spaCy provides pre-built pipelines for tasks like NER (and many more!).
import spacy

# Import "Matcher", which lets us define custom patterns
from spacy.matcher import Matcher

# Import "EntityRuler", which allows us to add custom rules for entities
from spacy.pipeline import EntityRuler

### **Intro Example**

We‚Äôll start by loading **spaCy‚Äôs small English model**, called `en_core_web_sm`.

- **`en_core_web_sm`** stands for *English (core) web-trained small model*.  
  It includes the basic components of spaCy‚Äôs NLP pipeline: a tokenizer, part-of-speech tagger, dependency parser, and named entity recognizer.  
- The **small model** is lightweight and fast, which makes it ideal for demos and teaching.  
- For more accuracy (but slower performance), you can use:
  - `en_core_web_md` ‚Üí *medium* model (includes word vectors)
  - `en_core_web_lg` ‚Üí *large* model (best accuracy, higher memory use)
- You can also train your own model or use models for other languages.

In [None]:
# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")

In [None]:
# A simple text example
text = "Dr. Jane Smith from the World Health Organization gave a talk in Geneva on July 15, 2021, about COVID-19."

In [None]:
# Run the NLP pipeline on the text
doc = nlp(text)

In [None]:
# Print entities detected by the model, including start and end character
print("Entities Found:")
for ent in doc.ents:
    print(f"- {ent.text} ({ent.label_}) [Start: {ent.start_char}, End: {ent.end_char}]")


Let's visualize this:

In [None]:
# Visualize entities in the text
spacy.displacy.render(doc, style="ent", jupyter=True)

###**Excersise**: Use your own example and run this!

In [None]:
# A simple text example (enter a sentence between the quotation marks)
new_text = ""

# ‚úèÔ∏è TODO: Enter a sentence between the quotation marks above.

In [None]:
# Run the NLP pipeline on the text
new_doc =

# ‚úèÔ∏è TODO: Run the NLP pipeline on the new text and store it in "new_doc" variable.

In [None]:
# Print entities detected by the model, including start and end character
print("Entities Found:")
for ent in new_doc.ents:
    print(f"- {ent.text} ({ent.label_}) [Start: {ent.start_char}, End: {ent.end_char}]")

# ‚úèÔ∏è TODO: No need to change this cell!

## **2. Applying NER to Real-World Data: Incel Forum Posts**

Now that we've seen a basic example, let‚Äôs test how spaCy‚Äôs off-the-shelf NER performs on **incel forum posts**.

### Background: What are "incel" forums?

The term **incel** stands for "involuntary celibate."  
It refers to online communities where people discuss frustrations about dating and relationships, often expressing **hateful language toward women** and **misogynistic ideologies**.  

Why are we using this data?
- They use **slang and community-specific terms** that are different from everyday language but also contain **clear references to people, groups, and institutions**
- They provide examples of **messy, real-world text** where standard NLP models may struggle.
- They are publicly available data often used in research on online communities.

This is especially relevant for **hate speech detection research**, where:
- Extracting entities helps identify targeted individuals or groups.
- We may want to track mentions of public figures, communities, or ideologies.

‚ö†Ô∏è In this workshop, we use incel forum text **only as an example** to show how NER works on social science data.  
Our focus is on the **method (NER)**, not on the community or its views.

üö® Content warning: Incel terminology often contains misogynistic expressions and may reference sexual or gender-based violence.



### üìÇ Step 1: Load Dataset and Inspect `text` Column

We have a dataset (CSV file) of incel posts with a column called `text`. This column contains the raw text of each post made in a forum.

We'll load the data, inspect a few entries, and then apply spaCy's NER model to extract named entities.

üîç This mimics a typical hate speech or social media dataset structure. Note that this is **raw, unprocessed data**. It‚Äôs intentionally left messy to illustrate the kinds of challenges you might face when applying NER, and to show how to clean and prepare your data for this task.


In [None]:
# Import the pandas library for working with tables (dataframes)
import pandas as pd

In [None]:
# URL of the CSV file on GitHub to read it directly into a pandas DataFrame
url = "https://raw.githubusercontent.com/MiriamSchirmer/Intro-to-NER/refs/heads/main/incel_comments.csv"

# Load the dataset into a pandas DataFrame
# A DataFrame is like a table (rows = observations, columns = variables)

df = pd.read_csv(url)


In [None]:
# Set pandas option to display the full content of the 'text' column
pd.set_option('display.max_colwidth', None)

# Display the first 5 rows of the dataset to check what it looks like
df.head()

In [None]:
# Display the shape of the DataFrame to see how many rows (first number) and columns (second number) we have
print("Shape of the DataFrame:")
print(df.shape)

### üè∑ Step 2: Apply NER to the `text` Column

Now we apply the NER pipeline to each post in the dataset.  
We‚Äôll extract:
- The full list of entities
- Their labels (e.g., PERSON, ORG)
- Their frequency in the dataset

This helps us:
- Spot key actors and targets in hate speech
- Identify misclassifications (e.g., slang detected as ORG)

Extract Named Entities

In [None]:
# Define a function that takes in a piece of text and returns all named entities the model finds

def extract_ents(text, nlp):
    """
    Extract named entities from a given text using a spaCy pipeline.

    Args:
        text (str): The input text from which to extract entities.
        nlp (spacy.language.Language): A loaded spaCy language model (e.g., spacy.load("en_core_web_sm")).

    Returns:
        list of tuples: A list containing (entity_text, entity_label) pairs.
    """
    # Process the text through the provided spaCy pipeline
    doc = nlp(text)

    # Collect each entity as a (text, label) pair
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    return entities


In [None]:
# Define what our texts are
texts = df['text']

# Process all texts efficiently in batches (this saves time because we are not calling the model separately for each row)
# You can adjust batch_size depending on text length
docs = list(nlp.pipe(texts, batch_size=10)) # low batch size because our posts are rather short

# Extract entities for each processed doc
df['entities'] = [[(ent.text, ent.label_) for ent in doc.ents] for doc in docs]


In [None]:
# Show 15 random rows with text and extracted entities
df[['text', 'entities']].sample(15, random_state=12)


Count Most Common Entities

In [None]:
# Import Counter, a helper tool that counts how often items appear in a list
from collections import Counter

In [None]:
# Flatten (i.e., "unzip")the list of entities across all posts
# - df['entities'] contains one list of entities per row
# - We loop over each row and then each entity inside it
all_entities = [ent for row in df['entities'] for ent in row]

In [None]:
# Count how often each (text, label) pair appears in the dataset
entity_counter = Counter(all_entities)

In [None]:
# Print the 15 most frequent named entities
print("Most Frequent Named Entities:")
for (text, label), count in entity_counter.most_common(15):
    print(f"{text} ({label}): {count}")

Adapt this slightly to exclude the numbers (which we are not really interested in for our current usecase).

In [None]:
exclude_labels = {"CARDINAL", "ORDINAL", "PERCENT"}

# Filter the entities to exclude those with labels in exclude_labels
filtered_entities = [ent for ent in all_entities if ent[1] not in exclude_labels]

# Count how often each (text, label) pair appears in the filtered list
entity_counter = Counter(filtered_entities)

# Print the 15 most frequent named entities from the filtered list
print("Most Frequent Named Entities (excluding numbers):")
for (text, label), count in entity_counter.most_common(15):
    print(f"{text} ({label}): {count}")

###**Discussion**: What works well, what does not work well? How can we improve this?

### What We Can Learn from the Entity Counts

- Did the model mark any words as entities that **aren‚Äôt actually entities**?  
- Are the **real people or names** we care about being tagged correctly?  
- Are there **important words** that the model missed?

These questions help us see what needs to be improved, either by:
- **Fixing** specific cases with simple rules (using the `EntityRuler`)
- **Teaching** the model new examples through training

Next, we‚Äôll look at how to **customize and improve** the model.


## üß© Step 3: Why Customize NER?

spaCy's default model doesn't recognize many **domain-specific concepts** in incel communities.

Examples:
- ‚ÄúChad‚Äù, ‚ÄúStacy‚Äù ‚Üí Often central figures, not recognized as people
- ‚ÄúTinder‚Äù, ‚ÄúReddit‚Äù ‚Üí Should be detected as platforms
- ‚ÄúRedpill‚Äù, ‚ÄúBlackpill‚Äù ‚Üí Ideologies
- ‚Äúnormie‚Äù, ‚Äúfoid‚Äù  ‚Üí Community-specific terms

Let‚Äôs start by using **spaCy‚Äôs Matcher** and **EntityRuler** to inject these into the pipeline.


####üíª Customizing Option A: Rule-Based Matching with `Matcher`

In [None]:
# Create a Matcher object, which lets us define custom rules
# It needs the vocabulary (nlp.vocab) from the spaCy model
matcher = Matcher(nlp.vocab)

In [None]:
# Define a simple pattern for the word "chad"
# LOWER means: match the lowercase version of the token
pattern_chad = [{"LOWER": "chad"}]

# Define a pattern for the word "stacy"
pattern_stacy = [{"LOWER": "stacy"}]

In [None]:
# Add both patterns to the matcher under the same label "INCEL_PERSON"
# The first argument ("INSEL_PERSON") is the name we give this rule
# The second argument is a list of patterns we want to match
matcher.add("INCEL_PERSON", [pattern_chad, pattern_stacy])

In [None]:
# Store counts
match_counter = Counter()
total_matches = 0

# Loop through your dataframe
for doc in nlp.pipe(df["text"], batch_size=50):
    matches = matcher(doc)
    total_matches += len(matches)

    for match_id, start, end in matches:
        label = nlp.vocab.strings[match_id]  # e.g. "INCEL_PERSON"
        span_text = doc[start:end].text
        match_counter[(span_text, label)] += 1

# Summary
print(f"Total matches found: {total_matches}\n")

print("Most Frequent Matcher Hits:")
for (text, label), count in match_counter.most_common():
    print(f"{text} ({label}): {count}")

Let's look at an example that contains "Stacy":

In [None]:
# Find a text entry that contains "Stacy"
stacy_text = df[df['text'].str.contains('Stacy', case=False, na=False)].iloc[0]['text']

# Print the text
print("Example text containing 'Stacy':")
print(stacy_text)

###**Excersise**: Choose a term you would like to add and run the the NER Matcher on our dataset!

In [None]:
# Define your own pattern(s)

new_pattern = [{"LOWER": ""}]
matcher.add("", [new_pattern])

# ‚úèÔ∏è TODO: Replace the underscores above with your own term and label! Replace "YOUR_LABEL_HERE" with your label name.

In [None]:
# Count how often we find your new pattern! (No adjustments needed.)

# Store counts
match_counter = Counter()
total_matches = 0

# Loop through your dataframe
for doc in nlp.pipe(df["text"], batch_size=50):
    matches = matcher(doc)
    total_matches += len(matches)

    for match_id, start, end in matches:
        label = nlp.vocab.strings[match_id]  # e.g. "INCEL_PERSON"
        span_text = doc[start:end].text
        match_counter[(span_text, label)] += 1

# Summary
print(f"Total matches found: {total_matches}\n")

print("Most Frequent Matcher Hits:")
for (text, label), count in match_counter.most_common():
    print(f"{text} ({label}): {count}")

###üíª Customizing Option B: Insert Custom Entities with `EntityRuler`

The **EntityRuler** is similar to the Matcher, but with one key difference:

- **Matcher**: Finds patterns in text but does not automatically turn them into "entities".  
  ‚Üí We had to manually print the matches.  

- **EntityRuler**: Lets us directly insert new *named entities* into spaCy‚Äôs pipeline.  
  ‚Üí The matches will appear alongside other entities (like PERSON, ORG, DATE) when we run `doc.ents`.

This makes the EntityRuler a better choice if we want our custom rules to behave just like the built-in NER model.

Reset the NLP Pipeline

In [None]:
# Start fresh to avoid lingering patterns/rulers
nlp = spacy.load("en_core_web_sm")

# Remove any existing entity_ruler(s)
for name in list(nlp.pipe_names):
    if name.startswith("entity_ruler"):
        nlp.remove_pipe(name)

In [None]:
# Add a NEW entity_ruler with lowercased phrase matching and overwrite behavior
ruler = nlp.add_pipe(
    "entity_ruler",
    before="ner",
    config={"overwrite_ents": True, "phrase_matcher_attr": "LOWER"}
)

Now we are adding our new labels:

In [None]:
# Option one: Define precise patterns, here for platforms:
platforms = ["Tinder", "Reddit", "YouTube", "Instagram", "TikTok", "Twitter", "X"]
patterns = [{"label": "PLATFORM", "pattern": p} for p in platforms]

In [None]:
# Option two: Other custom entities
patterns += [
    {"label": "PERSON",   "pattern": "Chad"},
    {"label": "PERSON",   "pattern": "Stacy"},
    {"label": "IDEOLOGY", "pattern": "Redpill"},
    {"label": "IDEOLOGY", "pattern": "Blackpill"},
    {"label": "COMMUNITY","pattern": "normie"},
    {"label": "SLUR",     "pattern": "foid"}
]

In [None]:
# We add the patterns
ruler.add_patterns(patterns)

Let's look at the results:

In [None]:
# Count entities on your dataset
exclude_labels = {"CARDINAL", "ORDINAL", "PERCENT"}
entity_counter = Counter()

for doc in nlp.pipe(df["text"], batch_size=50):
    for ent in doc.ents:
        if ent.label_ not in exclude_labels:
            entity_counter[(ent.text, ent.label_)] += 1

print("Most Frequent Named Entities (EntityRuler + NER):")
for (text, label), count in entity_counter.most_common(20):
    print(f"{text} ({label}): {count}")


###**Excersise**: Choose a term you would like to add and run the the NER Matcher on our dataset! Use the code above to add your examples.

## üìö **3. Additional Material: Training a Custom NER Model (Simple Demo)**

So far, we‚Äôve used:
- Pre-trained entities (PERSON, ORG, DATE, etc.)
- Rule-based customization (Matcher, EntityRuler)

Another option is to **train the model** to recognize new entity types.  
This requires **annotated data**, i.e., examples of text with entity spans labeled.

‚ö†Ô∏è This is just a toy demo to show the mechanics. Real training needs more data and time.


### Train the model from scratch

### **Understanding Model Training in spaCy**

Before training our own **Named Entity Recognition (NER)** model, here are the key ideas to understand:



##### **Key Concepts**

| üè∑Ô∏è **Concept** | üí° **What it Means** | üéØ **Why it Matters** |
|:----------------|:--------------------|:----------------------|
| **Annotated data** | Training needs examples where entities are *already labeled* in text, e.g. `"Redpill" ‚Üí IDEOLOGY`. | The model can only learn from what it sees.<br><br>More examples and variety = better generalization. |
| **Empty model**<br/>`spacy.blank("en")` | Creates a model with **no prior knowledge** ("a clean slate"). | Ideal for demos or custom domains.<br><br>The model learns entirely from the input data. |
| **Pretrained model**<br/>`en_core_web_sm` | A model that already understands **general English** syntax and entities. | You can **fine-tune** it instead of training from scratch.<br><br>This saves time and requires less data. |
| **Adding an NER component** | spaCy pipelines are sequences like:<br/>`tokenizer ‚Üí tagger ‚Üí parser ‚Üí NER`. | Adding an NER step lets the model detect and label entities (e.g., `PERSON`, `ORG`, or custom ones like `IDEOLOGY`). |
| **Token alignment & BILUO tags** | spaCy internally converts entity spans into the **BILUO** format:<br>**B**egin, **I**nside, **L**ast, **U**nit, **O**utside. | Ensures that entity spans match token boundaries.<br><br>This alignment is **essential for error-free training**. |
| **Epochs / iterations** | One ‚Äúepoch‚Äù = one **full pass** through the dataset. Training repeats over multiple epochs. | Each pass helps the model refine its understanding.<br><br>More epochs ‚Üí more learning (to a point). |
| **Updating model weights** | After every batch, spaCy adjusts internal **weights** based on the difference between predictions and correct labels. | These updates make the model gradually improve.<br><br>Over many updates, accuracy and stability increase. |




**Notes**


* This is a toy example. With only a few sentences, the model will overfit quickly; that‚Äôs fine for demonstration.
* For deterministic terms (exact names), an EntityRuler is often a better choice. Use training for fuzzier/variable mentions.

Let's start the training!

Here's a summary of what the following code does:


1. **Build a tiny training set** for a Named Entity Recognition (NER) task using a helper function that ensures entity spans align correctly to tokens
(this prevents "misalignment" errors during training)
2. Create and **train** a completely blank English NER **model** from scratch on two custom labels: IDEOLOGY and PLATFORM
3. **Evaluate** the trained model on a new test sentence to see if it learned to recognize similar patterns


In [None]:
# Import libraries for text processing and model training
import re
import random
import spacy
from spacy.training import Example, offsets_to_biluo_tags

Define a helper function to build token-aligned entity spans. The `make_example()` function ensures that entity spans (start and end positions) line up exactly with token boundaries. This is required by spaCy for training.

If an entity span cuts through a token (e.g., due to punctuation or whitespace), it will raise a clear error so you can adjust the example.

In [None]:
def make_example(text, spans, nlp_for_tokenization=None):
    """
    text: str -> the input sentence
    spans: list of tuples -> [(substring, LABEL), ...]
           e.g. [("Blackpill", "IDEOLOGY")]
    Finds the FIRST occurrence of each substring in `text`,
    checks that it aligns to token boundaries, and returns
    a tuple in the format spaCy expects: (text, {"entities": [(start, end, LABEL), ...]})
    """
    nlp_tok = nlp_for_tokenization or spacy.blank("en")
    doc = nlp_tok.make_doc(text)
    ents = []
    for substr, label in spans:
        m = re.search(re.escape(substr), text)
        if not m:
            raise ValueError(f"Substring not found: {substr!r} in: {text!r}")
        start_char, end_char = m.start(), m.end()
        if doc.char_span(start_char, end_char) is None:
            # If this happens, the substring doesn‚Äôt match full tokens.
            # You can fix this by adjusting the substring or the tokenizer.
            tokens = [t.text for t in doc]
            raise ValueError(
                f"Not token-aligned: {substr!r} -> ({start_char},{end_char}). "
                f"Tokens: {tokens}"
            )
        ents.append((start_char, end_char, label))
    return (text, {"entities": ents})

1) Build a tiny toy dataset

In [None]:
# Each entry is created with make_example() to ensure safe alignment.
# The data has two labels: "IDEOLOGY" (e.g., Blackpill, Redpill) and "PLATFORM" (e.g., Reddit, Tinder).

TRAIN_DATA = [
    make_example("He follows the Blackpill ideology.", [("Blackpill", "IDEOLOGY")]),
    make_example("Redpill beliefs are common on these forums.", [("Redpill", "IDEOLOGY")]),
    make_example("She spends time on Reddit.", [("Reddit", "PLATFORM")]),
    make_example("They met through Tinder.", [("Tinder", "PLATFORM")]),
    make_example("Many users argue about Blackpill ideas on Reddit.",
                 [("Blackpill", "IDEOLOGY"), ("Reddit", "PLATFORM")]),
    make_example("Tinder and Reddit are popular apps.",
                 [("Tinder", "PLATFORM"), ("Reddit", "PLATFORM")]),
]


# Optional: quick sanity check for alignment
# -----------------------------------------------------
# The function below visualizes tokenization and entity alignment
# by converting entities into the BILUO tagging scheme.
# BILUO = Begin, Inside, Last, Unit, Outside
# Misaligned entities will appear as '-' in the sequence.

def check_alignment(text, ents):
    doc = spacy.blank("en").make_doc(text)
    print(text)
    print("TOKENS:", [t.text for t in doc])
    print("BILUO:", offsets_to_biluo_tags(doc, ents), "\n")

for text, ann in TRAIN_DATA:
    check_alignment(text, ann["entities"])

2) Create a blank NER pipeline

In [None]:
# Start from an empty English pipeline and add the NER component.
# Register the custom labels so the model knows what to predict.

nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.add_label("IDEOLOGY")
ner.add_label("PLATFORM")

# Initialize training parameters (weights and optimizer)
optimizer = nlp.initialize()

3) Train the model

In [None]:
# For demonstration purposes, we train for a small number of iterations on a very small dataset.
# This is NOT a realistic setup ‚Äî it‚Äôs just to showhow the model learns to recognize the two entity types.

# Fix the random seed for reproducible results
random.seed(42)

# Number of training iterations (epochs)
n_iter = 15

for i in range(n_iter):
    # Shuffle training examples each epoch
    random.shuffle(TRAIN_DATA)
    losses = {}

    # Train on each text‚Äìannotation pair
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], sgd=optimizer, losses=losses, drop=0.2)

    # Show progress every 5 epochs
    if i % 5 == 0:
        print(f"Iteration {i} | Losses: {losses}")

4) Test the trained model

In [None]:
# Try the model on a new sentence that combines both entity types
# to see if it generalizes beyond the training examples.

test_text = "People debate Redpill ideas on Reddit and meet on Tinder."
doc = nlp(test_text)
print("\nTest text:", test_text)
print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])


## üéØ What We Learned

- Pre-trained NER is a great *starting point*, but‚Ä¶
- Social media / hate speech data has **slang and unique entities** that default models miss.
- Rule-based methods (`Matcher`, `EntityRuler`) let us quickly adapt NER for research.
- Combining **default + rules + fine-tuning** makes the strongest pipelines.

For your projects: Think about which entities are most meaningful (people? platforms? ideologies?) and adapt NER accordingly.


## üìö Further Resources: Named Entity Recognition (NER)

If you‚Äôd like to explore Named Entity Recognition further ‚Äî especially in the context of customization, domain adaptation, or research use ‚Äî here are some carefully selected resources:


### üß™ Tutorials and Beginner-Friendly Guides

- **spaCy Course (Highly Recommended)**  
  https://course.spacy.io  
  Interactive tutorials on NER, rule-based matching, and building pipelines.

- **NLTK Book ‚Äì Chapter 7: Information Extraction**  
  https://www.nltk.org/book/ch07.html  
  Classic introduction to NER using rule-based techniques.

### üß† Customizing and Training NER Models with spaCy

- **spaCy NER Docs**  
  https://spacy.io/usage/linguistic-features#named-entities  
  Overview of how NER works in spaCy and how to access entity labels.

- **spaCy Rule-Based Matching** (Matcher & EntityRuler)  
  https://spacy.io/usage/rule-based-matching  
  How to define token patterns and add custom entities.

- **Training a Custom NER Model in spaCy**  
  https://spacy.io/usage/training  
  End-to-end guide to creating training data and training your own model.

- **Using spaCy Projects for Training Pipelines**  
  https://spacy.io/usage/projects  
  Helps manage training configs, assets, and evaluation.

### ü§ñ Alternative NER Frameworks

- **Hugging Face Transformers (for Fine-Tuned NER Models)**  
  https://huggingface.co/models?pipeline_tag=token-classification  
  Browse pre-trained NER models like `bert-base-cased-finetuned-conll03`.

- **Tutorial: Fine-Tuning BERT for NER (Hugging Face)**  
  https://huggingface.co/transformers/v4.6.1/custom_datasets.html#named-entity-recognition  
  Advanced tutorial using PyTorch and Hugging Face datasets.


### üìÑ (Some) Key Papers and Benchmarks


- **Tjong Kim Sang & De Meulder (2003)**  
  [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition](https://aclanthology.org/W03-0419/)  
  *Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003*.

- **Nadeau & Sekine (2007)**  
  [A Survey of Named Entity Recognition and Classification](https://www.jbe-platform.com/content/journals/10.1075/li.30.1.03nad)  
  *Lingvisticae Investigationes, 30(1), 3-26*

- **Li et al. (2022)**  
  [A Survey on Deep Learning for Named Entity Recognition](https://ieeexplore.ieee.org/abstract/document/9039685)  
  *IEEE Transactions on Knowledge and Data Engineering, 2021*




