## Named Entity Recognition (NER)

**Named Entity Recognition (NER)** is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories. Common categories include person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In this example, we will demonstrate how to use spaCy's named entity recognition capabilities to identify and classify entities in a text. We will cover:
*   Identifying standard entities.
*   Understanding Tokens, Spans, Entities, and Labels.
*   Customizing entities (adding or modifying them).

In [1]:
# Import the spaCy library for natural language processing
import spacy

In [2]:
# Load the pre-trained English language model with NER capabilities
nlp = spacy.load("en_core_web_sm")

In [5]:
# Define a helper function to display named entities from a document
def show_ents(doc):
    # Check if the document contains any named entities
    if not doc.ents:
        print("No named entities found.")
    else:
        # Iterate through each named entity and display it with its label
        for ent in doc.ents:
            print(f"{ent.text:<15} {ent.label_:<10} {spacy.explain(ent.label_)}")

In [6]:
# Process a simple text with the NER pipeline
doc = nlp("Hi, how are you?")

In [8]:
# Display the named entities extracted from the simple text
show_ents(doc)

No named entities found.


In [9]:
# Process a complex sentence with multiple named entities for NER demonstration
# This sentence contains organizations, people, locations, and dates
doc = nlp(
    "Apple Inc. was founded by Steve Jobs in Cupertino, California on April 1, 1976. The company is now led by Tim Cook."
)

In [10]:
# Display all the named entities extracted from the complex text
show_ents(doc)

Apple Inc.      ORG        Companies, agencies, institutions, etc.
Steve Jobs      PERSON     People, including fictional
Cupertino       GPE        Countries, cities, states
California      GPE        Countries, cities, states
April 1, 1976   DATE       Absolute or relative dates or periods
Tim Cook        PERSON     People, including fictional


In [34]:
doc = nlp("When was Tesla established by Nikola Tesla?")
doc[5:7]

Nikola Tesla

In [35]:
show_ents(doc)

Tesla           ORG        Companies, agencies, institutions, etc.
Nikola Tesla    ORG        Companies, agencies, institutions, etc.


In [36]:
from spacy.tokens import Span

In [37]:
# 1. Create the new Span with the desired "ORG" label
# Token index 2 is 'Tesla'
new_ent = Span(doc, 5, 7, label="PERSON")

In [38]:
# 2. Create a list of all CURRENT entities, excluding the one we want to replace
# We check 'ent.start != new_ent.start' to filter out the old "Tesla" (PERSON)
cleaned_ents = [ent for ent in doc.ents if ent.start != new_ent.start]

In [39]:
# 3. Combine the lists and SORT them by position
# spaCy requires the entities list to be sorted by token index
all_ents = cleaned_ents + [new_ent]
all_ents.sort(key=lambda span: span.start)

In [40]:
# 4. Assign the complete, sorted list back to the doc
doc.set_ents(all_ents)

In [41]:
# Verify the entity was added
show_ents(doc)

Tesla           ORG        Companies, agencies, institutions, etc.
Nikola Tesla    PERSON     People, including fictional
