### Information Extraction: Named Entity Recognition (NER) and IOB Labelling

In this exercise, we will focus on **Named Entity Recognition (NER)**, a core task in Information Extraction that identifies and classifies proper nouns (entities) into predefined categories like Person, Organisation, or Location. We will use the powerful `spaCy` library to perform NER and then introduce the **IOB (Inside-Outside-Beginning)** tagging scheme, which is the standard method for representing entity boundaries and types programmatically. Understanding both the spaCy output and the IOB scheme is vital for building and evaluating custom NER systems. And, like the last exercise, we will use a small sample input from our email summary dataset to understand the concepts in more depth.

#### What we will cover in this exercise

- **Named Entity Recognition (NER)**: Definition and spaCy application.

- **Entity Visualisation and Programmatic Access**: Displaying and extracting entities and labels.

- **IOB Tagging Scheme**: Introduction to B-, I-, and O-labels.

- **IOB Tag Extraction**: Demonstrating how to generate IOB tags from text.

- **Practical Application**: Illustrating structured entity and IOB labelling on examples.

#### What we expect to learn from this:

- **NER** identifies proper nouns (entities) and assigns them a type (e.g., ORG, GPE).

- `spaCy` provides pre-trained models for highly accurate, fast NER.

- Entities are accessible as spans in spaCy's `doc.ents` attribute.

- The **IOB (Inside-Outside-Beginning)** scheme is the industry standard for marking entity boundaries, essential for training and evaluating sequence labeling models.

**Let's get started, now**

#### Setup and Prerequisites

**Installation Commands**: We rely on the standard `spacy` library for NER.

In [None]:
# Install the necessary library
!pip install spacy
# Download the specific language model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m104.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [1]:
import spacy
from spacy import displacy

print(f"spaCy Version: {spacy.__version__}")

# Load the spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Error loading spaCy model. Ensure it's downloaded.")

spaCy Version: 3.8.7


**Sample Corpus**

We create a small corpus from our email dataset.


In [None]:
import yaml

config_path='/Users/aditikulkarni/Documents/Masters/AI-Projects/05-DL-NLP/nlp-semantic/'
# Load the environment.yml file
print (config_path + "/configs/environment.yml")
with open(config_path + "/configs/environment.yml", "r") as f:
    config = yaml.safe_load(f)

# Choose environment (local or aws)
env = "local"   # or "aws"

base_path = config[env]["base_path"]
raw_data_path = base_path + config[env]["raw_data"]
processed_data_path = base_path + config[env]["processed_data"]
models_path = base_path + config[env]["models"]

print("Raw data path:", raw_data_path)
print("Processed data path:",  processed_data_path)
print("Models path:",  models_path)

In [None]:
# Loading the JSON data
import json

email_data = json.load(open(raw_data_path + "/content/email_thread_details.json"))
email_summary = json.load(open(raw_data_path + "/content/email_thread_summaries.json"))

In [21]:
import random

sampled_keys = random.sample(list(range(len(email_summary))), 10)

In [22]:
## We will pick a random email summary record as our sample text
CORPUS = []

for key in sampled_keys:
  CORPUS.append(email_summary[key]['summary'].split(". ")[0])
print(CORPUS)

['There is confusion regarding a deal with Columbia Gas of Ohio for January 2000', 'Art Avalos reached out to Kim Ward to inquire about comments on their Master Enfolio and to request a quote for a daily call option for Q3', 'There is a discussion about Enron Online (EOL) in this email thread', 'Carol North works in the Credit Group and will be working with Aparna Rajaram and Ken Curry', 'Enron issued a press release stating that they had entered into an energy services agreement with Hades worth an estimated $666 trillion', 'There is a reorientation session scheduled for Wednesday, March 27th from 9:00-10:30am in EB5C2', 'California state treasurer Phil Angelides expressed confidence that Southern California (SoCal) would go bankrupt due to their failure to ring-fence the parent company and the opposition from their equity holders', 'The email thread discusses the fuel specification requirements for purchasing fuel oil for the Ft', 'The sender is informing the recipient that they will

#### Named Entity Recognition (NER)

**Named Entity Recognition (NER)** is the task of locating and classifying named entities in text into predefined categories. These categories typically include names of people, organisations, locations, expressions of times, quantities, monetary values, and more.

**Applying spaCy NER**

We process the sample text to identify the entities.

In [None]:
SAMPLE_TEXT = CORPUS[8]

doc = nlp(SAMPLE_TEXT)

print("\n--- NER Visualisation with displaCy ---")
displacy.render(doc, style="ent", jupyter=True)


--- NER Visualization with displaCy ---


spaCy's model successfully identified several entities:
'Sarah' and 'Matty' as an PERSON, 'Puerto Rico' as GPE (Geo-Political Entity), and 'Memorial Date' as a DATE.

####  Entity Visualisation and Programmatic Access

In spaCy, entities are stored as Span objects within the `doc.ents` attribute. We can iterate over this list to access the entity text, its label, and its position.

In [26]:
print("--- Extracted Entities and Labels (Programmatic Access) ---")
print(f"{'Entity Text':20} | {'Start Token':12} | {'End Token':10} | {'Label':15} | {'Definition'}")
print("-" * 70)

for ent in doc.ents:
    # ent.text: The extracted entity string
    # ent.start: The index of the first token in the entity
    # ent.end: The index immediately after the last token in the entity
    # ent.label_: The standard entity type (e.g., PERSON, ORG)

    # Retrieve the human-readable definition of the label
    label_def = spacy.explain(ent.label_)

    print(f"{ent.text:20} | {ent.start:<12} | {ent.end:<10} | {ent.label_:15} | {label_def}")

--- Extracted Entities and Labels (Programmatic Access) ---
Entity Text          | Start Token  | End Token  | Label           | Definition
----------------------------------------------------------------------
Puerto Rico          | 12           | 14         | GPE             | Countries, cities, states
Sarah                | 15           | 16         | PERSON          | People, including fictional
Matty                | 17           | 18         | PERSON          | People, including fictional
Memorial Day         | 19           | 21         | DATE            | Absolute or relative dates or periods


#### IOB Tagging Scheme

The **IOB (Inside-Outside-Beginning)** tagging scheme is a simple and widely adopted format used to represent token-level entity boundaries for training and evaluating NER models. It assigns one of three prefixes to each token:

- B-: **Beginning** of an entity (e.g., B-PERSON).

- I-: **Inside** an entity (e.g., I-PERSON).

- O: **Outside** of any named entity.

This scheme ensures that multi-token entities are correctly segmented.

| Token | Entity | IOB Tag   |
|-------|--------|-----------|
| Elon  | PERSON | B-PERSON  |
| Musk  | PERSON | I-PERSON  |
| CEO   | O      | O         |
| of    | O      | O         |
| Tesla | ORG    | B-ORG     |


#### IOB Tag Extraction

**Demonstrating Token-Level Tagging**: `spaCy` offers the `token.ent_iob_` and `token.ent_type_` attributes, which together allow us to generate the full IOB tags.

In [None]:
print("--- Generating IOB Tags for Each Token ---")
print(f"{'Token':10} | {'IOB Tag':8} | {'Entity Type'}")
print("-" * 30)

iob_tags = []
for token in doc:
    # token.ent_iob_ gives B, I, or O
    # token.ent_type_ gives the entity label (e.g., PERSON)

    iob_prefix = token.ent_iob_
    entity_type = token.ent_type_

    # Combine prefix and type, or just use 'O'
    if iob_prefix == 'O':
        full_iob_tag = 'O'
    else:
        full_iob_tag = f"{iob_prefix}-{entity_type}"

    iob_tags.append((token.text, full_iob_tag))

    print(f"{token.text:10} | {iob_prefix:8} | {entity_type}")

# Display the final IOB sequence:
print("\nIOB Sequence (Token, Tag):")
print(iob_tags)

--- Generating IOB Tags for Each Token ---
Token      | IOB Tag  | Entity Type
------------------------------
The        | O        | 
sender     | O        | 
is         | O        | 
informing  | O        | 
the        | O        | 
recipient  | O        | 
that       | O        | 
they       | O        | 
will       | O        | 
be         | O        | 
going      | O        | 
to         | O        | 
Puerto     | B        | GPE
Rico       | I        | GPE
with       | O        | 
Sarah      | B        | PERSON
and        | O        | 
Matty      | B        | PERSON
during     | O        | 
Memorial   | B        | DATE
Day        | I        | DATE
weekend    | O        | 

IOB Sequence (Token, Tag):
[('The', 'O'), ('sender', 'O'), ('is', 'O'), ('informing', 'O'), ('the', 'O'), ('recipient', 'O'), ('that', 'O'), ('they', 'O'), ('will', 'O'), ('be', 'O'), ('going', 'O'), ('to', 'O'), ('Puerto', 'B-GPE'), ('Rico', 'I-GPE'), ('with', 'O'), ('Sarah', 'B-PERSON'), ('and', 'O'), ('Matty'

#### Practical Application: Illustrating Entity Structuring

Let's use a sentence with a multi-token organisation name and demonstrate how the IOB scheme correctly segments it.

In [None]:
SENTENCE_2 = "Apple hired Maria Silva, a new director from The Coca-Cola Company."
doc_2 = nlp(SENTENCE_2)

print("\n--- Practical Example: Multi-token Entity ---")
displacy.render(doc_2, style="ent", jupyter=True)

print(f"{'Token':10} | {'Full IOB Tag'}")
print("-" * 25)

for token in doc_2:
    iob_prefix = token.ent_iob_
    entity_type = token.ent_type_

    if iob_prefix == 'O':
        full_iob_tag = 'O'
    else:
        full_iob_tag = f"{iob_prefix}-{entity_type}"

    print(f"{token.text:10} | {full_iob_tag}")

# Illustration of IOB Boundaries:
# - 'The' is B-ORG (Beginning of Organisation)
# - 'Coca' is I-ORG (Inside Organisation)
# - 'Cola' is I-ORG (Inside Organisation)
# - 'Company' is I-ORG (Inside Organisation)
# This sequence clearly defines the four-word entity 'The Coca-Cola Company' without ambiguity.


--- Practical Example: Multi-token Entity ---


Token      | Full IOB Tag
-------------------------
Apple      | B-ORG
hired      | O
Maria      | B-PERSON
Silva      | I-PERSON
,          | O
a          | O
new        | O
director   | O
from       | O
The        | B-ORG
Coca       | I-ORG
-          | I-ORG
Cola       | I-ORG
Company    | I-ORG
.          | O


#### Conclusion

This exercise provided a comprehensive look at **Named Entity Recognition** using `spaCy`, demonstrating how to quickly extract and classify entities. Most importantly, we introduced the **IOB (Inside-Outside-Beginning)** tagging scheme.

NER systems transform unstructured text into structured data by:

- Identifying the boundary of an entity (via B- and I- tags).

- Assigning a type to the entity (e.g., ORG, PERSON).

The IOB representation is the crucial step that allows sequence labeling models to be trained and evaluated effectively, making it a fundamental concept for anyone working with modern information extraction pipelines.