

# 🏷️ Step 1: Introduction to Named Entity Recognition (NER)

Named Entity Recognition (**NER**) is a fundamental task in **Natural Language Processing (NLP)** that involves identifying and categorizing key information—or *"named entities"*—in text. These entities can be names of people, organizations, locations, dates, monetary values, and more.

For this task, we'll use **spaCy**, a powerful and efficient open-source library for advanced NLP in Python. It comes with pre-trained models that can recognize a wide variety of entities right out of the box.

### Why is NER useful? 💡

* **Information Extraction:** Automatically pull structured information (like company names and locations) from unstructured text like articles or reports.
* **Content Classification:** Help classify articles by identifying the key people or organizations mentioned.
* **Search and Question-Answering:** Improve the accuracy of search engines by understanding the type of information being sought (e.g., *"Who"* implies a **PERSON**, *"Where"* implies a **LOCATION**).

---

# ⚙️ Step 2: Installation and Setup

First, we need to install **spaCy** and download one of its pre-trained models for English. We will use **`en_core_web_sm`**, which is a small, efficient model perfect for getting started.

If you are running this in a **Kaggle Notebook**, the internet connection needs to be turned on in the **settings panel** on the right.

---


In [8]:
# Install the spaCy library
!pip install -q spacy

# Download the small English model
!python -m spacy download en_core_web_sm -q

print("✅ spaCy and English model installed successfully.")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m97.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.8.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.6 which is incompatible.
gensim 4.3.3 requires scipy<1.14.0,>=1.7.0, but you have scipy 1.15.3 which is incompatible.
mkl-umath 0.1.1 requires numpy<1.27.0,>=1.26.4, but you have numpy 2.2.6 which is incompatible.
mkl-random 1.2.4 requires numpy<1.27.0,>=1.26.4, but you have n

# 🚀 Step 3: Loading the Model and Performing NER

Now that **spaCy** is set up, we can load the model and use it to process some text.

When we pass a string of text to the loaded **`nlp`** object, **spaCy** returns a **Doc** object. This object is a rich container of linguistic annotations, including the named entities found in the text.

We can access these entities through the **`doc.ents`** attribute. Each entity in this list has two key properties:

* **`.text`**: The text of the entity itself (e.g., `"Apple"`).
* **`.label_`**: The label or category of the entity (e.g., `"ORG"` for organization).

---


In [9]:
import spacy

# Load the pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Sample text for our analysis
text = "Apple Inc., founded by Steve Jobs in California, is looking at buying a U.K. startup for $1 billion in 2025."

# Process the text with the spaCy pipeline
doc = nlp(text)

# Iterate through the entities and print them
print("--- Named Entities Found ---")
for ent in doc.ents:
    print(f"Entity: {ent.text:<20} | Label: {ent.label_}")

--- Named Entities Found ---
Entity: Apple Inc.           | Label: ORG
Entity: Steve Jobs           | Label: PERSON
Entity: California           | Label: GPE
Entity: U.K.                 | Label: GPE
Entity: $1 billion           | Label: MONEY
Entity: 2025                 | Label: DATE


# 🧐 Step 4: Understanding Entity Labels

The labels like **ORG**, **PERSON**, and **GPE** are abbreviations. **spaCy** provides a handy function, `spacy.explain()`, to get a clear definition for each label. This is extremely useful for understanding what the model is identifying.

Here are some of the most common labels:

* **PERSON**: People, including fictional characters.
* **ORG**: Organizations, companies, agencies, institutions.
* **GPE**: Geopolitical Entity (countries, cities, states).
* **DATE**: Absolute or relative dates or periods.
* **MONEY**: Monetary values, including unit.
* **PRODUCT**: Objects, vehicles, foods, etc. (not services).
* **LOC**: Non-GPE locations, mountain ranges, bodies of water.

---


In [10]:
# Use spacy.explain() to get definitions for common labels
print("--- Label Explanations ---")
print(f"GPE: {spacy.explain('GPE')}")
print(f"ORG: {spacy.explain('ORG')}")
print(f"MONEY: {spacy.explain('MONEY')}")
print(f"DATE: {spacy.explain('DATE')}")

--- Label Explanations ---
GPE: Countries, cities, states
ORG: Companies, agencies, institutions, etc.
MONEY: Monetary values, including unit
DATE: Absolute or relative dates or periods


# 🎨 Step 5: Visualizing Entities with displaCy

Reading a list of entities is good, but **visualizing** them directly in the text is even better for analysis and presentation. **spaCy** includes a fantastic built-in visualizer called **displaCy**.

We can use **displaCy** to render the text with the named entities **highlighted and labeled**, making it very easy to see the model's output at a glance.

---


In [11]:
from spacy import displacy

# Use displaCy to render the entities in the text
# The 'jupyter=True' argument makes it display correctly in notebooks
print("--- Visualizing Entities ---")
displacy.render(doc, style="ent", jupyter=True)

--- Visualizing Entities ---


# 📰 Step 6: Real-World Example - Analyzing a News Snippet

Let's apply everything we've learned to a **more realistic piece of text**, like a snippet from a news article. This demonstrates how **NER** can be used to quickly **extract key information** from unstructured data.

We will process the text and then use **displaCy** to **visualize the result**.

---


In [12]:
# A more complex text from a hypothetical news article
news_snippet = """
Sundar Pichai, the CEO of Google, announced yesterday at a conference in Paris 
that the company's new AI model, Gemini, will be integrated into Android phones 
starting next Tuesday. This move is expected to cost Alphabet Inc. over $500 million 
but aims to compete with Microsoft's recent advancements. The first updates 
will roll out across North America.
"""

# Process the news snippet
news_doc = nlp(news_snippet)

# Visualize the entities found in the article
print("--- NER on News Snippet ---")
displacy.render(news_doc, style="ent", jupyter=True)

--- NER on News Snippet ---
