# Named Entity Recognition (NER) Pipeline with spaCy Assignment

## Introduction
**Named Entity Recognition (NER)** is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into pre-defined categories such as person names, organizations, locations, dates, monetary values, etc. It's a crucial step for information extraction, question answering, and many other NLP applications.

In this assignment, you will build and explore an NER pipeline using **spaCy**, a highly optimized and efficient NLP library for Python. spaCy offers pre-trained models that can perform NER out-of-the-box, as well as tools to customize and extend its capabilities.

---

## Learning Objectives
Upon completion of this assignment, you should be able to:
- Load and use pre-trained spaCy models for Named Entity Recognition.
- Extract and interpret entities (text, label, span) from a document.
- Visualize NER results using `displacy`.
- Understand common spaCy entity types.
- Implement rule-based custom NER using `EntityRuler`.
- Discuss the challenges and real-world applications of NER.

---

## Setup and Sample Data
We will start by loading the necessary spaCy model. Ensure you have installed spaCy (`pip install spacy`) and downloaded a model (e.g., `python -m spacy download en_core_web_sm`).

We'll use a few sample sentences for demonstration. For training a statistical custom NER model, you would typically need a large, *annotated* dataset, which is beyond the scope of this assignment's direct implementation, but we'll discuss it.

---

In [None]:
import spacy
from spacy import displacy

print(f"spaCy Version: {spacy.__version__}")

# Load a pre-trained English model
try:
    nlp = spacy.load('en_core_web_sm')
    print("\nspaCy model 'en_core_web_sm' loaded successfully.")
except OSError:
    print("\nDownloading spaCy model 'en_core_web_sm'...")
    spacy.cli.download('en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')

# Sample texts for the assignment
sample_texts = [
    "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne on April 1, 1976, in Cupertino, California.",
    "Google's CEO Sundar Pichai announced a new AI initiative in Mountain View, California, promising billions in investment by 2025.",
    "Dr. Emily White, a senior researcher at the Oxford University, presented her findings on quantum physics at the recent conference in London.",
    "Tesla is planning to open a new Gigafactory in Texas, creating thousands of jobs and boosting its production capacity by 50%.",
    "The United Nations will hold its next general assembly in New York City, discussing global climate action. France will attend.",
    "My new MacBook Pro, bought for $2500, has 16GB RAM. I love its speed."
]

print("\nSample texts loaded. Total texts:", len(sample_texts))
print("\nFirst sample text:\n", sample_texts[0])

---

## Assignment Questions

---

### Question 1: Basic NER with Pre-trained Model
Use the loaded `en_core_web_sm` model to perform NER on the first `sample_text`.

1.  **Process the Text:** Pass `sample_texts[0]` through the `nlp` pipeline to create a `Doc` object.
2.  **Extract Entities:** Iterate through `doc.ents` and print the `text`, `label_`, and `start_char`/`end_char` for each detected entity.
3.  **Visualize Entities:** Use `displacy.render()` to visualize the entities in the text. (Set `jupyter=True` if running in Jupyter for inline rendering).

---

---

### Question 2: Understanding spaCy Entity Types
spaCy's pre-trained models recognize a variety of entity types.

1.  **List Entity Types:** Print all the entity types (labels) that the `en_core_web_sm` model can recognize. You can typically find this in the model's documentation or by inspecting `nlp.get_pipe('ner').labels`.
2.  **Describe 3 Entity Types:** Choose any three entity types that were *not* present in the output of Question 1 and briefly describe what they represent (e.g., `GPE`, `PRODUCT`, `EVENT`).

---

---

### Question 3: Custom NER with `EntityRuler` (Rule-based)
Sometimes, pre-trained models might miss specific entities or you might need to recognize custom entity types not present in the standard labels. `EntityRuler` allows you to add rule-based entities to your pipeline.

Imagine you want to extract `SOFTWARE_PRODUCT` entities, like "MacBook Pro" or "Windows 11", and `JOB_TITLE` entities like "senior researcher" or "CEO" more reliably.

1.  **Create an `EntityRuler`:** Initialize an `EntityRuler`.
2.  **Define Patterns:** Create a list of patterns for at least 2 `SOFTWARE_PRODUCT` entities and 2 `JOB_TITLE` entities. Patterns are dictionaries with `label` and `pattern` keys. The `pattern` is a list of token dictionaries.
    * Example for `SOFTWARE_PRODUCT`: `{'label': 'SOFTWARE_PRODUCT', 'pattern': [{'TEXT': 'MacBook'}, {'TEXT': 'Pro'}]}`
    * Example for `JOB_TITLE`: `{'label': 'JOB_TITLE', 'pattern': [{'LOWER': 'senior'}, {'LOWER': 'researcher'}]}`
3.  **Add Ruler to Pipeline:** Add the `entity_ruler` to your `nlp` pipeline *before* the existing `ner` component (using `nlp.add_pipe('entity_ruler', before='ner')`). This ensures your rules take precedence or augment the existing NER.
4.  **Test and Visualize:** Process `sample_texts[5]` (the MacBook Pro example) and `sample_texts[1]` (the CEO example) with the *modified* pipeline. Print and visualize the detected entities using `displacy.render()`. Check if your new entity types are correctly identified.

**Note:** If you add the ruler *after* `ner`, `ner` might already have identified some of your custom entities with different labels. Placing it `before` or `after` depends on desired behavior.

---

---

### Question 4: Advanced NER Concepts (Discussion)
Training a full statistical custom NER model from scratch involves more steps than rule-based approaches.

1.  **Data Annotation:** Briefly describe why data annotation is critical for training a statistical NER model. What tools or approaches are commonly used for this?
2.  **Training Process (High-Level):** Outline the high-level steps involved in training a statistical custom NER model with spaCy (e.g., data format, training loop, evaluation).
3.  **Evaluation Metrics for NER:** Beyond simple accuracy, what specific metrics are commonly used to evaluate NER models (e.g., Precision, Recall, F1-score for entities)? Why are they important?

---

---

### Question 5: Applications of NER
Named Entity Recognition has numerous practical applications across various industries.

1.  **Information Extraction:** Describe how NER can be used for automated information extraction from unstructured text (e.g., resumes, legal documents).
2.  **Search and Recommendation:** How can NER enhance search functionality or recommendation systems?
3.  **Customer Support/Chatbots:** Provide an example of how NER can improve customer support or chatbot interactions.

---

## Submission Guidelines
- Ensure your notebook runs without errors from top to bottom.
- Save your notebook as `your_name_spacy_ner_assignment.ipynb`.
- Clearly answer all questions and provide explanations where requested in Markdown cells.
- Feel free to add additional code cells or markdown cells for clarity or experimentation.

---