## Using spaCy for NLP Tasks

This notebook demonstrates how to install and use spaCy to perform various Natural Language Processing (NLP) tasks.

In today’s lesson we will:

- Install spaCy and download its statistical models.
- Read and process a text file.
- Perform Named Entity Recognition (NER) to extract entities from text.
- Visualize entity counts.
- Explore and customize the spaCy pipeline (including using the EntityRuler).

### 1. Installation
To get started, you must install spaCy and the English language model.

**Instructions:**

1. Use `pip install spacy` to install the core library.
2. Download the English model (`en_core_web_sm`) which includes the statistical model for English.


In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

### 2. Loading the spaCy Model

Once downloaded, those models can be opened via **spacy.load('model_name')** in python. Therefore, you can verify if the models were downloaded successfully by running the following code:

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')

### 3. Reading a Text File
Here, we read in a text file that contains a chapter from *The Fellowship Of The Ring*. Make sure the file is in your working directory (or provide the full path).

**Key Steps:**
- Open the file using Python’s built-in `open()` function.
- Read the content into a variable.
- Adjust `nlp.max_length` to avoid errors when processing long texts.


In [7]:
# Define the file path (adjust this path if your file is stored elsewhere)
lotr_script = 'The Fellowship of the Ring_Ch1.txt'

# Read the file content
with open(lotr_script, 'r', encoding='utf-8') as f:
    text = f.read()

# Adjust the maximum allowed length for the NLP model to process the full text
nlp.max_length = len(text)

# Process the text with the spaCy model
doc = nlp(text)

In [None]:
# Increase the max_length to handle the large text, avoids an error
nlp.max_length = len(text) # Sets the maximum length to the length of the text

doc = nlp(text)

### 4. Named Entity Recognition (NER)

Named Entity Recognition identifies and classifies entities (like names of people, places, or organizations) in text.
**What we'll do:**
- Extract entities from the processed document.
- Create a Pandas DataFrame that shows the entity text and its corresponding label.

In [11]:
import pandas as pd

# Create a list to collect entity data
entities_data = []

# Extract each entity and its label from the document
for ent in doc.ents:
    entities_data.append({
        "text" : ent.text,
        "label" : ent.label_
    })

# Convert the list into a DataFrame for easier viewing
ent_df = pd.DataFrame(entities_data)
# Display the entities
ent_df

Unnamed: 0,text,label
0,Chapter 1,LAW
1,Party,ORG
2,Bilbo Baggins,PERSON
3,Bag End,ORG
4,first,ORDINAL
...,...,...
520,Shire,PERSON
521,Gandalf,NORP
522,Frodo,ORG
523,Frodo,ORG


### 5. Analyzing Entity Data
Let's examine:
- **Text and Label Frequency:**  
Display the most common entity texts and their labels.
- **Entity Details:**  
Use spaCy's built-in explanation function to understand what a specific label (e.g., "FAC") means.

In [14]:
# Display the top 15 most common texts and labels
ent_df['text'].value_counts()[:15]
ent_df['label'].value_counts()[:15]
# Explain a specific entity label
spacy.explain("GPE")

'Countries, cities, states'

In [None]:
# Display combinations of text and label counts


In [None]:
# List unique labels in the dataset


###  6. Exploring the NLP Pipeline
[NLP Pipeline Documentation](https://spacy.io/usage/processing-pipelines#processing)

You can inspect the components of the NLP pipeline `nlp.pipeline`.

In [None]:
# Spacy's language model pipeline


### 7. Visualizing Entities with DisplaCy
DisplaCy is spaCy’s visualization tool for rendering entities and dependencies in Jupyter notebooks.


In [21]:
from spacy import displacy
import IPython
from IPython.core.display import HTML, display

# Render entities in the processed document using DisplaCy
displacy.render(doc, style = "ent", jupyter = True)

ImportError: cannot import name 'display' from 'IPython.core.display' (C:\Users\gingw\AppData\Roaming\Python\Python312\site-packages\IPython\core\display.py)

In [None]:
from spacy import displacy

# Render entities in the processed document using DisplaCy
displacy.render(doc, style = "ent", jupyter = True)

### 8. Identifying Issues in the Named Entity Recognizer

Use the [Lord of the Rings Wiki](https://lotr.fandom.com/wiki/Main_Page) if you need help

Example Issues: ?

### 9. Creating Custom Entity Recognizers with the EntityRuler

Custom patterns can be added to the pipeline using spaCy's `EntityRuler`. This allows us to capture entities that might be missed by the statistical model.

**Steps:**
- Define custom entity patterns (as a list of dictionaries).
- Check if the "ner" component exists and add the EntityRuler accordingly.

[EntityRuler Documentation](https://spacy.io/api/entityruler#add_patterns)

In [None]:
# Define custom entity patterns for names, locations, and other entities
entity_patterns = [
    {"label":"GPE", "pattern":"Shire"},
    {"label":"GPE","pattern":"Bag End"},
    {"label":"WIZARD","pattern":"Gandalf"}, # Bro what the hell with the Maiar
    {"label": "HOBBIT", "pattern" : "Biblo"},
    {"label": "MAGIC_OBJECT", "parrtern":[{"LOWER":"ring"}]}

 ]

#Add EntityRuler to the pipeline
#ruler = nlp.add_pipe("entity_ruler", before="ner")
#ruler.add_patterns(entity_patterns)

#Access the existing entity_ruler
#ruler = nlp.get_pipe("entity_ruler")

#Add your custom patterns
#ruler.add_patterns(entity_patterns)

ValueError: [E007] 'entity_ruler' already exists in pipeline. Existing names: ['tok2vec', 'tagger', 'parser', 'senter', 'attribute_ruler', 'lemmatizer', 'entity_ruler', 'ner']

In [None]:
# Check if the "ner" pipe exists. If it does, add the EntityRuler before it.

    # If entity_ruler already exists, simply add patterns to it.

    # If the NER component does not exist, add both the EntityRuler and the NER component.

# Check updated pipeline labels


### 10. Testing the Custom Entity Ruler
Run a sample sentence through the updated pipeline to check if your custom patterns are recognized.


In [None]:
from spacy import displacy

doc_2 = nlp("Gandalf went to Bilbo's house for his birthday because he was my precious which is the Ring.")
displacy.render(doc_2, style="ent", jupyter=True)

### 11. Re-Processing the Full Text
Now that we have updated the pipeline with our custom patterns, re-run the full text to see how the recognizer performs.


In [None]:
# Re-process the text with the updated pipeline
nlp.max_length = len(text)
doc = nlp(text)

displacy.render(doc, style="ent", jupyter=True)

### 12. Re-Analyzing Entity Data
Let's again create a DataFrame from the updated document to see if our custom recognitions improved entity extraction.

In [None]:
# Collect entities from the updated document
entities_data = []
for ent in doc.ents:
    entities_data.append({
        'text': ent.text,
        'label': ent.label_
    })

# Convert to DataFrame and display
ent_df = pd.DataFrame(entities_data)
ent_df

# %%
# Show value counts for text and label combinations after the update
print("Updated Text and Label Combinations (Top 20):")
print(ent_df[['text', 'label']].value_counts()[:20])

### 13. Visualizing Entity Data
Finally, we create a bar plot to visualize the top 10 most common text and label combinations.
**Steps:**
- Use Pandas to compute counts.
- Plot the counts using Seaborn and Matplotlib.

**Tree Map**
- The `color` parameter is set to `label` so that each entity label gets a distinct color.

- The `path` parameter defines a hierarchy where entities are grouped by their `label` first, then by `text`.

- This interactive treemap allows students to easily see how different entity labels contribute to the overall counts.

In [None]:
import plotly.express as px

# Prepare the data for visualization: top 10 entity combinations
top_10_ents = ent_df[['text', 'label']].value_counts().head(10).reset_index(name='counts')

# Create a treemap using a hierarchical structure (first by label, then by text)
fig = px.treemap(top_10_ents,
                path=['label', 'text'],
                values='counts',
                title='Top 10 Text and Label Combinations (Treemap)',
                color='label')

fig.show()