# Named Entity Recognition (NER) Exercises

## Overview of Named Entity Recognition (NER)
Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as the names of people, organizations, locations, dates, etc.

NER helps systems to understand and extract important information from text, which is useful in tasks like document summarization, question answering, and knowledge extraction.

### Common Entity Types:
- **PERSON**: Names of people, e.g., "Albert Einstein"
- **LOCATION**: Geopolitical entities, cities, countries, etc., e.g., "Paris"
- **ORGANIZATION**: Company names, agencies, etc., e.g., "Google"
- **DATE**: Specific dates or time expressions, e.g., "July 4, 1776"
- **MISC**: Other categories like events, laws, etc., e.g., "World War II"


## NER in NLTK

### 1. Importing Required Libraries
To start working with NER in NLTK, we need the `nltk` library and its pretrained models. 

First, make sure you have NLTK installed and download the necessary resources:

```bash
pip install nltk
```

In [1]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

### 2. Tokenizing Text

Before we can apply NER, we need to tokenize the text into words. Implement a function that takes a text string as input and returns a list of words.

In [2]:
def tokenize_text(text):
    return word_tokenize(text)

### 3. Part-of-Speech Tagging

Next, we perform part-of-speech (POS) tagging to mark words with their grammatical categories. This helps the NER model identify entities correctly.

Implement a function that takes a list of words as input and returns a list of tuples where each tuple contains a word and its POS tag.

In [3]:
def pos_tag_text(words):
    return pos_tag(words)

### 4. Named Entity Recognition

Now, we apply NER which identifies named entities in the tokenized and tagged text.

Implement a function that takes a list of word-POS tuples as input and returns a list of named entities found in the text.

In [4]:
def extract_named_entities(pos_tagged_text):
    return ne_chunk(pos_tagged_text)

### 5. Extracting Named Entities from Text

Finally, we put all the steps together to extract named entities from a given text string.

Implement a function that takes a text string as input and returns a list of named entities found in the text.

The output should be a list of tuples where each tuple contains the entity name and its category (e.g., 'PERSON', 'LOCATION', etc.).

**Hint:** `ne_chunk` function returns a nested tree (nlkt.Tree) that can be traversed to extract named entities. E.g., `tree.label()` gives the entity type.

In [5]:
def extract_named_entities(text):
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    entities = ne_chunk(tags)
    named_entities = []
    
    for subtree in entities:
        if isinstance(subtree, nltk.Tree):
            entity = " ".join([word for word, tag in subtree])
            entity_type = subtree.label()
            named_entities.append((entity, entity_type))
    
    return named_entities

# Test the function
text = "Bill Gates founded Microsoft in the United States."
print(extract_named_entities(text))


[('Bill', 'PERSON'), ('Gates', 'PERSON'), ('Microsoft', 'ORGANIZATION'), ('United States', 'GPE')]


### 6. Categorizing Named Entities

To make the output more readable, we can group the named entities by their categories.

Implement a function that takes text as input and returns a dictionary where the keys are entity categories and the values are lists of entities belonging to that category.

In [6]:
def categorize_named_entities(text):
    tokens = word_tokenize(text)
    tags = pos_tag(tokens)
    entities = ne_chunk(tags)
    categorized_entities = {}
    
    for subtree in entities:
        if isinstance(subtree, nltk.Tree):
            entity = " ".join([word for word, tag in subtree])
            entity_type = subtree.label()
            if entity_type in categorized_entities:
                categorized_entities[entity_type].append(entity)
            else:
                categorized_entities[entity_type] = [entity]
    
    return categorized_entities

# Test the function
text = "Bill Gates founded Microsoft in the United States."
print(categorize_named_entities(text))

{'PERSON': ['Bill', 'Gates'], 'ORGANIZATION': ['Microsoft'], 'GPE': ['United States']}


### Visualizing Named Entities with Spacy

In [9]:
import spacy
from spacy import displacy

# download en_core_web_sm model
#!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

text = "Barack Obama gave a speech in New York on November 10, 2024."
doc = nlp(text)

# Visualize entities
displacy.render(doc, style="ent")

### Conclusion

NER is an essential tool for extracting structured information from unstructured text. By practicing these exercises, you will gain a better understanding of how to build and improve NER systems using Python and NLTK.