# Named Entity Recognition (NER) Exercises

## Overview of Named Entity Recognition (NER)
Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as the names of people, organizations, locations, dates, etc.

NER helps systems to understand and extract important information from text, which is useful in tasks like document summarization, question answering, and knowledge extraction.

### Common Entity Types:
- **PERSON**: Names of people, e.g., "Albert Einstein"
- **LOCATION**: Geopolitical entities, cities, countries, etc., e.g., "Paris"
- **ORGANIZATION**: Company names, agencies, etc., e.g., "Google"
- **DATE**: Specific dates or time expressions, e.g., "July 4, 1776"
- **MISC**: Other categories like events, laws, etc., e.g., "World War II"


## NER in NLTK

### 1. Importing Required Libraries
To start working with NER in NLTK, we need the `nltk` library and its pretrained models. 

First, make sure you have NLTK installed and download the necessary resources:

```bash
pip install nltk
```

In [1]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from scipy.special import which

nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')

  from scipy.stats import fisher_exact
[nltk_data] Downloading package punkt to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/joao-
[nltk_data]     correia/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

### 2. Tokenizing Text

Before we can apply NER, we need to tokenize the text into words. Implement a function that takes a text string as input and returns a list of words.

In [2]:
# add code here

### 3. Part-of-Speech Tagging

Next, we perform part-of-speech (POS) tagging to mark words with their grammatical categories. This helps the NER model identify entities correctly.

Implement a function that takes a list of words as input and returns a list of tuples where each tuple contains a word and its POS tag.

In [3]:
# add code here

### 4. Named Entity Recognition

Now, we apply NER which identifies named entities in the tokenized and tagged text.

Implement a function that takes a list of word-POS tuples as input and returns a list of named entities found in the text.

In [4]:
# add code here

### 5. Extracting Named Entities from Text

Finally, we put all the steps together to extract named entities from a given text string.

Implement a function that takes a text string as input and returns a list of named entities found in the text.

The output should be a list of tuples where each tuple contains the entity name and its category (e.g., 'PERSON', 'LOCATION', etc.).

**Hint:** `ne_chunk` function returns a nested tree (nlkt.Tree) that can be traversed to extract named entities. E.g., `tree.label()` gives the entity type.

In [None]:
# add code here

### 6. Categorizing Named Entities

To make the output more readable, we can group the named entities by their categories.

Implement a function that takes text as input and returns a dictionary where the keys are entity categories and the values are lists of entities belonging to that category.

In [None]:
# add code here

### Visualizing Named Entities with Spacy

spaCy is an open-source software library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and offers a fast, efficient, and easy-to-use interface for text processing tasks. spaCy provides pre-trained models for various languages, making it highly effective for a wide range of NLP tasks, including:

- Tokenization
- Part-of-speech tagging
- Named Entity Recognition (NER)
- Dependency parsing
- Text classification
- Text similarity

One of the key features of spaCy is its high performance. It is optimized for speed, making it suitable for processing large volumes of text. spaCy also provides integration with other powerful libraries like TensorFlow, PyTorch, and Scikit-learn for more advanced NLP tasks and machine learning.


In [None]:
import spacy
from spacy import displacy

# download en_core_web_sm model
#!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

text = "Barack Obama gave a speech in New York on November 10, 2024."
doc = nlp(text)

# Visualize entities
displacy.render(doc, style="ent")

### Conclusion

NER is an essential tool for extracting structured information from unstructured text. By practicing these exercises, you will gain a better understanding of how to build and improve NER systems using Python and NLTK.