## **Introduction**

### **Named Entity Recognition (NER)**
Named Entity Recognition (NER) is a critical task in Natural Language Processing (NLP) that involves identifying and categorizing named entities in text, such as names of people, organizations, and locations. 

Here’s a more detailed breakdown of how we can approach NER:

1. **Tokenization and POS Tagging:**

    - Tokenization: This step involves breaking down the text into individual tokens or words. It’s the first step in preparing text for further analysis.

    - POS Tagging: Part-of-Speech (POS) tagging assigns a grammatical category (tag) to each token, such as noun, verb, adjective, etc. This information is crucial for identifying named entities because entities often follow specific patterns of POS tags.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample text for demonstration
text = "Apple is planning to build a new store in London."

# Tokenize the text
tokens = word_tokenize(text)

# Perform POS tagging
pos_tags = pos_tag(tokens)

print("Tokenized Text:", tokens)
print("POS Tags:", pos_tags)


Tokenized Text: ['Apple', 'is', 'planning', 'to', 'build', 'a', 'new', 'store', 'in', 'London', '.']
POS Tags: [('Apple', 'NNP'), ('is', 'VBZ'), ('planning', 'VBG'), ('to', 'TO'), ('build', 'VB'), ('a', 'DT'), ('new', 'JJ'), ('store', 'NN'), ('in', 'IN'), ('London', 'NNP'), ('.', '.')]


### **Chunking:**

Chunking is the process of grouping tokens into chunks, often based on specific patterns of POS tags. In NER, chunking helps to identify sequences of tokens that together form named entities.

In [2]:
from nltk.chunk import RegexpParser

# Define grammar for chunking
grammar = r"""
    NE: {<NNP>+}    # Chunk sequences of proper nouns
"""

# Create chunk parser with defined grammar
chunk_parser = RegexpParser(grammar)

# Apply chunking to POS tagged tokens
chunked_result = chunk_parser.parse(pos_tags)

print("Chunked Result:")
print(chunked_result)


Chunked Result:
(S
  (NE Apple/NNP)
  is/VBZ
  planning/VBG
  to/TO
  build/VB
  a/DT
  new/JJ
  store/NN
  in/IN
  (NE London/NNP)
  ./.)


- Explanation of the Grammar (regex pattern):

    `NE: {<NNP>+}`: This rule specifies that we want to chunk sequences (+) of proper nouns (<NNP>) into a single chunk labeled as NE.

- Creating a Chunk Parser:

    `RegexpParser(grammar)`: This initializes a chunk parser using the grammar defined by the regex pattern.

- Applying Chunking:

    `chunk_parser.parse(pos_tags)`: This applies the chunking rules defined by chunk_parser to the POS tagged tokens (pos_tags) obtained from the text.

## **Named Entity Recognition with NLTK or SpaCy:**

- NLTK: Provides tools for tokenization, POS tagging, and basic chunking. It's versatile and suitable for learning and prototyping.

- SpaCy: A more modern NLP library that offers efficient tokenization, POS tagging, and built-in NER capabilities. It provides pre-trained models for various languages.

In [3]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.7.5-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached murmurhash-1.0.10-cp311-cp311-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached cymem-2.0.8-cp311-cp311-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Using cached preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.2.2 (from spacy)
  Downloading thinc-8.2.5-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Using cached srsly-2.4.8-cp311-cp311-win_

In [6]:
import spacy

# Load SpaCy's English NLP pipeline
nlp = spacy.load('en_core_web_sm')

# Process the text
doc = nlp(text)

# Print named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
London GPE


In [7]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample text for demonstration
text = "In the bustling city of New York, Apple Inc. is set to unveil its latest iPhone model at a highly anticipated event. The European Union (EU) is facing economic challenges amid Brexit negotiations, while NATO continues to strengthen its defense initiatives in response to global security threats. The Amazon rainforest, a vital ecosystem in South America, is under threat from deforestation, prompting environmental activists to call for urgent conservation efforts. Meanwhile, Harvard University remains a beacon of academic excellence, attracting students and researchers from around the world to its historic campus in Cambridge."

# Tokenize the text
tokens = word_tokenize(text)

# Perform POS tagging
pos_tags = pos_tag(tokens)

print("Tokenized Text:", tokens)
print("POS Tags:", pos_tags)


Tokenized Text: ['In', 'the', 'bustling', 'city', 'of', 'New', 'York', ',', 'Apple', 'Inc.', 'is', 'set', 'to', 'unveil', 'its', 'latest', 'iPhone', 'model', 'at', 'a', 'highly', 'anticipated', 'event', '.', 'The', 'European', 'Union', '(', 'EU', ')', 'is', 'facing', 'economic', 'challenges', 'amid', 'Brexit', 'negotiations', ',', 'while', 'NATO', 'continues', 'to', 'strengthen', 'its', 'defense', 'initiatives', 'in', 'response', 'to', 'global', 'security', 'threats', '.', 'The', 'Amazon', 'rainforest', ',', 'a', 'vital', 'ecosystem', 'in', 'South', 'America', ',', 'is', 'under', 'threat', 'from', 'deforestation', ',', 'prompting', 'environmental', 'activists', 'to', 'call', 'for', 'urgent', 'conservation', 'efforts', '.', 'Meanwhile', ',', 'Harvard', 'University', 'remains', 'a', 'beacon', 'of', 'academic', 'excellence', ',', 'attracting', 'students', 'and', 'researchers', 'from', 'around', 'the', 'world', 'to', 'its', 'historic', 'campus', 'in', 'Cambridge', '.']
POS Tags: [('In', 'I

In [8]:
# Load SpaCy's English NLP pipeline
nlp = spacy.load('en_core_web_sm')

# Process the text
doc = nlp(text)

# Print named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

New York GPE
Apple Inc. ORG
The European Union ORG
EU ORG
Brexit PERSON
NATO ORG
Amazon ORG
South America LOC
Harvard University ORG
Cambridge GPE
