# Named Entity Recognition (NER)

Named Entity Recognition is a natural language processing task that identifies and classifies named entities in text into predefined categories such as:

- **PERSON**: Names of people
- **ORGANIZATION**: Companies, agencies, institutions
- **LOCATION**: Countries, cities, states
- **DATE**: Absolute or relative dates or periods
- **TIME**: Times smaller than a day
- **MONEY**: Monetary values
- **PERCENT**: Percentage values
- **FACILITY**: Buildings, airports, highways, bridges
- **GPE**: Geopolitical entities (countries, cities, states)

## Why is NER Important?
- Information extraction from unstructured text
- Building knowledge graphs
- Question answering systems
- Content recommendation
- Social media analysis

In [1]:
# Import required libraries
import nltk
import spacy
from nltk import ne_chunk, pos_tag, word_tokenize

# Download required NLTK data
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

In [2]:
# Sample text for Named Entity Recognition
text = """
Apple Inc. is an American multinational technology company headquartered in Cupertino, California. 
The company was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne on April 1, 1976. 
Tim Cook is the current CEO who took over from Steve Jobs in August 2011. 
Apple's headquarters, known as Apple Park, opened in April 2017 and cost approximately $5 billion to build.
The company's market capitalization reached $3 trillion in January 2022, making it the most valuable company in the world.
"""

print("Sample Text:")
print(text)

Sample Text:

Apple Inc. is an American multinational technology company headquartered in Cupertino, California. 
The company was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne on April 1, 1976. 
Tim Cook is the current CEO who took over from Steve Jobs in August 2011. 
Apple's headquarters, known as Apple Park, opened in April 2017 and cost approximately $5 billion to build.
The company's market capitalization reached $3 trillion in January 2022, making it the most valuable company in the world.



## Method 1: Using NLTK for Named Entity Recognition

NLTK provides a built-in named entity recognizer that uses the `ne_chunk()` function. This method:
1. Tokenizes the text into words
2. Performs Part-of-Speech tagging
3. Identifies named entities using a pre-trained model

In [3]:
# nltk.download('maxent_ne_chunker_tab')

In [4]:
# NLTK Named Entity Recognition
def nltk_ner(text):
    # Step 1: Tokenize the text
    tokens = word_tokenize(text)
    
    # Step 2: Perform Part-of-Speech tagging
    pos_tags = pos_tag(tokens)
    
    # Step 3: Named Entity Recognition
    entities = ne_chunk(pos_tags)
    
    return entities

# Apply NLTK NER to our sample text
nltk_entities = nltk_ner(text)

print("NLTK Named Entity Recognition Results:")
print("="*50)

# Extract and display named entities
named_entities = []
for entity in nltk_entities:
    if hasattr(entity, 'label'):  # It's a named entity
        entity_name = ' '.join([child[0] for child in entity])
        entity_label = entity.label()
        named_entities.append((entity_name, entity_label))
        print(f"Entity: {entity_name:20} | Type: {entity_label}")

print(f"\nTotal named entities found: {len(named_entities)}")

NLTK Named Entity Recognition Results:
Entity: Apple                | Type: PERSON
Entity: Inc.                 | Type: ORGANIZATION
Entity: American             | Type: GPE
Entity: Cupertino            | Type: GPE
Entity: California           | Type: GPE
Entity: Steve Jobs           | Type: PERSON
Entity: Steve Wozniak        | Type: PERSON
Entity: Ronald Wayne         | Type: PERSON
Entity: Tim Cook             | Type: PERSON
Entity: Steve Jobs           | Type: PERSON
Entity: Apple                | Type: PERSON
Entity: Apple Park           | Type: PERSON

Total named entities found: 12


## Method 2: Using spaCy for Named Entity Recognition

spaCy is a more advanced NLP library that provides better accuracy for NER tasks. It offers:
- More entity types
- Better accuracy
- Built-in visualization tools
- Industrial-strength performance

**Note**: You need to install spaCy and download the English model:
```bash
pip install spacy
python -m spacy download en_core_web_sm
```

In [5]:
# spaCy Named Entity Recognition
try:
    # Load the English language model
    nlp = spacy.load("en_core_web_sm")
    
    # Process the text
    doc = nlp(text)
    
    print("spaCy Named Entity Recognition Results:")
    print("="*60)
    
    # Extract and display named entities
    spacy_entities = []
    for ent in doc.ents:
        spacy_entities.append((ent.text, ent.label_))
        print(f"Entity: {ent.text:20} | Type: {ent.label_:12} | Description: {spacy.explain(ent.label_)}")
    
    print(f"\nTotal named entities found: {len(spacy_entities)}")
    
except OSError:
    print("spaCy English model not found!")
    print("Please install it using: python -m spacy download en_core_web_sm")
except ImportError:
    print("spaCy not installed!")
    print("Please install it using: pip install spacy")

spaCy Named Entity Recognition Results:
Entity: Apple Inc.           | Type: ORG          | Description: Companies, agencies, institutions, etc.
Entity: American             | Type: NORP         | Description: Nationalities or religious or political groups
Entity: Cupertino            | Type: GPE          | Description: Countries, cities, states
Entity: California           | Type: GPE          | Description: Countries, cities, states
Entity: Steve Jobs           | Type: PERSON       | Description: People, including fictional
Entity: Steve Wozniak        | Type: PERSON       | Description: People, including fictional
Entity: Ronald Wayne         | Type: PERSON       | Description: People, including fictional
Entity: April 1, 1976        | Type: DATE         | Description: Absolute or relative dates or periods
Entity: Tim Cook             | Type: PERSON       | Description: People, including fictional
Entity: Steve Jobs           | Type: PERSON       | Description: People, including fic

## Method 3: Custom Entity Extraction

You can also create custom functions to extract specific types of entities using regular expressions or pattern matching:

In [6]:
import re

def extract_custom_entities(text):
    """
    Extract specific types of entities using regular expressions
    """
    entities = {
        'dates': [],
        'money': [],
        'percentages': [],
        'emails': [],
        'phone_numbers': []
    }
    
    # Extract dates (various formats)
    date_patterns = [
        r'\b\d{1,2}/\d{1,2}/\d{4}\b',  # MM/DD/YYYY
        r'\b\d{4}-\d{1,2}-\d{1,2}\b',  # YYYY-MM-DD
        r'\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}\b',  # Month DD, YYYY
        r'\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2},?\s+\d{4}\b'  # Abbreviated months
    ]
    
    for pattern in date_patterns:
        entities['dates'].extend(re.findall(pattern, text, re.IGNORECASE))
    
    # Extract money amounts
    money_pattern = r'\$\d+(?:,\d{3})*(?:\.\d{2})?(?:\s+(?:billion|million|thousand|trillion))?'
    entities['money'].extend(re.findall(money_pattern, text, re.IGNORECASE))
    
    # Extract percentages
    percentage_pattern = r'\d+(?:\.\d+)?%'
    entities['percentages'].extend(re.findall(percentage_pattern, text))
    
    # Extract email addresses
    email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    entities['emails'].extend(re.findall(email_pattern, text))
    
    # Extract phone numbers (US format)
    phone_pattern = r'\b(?:\+?1[-.\s]?)?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}\b'
    entities['phone_numbers'].extend(re.findall(phone_pattern, text))
    
    return entities

# Apply custom entity extraction
custom_entities = extract_custom_entities(text)

print("Custom Entity Extraction Results:")
print("="*40)

for entity_type, entity_list in custom_entities.items():
    if entity_list:
        print(f"\n{entity_type.upper()}:")
        for entity in entity_list:
            print(f"  - {entity}")
    else:
        print(f"\n{entity_type.upper()}: None found")

Custom Entity Extraction Results:

DATES:
  - April 1, 1976

MONEY:
  - $5 billion
  - $3 trillion

PERCENTAGES: None found

EMAILS: None found

PHONE_NUMBERS: None found


## Entity Types Explanation

### NLTK Entity Types:
- **PERSON**: Names of people
- **ORGANIZATION**: Companies, agencies, institutions  
- **GPE**: Geopolitical entities (countries, cities, states)
- **LOCATION**: Non-GPE locations, mountain ranges, bodies of water
- **FACILITY**: Buildings, airports, highways, bridges, etc.
- **GSP**: Geopolitical entities, locations, facilities

### spaCy Entity Types (More Comprehensive):
- **PERSON**: People, including fictional
- **NORP**: Nationalities or religious or political groups
- **FAC**: Buildings, airports, highways, bridges, etc.
- **ORG**: Companies, agencies, institutions, etc.
- **GPE**: Countries, cities, states
- **LOC**: Non-GPE locations, mountain ranges, bodies of water
- **PRODUCT**: Objects, vehicles, foods, etc. (not services)
- **EVENT**: Named hurricanes, battles, wars, sports events, etc.
- **WORK_OF_ART**: Titles of books, songs, etc.
- **LAW**: Named documents made into laws
- **LANGUAGE**: Any named language
- **DATE**: Absolute or relative dates or periods
- **TIME**: Times smaller than a day
- **PERCENT**: Percentage, including "%"
- **MONEY**: Monetary values, including unit
- **QUANTITY**: Measurements, as of weight or distance
- **ORDINAL**: "first", "second", etc.
- **CARDINAL**: Numerals that do not fall under another type

In [7]:
# Let's test with a more complex example
complex_text = """
Dr. John Smith, the CEO of TechCorp Inc., announced that their new headquarters in San Francisco 
will open on December 15, 2024. The project cost $250 million and is expected to increase 
productivity by 35%. You can contact him at john.smith@techcorp.com or call (555) 123-4567. 
The company, founded in 1995, has offices in New York, London, and Tokyo. Their latest product, 
the SuperWidget 3000, won the Innovation Award at the Tech Conference 2024.
"""

print("Complex Text Analysis:")
print("="*50)
print(complex_text)
print("\n" + "="*50)

# Apply NLTK NER
print("\nNLTK Results:")
nltk_complex = nltk_ner(complex_text)
for entity in nltk_complex:
    if hasattr(entity, 'label'):
        entity_name = ' '.join([child[0] for child in entity])
        print(f"  {entity_name} ({entity.label()})")

# Apply custom extraction
print("\nCustom Extraction Results:")
custom_complex = extract_custom_entities(complex_text)
for entity_type, entities in custom_complex.items():
    if entities:
        print(f"  {entity_type}: {entities}")

Complex Text Analysis:

Dr. John Smith, the CEO of TechCorp Inc., announced that their new headquarters in San Francisco 
will open on December 15, 2024. The project cost $250 million and is expected to increase 
productivity by 35%. You can contact him at john.smith@techcorp.com or call (555) 123-4567. 
The company, founded in 1995, has offices in New York, London, and Tokyo. Their latest product, 
the SuperWidget 3000, won the Innovation Award at the Tech Conference 2024.



NLTK Results:
  John Smith (PERSON)
  CEO (ORGANIZATION)
  TechCorp Inc. (ORGANIZATION)
  San Francisco (GPE)
  john.smith (ORGANIZATION)
  New York (GPE)
  London (GPE)
  Tokyo (GPE)
  SuperWidget (ORGANIZATION)
  Innovation Award (ORGANIZATION)
  Tech (ORGANIZATION)

Custom Extraction Results:
  dates: ['December 15, 2024']
  money: ['$250 million']
  percentages: ['35%']
  emails: ['john.smith@techcorp.com']
  phone_numbers: ['555) 123-4567']
  John Smith (PERSON)
  CEO (ORGANIZATION)
  TechCorp Inc. (ORGANIZA

## Key Takeaways and Applications

### When to use each method:

1. **NLTK NER**:
   - Good for basic NER tasks
   - Suitable for educational purposes
   - Limited entity types
   - Free and easy to use

2. **spaCy NER**:
   - More accurate and comprehensive
   - Better for production applications
   - Supports more entity types
   - Industrial-strength performance

3. **Custom Extraction**:
   - Perfect for domain-specific entities
   - When you need very specific patterns
   - Complement to other methods
   - Full control over extraction logic

### Real-world Applications:
- **Information Extraction**: Extract structured data from unstructured text
- **Content Analysis**: Analyze news articles, social media posts
- **Customer Service**: Extract customer information from emails
- **Legal Documents**: Identify parties, dates, amounts in contracts
- **Medical Records**: Extract patient information, medications, conditions
- **Financial Analysis**: Extract company names, financial figures from reports