# Week 6, Day 3: Named Entity Recognition and Information Extraction

## Learning Objectives
- Understand NER concepts and techniques
- Learn information extraction methods
- Master entity recognition patterns
- Practice implementing NER systems

## Topics Covered
1. Named Entity Recognition
2. Information Extraction
3. Pattern Matching
4. Entity Linking

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import nltk
from nltk import ne_chunk
from nltk.tokenize import word_tokenize
import re

# Download required models
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

## 1. Named Entity Recognition

In [None]:
def ner_example():
    # Sample text
    text = """
    Apple Inc. CEO Tim Cook announced new products at their headquarters in Cupertino, California.
    The event was also attended by Google's Sundar Pichai and Microsoft's Satya Nadella.
    The companies discussed their plans for AI development in 2024.
    """
    
    # Process with spaCy
    doc = nlp(text)
    
    # Extract entities
    entities = [
        (ent.text, ent.label_) for ent in doc.ents
    ]
    
    # Create DataFrame
    df = pd.DataFrame(entities, columns=['Entity', 'Type'])
    
    # Plot entity distribution
    plt.figure(figsize=(10, 5))
    sns.countplot(data=df, y='Type')
    plt.title('Entity Type Distribution')
    plt.show()
    
    # Print entities
    print("\nIdentified Entities:")
    for entity, type_ in entities:
        print(f"{entity}: {type_}")
    
    # Visualize entities in context
    from spacy import displacy
    displacy.render(doc, style='ent', jupyter=True)

ner_example()

## 2. Information Extraction

In [None]:
def information_extraction_example():
    # Sample text
    text = """
    Contact: John Smith
    Email: john.smith@email.com
    Phone: (123) 456-7890
    Address: 123 Main St, New York, NY 10001
    Website: www.example.com
    """
    
    # Define patterns
    patterns = {
        'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'phone': r'\(\d{3}\)\s*\d{3}-\d{4}',
        'website': r'www\.[\w-]+\.[\w.]+'
    }
    
    # Extract information
    extracted_info = {}
    for key, pattern in patterns.items():
        matches = re.findall(pattern, text)
        extracted_info[key] = matches
    
    # Print results
    print("Extracted Information:")
    for key, values in extracted_info.items():
        print(f"\n{key.title()}:")
        for value in values:
            print(f"- {value}")
    
    # Extract structured data
    doc = nlp(text)
    
    # Analyze dependencies
    print("\nDependency Analysis:")
    for token in doc:
        print(f"{token.text:<20} {token.dep_:<20} {token.head.text}")

information_extraction_example()

## 3. Pattern Matching

In [None]:
def pattern_matching_example():
    # Define patterns
    patterns = [
        {"label": "ORG", "pattern": [{"LOWER": "apple"}, {"LOWER": "inc"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "iphone"}, {"IS_DIGIT": True}]},
        {"label": "EVENT", "pattern": [{"LOWER": "wwdc"}, {"IS_DIGIT": True}]}
    ]
    
    # Add patterns to model
    ruler = nlp.add_pipe("entity_ruler")
    ruler.add_patterns(patterns)
    
    # Sample text
    text = """
    Apple Inc. announced the iPhone 15 at WWDC 2023.
    The new iPhone 14 and iPhone 13 will see price reductions.
    Developers are excited for WWDC 2024.
    """
    
    # Process text
    doc = nlp(text)
    
    # Extract matches
    matches = [
        (ent.text, ent.label_) for ent in doc.ents
    ]
    
    # Create DataFrame
    df = pd.DataFrame(matches, columns=['Entity', 'Type'])
    
    # Plot results
    plt.figure(figsize=(10, 5))
    sns.countplot(data=df, x='Type')
    plt.title('Pattern Matches by Type')
    plt.xticks(rotation=45)
    plt.show()
    
    # Print matches
    print("\nPattern Matches:")
    for entity, type_ in matches:
        print(f"{entity}: {type_}")

pattern_matching_example()

## Practical Exercises

In [None]:
# Exercise 1: Custom NER System

def custom_ner_exercise():
    # Sample text
    text = """
    The new MacBook Pro features an M2 chip and starts at $1299.
    Available colors include Space Gray and Silver.
    Pre-orders begin on March 15, 2024 at apple.com.
    """
    
    print("Task: Create a custom NER system")
    print("1. Define entity patterns")
    print("2. Implement pattern matching")
    print("3. Extract product information")
    print("4. Visualize results")
    
    # Your code here

custom_ner_exercise()

In [None]:
# Exercise 2: Information Extraction Pipeline

def information_extraction_exercise():
    # Sample text
    text = """
    Company: TechCorp Solutions
    CEO: Sarah Johnson
    Founded: 2015
    Location: San Francisco, CA
    Revenue: $50M (2023)
    Employees: 250
    """
    
    print("Task: Build an information extraction pipeline")
    print("1. Define extraction patterns")
    print("2. Extract structured data")
    print("3. Validate information")
    print("4. Format output")
    
    # Your code here

information_extraction_exercise()

## MCQ Quiz

1. What is Named Entity Recognition?
   - a) Text classification
   - b) Entity identification
   - c) Sentiment analysis
   - d) Machine translation

2. Which is NOT a common entity type?
   - a) Person
   - b) Organization
   - c) Adjective
   - d) Location

3. What is information extraction?
   - a) Data compression
   - b) Structured data extraction
   - c) Text generation
   - d) Language translation

4. What are regular expressions used for?
   - a) Machine learning
   - b) Pattern matching
   - c) Text translation
   - d) Data storage

5. What is entity linking?
   - a) Text formatting
   - b) Entity disambiguation
   - c) Data encryption
   - d) File compression

6. Which tool is commonly used for NER?
   - a) Pandas
   - b) spaCy
   - c) Matplotlib
   - d) NumPy

7. What is chunking in NER?
   - a) Data splitting
   - b) Phrase grouping
   - c) Text compression
   - d) File handling

8. Why use dependency parsing?
   - a) Data storage
   - b) Relationship analysis
   - c) Text generation
   - d) File compression

9. What is BIO tagging?
   - a) File format
   - b) Entity annotation
   - c) Data compression
   - d) Text encryption

10. What is coreference resolution?
    - a) Data compression
    - b) Entity reference linking
    - c) Text formatting
    - d) File handling

Answers: 1-b, 2-c, 3-b, 4-b, 5-b, 6-b, 7-b, 8-b, 9-b, 10-b