# Advanced Usage Examples for KeywordX

This notebook demonstrates advanced usage patterns and real-world examples for KeywordX, including:
1. Working with different entity types and custom weights
2. Processing job postings to extract skills and requirements
3. Analyzing news articles for locations and dates
4. Processing a CSV file with multiple records

First, let's set up our environment and import required libraries.

In [4]:
import sys
import os
sys.path.append(os.path.abspath(".."))

from src.keywordx.extractor import KeywordExtractor
import pandas as pd
import pprint
from datetime import datetime

## 1. Entity Types and Custom Weights

Let's explore how different entity types are handled and how custom weights affect matching.

In [None]:
# Initialize with custom weights for different entity types
ke = KeywordExtractor(entity_weights={
    'DATE': 1.5,      # Boost date matches
    'GPE': 1.2,      # Slightly boost location matches
    'MONEY': 0.8,     # Reduce priority of money mentions
    'CARDINAL': 0.9   # Lower priority for numbers
})

# Example text with multiple entity types
text = """
The conference will be held next Friday at the Hilton Hotel in New York City.
Early bird registration costs $599, and the regular price is $899.
Over 500 participants from London, Tokyo, and San Francisco are expected to attend.
"""

# Keywords to extract
keywords = ["date", "location", "cost", "attendance"]  # More specific keywords

# Extract and display results
results = ke.extract(text, keywords)
pprint.pprint(results)

{'entities': [{'span': (29, 40), 'text': 'next Friday', 'type': 'DATE'},
              {'span': (44, 60), 'text': 'the Hilton Hotel', 'type': 'FAC'},
              {'span': (64, 77), 'text': 'New York City', 'type': 'GPE'},
              {'span': (110, 113), 'text': '599', 'type': 'MONEY'},
              {'span': (141, 144), 'text': '899', 'type': 'MONEY'},
              {'span': (146, 154), 'text': 'Over 500', 'type': 'CARDINAL'},
              {'span': (173, 179), 'text': 'London', 'type': 'GPE'},
              {'span': (181, 186), 'text': 'Tokyo', 'type': 'GPE'},
              {'span': (192, 205), 'text': 'San Francisco', 'type': 'GPE'}],
 'semantic_matches': [{'keyword': 'date',
                       'match': 'registration',
                       'score': 0.9999996284747263},
                      {'keyword': 'place',
                       'match': 'New York City',
                       'score': 0.792362229039683},
                      {'keyword': 'money',
                    

## 2. Job Posting Analysis

This example shows how to extract key information from job postings.

In [7]:
# Initialize extractor with weights favoring organizations and skills
job_extractor = KeywordExtractor(entity_weights={
    'ORG': 1.3,       # Boost company names
    'PRODUCT': 1.2,    # Boost product/technology mentions
    'GPE': 1.1        # Slightly boost locations
})

job_post = """
Senior Machine Learning Engineer
Location: San Francisco, CA (Hybrid)

TechCorp is seeking a Senior ML Engineer to join our AI team. 
The role offers a competitive salary of $150,000 to $200,000.

Required Skills:
- 5+ years experience with Python, TensorFlow, and PyTorch
- Previous experience at FAANG companies preferred
- Master's degree in Computer Science or related field

Start date: January 2026
"""

job_keywords = ["company", "location", "salary", "date", "degree", "technology"]

job_results = job_extractor.extract(job_post, job_keywords)
pprint.pprint(job_results)

{'entities': [{'span': (44, 57), 'text': 'San Francisco', 'type': 'GPE'},
              {'span': (72, 80), 'text': 'TechCorp', 'type': 'ORG'},
              {'span': (175, 195),
               'text': '$150,000 to $200,000',
               'type': 'MONEY'},
              {'span': (217, 225), 'text': '5+ years', 'type': 'DATE'},
              {'span': (242, 248), 'text': 'Python', 'type': 'ORG'},
              {'span': (250, 260), 'text': 'TensorFlow', 'type': 'ORG'},
              {'span': (266, 273), 'text': 'PyTorch', 'type': 'ORG'},
              {'span': (346, 362), 'text': 'Computer Science', 'type': 'ORG'},
              {'span': (393, 405), 'text': 'January 2026', 'type': 'DATE'}],
 'semantic_matches': [{'keyword': 'company',
                       'match': 'company',
                       'score': 0.9999996880793711},
                      {'keyword': 'location',
                       'match': '\nSenior Machine Learning Engineer\nLocation',
                       'score': 0.7

## 3. News Article Analysis

Example showing how to extract key information from news articles.

In [8]:
# Initialize with weights prioritizing dates, locations, and organizations
news_extractor = KeywordExtractor(entity_weights={
    'DATE': 1.4,
    'GPE': 1.3,
    'ORG': 1.2,
    'PERSON': 1.1
})

article = """
Tech Summit 2025 Announces Major AI Breakthroughs

SILICON VALLEY, October 15, 2025 - The annual Tech Innovation Summit,
hosted by Google and Microsoft, revealed groundbreaking developments in AI technology.
Dr. Sarah Chen, leading researcher at DeepMind, presented a new neural architecture
that achieved unprecedented results in natural language understanding.

The conference, attended by over 1,000 researchers from 50 countries,
will continue until October 20th at the Santa Clara Convention Center.
"""

news_keywords = ["event_date", "location", "company", "person", "number"]

news_results = news_extractor.extract(article, news_keywords)
pprint.pprint(news_results)

{'entities': [{'span': (68, 78), 'text': 'October 15', 'type': 'DATE'},
              {'span': (98, 120),
               'text': 'Tech Innovation Summit',
               'type': 'ORG'},
              {'span': (132, 138), 'text': 'Google', 'type': 'ORG'},
              {'span': (143, 152), 'text': 'Microsoft', 'type': 'ORG'},
              {'span': (213, 223), 'text': 'Sarah Chen', 'type': 'PERSON'},
              {'span': (247, 255), 'text': 'DeepMind', 'type': 'ORG'},
              {'span': (393, 403), 'text': 'over 1,000', 'type': 'CARDINAL'},
              {'span': (421, 423), 'text': '50', 'type': 'CARDINAL'},
              {'span': (455, 467), 'text': 'October 20th', 'type': 'DATE'},
              {'span': (471, 504),
               'text': 'the Santa Clara Convention Center',
               'type': 'FAC'}],
 'semantic_matches': [{'keyword': 'location',
                       'match': 'a new neural architecture',
                       'score': 0.6240192214751383},
               

## 4. CSV Processing Pipeline

This example demonstrates how to process a CSV file containing multiple records.

In [9]:
# Create a sample CSV with some data
sample_data = """
title,description,date_posted
Data Scientist Opening,"Looking for a Data Scientist in Boston. Offering $120k-150k. Start date: March 2026.",2025-10-01
Product Manager Role,"Senior PM needed in Seattle. Work with top tech companies. Comp: $140k-180k. Immediate start.",2025-10-02
Software Engineer,"Remote position available. $130k base + stock options. Team based in San Francisco.",2025-10-03
"""

# Save to CSV
with open('sample_jobs.csv', 'w') as f:
    f.write(sample_data)

# Read CSV
df = pd.read_csv('sample_jobs.csv')

# Initialize extractor
pipeline_extractor = KeywordExtractor(entity_weights={
    'GPE': 1.3,
    'MONEY': 1.2,
    'DATE': 1.1
})

# Keywords to extract from each job posting
pipeline_keywords = ["location", "salary", "date"]

# Process each row
def process_job_posting(row):
    # Combine title and description for analysis
    text = f"{row['title']}. {row['description']}"
    return pipeline_extractor.extract(text, pipeline_keywords)

# Apply processing to each row
df['extracted_info'] = df.apply(process_job_posting, axis=1)

# Display results
for idx, row in df.iterrows():
    print(f"\nJob {idx+1}: {row['title']}")
    pprint.pprint(row['extracted_info'])


Job 1: Data Scientist Opening
{'entities': [{'span': (56, 62), 'text': 'Boston', 'type': 'GPE'},
              {'span': (74, 83), 'text': '120k-150k', 'type': 'MONEY'},
              {'span': (97, 107), 'text': 'March 2026', 'type': 'DATE'}],
 'semantic_matches': [{'keyword': 'location',
                       'match': 'a Data Scientist',
                       'score': 0.6532986621165415},
                      {'keyword': 'salary',
                       'match': 'date',
                       'score': 0.6166546503806254},
                      {'keyword': 'date',
                       'match': 'date',
                       'score': 0.9999996284747263}]}

Job 2: Product Manager Role
{'entities': [{'span': (16, 20), 'text': 'Role', 'type': 'PERSON'},
              {'span': (42, 49), 'text': 'Seattle', 'type': 'GPE'},
              {'span': (81, 85), 'text': 'Comp', 'type': 'NORP'},
              {'span': (88, 97), 'text': '140k-180k', 'type': 'MONEY'}],
 'semantic_matches': [{'keyw

## Cleanup

Remove the sample CSV file we created

In [10]:
import os
os.remove('sample_jobs.csv')