# 05 — Keyphrase Extraction

Keyphrase extraction serves two roles in SocraticPath: populating the concept map shown to the user, and generating search terms for Wikipedia context retrieval. KeyBERT (Grootendorst, 2020) is used because it requires no task-specific training — it ranks candidate n-grams by cosine similarity to the document embedding produced by a pre-trained sentence transformer (`all-MiniLM-L6-v2`; Reimers & Gurevych, 2019). Maximal Marginal Relevance (MMR; Carbonell & Goldstein, 1998) is applied to balance relevance against diversity, preventing semantically redundant keyphrases from dominating the extracted set.

In [None]:
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from collections import Counter

In [None]:
kw_model = KeyBERT(model='all-MiniLM-L6-v2')
print("KeyBERT model loaded.")

In [None]:
sample_text = """
Climate change is one of the most pressing issues of our time. The scientific consensus 
is clear: human activities, particularly the burning of fossil fuels, are causing global 
temperatures to rise. This leads to more extreme weather events, rising sea levels, and 
threats to biodiversity. We need immediate action on renewable energy and carbon reduction.
"""

keywords = kw_model.extract_keywords(
    sample_text,
    keyphrase_ngram_range=(1, 2),
    stop_words='english',
    top_n=10
)

print("Extracted Keyphrases:")
for kw, score in keywords:
    print(f"  {kw}: {score:.4f}")

Maximal Marginal Relevance (Carbonell & Goldstein, 1998) selects keyphrases that are relevant to the document while being dissimilar to one another. The `diversity` parameter (0 = maximum relevance, 1 = maximum diversity) is set to 0.5 for concept map extraction and 0.7 for retrieval queries, where broader coverage is preferred.

In [None]:
keywords_diverse = kw_model.extract_keywords(
    sample_text,
    keyphrase_ngram_range=(1, 3),
    stop_words='english',
    top_n=10,
    use_mmr=True,
    diversity=0.7
)

print("Diverse Keyphrases (MMR):")
for kw, score in keywords_diverse:
    print(f"  {kw}: {score:.4f}")

In [None]:
def extract_keyphrases(
    text,
    model=None,
    top_n=5,
    ngram_range=(1, 2),
    diversity=0.5,
    use_mmr=True
):
    """
    Extract keyphrases from text using KeyBERT.
    
    Args:
        text: Input text
        model: KeyBERT model instance
        top_n: Number of keyphrases to extract
        ngram_range: Tuple of (min, max) n-gram sizes
        diversity: MMR diversity parameter (0=max relevance, 1=max diversity)
        use_mmr: Whether to use MMR for diversification
    
    Returns:
        List of (keyphrase, score) tuples
    """
    if model is None:
        model = kw_model
    
    if not text or len(text.strip()) < 10:
        return []
    
    keywords = model.extract_keywords(
        text,
        keyphrase_ngram_range=ngram_range,
        stop_words='english',
        top_n=top_n,
        use_mmr=use_mmr,
        diversity=diversity
    )
    
    return keywords

In [None]:
test_contexts = [
    "I believe that artificial intelligence will eventually replace most human jobs. The advances in machine learning and automation are accelerating, and companies are already using AI for tasks that humans used to do.",
    
    "Social media is destroying democracy. The spread of misinformation and the creation of echo chambers is making it impossible for people to agree on basic facts or have productive political discussions.",
    
    "I think everyone should be required to learn a programming language in school. Technology is everywhere now, and understanding code is as important as reading and writing."
]

print("Keyphrase Extraction Examples:")
print("=" * 60)

for i, context in enumerate(test_contexts, 1):
    keyphrases = extract_keyphrases(context, top_n=5)
    
    print(f"\nContext {i}: {context[:80]}...")
    print("Keyphrases:")
    for kw, score in keyphrases:
        print(f"  - {kw} ({score:.3f})")
    print("-" * 60)

For the concept map, keyphrases are extracted from the user input and the generated Socratic question. Results from each source are deduplicated and ranked by the maximum score across sources, giving priority to concepts that appear in multiple parts of the interaction.

In [None]:
def extract_all_keyphrases(user_input, generated_question, retrieved_context=None, top_n=5):
    """
    Extract keyphrases from all sources and combine them.
    
    Returns:
        dict with keyphrases categorized by source
    """
    results = {
        'user_input': extract_keyphrases(user_input, top_n=top_n),
        'generated_question': extract_keyphrases(generated_question, top_n=3),
        'combined': []
    }
    
    if retrieved_context:
        results['retrieved_context'] = extract_keyphrases(retrieved_context, top_n=top_n)
    
    all_keyphrases = {}
    for source, kws in results.items():
        if source == 'combined':
            continue
        for kw, score in kws:
            kw_lower = kw.lower()
            if kw_lower not in all_keyphrases:
                all_keyphrases[kw_lower] = {'phrase': kw, 'score': score, 'sources': []}
            all_keyphrases[kw_lower]['sources'].append(source)
            all_keyphrases[kw_lower]['score'] = max(all_keyphrases[kw_lower]['score'], score)
    
    sorted_kws = sorted(all_keyphrases.values(), key=lambda x: x['score'], reverse=True)
    results['combined'] = sorted_kws[:top_n * 2]
    
    return results

In [None]:
example_input = "Climate change is not as serious as scientists claim because the weather has always changed throughout history."
example_question = "What specific evidence distinguishes current climate patterns from historical natural variations?"
example_context = "The current rate of warming is unprecedented in the geological record. Ice core data shows CO2 levels are higher than any point in the last 800,000 years."

all_kws = extract_all_keyphrases(
    example_input,
    example_question,
    example_context
)

print("Multi-Source Keyphrase Extraction:")
print("=" * 60)

print("\nFrom User Input:")
for kw, score in all_kws['user_input']:
    print(f"  - {kw} ({score:.3f})")

print("\nFrom Generated Question:")
for kw, score in all_kws['generated_question']:
    print(f"  - {kw} ({score:.3f})")

print("\nFrom Retrieved Context:")
for kw, score in all_kws['retrieved_context']:
    print(f"  - {kw} ({score:.3f})")

print("\nCombined (Unique, Ranked):")
for item in all_kws['combined']:
    sources = ', '.join(item['sources'])
    print(f"  - {item['phrase']} ({item['score']:.3f}) - from: {sources}")

In [None]:
def generate_concept_nodes(keyphrases_result, central_topic=None):
    """
    Generate concept map nodes from keyphrases.
    
    Returns:
        list of node dictionaries for React Flow
    """
    nodes = []
    
    if central_topic:
        nodes.append({
            'id': 'central',
            'type': 'topic',
            'label': central_topic,
            'position': {'x': 0, 'y': 0},
            'data': {'score': 1.0, 'source': 'user'}
        })
    
    for i, item in enumerate(keyphrases_result.get('combined', [])):
        node_type = 'concept'
        if 'generated_question' in item.get('sources', []):
            node_type = 'question'
        
        nodes.append({
            'id': f'concept_{i}',
            'type': node_type,
            'label': item['phrase'],
            'position': {'x': 0, 'y': 0},
            'data': {
                'score': item['score'],
                'sources': item['sources']
            }
        })
    
    return nodes

In [None]:
nodes = generate_concept_nodes(all_kws, central_topic="Climate Change")

print("Generated Concept Nodes:")
for node in nodes:
    print(f"  [{node['type']}] {node['label']} (score: {node['data']['score']:.3f})")

In [None]:
DATA_PATH = Path("../datasets/processed")

if (DATA_PATH / "test_formatted.parquet").exists():
    test_df = pd.read_parquet(DATA_PATH / "test_formatted.parquet")
    sample_df = test_df.head(100)
    
    print("Processing keyphrases for sample data...")
    
    all_keyphrases = []
    for idx, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
        kws = extract_keyphrases(row['original_input'], top_n=5)
        all_keyphrases.extend([kw for kw, _ in kws])
    
    keyphrase_counts = Counter(all_keyphrases)
    top_keyphrases = keyphrase_counts.most_common(20)
    
    print("\nMost Common Keyphrases in Dataset:")
    for kw, count in top_keyphrases:
        print(f"  {kw}: {count}")
else:
    print("Test data not found. Run preprocessing notebook first.")

In [None]:
class KeyphraseExtractor:
    """
    Production-ready keyphrase extractor for SocraticPath.
    """
    
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = KeyBERT(model=model_name)
    
    def extract(
        self,
        text,
        top_n=5,
        ngram_range=(1, 2),
        diversity=0.5
    ):
        """Extract keyphrases from a single text."""
        if not text or len(text.strip()) < 10:
            return []
        
        keywords = self.model.extract_keywords(
            text,
            keyphrase_ngram_range=ngram_range,
            stop_words='english',
            top_n=top_n,
            use_mmr=True,
            diversity=diversity
        )
        
        return [{'phrase': kw, 'score': float(score)} for kw, score in keywords]
    
    def extract_for_retrieval(self, text, top_n=3):
        """Extract keyphrases optimized for API retrieval queries."""
        kws = self.extract(text, top_n=top_n, ngram_range=(1, 3), diversity=0.7)
        return [kw['phrase'] for kw in kws]

In [None]:
extractor = KeyphraseExtractor()

test_text = "The government should invest more in renewable energy sources like solar and wind power to combat climate change."

print("Standard Extraction:")
for kw in extractor.extract(test_text):
    print(f"  - {kw['phrase']} ({kw['score']:.3f})")

print("\nFor Retrieval Queries:")
print(f"  {extractor.extract_for_retrieval(test_text)}")

In [None]:
import json

OUTPUT_PATH = Path("../models/keybert_config")
OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

config = {
    "model_name": "all-MiniLM-L6-v2",
    "default_top_n": 5,
    "default_ngram_range": [1, 2],
    "default_diversity": 0.5,
    "retrieval_top_n": 3,
    "retrieval_ngram_range": [1, 3],
    "retrieval_diversity": 0.7,
    "stop_words": "english"
}

with open(OUTPUT_PATH / "config.json", "w") as f:
    json.dump(config, f, indent=2)

print(f"Configuration saved to {OUTPUT_PATH / 'config.json'}")

The `KeyphraseExtractor` class is the interface used by the inference pipeline in notebook 06. The two extraction modes — `extract()` for concept maps (top_n=5, ngram_range=(1,2)) and `extract_for_retrieval()` for Wikipedia queries (top_n=3, ngram_range=(1,3)) — reflect different coverage requirements. Configuration is persisted to `models/keybert_config/config.json` for reproducibility.