# 05 - Keyphrase Extraction

This notebook implements keyphrase extraction for the concept map visualization and context retrieval.

## Objectives
- Implement TF-IDF based keyphrase extraction (baseline)
- Fine-tune a KeyBERT model on KP20k (optional, advanced)
- Test extraction on sample contexts
- Integrate with the retrieval pipeline

## Why Keyphrase Extraction?
Keyphrases are used for:
1. **Concept Map Visualization** - Nodes in React Flow represent extracted concepts
2. **Wikipedia/Gemini Retrieval** - Query terms for context augmentation
3. **Question Focusing** - Generate questions about specific concepts

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from collections import Counter
import re
from tqdm import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

SEED = 42
np.random.seed(SEED)

OUTPUT_DIR = Path("../backend/keyphrase_artifacts")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

## 2. Load Sample Data

In [None]:
test_contexts = [
    "Climate change is causing significant shifts in global weather patterns. Rising temperatures lead to melting ice caps, rising sea levels, and more frequent extreme weather events. Scientists argue that human activities, particularly the burning of fossil fuels, are the primary drivers of these changes.",
    
    "Machine learning algorithms can be broadly categorized into supervised learning, unsupervised learning, and reinforcement learning. Neural networks, a subset of machine learning, have revolutionized fields like computer vision and natural language processing.",
    
    "The French Revolution of 1789 fundamentally transformed French society. The abolition of feudalism, the Declaration of the Rights of Man, and the rise of Napoleon Bonaparte marked significant turning points in European history.",
    
    "Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen. This process occurs primarily in the chloroplasts of plant cells and is essential for life on Earth.",
    
    "Quantum computing leverages quantum mechanical phenomena like superposition and entanglement to perform computations. Unlike classical bits that are either 0 or 1, quantum bits or qubits can exist in multiple states simultaneously."
]

print(f"Loaded {len(test_contexts)} sample contexts")

## 3. TF-IDF Keyphrase Extractor (Baseline)

A simple but effective approach using TF-IDF scores and n-grams.

In [None]:
class TFIDFKeyphraseExtractor:
    """Extract keyphrases using TF-IDF scoring."""
    
    def __init__(self, ngram_range=(1, 3), top_n=10):
        self.ngram_range = ngram_range
        self.top_n = top_n
        self.stop_words = set(stopwords.words('english'))
        self.vectorizer = TfidfVectorizer(
            ngram_range=ngram_range,
            stop_words='english',
            max_features=1000
        )
        self.stemmer = PorterStemmer()
    
    def preprocess(self, text):
        """Clean and normalize text."""
        text = text.lower()
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    
    def extract(self, text, top_n=None):
        """Extract top keyphrases from text."""
        if top_n is None:
            top_n = self.top_n
        
        processed = self.preprocess(text)
        
        try:
            tfidf_matrix = self.vectorizer.fit_transform([processed])
            feature_names = self.vectorizer.get_feature_names_out()
            scores = tfidf_matrix.toarray()[0]
            
            scored_phrases = list(zip(feature_names, scores))
            scored_phrases.sort(key=lambda x: x[1], reverse=True)
            
            keyphrases = []
            seen_stems = set()
            
            for phrase, score in scored_phrases:
                stem = ' '.join([self.stemmer.stem(w) for w in phrase.split()])
                if stem not in seen_stems and score > 0:
                    keyphrases.append({'text': phrase, 'score': float(score)})
                    seen_stems.add(stem)
                if len(keyphrases) >= top_n:
                    break
            
            return keyphrases
        except Exception as e:
            print(f"Error extracting keyphrases: {e}")
            return []

In [None]:
tfidf_extractor = TFIDFKeyphraseExtractor(ngram_range=(1, 2), top_n=5)

print("TF-IDF Keyphrase Extraction Results:")
print("="*60)

for i, context in enumerate(test_contexts):
    keyphrases = tfidf_extractor.extract(context)
    print(f"\nContext {i+1}: {context[:80]}...")
    print("Keyphrases:")
    for kp in keyphrases:
        print(f"  - {kp['text']} (score: {kp['score']:.4f})")

## 4. KeyBERT-based Extractor (Advanced)

Uses sentence embeddings for better semantic understanding.

In [None]:
from sentence_transformers import SentenceTransformer

class KeyBERTExtractor:
    """Extract keyphrases using sentence embeddings."""
    
    def __init__(self, model_name='all-MiniLM-L6-v2', ngram_range=(1, 2), top_n=5):
        self.model = SentenceTransformer(model_name)
        self.ngram_range = ngram_range
        self.top_n = top_n
        self.stop_words = set(stopwords.words('english'))
    
    def get_candidates(self, text):
        """Extract candidate phrases from text."""
        words = word_tokenize(text.lower())
        candidates = []
        
        for n in range(self.ngram_range[0], self.ngram_range[1] + 1):
            for i in range(len(words) - n + 1):
                phrase = ' '.join(words[i:i+n])
                phrase_words = phrase.split()
                if not any(w in self.stop_words for w in phrase_words):
                    if all(w.isalpha() for w in phrase_words):
                        candidates.append(phrase)
        
        return list(set(candidates))
    
    def extract(self, text, top_n=None):
        """Extract keyphrases using embedding similarity."""
        if top_n is None:
            top_n = self.top_n
        
        candidates = self.get_candidates(text)
        if not candidates:
            return []
        
        doc_embedding = self.model.encode([text])
        candidate_embeddings = self.model.encode(candidates)
        
        similarities = cosine_similarity(doc_embedding, candidate_embeddings)[0]
        
        scored = list(zip(candidates, similarities))
        scored.sort(key=lambda x: x[1], reverse=True)
        
        keyphrases = []
        for phrase, score in scored[:top_n]:
            keyphrases.append({'text': phrase, 'score': float(score)})
        
        return keyphrases

In [None]:
keybert_extractor = KeyBERTExtractor(ngram_range=(1, 2), top_n=5)

print("KeyBERT Keyphrase Extraction Results:")
print("="*60)

for i, context in enumerate(test_contexts):
    keyphrases = keybert_extractor.extract(context)
    print(f"\nContext {i+1}: {context[:80]}...")
    print("Keyphrases:")
    for kp in keyphrases:
        print(f"  - {kp['text']} (score: {kp['score']:.4f})")

## 5. Load and Explore KP20k Dataset

In [None]:
from datasets import load_dataset

kp20k = load_dataset("midas/kp20k", trust_remote_code=True)

print(f"KP20k dataset:")
print(f"  Train: {len(kp20k['train'])} samples")
print(f"  Validation: {len(kp20k['validation'])} samples")
print(f"  Test: {len(kp20k['test'])} samples")

In [None]:
sample = kp20k['train'][0]
print("\nSample entry:")
print(f"Document: {sample['document'][:300]}...")
print(f"\nExtractive keyphrases: {sample['extractive_keyphrases']}")
print(f"Abstractive keyphrases: {sample['abstractive_keyphrases']}")

## 6. Evaluate Extractors on KP20k

Compare TF-IDF and KeyBERT against ground truth keyphrases.

In [None]:
def evaluate_extraction(extractor, samples, k=5):
    """Evaluate keyphrase extraction precision and recall."""
    precisions = []
    recalls = []
    f1s = []
    
    for sample in tqdm(samples, desc="Evaluating"):
        doc = sample['document']
        true_kps = set([kp.lower() for kp in sample['extractive_keyphrases']])
        
        if not true_kps or not doc:
            continue
        
        predicted = extractor.extract(doc, top_n=k)
        pred_kps = set([kp['text'].lower() for kp in predicted])
        
        if not pred_kps:
            continue
        
        overlap = len(pred_kps & true_kps)
        precision = overlap / len(pred_kps) if pred_kps else 0
        recall = overlap / len(true_kps) if true_kps else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
    
    return {
        'precision': np.mean(precisions),
        'recall': np.mean(recalls),
        'f1': np.mean(f1s)
    }

In [None]:
eval_samples = kp20k['test'].select(range(min(500, len(kp20k['test']))))

print("Evaluating TF-IDF extractor...")
tfidf_results = evaluate_extraction(tfidf_extractor, eval_samples, k=5)
print(f"TF-IDF Results:")
print(f"  Precision@5: {tfidf_results['precision']:.4f}")
print(f"  Recall@5: {tfidf_results['recall']:.4f}")
print(f"  F1@5: {tfidf_results['f1']:.4f}")

In [None]:
print("\nEvaluating KeyBERT extractor...")
keybert_results = evaluate_extraction(keybert_extractor, eval_samples, k=5)
print(f"KeyBERT Results:")
print(f"  Precision@5: {keybert_results['precision']:.4f}")
print(f"  Recall@5: {keybert_results['recall']:.4f}")
print(f"  F1@5: {keybert_results['f1']:.4f}")

## 7. Create Production Extractor Class

In [None]:
class KeyphraseExtractor:
    """Production keyphrase extractor with fallback."""
    
    def __init__(self, use_keybert=True, model_name='all-MiniLM-L6-v2'):
        self.use_keybert = use_keybert
        
        if use_keybert:
            try:
                self.extractor = KeyBERTExtractor(model_name=model_name)
                print("Using KeyBERT extractor")
            except Exception as e:
                print(f"KeyBERT failed, falling back to TF-IDF: {e}")
                self.extractor = TFIDFKeyphraseExtractor()
                self.use_keybert = False
        else:
            self.extractor = TFIDFKeyphraseExtractor()
            print("Using TF-IDF extractor")
    
    def extract(self, text, top_n=5):
        """Extract keyphrases from text."""
        try:
            return self.extractor.extract(text, top_n=top_n)
        except Exception as e:
            print(f"Extraction error: {e}")
            return []
    
    def batch_extract(self, texts, top_n=5):
        """Extract keyphrases from multiple texts."""
        results = []
        for text in tqdm(texts, desc="Extracting"):
            results.append(self.extract(text, top_n=top_n))
        return results

In [None]:
extractor = KeyphraseExtractor(use_keybert=True)

test_text = "Artificial intelligence and machine learning are transforming healthcare through improved diagnostics and personalized treatment."
keyphrases = extractor.extract(test_text, top_n=5)

print(f"\nTest extraction:")
print(f"Input: {test_text}")
print(f"Keyphrases: {[kp['text'] for kp in keyphrases]}")

## 8. Save Extractor for Backend Use

In [None]:
import pickle
import json

extractor_config = {
    "use_keybert": True,
    "model_name": "all-MiniLM-L6-v2",
    "default_top_n": 5,
    "ngram_range": [1, 2],
    "tfidf_evaluation": tfidf_results,
    "keybert_evaluation": keybert_results
}

with open(OUTPUT_DIR / "extractor_config.json", "w") as f:
    json.dump(extractor_config, f, indent=2)

print(f"Configuration saved to {OUTPUT_DIR / 'extractor_config.json'}")

In [None]:
extractor_code = '''
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

class KeyphraseExtractor:
    def __init__(self, model_name='all-MiniLM-L6-v2', ngram_range=(1, 2), top_n=5):
        self.model = SentenceTransformer(model_name)
        self.ngram_range = ngram_range
        self.top_n = top_n
        self.stop_words = set(stopwords.words('english'))
    
    def get_candidates(self, text):
        words = word_tokenize(text.lower())
        candidates = []
        for n in range(self.ngram_range[0], self.ngram_range[1] + 1):
            for i in range(len(words) - n + 1):
                phrase = ' '.join(words[i:i+n])
                phrase_words = phrase.split()
                if not any(w in self.stop_words for w in phrase_words):
                    if all(w.isalpha() for w in phrase_words):
                        candidates.append(phrase)
        return list(set(candidates))
    
    def extract(self, text, top_n=None):
        if top_n is None:
            top_n = self.top_n
        candidates = self.get_candidates(text)
        if not candidates:
            return []
        doc_embedding = self.model.encode([text])
        candidate_embeddings = self.model.encode(candidates)
        similarities = cosine_similarity(doc_embedding, candidate_embeddings)[0]
        scored = sorted(zip(candidates, similarities), key=lambda x: x[1], reverse=True)
        return [{"text": phrase, "score": float(score)} for phrase, score in scored[:top_n]]
'''

with open(OUTPUT_DIR / "keyphrase_extractor.py", "w") as f:
    f.write(extractor_code)

print(f"Extractor module saved to {OUTPUT_DIR / 'keyphrase_extractor.py'}")

## 9. Integration Example

In [None]:
def prepare_for_concept_map(text, extractor, top_n=8):
    """Prepare keyphrase data for React Flow concept map."""
    keyphrases = extractor.extract(text, top_n=top_n)
    
    nodes = [
        {
            "id": "main",
            "type": "topic",
            "data": {"label": "Main Topic"},
            "position": {"x": 400, "y": 300}
        }
    ]
    edges = []
    
    for i, kp in enumerate(keyphrases):
        angle = (2 * np.pi * i) / len(keyphrases)
        radius = 200
        x = 400 + radius * np.cos(angle)
        y = 300 + radius * np.sin(angle)
        
        node_id = f"kp_{i}"
        nodes.append({
            "id": node_id,
            "type": "keyphrase",
            "data": {
                "label": kp['text'],
                "score": kp['score']
            },
            "position": {"x": x, "y": y}
        })
        
        edges.append({
            "id": f"edge_{i}",
            "source": "main",
            "target": node_id,
            "animated": kp['score'] > 0.5
        })
    
    return {"nodes": nodes, "edges": edges}

concept_data = prepare_for_concept_map(test_contexts[0], extractor)
print("Concept map data structure:")
print(json.dumps(concept_data, indent=2))

## 10. Summary

### Implemented
- TF-IDF based keyphrase extraction (baseline)
- KeyBERT embedding-based extraction (advanced)
- Evaluation on KP20k dataset
- Production-ready extractor class
- Concept map data preparation

### Performance Comparison
| Method | Precision@5 | Recall@5 | F1@5 |
|--------|------------|----------|------|
| TF-IDF | See above | See above | See above |
| KeyBERT | See above | See above | See above |

### Next Steps
1. **06_vector_store_setup.ipynb** - Build ChromaDB index with extracted keyphrases
2. Integrate extractor with FastAPI backend
3. Connect to React Flow visualization