# Documented K/S/A Enhancement Pipeline

This notebook wraps the existing pipeline **as-is**. The full script is included in the next cell without any modifications.

**Note:** Running the pipeline requires your local environment, data paths, and API keys (as the original script does).

# LLM K/S/A Enhancement Pipeline  
**Purpose:** Enhance LAiSER-extracted skills with knowledge requirements from AFSC documentation using LLMs (Claude / GPT-4).  

**Author:** Kyle Hall  
**Project:** AFSC to Civilian Skills Translation Capstone  
**Last Updated:** October 2025 

## Section 1: Imports and Dependencies
This section pulls in all the Python tools the pipeline needs.

**What it includes & why:**
- **pandas** – reads/writes CSV files and helps reshape data.
- **json, pathlib, os** – open/save files and handle paths in a cross-platform way.
- **re (regex)** – finds patterns in text (e.g., “3.1 Knowledge” sections).
- **hashlib** – makes short, stable hashes so we can cache results or build node IDs.
- **datetime, time** – timestamps for logging and gentle pauses to avoid rate limits.
- **typing** – clarifies function inputs/outputs (lists, dicts, etc.).
- **getpass** – safely asks you for API keys on the command line.
- **anthropic, openai** – talks to the LLMs (Claude / OpenAI models).

In [5]:
import json
import pandas as pd
import re
import hashlib
import os
from pathlib import Path
from datetime import datetime
import time
from typing import List, Dict, Tuple, Optional
from getpass import getpass

# External dependencies (install with: pip install anthropic openai)
import anthropic
from openai import OpenAI

## Section 2: Configuration and Parameters
This is the “settings panel” for the pipeline. You can update paths and knobs here without touching the core logic.

**Key settings:**
- **Paths** – where to read inputs and save outputs:
  - `INPUT_FILE`: LAiSER CSV (skills/abilities already extracted).
  - `CORPUS_FILE`: AFSC source texts (JSONL, one AFSC per line).
  - `OUTPUT_DIR`: folder where all enhanced outputs go.
- **Model options** – which LLM to use and how deterministic the replies should be:
  - `ENHANCER_PROVIDER` (claude or openai), `CLAUDE_MODEL`, `OPENAI_MODEL`.
  - `TEMPERATURE` low = more consistent answers.
  - `SEED` (best-effort determinism for OpenAI; ignored if not supported).
- **Quality thresholds** – help control precision vs coverage:
  - `MIN_EXPLICIT_CONFIDENCE` – confidence for document-sourced knowledge.
  - `MIN_INFERRED_CONFIDENCE` – lower bar for inference (used sparingly).
  - `MAX_KNOWLEDGE_PER_AFSC` – cap to prevent any AFSC from flooding the dataset.
- **Cache toggle** – `USE_CACHE` avoids repeat LLM calls (saves time/money).

**Takeaway:** If something looks off in outputs, adjust thresholds or caps here first.

In [None]:
# Data paths - adjust these for your environment
DATA_DIR = Path(r"C:\Users\Kyle\OneDrive\Desktop\Capstone\fall-2025-group6\src\Data\Manual Extraction")
INPUT_FILE = DATA_DIR / "ksa_output_simple" / "ksa_extractions.csv"  # LAiSER output
CORPUS_FILE = DATA_DIR / "corpus_manual_dataset.jsonl"  # AFSC documentation
OUTPUT_DIR = DATA_DIR / "ksa_enhanced"
OUTPUT_DIR.mkdir(exist_ok=True)

# Model Configuration
# These can be overridden via environment variables for flexibility
ENHANCER_PROVIDER = os.getenv('ENHANCER_PROVIDER', 'claude')  # 'claude' or 'openai'
CLAUDE_MODEL = os.getenv('CLAUDE_MODEL', 'claude-opus-4-1-20250805')  # Latest Claude model
OPENAI_MODEL = os.getenv('OPENAI_MODEL', 'gpt-4o-mini')  # Reliable OpenAI model
TEMPERATURE = float(os.getenv('LLM_TEMPERATURE', '0.1'))  # Low temp for consistency
SEED = int(os.getenv('LLM_SEED', '42'))  # For reproducibility (best-effort)

# Quality Control Parameters
"""
These thresholds control the quality/quantity tradeoff:
- MIN_EXPLICIT_CONFIDENCE: Higher = fewer but better explicit knowledge items
- MIN_INFERRED_CONFIDENCE: Lower since inference is less certain
- MAX_KNOWLEDGE_PER_AFSC: Prevents any single AFSC from dominating
"""
USE_CACHE = True  # Cache LLM responses to save API costs
MIN_INFERRED_CONFIDENCE = 0.45  # Lower bar for inferred knowledge
MIN_EXPLICIT_CONFIDENCE = 0.82  # High bar for document-extracted knowledge
MAX_KNOWLEDGE_PER_AFSC = 15  # Cap to prevent flooding


## Section 3: Provenance Enhancer Class
This class is the “brains” that talks to the LLMs and keeps receipts (provenance).

**What it does:**
1. **Initialize LLM clients**  
   Securely reads your API keys with `getpass` and sets up Anthropics/OpenAI clients based on `ENHANCER_PROVIDER`.

2. **Cache results**  
   Builds a small on-disk cache (JSON) keyed by AFSC + content hash so the same prompt/text doesn’t trigger another paid request.

3. **Extract explicit knowledge from documents**  
   - Uses regex to find likely “Knowledge” sections (e.g., “3.1 Knowledge …”).
   - Sends only that snippet to the LLM with a very strict prompt:
     - Output must be **short, noun-phrase knowledge items**.
     - No bullets, numbers, quotes, or meta text.
     - If nothing explicit is present, the model must return `NONE`.
   - For each item, it also finds a small **evidence snippet** from the original text.
   - Applies **filters** (dedupe, caps, confidence) before returning.

4. **Infer knowledge from skills (only when sparse)**  
   - If a given AFSC didn’t yield enough explicit knowledge, it looks at the **top skills** and asks the LLM: “What theory/knowledge is necessarily required to do these skills?”
   - Labels these as **`skill_inferred`** and assigns a **lower confidence**.

5. **Utility helpers**
   - `_find_evidence` tries to pull the shortest sentence with 2+ key terms from the source text (so reviewers can verify the claim quickly).
   - `_identify_section` figures out what heading we’re under (useful for provenance tags).

**Why it matters:**  
Everything added by the LLM is traceable (source text or skill context), repeatable (cache + deterministic settings), and controlled (filters + caps).


In [7]:
class ProvenanceEnhancer:
    def __init__(self):
        """
        Initialize the enhancer with selected LLM provider.
        Prompts for API keys securely (never stored in code).
        Sets up cache for cost-efficient operation.
        """
        print("API Authentication Required")
        print("-" * 40)
        
        self.provider = ENHANCER_PROVIDER
        self.claude = None
        self.openai = None
        
        # Initialize only the selected provider to minimize API setup
        if self.provider == 'claude':
            claude_key = getpass("Enter your Claude API key: ")
            self.claude = anthropic.Anthropic(api_key=claude_key)
            print(f"✓ Claude ({CLAUDE_MODEL}) initialized\n")
        elif self.provider == 'openai':
            openai_key = getpass("Enter your OpenAI API key: ")
            self.openai = OpenAI(api_key=openai_key)
            print(f"✓ OpenAI ({OPENAI_MODEL}) initialized\n")
        else:
            # Option to use both for comparison
            claude_key = getpass("Enter your Claude API key: ")
            openai_key = getpass("Enter your OpenAI API key: ")
            self.claude = anthropic.Anthropic(api_key=claude_key)
            self.openai = OpenAI(api_key=openai_key)
            print("✓ Both providers initialized\n")
        
        self.cache = self.load_cache()
        
    def load_cache(self) -> dict:
        """
        Load cached LLM responses from disk.
        Using a stable filename (not date-based) to maximize cache reuse.
        """
        cache_file = OUTPUT_DIR / "llm_cache_v2.json"
        if cache_file.exists():
            with open(cache_file, 'r', encoding='utf-8') as f:
                return json.load(f)
        return {}
    
    def save_cache(self):
        """Save cache to disk after each operation to prevent loss."""
        cache_file = OUTPUT_DIR / "llm_cache_v2.json"
        with open(cache_file, 'w', encoding='utf-8') as f:
            json.dump(self.cache, f, indent=2)
    
    def _make_cache_key(self, prefix: str, afsc: str, content: str) -> str:
        """
        Create deterministic cache key based on content.
        This ensures we reuse cached results for identical inputs,
        even across different runs or days.
        """
        content_hash = hashlib.md5(content[:500].encode()).hexdigest()[:8]
        return f"{prefix}_{afsc}_{content_hash}"
    
    def extract_knowledge_from_document(self, afsc: str, doc: Dict) -> List[Dict]:
        """
        Extract explicit knowledge requirements from AFSC documentation.
        
        This method:
        1. Searches for knowledge sections using regex patterns
        2. Sends relevant text to LLM for structured extraction
        3. Finds evidence snippets for each knowledge item
        4. Applies quality filters and deduplication
        5. Returns up to MAX_KNOWLEDGE_PER_AFSC items
        
        Args:
            afsc: Air Force Specialty Code
            doc: Document dictionary with text and metadata
            
        Returns:
            List of knowledge items with full provenance
        """
        text = doc.get('text', '')
        doc_id = doc.get('doc_id', '')
        title = doc.get('title', '')
        
        # Check cache first to avoid redundant API calls
        cache_key = self._make_cache_key('doc_knowledge', afsc, text)
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        knowledge_items = []
        
        # Regex patterns to find knowledge sections in military documents
        # These patterns match common formats in AFOCD/AFECD documents
        patterns = [
            r"3\.1\.?\s*Knowledge[.\s]+(.+?)(?=\n3\.\d|\n2\.|$)",  # Section 3.1 Knowledge
            r"Knowledge[.\s]*(?:is mandatory of|includes?|requires?)[:\s]+(.+?)(?=\n\d\.|$)",  # Knowledge statements
            r"(?:principles?|theory|concepts?)\s+of[:\s]+(.+?)(?=\n|\.{2,}|$)"  # Theory/concepts
        ]
        
        for pattern in patterns:
            matches = re.findall(pattern, text, re.IGNORECASE | re.DOTALL)
            for match in matches:
                knowledge_text = match[:1500]  # Limit length for LLM context
                
                # Track location in document for provenance
                match_start = text.find(match)
                section = self._identify_section(text, match_start)
                page = doc.get('page_start', 0)
                
                # Carefully crafted prompt to get clean, specific knowledge items
                prompt = f"""Extract ONLY explicit knowledge requirements from the text below.

Text:
{knowledge_text}

Return ONLY knowledge items as plain lines.

Rules:
- Each item must be directly stated in the text (no inference).
- 3-6 words per line, noun phrases only (no leading verbs).
- Each item must be a complete, specific concept.
- No numbers, bullets, markdown, quotes, or explanations.
- No single words or fragments (bad: "flight", "aircraft operating").
- Prefer specific phrases (good: "aircraft operating procedures").
- Use lowercase, no trailing punctuation.
- Maximum 10 lines. No duplicates.
- If none are explicit, return: NONE

Examples (format and specificity):
aircraft operating procedures
air navigation principles
aviation meteorology
weapons system capabilities
mission planning procedures

Return ONLY the knowledge phrases, nothing else."""

                # Call appropriate LLM based on configuration
                if self.provider == 'claude' and self.claude:
                    response = self.claude.messages.create(
                        model=CLAUDE_MODEL,
                        max_tokens=300,
                        temperature=TEMPERATURE,
                        messages=[{"role": "user", "content": prompt}]
                    )
                    content = response.content[0].text
                elif self.openai:
                    # Handle seed parameter carefully (not all OpenAI versions support it)
                    kwargs = {
                        'model': OPENAI_MODEL,
                        'messages': [{"role": "user", "content": prompt}],
                        'temperature': TEMPERATURE,
                        'max_tokens': 300
                    }
                    try:
                        kwargs['seed'] = SEED
                    except Exception:
                        pass
                    response = self.openai.chat.completions.create(**kwargs)
                    content = response.choices[0].message.content
                else:
                    continue
                
                # Parse LLM response and build knowledge items
                if "NONE" not in content.upper():
                    for line in content.strip().split('\n'):
                        if not line.strip() or line.startswith('Quote:'):
                            continue
                        
                        # Clean the extracted text
                        line = line.strip('- •*').strip()
                        
                        # Filter by length (too short = fragment, too long = sentence)
                        if len(line) > 5 and len(line) < 100:
                            # Find supporting evidence in original text
                            evidence = self._find_evidence(knowledge_text, line)
                            
                            # Create knowledge item with full provenance
                            knowledge_items.append({
                                'text': line,
                                'type': 'knowledge',
                                'confidence': MIN_EXPLICIT_CONFIDENCE,
                                'source_method': 'document_explicit',
                                'evidence_snippet': evidence,
                                'doc_id': doc_id,
                                'title': title,
                                'category': doc.get('category', ''),
                                'afsc': afsc,
                                'section': section,
                                'page': page if page > 0 else None
                            })
        
        # Deduplication: remove very similar items
        seen_texts = set()
        unique_items = []
        for item in knowledge_items:
            normalized = ' '.join(item['text'].lower().split())
            if normalized not in seen_texts:
                seen_texts.add(normalized)
                unique_items.append(item)
        
        # Sort by specificity (longer = more specific) and cap
        unique_items.sort(key=lambda x: len(x['text']), reverse=True)
        unique_items = unique_items[:MAX_KNOWLEDGE_PER_AFSC]
        
        # Cache results and return
        self.cache[cache_key] = unique_items
        self.save_cache()
        return unique_items
    
    def infer_knowledge_from_skills(self, afsc: str, skills_df: pd.DataFrame, doc_meta: Dict) -> List[Dict]:
        """
        Infer implicit knowledge requirements from extracted skills.
        
        This fallback method is used when explicit knowledge extraction
        yields few results. It analyzes the top skills to determine
        what theoretical knowledge would be required.
        
        Args:
            afsc: Air Force Specialty Code
            skills_df: DataFrame of skills for this AFSC
            doc_meta: Document metadata for provenance
            
        Returns:
            List of inferred knowledge items with lower confidence
        """
        if skills_df.empty:
            return []
        
        # Select top 5 skills by confidence for inference
        top_skills = skills_df.nlargest(5, 'confidence') if 'confidence' in skills_df else skills_df.head(5)
        skills_text = '\n'.join([f"- {row.get('text', row.get('Raw Skill', ''))}" for _, row in top_skills.iterrows()])
        
        # Include evidence from source skills for traceability
        skills_evidence = '\n'.join([f"- {row.get('evidence_snippet', '')}" for _, row in top_skills.iterrows() if row.get('evidence_snippet')])
        
        # Check cache
        cache_key = self._make_cache_key('inferred', afsc, skills_text)
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        # Conservative prompt for inference
        prompt = f"""Given these verified Air Force skills for AFSC {afsc}:
{skills_text}

Supporting evidence from documents:
{skills_evidence if skills_evidence else 'N/A'}

What knowledge is NECESSARILY required to perform these skills?
- Return ONLY knowledge that is clearly implied by the skills
- If knowledge cannot be reliably inferred, return "NONE"
- Format: Brief theoretical concepts (3-7 words each)
- Maximum 3 items"""

        # Call LLM for inference
        if self.provider == 'claude' and self.claude:
            response = self.claude.messages.create(
                model=CLAUDE_MODEL,
                max_tokens=200,
                temperature=TEMPERATURE,
                messages=[{"role": "user", "content": prompt}]
            )
            content = response.content[0].text
        elif self.openai:
            kwargs = {
                'model': OPENAI_MODEL,
                'messages': [{"role": "user", "content": prompt}],
                'temperature': TEMPERATURE,
                'max_tokens': 200
            }
            try:
                kwargs['seed'] = SEED
            except Exception:
                pass
            response = self.openai.chat.completions.create(**kwargs)
            content = response.choices[0].message.content
        else:
            return []
        
        # Parse response and create inferred items with lower confidence
        knowledge_items = []
        if "NONE" not in content.upper():
            for line in content.strip().split('\n'):
                line = line.strip('- •*').strip()
                if len(line) > 5 and len(line) < 100:
                    source_skills = [row.get('text', row.get('Raw Skill', '')) for _, row in top_skills.iterrows()]
                    
                    knowledge_items.append({
                        'text': line,
                        'type': 'knowledge',
                        'confidence': MIN_INFERRED_CONFIDENCE,  # Lower confidence
                        'source_method': 'skill_inferred',
                        'evidence_snippet': f"Inferred from skills: {', '.join(source_skills[:2])}",
                        'doc_id': doc_meta.get('doc_id', ''),
                        'title': doc_meta.get('title', ''),
                        'category': doc_meta.get('category', ''),
                        'afsc': afsc,
                        'section': 'inferred',
                        'page': None,
                        'parent_skills': source_skills  # Track lineage
                    })
        
        self.cache[cache_key] = knowledge_items
        self.save_cache()
        return knowledge_items
    
    def _find_evidence(self, text: str, item: str) -> str:
        """
        Find the best evidence snippet for a knowledge item.
        
        Prefers complete sentences containing multiple key terms
        from the knowledge item, selecting the shortest matching
        sentence for clarity.
        
        Args:
            text: Source text to search
            item: Knowledge item to find evidence for
            
        Returns:
            Evidence snippet (max 200 chars)
        """
        # Extract meaningful key terms (skip short words)
        key_terms = [w for w in item.lower().split() if len(w) > 3][:3]
        
        if not key_terms:
            return text[:150].strip() + "..."
        
        # Split into sentences
        sentences = re.split(r'(?<=[.!?])\s+', text)
        best_sentence = None
        best_score = 0
        
        for sent in sentences:
            sent = sent.strip()
            if len(sent) < 20:  # Too short to be meaningful
                continue
                
            sent_lower = sent.lower()
            # Count matching key terms
            matches = sum(1 for term in key_terms if term in sent_lower)
            
            if matches >= 2:  # Prefer sentences with multiple matches
                # Score: more matches better, shorter better
                score = (matches * 1000) - len(sent)
                if score > best_score:
                    best_score = score
                    best_sentence = sent
        
        if best_sentence:
            return best_sentence[:200]  # Cap length but keep complete
        
        # Fallback: context around first key term
        for term in key_terms:
            if term in text.lower():
                idx = text.lower().find(term)
                start = max(0, idx - 50)
                end = min(len(text), idx + len(term) + 100)
                return "..." + text[start:end].strip() + "..."
        
        return text[:150].strip() + "..."
    
    def _identify_section(self, text: str, position: int) -> str:
        """
        Identify the document section number at a given position.
        Used for provenance tracking.
        """
        before = text[:position]
        section_match = re.findall(r'(\d+\.\d+\.?\s*[A-Z][^.]+)', before[-200:])
        if section_match:
            return section_match[-1][:20]
        return "3.1 Knowledge"  # Default for knowledge sections


## Section 4: Quality Control Functions
These are small helpers that harden the output.

**What they do:**
- **`cap_k_per_afsc`** – makes sure we keep at most `MAX_KNOWLEDGE_PER_AFSC` knowledge items per AFSC (prioritizing higher confidence). This keeps the dataset balanced across AFSCs.

**Why it matters:**  
Even with good prompts, LLMs can overshoot. Caps and sanity checks keep the graph readable and prevent any one AFSC from dominating.

In [8]:
def cap_k_per_afsc(df: pd.DataFrame, k: int = MAX_KNOWLEDGE_PER_AFSC) -> pd.DataFrame:
    """
    Defensive cap to ensure no AFSC exceeds the knowledge limit.
    
    This function is applied after all processing to guarantee
    that no single AFSC dominates the dataset, even if inference
    added extra items.
    
    Args:
        df: DataFrame with all K/S/A items
        k: Maximum knowledge items per AFSC
        
    Returns:
        DataFrame with knowledge items capped per AFSC
    """
    blocks = []
    for afsc in df['afsc'].dropna().unique():
        sub = df[df['afsc'] == afsc]
        # Keep top k knowledge items by confidence
        keep_k = sub[sub['type'] == 'knowledge'].sort_values('confidence', ascending=False).head(k)
        # Keep all non-knowledge items
        others = sub[sub['type'] != 'knowledge']
        blocks.append(pd.concat([keep_k, others], ignore_index=True))
    return pd.concat(blocks, ignore_index=True) if blocks else df


## Section 5: Main Pipeline Function
This is the orchestration layer that calls everything in order.

**Step-by-step:**
1. **Load LAiSER extractions**  
   Reads the CSV produced by your earlier pipeline. Renames columns to a standard format and tags all rows with `source_method = 'laiser'`.

2. **Load the AFSC corpus**  
   Opens the JSONL file of AFSC source texts (the “ground truth” documents).

3. **Initialize the enhancer**  
   Sets up LLM clients and loads the cache.

4. **For each AFSC:**
   - Keep all original LAiSER rows (skills/abilities).
   - Try to extract **explicit knowledge** from the AFSC document.
   - If explicit is **sparse**, infer **minimal** knowledge from the top few skills.
   - Add gentle delays to avoid API rate limiting.

5. **Combine & normalize**  
   - Merge the original + new knowledge.
   - Ensure required columns exist (text, type, confidence, evidence, etc.).
   - Normalize types to `skill / knowledge / ability`.
   - Bound/clean confidence values to `[0, 1]`.
   - Add `review_status` (explicit/inferred → `pending`; original LAiSER → `reviewed`).
   - Dedupe by `(afsc, type, text)` and **apply per-AFSC cap**.

6. **Save outputs**  
   - `ksa_extractions_enhanced.csv` – the main, cleaned dataset.
   - `qc_candidates.csv` – only new knowledge items that need human review.
   - Calls the graph exporter to produce `graph_export_enhanced.json`.
   - Writes `enhancement_stats.json` with counts and averages.

7. **QC prints**  
   A few quick printed checks (e.g., min per AFSC) to catch structural problems early.

**Outputs to expect:**
- A well-formed CSV you can analyze, visualize, or load to a database.
- A graph JSON ready for Neo4j or other graph tooling.
- A small QC file you can hand to reviewers.


In [9]:
def enhance_with_provenance():
    """
    Main pipeline function that orchestrates the enhancement process.
    
    Pipeline steps:
    1. Load LAiSER extractions (baseline K/S/A)
    2. Load AFSC corpus documents
    3. Initialize LLM enhancer
    4. For each AFSC:
       - Extract explicit knowledge from documents
       - Infer implicit knowledge if needed
    5. Combine and validate all items
    6. Apply quality controls
    7. Generate outputs (CSV, JSON, statistics)
    8. Run quality checks
    """
    
    print("=" * 60)
    print("PROPOSAL-COMPLIANT K/S/A ENHANCEMENT")
    print("=" * 60)
    
    # ========== STEP 1: Load LAiSER Baseline ==========
    print("\n1. Loading LAiSER extractions...")
    base_df = pd.read_csv(INPUT_FILE)
    
    # Standardize column names for compatibility with rest of pipeline
    base_df = base_df.rename(columns={
        'Raw Skill': 'text',
        'ksa_type': 'type',
        'Correlation Coefficient': 'confidence'
    })
    
    # Track source of existing items
    base_df['source_method'] = 'laiser'
    
    print(f"   Loaded {len(base_df)} items")
    
    # ========== STEP 2: Load AFSC Corpus ==========
    print("\n2. Loading source corpus...")
    corpus = {}
    with open(CORPUS_FILE, 'r') as f:
        for line in f:
            doc = json.loads(line)
            corpus[doc['afsc']] = doc
    print(f"   Loaded {len(corpus)} documents")
    
    # ========== STEP 3: Initialize Enhancer ==========
    print("\n3. Initializing enhancement engine...")
    enhancer = ProvenanceEnhancer()
    
    # ========== STEP 4: Process Each AFSC ==========
    all_items = []
    new_knowledge = []
    
    print("\n4. Extracting knowledge with provenance...")
    for afsc in sorted(base_df['afsc'].unique()):
        print(f"\n   Processing {afsc}...")
        
        # Keep existing LAiSER items
        afsc_base = base_df[base_df['afsc'] == afsc].copy()
        all_items.extend(afsc_base.to_dict('records'))
        
        # Extract explicit knowledge from document
        doc_knowledge = []  # Initialize to prevent UnboundLocalError
        if afsc in corpus:
            doc_knowledge = enhancer.extract_knowledge_from_document(afsc, corpus[afsc])
            new_knowledge.extend(doc_knowledge)
            if len(doc_knowledge) > 0:
                print(f"     Found {len(doc_knowledge)} explicit knowledge items")
            else:
                print(f"     No explicit knowledge found in document")
        else:
            print(f"     No corpus document found; skipping explicit knowledge extraction")
        
        # Infer knowledge from skills if explicit extraction was insufficient
        afsc_skills = afsc_base[afsc_base['type'] == 'skill'].copy()
        afsc_skills['confidence'] = pd.to_numeric(afsc_skills['confidence'], errors='coerce').fillna(0.0)
        
        if len(afsc_skills) > 0 and len(doc_knowledge) < 2:  # Conservative inference threshold
            doc_meta = corpus.get(afsc, {})
            inferred = enhancer.infer_knowledge_from_skills(afsc, afsc_skills, doc_meta)
            new_knowledge.extend(inferred)
            print(f"     Inferred {len(inferred)} knowledge items from skills")
        
        time.sleep(0.3)  # Rate limiting to avoid API throttling
    
    # ========== STEP 5: Combine and Validate ==========
    print(f"\n5. Combining and validating...")
    enhanced_df = pd.DataFrame(all_items + new_knowledge)
    
    # Ensure all required columns exist
    required_cols = ['text', 'type', 'confidence', 'source_method', 
                     'evidence_snippet', 'doc_id', 'title', 'category', 'afsc']
    
    for col in required_cols:
        if col not in enhanced_df.columns:
            enhanced_df[col] = ''
    
    # Normalize types to lowercase and ensure valid values
    enhanced_df['type'] = enhanced_df['type'].str.lower().replace({
        'knowledge': 'knowledge',
        'skill': 'skill',
        'ability': 'ability'
    }).fillna('skill')
    
    # Bound confidence scores to [0, 1]
    enhanced_df['confidence'] = pd.to_numeric(enhanced_df['confidence'], errors='coerce').clip(0, 1).fillna(0.5)
    
    # Fill missing values
    enhanced_df['evidence_snippet'] = enhanced_df['evidence_snippet'].fillna('')
    enhanced_df['source_method'] = enhanced_df['source_method'].fillna('unknown')
    
    # Add review status for QC workflow
    enhanced_df['review_status'] = enhanced_df.apply(
        lambda x: 'pending' if x['source_method'] in ['skill_inferred', 'document_explicit'] else 'reviewed',
        axis=1
    )
    
    # Remove duplicates (considering AFSC, type, and text)
    enhanced_df = enhanced_df.drop_duplicates(subset=['afsc', 'type', 'text']).reset_index(drop=True)
    
    # Apply defensive cap per AFSC
    enhanced_df = cap_k_per_afsc(enhanced_df, MAX_KNOWLEDGE_PER_AFSC)
    
    # Sort for readability
    enhanced_df = enhanced_df.sort_values(
        ['afsc', 'type', 'confidence'],
        ascending=[True, True, False]
    )
    
    # ========== STEP 6: Generate Outputs ==========
    print(f"\n6. Saving outputs...")
    
    # Main enhanced dataset
    output_file = OUTPUT_DIR / "ksa_extractions_enhanced.csv"
    enhanced_df.to_csv(output_file, index=False)
    
    # QC candidates file (new knowledge items only)
    qc_candidates = pd.DataFrame(new_knowledge)
    if len(qc_candidates) > 0:
        qc_candidates['review_status'] = 'pending'
        qc_candidates['origin'] = 'enhancement'
    qc_file = OUTPUT_DIR / "qc_candidates.csv"
    qc_candidates.to_csv(qc_file, index=False)
    
    # Graph export for visualization
    graph_file = OUTPUT_DIR / "graph_export_enhanced.json"
    graph_data = export_graph(enhanced_df, graph_file)
    
    # Generate comprehensive statistics
    mask_explicit = (enhanced_df['source_method'] == 'document_explicit')
    mask_inferred = (enhanced_df['source_method'] == 'skill_inferred')
    mask_laiser = (enhanced_df['source_method'] == 'laiser')
    
    stats = {
        'enhancement_date': datetime.now().isoformat(),
        'provider': ENHANCER_PROVIDER,
        'model': CLAUDE_MODEL if ENHANCER_PROVIDER == 'claude' else OPENAI_MODEL,
        'original_items': len(base_df),
        'enhanced_items': len(enhanced_df),
        'knowledge_added': len(new_knowledge),
        'explicit_knowledge': sum(1 for k in new_knowledge if k.get('source_method') == 'document_explicit'),
        'inferred_knowledge': sum(1 for k in new_knowledge if k.get('source_method') == 'skill_inferred'),
        'type_distribution': enhanced_df['type'].value_counts().to_dict(),
        'source_distribution': enhanced_df['source_method'].value_counts().to_dict(),
        'avg_confidence': {
            'overall': float(enhanced_df['confidence'].mean()),
            'explicit': float(enhanced_df.loc[mask_explicit, 'confidence'].mean()) if mask_explicit.any() else 0.0,
            'inferred': float(enhanced_df.loc[mask_inferred, 'confidence'].mean()) if mask_inferred.any() else 0.0,
            'laiser': float(enhanced_df.loc[mask_laiser, 'confidence'].mean()) if mask_laiser.any() else 0.0
        },
        'graph_stats': {
            'nodes': len(graph_data['nodes']),
            'edges': len(graph_data['edges'])
        }
    }
    
    stats_file = OUTPUT_DIR / "enhancement_stats.json"
    with open(stats_file, 'w') as f:
        json.dump(stats, f, indent=2)
    
    # ========== STEP 7: Quality Checks ==========
    print("\n7. Running quality checks...")
    problems = []
    
    # Check confidence bounds
    bad_conf = enhanced_df[(enhanced_df['confidence'] < 0) | (enhanced_df['confidence'] > 1)]
    if not bad_conf.empty:
        problems.append(("confidence_out_of_bounds", len(bad_conf)))
    
    # Check for empty text
    empty_text = enhanced_df[enhanced_df['text'].astype(str).str.strip() == ""]
    if not empty_text.empty:
        problems.append(("empty_text", len(empty_text)))
    
    # Check each AFSC has minimum coverage
    by_afsc = enhanced_df.groupby('afsc').size()
    low_afsc = by_afsc[by_afsc < 3]
    if not low_afsc.empty:
        problems.append(("afsc_with_lt3_items", low_afsc.to_dict()))
    
    if problems:
        print(f"   QC issues found: {problems}")
    else:
        print(f"   ✓ All quality checks passed")
    
    # ========== STEP 8: Print Summary ==========
    print("\n" + "=" * 60)
    print("ENHANCEMENT COMPLETE")
    print("=" * 60)
    print(f"Original items: {stats['original_items']}")
    print(f"Enhanced items: {stats['enhanced_items']}")
    print(f"Knowledge added: {stats['knowledge_added']}")
    print(f"  - Explicit: {stats['explicit_knowledge']}")
    print(f"  - Inferred: {stats['inferred_knowledge']}")
    print(f"\nType distribution:")
    for t, count in stats['type_distribution'].items():
        print(f"  {t}: {count} ({count/len(enhanced_df)*100:.1f}%)")
    print(f"\nFiles saved:")
    print(f"  - {output_file.name} (main dataset)")
    print(f"  - {qc_file.name} (QC candidates)")
    print(f"  - {graph_file.name} (graph export)")
    print(f"  - {stats_file.name} (statistics)")

## Section 6: Graph Export Function
Builds a Neo4j-friendly JSON graph from the enhanced dataframe.

**How it works:**
- **Nodes**
  - One node per AFSC: `{"id": "21A", "type": "AFSC", properties: {title, category}}`
  - One node per K/S/A item, with a **stable hash ID** so the same text + AFSC doesn’t duplicate on reload.
  - Each K/S/A node stores: `text`, `confidence`, `source_method`, and any ESCO tag found.

- **Edges**
  - For every `(AFSC, KSA)` row, creates a `REQUIRES_{TYPE}` edge from the AFSC node to that KSA node.
  - Edge properties include `confidence`, `source_method`, and a **truncated evidence snippet**.

- **Metadata**
  - Sums of nodes/edges and counts by type (handy for quick health checks).

**Why it matters:**  
This format is easy to import into Neo4j (via APOC or script) and lets you run queries like “show all knowledge required by AFSC 1N4”.


In [10]:
def export_graph(df: pd.DataFrame, output_path: Path) -> dict:
    """
    Export enhanced data as graph structure for visualization.
    
    Creates a graph with:
    - AFSC nodes (job codes)
    - K/S/A nodes (requirements)
    - REQUIRES edges (relationships)
    
    Each node and edge includes properties for filtering and analysis.
    Node IDs are deterministic (hash-based) for consistency across runs.
    
    Args:
        df: Enhanced dataframe with all K/S/A items
        output_path: Path to save graph JSON
        
    Returns:
        Graph data dictionary with nodes, edges, and metadata
    """
    nodes = []
    edges = []
    seen_nodes = set()
    
    # Create AFSC nodes (one per unique AFSC)
    for afsc in df['afsc'].unique():
        if pd.notna(afsc):
            afsc_data = df[df['afsc'] == afsc].iloc[0]
            nodes.append({
                'id': str(afsc),
                'type': 'AFSC',
                'properties': {
                    'title': afsc_data.get('title', ''),
                    'category': afsc_data.get('category', '')
                }
            })
            seen_nodes.add(str(afsc))
    
    # Create K/S/A nodes and edges
    for _, row in df.iterrows():
        # Create stable node ID using hash (ensures consistency across runs)
        text_for_hash = f"{row['afsc']}_{row['text']}"
        node_id = f"{row['type']}_{hashlib.md5(text_for_hash.encode()).hexdigest()[:12]}"
        
        # Add node if not already created
        if node_id not in seen_nodes:
            # Flexible ESCO column handling (different formats possible)
            esco_tag = (row.get('esco_tag') or row.get('Skill Tag') or 
                       row.get('ESCO') or row.get('Esco Tag') or '')
            
            nodes.append({
                'id': node_id,
                'type': row['type'].upper(),
                'properties': {
                    'text': row['text'],
                    'confidence': float(row['confidence']),
                    'source_method': row.get('source_method', 'unknown'),
                    'esco_tag': esco_tag
                }
            })
            seen_nodes.add(node_id)
        
        # Create edge from AFSC to K/S/A
        edges.append({
            'source': str(row['afsc']),
            'target': node_id,
            'relationship': f"REQUIRES_{row['type'].upper()}",
            'properties': {
                'confidence': float(row['confidence']),
                'evidence': (row.get('evidence_snippet', '') or '')[:200],
                'source_method': row.get('source_method', 'unknown')
            }
        })
    
    # Create graph structure with metadata
    graph_data = {
        'nodes': nodes,
        'edges': edges,
        'metadata': {
            'created': datetime.now().isoformat(),
            'total_nodes': len(nodes),
            'total_edges': len(edges),
            'afsc_count': len([n for n in nodes if n['type'] == 'AFSC']),
            'knowledge_count': len([n for n in nodes if n['type'] == 'KNOWLEDGE']),
            'skill_count': len([n for n in nodes if n['type'] == 'SKILL']),
            'ability_count': len([n for n in nodes if n['type'] == 'ABILITY'])
        }
    }
    
    # Save to file
    with open(output_path, 'w') as f:
        json.dump(graph_data, f, indent=2)
    
    return graph_data


## Section 7: Entry Point
This is the standard Python “main” hook.  
When you run the file directly (`python documented_ksa_pipeline.py`), it calls `enhance_with_provenance()`.

**Tip (Jupyter users):**  
In a notebook, you don’t need the entry point — just run the cell that defines `enhance_with_provenance()` and then call it in a separate cell.

In [11]:

if __name__ == "__main__":
    enhance_with_provenance()


PROPOSAL-COMPLIANT K/S/A ENHANCEMENT

1. Loading LAiSER extractions...
   Loaded 73 items

2. Loading source corpus...
   Loaded 12 documents

3. Initializing enhancement engine...
API Authentication Required
----------------------------------------
✓ Claude (claude-opus-4-1-20250805) initialized


4. Extracting knowledge with provenance...

   Processing 11F3...
     Found 7 explicit knowledge items

   Processing 12B...
     Found 7 explicit knowledge items

   Processing 14F...
     Found 8 explicit knowledge items

   Processing 14N...
     Found 11 explicit knowledge items

   Processing 1A3X1...
     Found 15 explicit knowledge items

   Processing 1C3...
     Found 14 explicit knowledge items

   Processing 1N0...
     Found 10 explicit knowledge items

   Processing 1N4...
     Found 10 explicit knowledge items

   Processing 21A...
     Found 14 explicit knowledge items

   Processing 21M...
     No explicit knowledge found in document
     Inferred 3 knowledge items from skil