# LAiSER K/S/A Extraction Pipeline for Military AFSCs

## Overview
This pipeline extracts Knowledge, Skills, and Abilities (K/S/A) from military Air Force Specialty Code (AFSC) descriptions using the LAiSER (Leveraging AI for Skill Extraction & Research) framework.

### Key Features:
- Processes 12 AFSC descriptions from Operations, Intelligence, and Maintenance categories
- Extracts skills using ESCO taxonomy via LAiSER
- Classifies extractions into Knowledge, Skills, or Abilities
- Generates evidence snippets for traceability
- Applies quality filters (confidence threshold, stoplist)
- Produces QC samples for validation
- Exports graph-ready structure for database import

### Pipeline Outputs:
1. `ksa_extractions.csv` - Filtered K/S/A items with metadata
2. `qc_sample.csv` - Stratified sample for quality control
3. `extraction_stats.json` - Summary statistics
4. `graph_export.json` - Graph database structure

## Section 1: Environment Setup and Configuration

This section imports required libraries and sets up the pipeline configuration.

### Key Components:
- **Path Configuration**: Sets input/output directories for Windows environment
- **Quality Thresholds**: MIN_CONFIDENCE (0.55) filters low-quality extractions
- **Generic Stoplist**: Removes irrelevant terms (navy/maritime references)
- **KSA Patterns**: Keywords for classifying extracted items into K/S/A categories

The configuration is deliberately minimal to maintain simplicity and reproducibility.

In [14]:
import json
import re
from pathlib import Path
from getpass import getpass
from datetime import datetime

import pandas as pd
from laiser.skill_extractor import Skill_Extractor

# Configuration
DATA_DIR = Path(r"C:\Users\Kyle\OneDrive\Desktop\Capstone\fall-2025-group6\src\Data\Manual Extraction")
INPUT_FILE = DATA_DIR / "corpus_manual_dataset.jsonl"
OUTPUT_DIR = DATA_DIR / "ksa_output_simple"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Quality thresholds
MIN_CONFIDENCE = 0.55
GENERIC_STOPLIST = [
    "navy operations", "naval operations", "small vessel",
    "maritime", "marine operations", "ship operations"
]

# K/S/A classification patterns
KSA_PATTERNS = {
    "knowledge": ["knowledge", "understand", "theory", "principles", "concepts", 
                  "awareness", "familiarity"],
    "skill": ["operate", "perform", "execute", "conduct", "implement", 
              "maintain", "repair", "troubleshoot", "analyze"],
    "ability": ["able to", "capability", "capacity", "aptitude", "competence", 
                "proficiency", "adapt", "lead", "manage", "coordinate"]
}

print(f"Pipeline initialized at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Output directory: {OUTPUT_DIR}")

Pipeline initialized at 2025-09-30 09:38:03
Output directory: C:\Users\Kyle\OneDrive\Desktop\Capstone\fall-2025-group6\src\Data\Manual Extraction\ksa_output_simple


## Section 2: Data Loading and Preparation

### Functions:
- `load_jsonl()`: Loads the JSONL dataset containing AFSC descriptions
- Data preparation: Renames columns to match LAiSER's expected format
  - `doc_id` → `job_id` (unique identifier)
  - `text` → `description` (main content for extraction)

### Preprocessing:
- Minimal text cleaning
- No complex transformations to preserve original military terminology
- Validates required columns are present

In [15]:
def load_jsonl(fp: Path) -> pd.DataFrame:
    """Load JSONL file into DataFrame"""
    rows = []
    with open(fp, "r", encoding="utf-8") as f:
        for line in f:
            rows.append(json.loads(line))
    return pd.DataFrame(rows)

# Load dataset
df = load_jsonl(INPUT_FILE)
print(f"✓ Loaded {len(df)} AFSC records from {INPUT_FILE.name}")

# Validate columns
required_cols = {"doc_id", "text", "title", "afsc", "category"}
assert required_cols.issubset(df.columns), f"Missing columns: {required_cols - set(df.columns)}"

# Prepare for LAiSER
laiser_df = df[["doc_id", "text", "title", "afsc", "category"]].rename(
    columns={"doc_id": "job_id", "text": "description"}
)

# Minimal preprocessing
laiser_df["description"] = laiser_df["description"].fillna("").str.replace(r"\s+", " ", regex=True).str.strip()

print(f"✓ Data prepared for LAiSER extraction")
print(f"  Categories: {laiser_df['category'].unique()}")
print(f"  Sample AFSC: {laiser_df.iloc[0]['afsc']} - {laiser_df.iloc[0]['title']}")

✓ Loaded 12 AFSC records from corpus_manual_dataset.jsonl
✓ Data prepared for LAiSER extraction
  Categories: ['Operations' 'Intelligence' 'Maintenance']
  Sample AFSC: 11F3 - FIGHTER PILOT


## Section 3: LAiSER Initialization

### Model Configuration:
- **Model**: Microsoft Phi-2 (2.7B parameters, CPU-optimized)
- **Mode**: CPU-only (avoids vLLM/GPU dependencies on Windows)
- **Authentication**: Requires Hugging Face token for model access

### Phi-2
- Efficient for CPU processing
- Good balance of speed and accuracy
- Successfully processes 12 AFSCs in ~0.21 seconds

In [16]:
# Get credentials
HF_TOKEN = getpass("Enter your Hugging Face token: ")
MODEL_ID = "microsoft/phi-2"  # CPU-optimized model

print(f"\nInitializing LAiSER with {MODEL_ID}...")

# Initialize LAiSER
se = Skill_Extractor(
    AI_MODEL_ID=MODEL_ID,
    HF_TOKEN=HF_TOKEN,
    use_gpu=False  # CPU mode for Windows compatibility
)
print("✓ LAiSER initialized successfully")


Initializing LAiSER with microsoft/phi-2...
Loading ESCO skill taxonomy data...
Loading FAISS index for ESCO skills...
FAISS index for ESCO skills loaded successfully.
Found 'en_core_web_lg' model. Loading...
GPU is not available. Using CPU for SkillNer model initialization...
loading full_matcher ...
loading abv_matcher ...
loading full_uni_matcher ...
loading low_form_matcher ...
loading token_matcher ...
✓ LAiSER initialized successfully


## Section 4: Primary Extraction with LAiSER

### Process:
1. Calls LAiSER's `extractor()` method with prepared DataFrame
2. Processes in batches of 4 for stability
3. Returns ~25 items per AFSC (300 total for 12 AFSCs)

### Output Columns from LAiSER:
- `Research ID` / `job_id`: Links to source AFSC
- `Raw Skill`: Extracted skill phrase
- `Correlation Coefficient`: Confidence score (0-1)
- `Skill Tag`: ESCO taxonomy reference
- `Description`: Original text (for evidence)

In [17]:
print("\nExtracting skills with LAiSER...")
import time
start_time = time.time()

# Extract skills
extractions = se.extractor(
    laiser_df,
    id_column="job_id",
    text_columns=["description"],
    batch_size=4  # Conservative batch size for stability
)

# Normalize column names
if "Research ID" in extractions.columns and "job_id" not in extractions.columns:
    extractions = extractions.rename(columns={"Research ID": "job_id"})

# Validate output
required_cols = {"job_id", "Raw Skill", "Correlation Coefficient"}
assert required_cols.issubset(extractions.columns), f"Missing columns: {required_cols - set(extractions.columns)}"

elapsed_time = time.time() - start_time
print(f"✓ Extraction complete in {elapsed_time:.2f} seconds")
print(f"  Raw items extracted: {len(extractions)}")
print(f"  Items per AFSC: ~{len(extractions) // len(laiser_df):.0f}")


Extracting skills with LAiSER...
✓ Extraction complete in 0.21 seconds
  Raw items extracted: 300
  Items per AFSC: ~25


## Section 5: K/S/A Classification

### Heuristic Approach:
Since LAiSER extracts "skills" only, we apply keyword-based classification:
- **Knowledge**: Contains words like "understand", "theory", "principles"
- **Skill**: Contains action verbs like "operate", "perform", "maintain"
- **Ability**: Contains capability words like "lead", "manage", "coordinate"

### Reality Check:
Most items classify as "skills" (90%+) because:
- Military descriptions emphasize tasks/duties
- ESCO taxonomy is action-oriented
- Knowledge is often implicit in military documentation

In [18]:
def classify_ksa(text: str) -> str:
    """Classify extracted text as Knowledge, Skill, or Ability"""
    t = (text or "").lower()
    scores = {k: sum(1 for kw in kws if kw in t) for k, kws in KSA_PATTERNS.items()}
    return max(scores, key=scores.get) if max(scores.values()) > 0 else "skill"

# Apply classification
extractions["ksa_type"] = extractions["Raw Skill"].astype(str).apply(classify_ksa)

# Show distribution
ksa_dist = extractions["ksa_type"].value_counts()
print("\nK/S/A Classification Results:")
for ksa_type, count in ksa_dist.items():
    print(f"  {ksa_type.capitalize()}: {count} ({count/len(extractions)*100:.1f}%)")


K/S/A Classification Results:
  Skill: 265 (88.3%)
  Ability: 33 (11.0%)
  Knowledge: 2 (0.7%)


## Section 6: Evidence Snippet Generation

### Purpose:
Provides traceability by capturing text around extracted items.

### Method:
1. Takes first 3 words of extracted skill as anchor
2. Searches for anchor in original text
3. Captures ±100 characters around match
4. Creates snippet for audit/validation

This lightweight approach avoids complex NLP while maintaining provenance.

In [19]:
# Merge with original data for evidence extraction
extractions = extractions.merge(
    laiser_df[["job_id", "description", "afsc", "title", "category"]],
    on="job_id",
    how="left"
)

def evidence_snippet(full_text: str, phrase: str, window: int = 100) -> str:
    """Extract text snippet around where skill was found"""
    if not isinstance(full_text, str) or not isinstance(phrase, str):
        return ""
    
    # Use first 3 words as anchor
    words = phrase.lower().split()[:3]
    if not words:
        return ""
    
    # Search for pattern
    pat = r"\b" + r"\s+".join(map(re.escape, words))
    m = re.search(pat, full_text.lower())
    
    if not m:
        return ""
    
    # Extract window
    start = max(0, m.start() - window)
    end = min(len(full_text), m.end() + window)
    return "..." + full_text[start:end] + "..."

# Generate evidence snippets
extractions["evidence_snippet"] = extractions.apply(
    lambda r: evidence_snippet(r["description"], r["Raw Skill"]), axis=1
)

print("\n✓ Evidence snippets generated")
print(f"  Snippets with evidence: {(extractions['evidence_snippet'] != '').sum()}/{len(extractions)}")


✓ Evidence snippets generated
  Snippets with evidence: 3/300


## Section 7: Quality Filters

### Three-Stage Filtering:
1. **Confidence Threshold**: Removes items below 0.55 correlation
2. **Generic Stoplist**: Removes navy/maritime terms (irrelevant to Air Force)
3. **Deduplication**: One instance per AFSC-skill combination

### Impact:
- Reduces from 300 raw extractions to ~106 high-quality items
- Increases average confidence from 0.51 to 0.60
- Removes obvious errors and redundancies

In [20]:
print("\nApplying quality filters...")

# Stage 1: Confidence threshold
initial_count = len(extractions)
filtered = extractions[extractions["Correlation Coefficient"] >= MIN_CONFIDENCE].copy()
print(f"  After confidence filter (≥{MIN_CONFIDENCE}): {len(filtered)}/{initial_count}")

# Stage 2: Remove generic/irrelevant terms
for term in GENERIC_STOPLIST:
    before = len(filtered)
    filtered = filtered[~filtered["Raw Skill"].str.contains(term, case=False, na=False)]
    if before > len(filtered):
        print(f"  Removed {before - len(filtered)} items containing '{term}'")

# Stage 3: Deduplicate per AFSC
filtered = filtered.drop_duplicates(subset=["afsc", "Raw Skill"]).reset_index(drop=True)

print(f"\n✓ Final count: {len(filtered)} high-quality items")
print(f"  Average confidence: {filtered['Correlation Coefficient'].mean():.3f}")


Applying quality filters...
  After confidence filter (≥0.55): 74/300
  Removed 1 items containing 'navy operations'

✓ Final count: 73 high-quality items
  Average confidence: 0.601


## Section 8: Quality Control Sample

### Stratified Sampling:
- Creates 30-item sample (or less if fewer items available)
- Balanced across K/S/A types (10 each when possible)
- Includes columns for manual review:
  - `reviewer_label`: Corrected K/S/A classification
  - `is_correct`: Validation flag
  - `notes`: Reviewer comments

This enables systematic validation of extraction quality.

In [21]:
def make_qc_sample(df_in: pd.DataFrame, n: int = 30) -> pd.DataFrame:
    """Create stratified QC sample for manual review"""
    if df_in.empty:
        return df_in.assign(reviewer_label="", is_correct="", notes="")
    
    # Target equal distribution across K/S/A
    per_bucket = max(1, n // 3)
    parts = []
    
    for k in ["knowledge", "skill", "ability"]:
        sub = df_in[df_in["ksa_type"] == k]
        if not sub.empty:
            sample_size = min(len(sub), per_bucket)
            parts.append(sub.sample(sample_size, random_state=42))
    
    # Combine samples
    qc = pd.concat(parts, ignore_index=True) if parts else df_in.head(min(n, len(df_in))).copy()
    
    # Add review columns
    qc["reviewer_label"] = ""
    qc["is_correct"] = ""
    qc["notes"] = ""
    
    # Select relevant columns for review
    cols = ["afsc", "title", "category", "ksa_type", "Raw Skill", 
            "Correlation Coefficient", "evidence_snippet", 
            "reviewer_label", "is_correct", "notes"]
    return qc[cols]

# Generate QC sample
qc_sample = make_qc_sample(filtered, n=30)
print(f"\n✓ QC sample created: {len(qc_sample)} items")
print(f"  K/S/A distribution in sample:")
for ksa, count in qc_sample["ksa_type"].value_counts().items():
    print(f"    {ksa}: {count}")


✓ QC sample created: 18 items
  K/S/A distribution in sample:
    skill: 10
    ability: 8


## Section 9: Summary Statistics and Exports

### Statistics Captured:
- Total extractions and unique AFSCs
- Average confidence scores
- K/S/A distribution
- Per-AFSC breakdown

### Graph Export Structure:
- **Nodes**: AFSCs (12) + unique K/S/As (~47)
- **Edges**: AFSC→K/S/A relationships with confidence weights
- Ready for Neo4j or other graph database import

In [22]:
# Generate summary statistics
stats = {
    "pipeline_run": datetime.now().isoformat(),
    "total_extractions": int(len(filtered)),
    "unique_afscs": int(filtered["afsc"].nunique()),
    "avg_confidence": float(filtered["Correlation Coefficient"].mean()),
    "min_confidence": float(filtered["Correlation Coefficient"].min()),
    "max_confidence": float(filtered["Correlation Coefficient"].max()),
    "ksa_distribution": filtered["ksa_type"].value_counts().to_dict(),
    "items_per_afsc": {}
}

# Per-AFSC breakdown
for af in filtered["afsc"].unique():
    sub = filtered[filtered["afsc"] == af]
    stats["items_per_afsc"][af] = {
        "total": int(len(sub)),
        "knowledge": int((sub["ksa_type"] == "knowledge").sum()),
        "skills": int((sub["ksa_type"] == "skill").sum()),
        "abilities": int((sub["ksa_type"] == "ability").sum()),
        "avg_confidence": float(sub["Correlation Coefficient"].mean())
    }

# Create graph export
def graph_export(df_in: pd.DataFrame) -> dict:
    """Generate graph database structure"""
    nodes = []
    edges = []
    
    # AFSC nodes
    for af in df_in["afsc"].dropna().unique():
        row = df_in[df_in["afsc"] == af].iloc[0]
        nodes.append({
            "id": str(af),
            "type": "AFSC",
            "properties": {
                "title": row.get("title", ""),
                "category": row.get("category", "")
            }
        })
    
    # K/S/A nodes and edges
    seen_node_ids = set()
    for _, r in df_in.iterrows():
        # Create unique node ID
        node_id = f"{r['ksa_type']}_{re.sub(r'[^a-z0-9]+', '_', r['Raw Skill'].lower())[:50]}"
        
        # Add node if new
        if node_id not in seen_node_ids:
            nodes.append({
                "id": node_id,
                "type": r["ksa_type"].upper(),
                "properties": {
                    "text": r["Raw Skill"],
                    "confidence": float(r["Correlation Coefficient"])
                }
            })
            seen_node_ids.add(node_id)
        
        # Add edge
        edges.append({
            "source": str(r["afsc"]),
            "target": node_id,
            "relationship": f"REQUIRES_{r['ksa_type'].upper()}",
            "properties": {
                "confidence": float(r["Correlation Coefficient"]),
                "evidence": (r.get("evidence_snippet", "") or "")[:200]
            }
        })
    
    return {"nodes": nodes, "edges": edges}

graph_data = graph_export(filtered)

print("\n📊 Summary Statistics:")
print(f"  Total K/S/A: {stats['total_extractions']}")
print(f"  Unique AFSCs: {stats['unique_afscs']}")
print(f"  Average confidence: {stats['avg_confidence']:.3f}")
print(f"  K/S/A distribution: {stats['ksa_distribution']}")

print(f"\n📈 Graph structure:")
print(f"  Nodes: {len(graph_data['nodes'])}")
print(f"  Edges: {len(graph_data['edges'])}")


📊 Summary Statistics:
  Total K/S/A: 73
  Unique AFSCs: 12
  Average confidence: 0.601
  K/S/A distribution: {'skill': 65, 'ability': 8}

📈 Graph structure:
  Nodes: 42
  Edges: 73


## Section 10: Save All Outputs

### Files Generated:
1. **ksa_extractions.csv**: Full filtered dataset with all K/S/A items
2. **qc_sample.csv**: Stratified sample for quality control review
3. **extraction_stats.json**: Performance metrics and statistics
4. **graph_export.json**: Graph database import structure

All files are saved to the configured output directory with timestamps for versioning.

In [23]:
# Save all outputs
out_main = OUTPUT_DIR / "ksa_extractions.csv"
out_qc = OUTPUT_DIR / "qc_sample.csv"
out_stats = OUTPUT_DIR / "extraction_stats.json"
out_graph = OUTPUT_DIR / "graph_export.json"

# Write files
filtered.to_csv(out_main, index=False)
qc_sample.to_csv(out_qc, index=False)

with open(out_stats, "w") as f:
    json.dump(stats, f, indent=2)

with open(out_graph, "w") as f:
    json.dump(graph_data, f, indent=2)

# Final summary
print("\n" + "="*60)
print("PIPELINE COMPLETE")
print("="*60)
print(f"\n✓ Successfully processed {len(laiser_df)} AFSCs")
print(f"✓ Extracted and filtered {len(filtered)} K/S/A items")
print(f"✓ Generated {len(qc_sample)}-item QC sample")
print(f"✓ Created graph with {len(graph_data['nodes'])} nodes and {len(graph_data['edges'])} edges")

print(f"\nOutputs saved to: {OUTPUT_DIR}")
print(f"  - {out_main.name}")
print(f"  - {out_qc.name}")
print(f"  - {out_stats.name}")
print(f"  - {out_graph.name}")

print(f"\n🎯 Next steps:")
print("  1. Review QC sample for validation")
print("  2. Import graph structure to database")
print("  3. Analyze skill relationships across AFSCs")
print("  4. Consider domain-specific improvements")


PIPELINE COMPLETE

✓ Successfully processed 12 AFSCs
✓ Extracted and filtered 73 K/S/A items
✓ Generated 18-item QC sample
✓ Created graph with 42 nodes and 73 edges

Outputs saved to: C:\Users\Kyle\OneDrive\Desktop\Capstone\fall-2025-group6\src\Data\Manual Extraction\ksa_output_simple
  - ksa_extractions.csv
  - qc_sample.csv
  - extraction_stats.json
  - graph_export.json

🎯 Next steps:
  1. Review QC sample for validation
  2. Import graph structure to database
  3. Analyze skill relationships across AFSCs
  4. Consider domain-specific improvements
