# 05 · Train Auto-K Cardinality Model

## Purpose

Learn to predict how many controls (1–3) to emit per artifact using calibrated scores.

## Inputs

- `data/processed/pairs/train.jsonl` and `.../dev.jsonl` for score features.
- Calibrated cross-encoder probabilities produced in Notebook 04.

## Outputs

- `models/cardinality/model.pkl` containing the trained classifier.
- `models/cardinality/feature_spec.json` documenting feature engineering.

## Steps

1. Generate per-artifact features: top calibrated probabilities s1–s4 (zero-padded), deltas (s1-s2, s2-s3), score entropy, and evidence_type one-hot vectors.
2. Label each artifact with `min(3, |gold_controls|)` using the processed split file.
3. Train a multi-class classifier (e.g., logistic regression or gradient boosting) on train artifacts.
4. Evaluate on dev artifacts, capture accuracy, confusion matrix, and calibration sanity checks.
5. Persist the fitted model and feature spec JSON for reproducible inference.

## Acceptance Checks

- Reported dev accuracy and confusion matrix summarize performance.
- `models/cardinality/model.pkl` and `feature_spec.json` are written.

In [9]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
import pickle
from collections import defaultdict
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
import torch
from sentence_transformers import CrossEncoder

# Set random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("✓ Imports complete")

✓ Imports complete


## 1. Load data and models

In [10]:
# Load artifacts with splits and gold labels
artifacts = pd.read_csv("../data/processed/artifacts_with_split.csv", dtype={"artifact_id": str})
print(f"✓ Loaded {len(artifacts)} artifacts")

# Load cross-encoder model and calibrator
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
cross_encoder = CrossEncoder("../models/cross_encoder", device=device)
print(f"✓ Loaded cross-encoder model on {device}")

# Load calibrator
with open("../models/calibration/cross_iso.pkl", "rb") as f:
    calibrator = pickle.load(f)
print(f"✓ Loaded calibrator")

# Load enhanced controls
controls = pd.read_csv("../data/processed/controls_enhanced.csv", dtype=str)
# index_text is already created in controls_enhanced.csv
print(f"✓ Loaded {len(controls)} enhanced controls")

✓ Loaded 2574 artifacts
✓ Loaded cross-encoder model on mps
✓ Loaded calibrator
✓ Loaded 34 enhanced controls


## 2. Generate calibrated scores for all artifacts

In [11]:
def get_calibrated_scores(artifact_text, controls_df, cross_encoder, calibrator, top_k=4):
    """
    Get top-K calibrated scores for an artifact against all controls.
    """
    # Create pairs
    pairs = [[artifact_text, ctrl_text] for ctrl_text in controls_df["index_text"]]
    
    # Get cross-encoder scores
    scores = cross_encoder.predict(pairs, convert_to_numpy=True, show_progress_bar=False)
    
    # Convert to probabilities (sigmoid)
    probs = 1 / (1 + np.exp(-scores))
    
    # Calibrate
    calibrated_probs = calibrator.predict(probs)
    
    # Get top-K
    top_indices = np.argsort(calibrated_probs)[::-1][:top_k]
    top_scores = calibrated_probs[top_indices]
    
    return top_scores

# Generate scores for all artifacts (this will take a while)
print("Generating calibrated scores for all artifacts...")
print("This may take several minutes...")

artifact_scores = {}
for partition in ["train", "dev"]:
    partition_artifacts = artifacts[artifacts["partition"] == partition]
    print(f"\nProcessing {partition} partition ({len(partition_artifacts)} artifacts)...")
    
    scores_list = []
    for idx, row in partition_artifacts.iterrows():
        top_scores = get_calibrated_scores(row["text"], controls, cross_encoder, calibrator, top_k=4)
        scores_list.append({
            "artifact_id": row["artifact_id"],
            "scores": top_scores
        })
        
        if (len(scores_list) % 50 == 0):
            print(f"  Processed {len(scores_list)}/{len(partition_artifacts)}...")
    
    artifact_scores[partition] = scores_list
    print(f"✓ Completed {partition} partition")

print(f"\n✓ Generated scores for {sum(len(v) for v in artifact_scores.values())} artifacts")

Generating calibrated scores for all artifacts...
This may take several minutes...

Processing train partition (1855 artifacts)...
  Processed 50/1855...
  Processed 100/1855...
  Processed 150/1855...
  Processed 200/1855...
  Processed 250/1855...
  Processed 300/1855...
  Processed 350/1855...
  Processed 400/1855...
  Processed 450/1855...
  Processed 500/1855...
  Processed 550/1855...
  Processed 600/1855...
  Processed 650/1855...
  Processed 700/1855...
  Processed 750/1855...
  Processed 800/1855...
  Processed 850/1855...
  Processed 900/1855...
  Processed 950/1855...
  Processed 1000/1855...
  Processed 1050/1855...
  Processed 1100/1855...
  Processed 1150/1855...
  Processed 1200/1855...
  Processed 1250/1855...
  Processed 1300/1855...
  Processed 1350/1855...
  Processed 1400/1855...
  Processed 1450/1855...
  Processed 1500/1855...
  Processed 1550/1855...
  Processed 1600/1855...
  Processed 1650/1855...
  Processed 1700/1855...
  Processed 1750/1855...
  Processed 18

## 3. Engineer features for cardinality prediction

In [12]:
def extract_features(scores, evidence_type, evidence_types=["config", "log", "ticket"]):
    """
    Extract features from calibrated scores.
    
    Features:
    - s1, s2, s3, s4: Top 4 scores (zero-padded)
    - delta_12: s1 - s2
    - delta_23: s2 - s3
    - entropy: Score entropy
    - evidence_type one-hot encoding
    """
    features = {}
    
    # Top scores (zero-pad if less than 4)
    for i in range(4):
        features[f"s{i+1}"] = scores[i] if i < len(scores) else 0.0
    
    # Deltas
    features["delta_12"] = features["s1"] - features["s2"]
    features["delta_23"] = features["s2"] - features["s3"]
    
    # Entropy
    valid_scores = scores[scores > 0]
    if len(valid_scores) > 0:
        entropy = -np.sum(valid_scores * np.log(valid_scores + 1e-10))
    else:
        entropy = 0.0
    features["entropy"] = entropy
    
    # Evidence type one-hot
    for et in evidence_types:
        features[f"type_{et}"] = 1 if evidence_type == et else 0
    
    return features

# Create feature matrix for train and dev
def create_feature_matrix(artifacts_df, scores_dict, partition):
    """Create feature matrix and labels for a partition"""
    features_list = []
    labels = []
    artifact_ids = []
    
    # Get scores for this partition
    scores_data = {s["artifact_id"]: s["scores"] for s in scores_dict[partition]}
    
    for idx, row in artifacts_df[artifacts_df["partition"] == partition].iterrows():
        artifact_id = row["artifact_id"]
        
        # Get features
        if artifact_id in scores_data:
            feat = extract_features(scores_data[artifact_id], row["evidence_type"])
            features_list.append(feat)
            
            # Label: min(3, number of gold controls)
            n_gold = len(row["gold_controls"].split(";")) if pd.notna(row["gold_controls"]) else 0
            label = min(3, n_gold)
            labels.append(label)
            artifact_ids.append(artifact_id)
    
    # Convert to DataFrame
    X = pd.DataFrame(features_list)
    y = np.array(labels)
    
    return X, y, artifact_ids

# Create train and dev sets
X_train, y_train, train_ids = create_feature_matrix(artifacts, artifact_scores, "train")
X_dev, y_dev, dev_ids = create_feature_matrix(artifacts, artifact_scores, "dev")

print(f"✓ Created feature matrices:")
print(f"  Train: {X_train.shape}, labels: {len(y_train)}")
print(f"  Dev: {X_dev.shape}, labels: {len(y_dev)}")
print(f"\n  Feature columns: {list(X_train.columns)}")
print(f"\n  Label distribution (train): {np.bincount(y_train)}")
print(f"  Label distribution (dev): {np.bincount(y_dev)}")

✓ Created feature matrices:
  Train: (1855, 10), labels: 1855
  Dev: (365, 10), labels: 365

  Feature columns: ['s1', 's2', 's3', 's4', 'delta_12', 'delta_23', 'entropy', 'type_config', 'type_log', 'type_ticket']

  Label distribution (train): [  0 723 973 159]
  Label distribution (dev): [  0 156 202   7]


## 4. Train Auto-K classifier

In [13]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_dev_scaled = scaler.transform(X_dev)

print("✓ Standardized features")

# Train logistic regression classifier
classifier = LogisticRegression(
    multi_class="multinomial",
    max_iter=1000,
    random_state=RANDOM_SEED,
    class_weight="balanced"  # Handle class imbalance
)

print("\nTraining Auto-K classifier...")
classifier.fit(X_train_scaled, y_train)
print("✓ Training complete")

# Predictions
y_train_pred = classifier.predict(X_train_scaled)
y_dev_pred = classifier.predict(X_dev_scaled)

# Metrics
train_acc = accuracy_score(y_train, y_train_pred)
dev_acc = accuracy_score(y_dev, y_dev_pred)

print(f"\n{'='*60}")
print(f"TRAINING RESULTS")
print(f"{'='*60}")
print(f"  Train accuracy: {train_acc:.4f}")
print(f"  Dev accuracy:   {dev_acc:.4f}")

✓ Standardized features

Training Auto-K classifier...
✓ Training complete

TRAINING RESULTS
  Train accuracy: 0.6954
  Dev accuracy:   0.6411




In [14]:
# Confusion matrix and classification report
print(f"\n{'='*60}")
print(f"CONFUSION MATRIX (Dev Set)")
print(f"{'='*60}")
cm = confusion_matrix(y_dev, y_dev_pred)
print(cm)
print(f"\nRows: True labels, Columns: Predicted labels")

print(f"\n{'='*60}")
print(f"CLASSIFICATION REPORT (Dev Set)")
print(f"{'='*60}")
print(classification_report(y_dev, y_dev_pred, target_names=["k=1", "k=2", "k=3"]))


CONFUSION MATRIX (Dev Set)
[[115  28  13]
 [ 48 113  41]
 [  0   1   6]]

Rows: True labels, Columns: Predicted labels

CLASSIFICATION REPORT (Dev Set)
              precision    recall  f1-score   support

         k=1       0.71      0.74      0.72       156
         k=2       0.80      0.56      0.66       202
         k=3       0.10      0.86      0.18         7

    accuracy                           0.64       365
   macro avg       0.53      0.72      0.52       365
weighted avg       0.74      0.64      0.68       365



## 5. Save model and feature specification

In [15]:
# Create output directory
output_dir = Path("../models/cardinality")
output_dir.mkdir(parents=True, exist_ok=True)

# Save model and scaler together
model_data = {
    "classifier": classifier,
    "scaler": scaler,
    "feature_columns": list(X_train.columns)
}

model_path = output_dir / "model.pkl"
with open(model_path, "wb") as f:
    pickle.dump(model_data, f)

print(f"✓ Saved model to {model_path}")

# Save feature specification
feature_spec = {
    "features": [
        {"name": "s1", "description": "Top 1 calibrated score"},
        {"name": "s2", "description": "Top 2 calibrated score"},
        {"name": "s3", "description": "Top 3 calibrated score"},
        {"name": "s4", "description": "Top 4 calibrated score"},
        {"name": "delta_12", "description": "s1 - s2 (score gap)"},
        {"name": "delta_23", "description": "s2 - s3 (score gap)"},
        {"name": "entropy", "description": "Score entropy"},
        {"name": "type_config", "description": "Evidence type: config (one-hot)"},
        {"name": "type_log", "description": "Evidence type: log (one-hot)"},
        {"name": "type_ticket", "description": "Evidence type: ticket (one-hot)"}
    ],
    "num_features": len(X_train.columns),
    "classes": [1, 2, 3],
    "class_names": ["k=1", "k=2", "k=3"]
}

feature_spec_path = output_dir / "feature_spec.json"
with open(feature_spec_path, "w") as f:
    json.dump(feature_spec, f, indent=2)

print(f"✓ Saved feature spec to {feature_spec_path}")

✓ Saved model to ../models/cardinality/model.pkl
✓ Saved feature spec to ../models/cardinality/feature_spec.json


## 6. Acceptance checks

In [16]:
print("="*60)
print("ACCEPTANCE CHECKS")
print("="*60)

# Check 1: Dev accuracy and confusion matrix reported
check1 = dev_acc > 0
print(f"\n✓ Check 1: Dev accuracy and confusion matrix reported")
print(f"  Dev accuracy: {dev_acc:.4f}")
print(f"  Confusion matrix shape: {cm.shape}")
print(f"  Result: {'PASS' if check1 else 'FAIL'}")

# Check 2: Model file saved
check2 = model_path.exists()
print(f"\n✓ Check 2: Model file saved")
print(f"  Path: {model_path}")
print(f"  Exists: {check2}")
print(f"  Size: {model_path.stat().st_size / 1024:.2f} KB" if check2 else "  Size: N/A")
print(f"  Result: {'PASS' if check2 else 'FAIL'}")

# Check 3: Feature spec saved
check3 = feature_spec_path.exists()
print(f"\n✓ Check 3: Feature spec saved")
print(f"  Path: {feature_spec_path}")
print(f"  Exists: {check3}")
print(f"  Num features: {feature_spec['num_features']}" if check3 else "  Num features: N/A")
print(f"  Result: {'PASS' if check3 else 'FAIL'}")

# Overall
all_checks_passed = check1 and check2 and check3
print("\n" + "="*60)
if all_checks_passed:
    print("✅ ALL ACCEPTANCE CHECKS PASSED")
else:
    print("❌ SOME ACCEPTANCE CHECKS FAILED")
print("="*60)

ACCEPTANCE CHECKS

✓ Check 1: Dev accuracy and confusion matrix reported
  Dev accuracy: 0.6411
  Confusion matrix shape: (3, 3)
  Result: PASS

✓ Check 2: Model file saved
  Path: ../models/cardinality/model.pkl
  Exists: True
  Size: 1.70 KB
  Result: PASS

✓ Check 3: Feature spec saved
  Path: ../models/cardinality/feature_spec.json
  Exists: True
  Num features: 10
  Result: PASS

✅ ALL ACCEPTANCE CHECKS PASSED
