# 02 · Build Pairs with Hard Negatives

## Purpose

Construct supervised pairs with positives and BM25/TF-IDF hard negatives for ranking models.

## Inputs

- `data/processed/controls_enhanced.csv` with rationale-augmented control descriptions.
- `data/processed/artifacts_with_split.csv` produced by Notebook 01.

## Outputs

- `data/processed/pairs/train.jsonl`, `.../dev.jsonl`, and `.../test.jsonl` in the defined schema.
- Optional diagnostics on positive/negative counts per split.

## Steps

1. Create control index strings by concatenating title and summary ("title. summary").
2. Instantiate a lexical retriever (BM25 or TF-IDF) over control texts.
3. For each artifact, retrieve top-K candidate controls (e.g., 32) as hard-negative candidates.
4. Emit labeled pairs: all gold controls as positives plus non-gold retrieved controls as negatives.
5. Segment outputs by artifact partition and serialize to JSONL following the pair schema.

## Acceptance Checks

- No artifact from the test split appears in train or dev pair files.
- Every train artifact yields at least one positive pair (label == 1).

In [10]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
from rank_bm25 import BM25Okapi
import re

# Set random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

## 1. Load controls and artifacts

In [11]:
# Load enhanced controls
controls = pd.read_csv("../data/processed/controls_enhanced.csv", dtype=str)
print(f"✓ Loaded {len(controls)} enhanced controls")

# index_text is already created in controls_enhanced.csv
print(f"  Sample control text: {controls.iloc[0]['index_text'][:100]}...")

controls.head(3)

✓ Loaded 34 enhanced controls
  Sample control text: Account Management. Provision, review, and remove accounts; enforce least privilege and approvals.. ...


Unnamed: 0,control_id,family,title,summary,index_text
0,AC-2,AC,Account Management,"Provision, review, and remove accounts; enforc...","Account Management. Provision, review, and rem..."
1,AC-6,AC,Least Privilege,Restrict privileges to the minimum necessary; ...,Least Privilege. Restrict privileges to the mi...
2,AC-7,AC,Unsuccessful Logon Attempts,Enforce lockout thresholds and durations after...,Unsuccessful Logon Attempts. Enforce lockout t...


In [12]:
# Load artifacts with splits
artifacts = pd.read_csv("../data/processed/artifacts_with_split.csv", dtype={"artifact_id": str, "text": str, "evidence_type": str, "gold_controls": str, "partition": str})

print(f"✓ Loaded {len(artifacts)} artifacts")
print(f"  Partition distribution:")
print(artifacts["partition"].value_counts().sort_index())

artifacts.head(3)

✓ Loaded 2574 artifacts
  Partition distribution:
partition
dev       365
test      354
train    1855
Name: count, dtype: int64


Unnamed: 0,artifact_id,text,evidence_type,timestamp,gold_controls,gold_rationale,partition
0,10001,User 'svc-api' failed login 11 times in 2 minu...,log,2025-09-15 08:15:00+00:00,AC-7;AU-6,Failed login threshold was met without a locko...,train
1,10002,Configuration scan shows object store bucket '...,config,2025-09-15 09:22:10+00:00,SC-28;SC-12,Information at rest is not encrypted and crypt...,test
2,10003,CHG-9901: Emergency hotfix for API memory leak...,ticket,2025-09-15 11:45:30+00:00,CM-3;SA-11,A configuration change was deployed without co...,train


## 2. Build BM25 index over controls

In [13]:
def tokenize(text):
    """Simple whitespace + lowercase tokenizer"""
    return re.findall(r'\w+', text.lower())

# Tokenize all control texts
control_tokens = [tokenize(text) for text in controls["index_text"]]

# Build BM25 index
bm25 = BM25Okapi(control_tokens)
print(f"✓ Built BM25 index over {len(control_tokens)} controls")

# Test retrieval
test_query = "failed login attempts"
test_scores = bm25.get_scores(tokenize(test_query))
top_idx = np.argsort(test_scores)[::-1][:3]
print(f"\n  Test query: '{test_query}'")
print(f"  Top 3 controls:")
for i, idx in enumerate(top_idx):
    print(f"    {i+1}. {controls.iloc[idx]['control_id']}: {controls.iloc[idx]['title']} (score: {test_scores[idx]:.2f})")

✓ Built BM25 index over 34 controls

  Test query: 'failed login attempts'
  Top 3 controls:
    1. AC-7: Unsuccessful Logon Attempts (score: 8.56)
    2. AU-6: Audit Review, Analysis, and Reporting (score: 6.40)
    3. SC-7: Boundary Protection (score: 1.89)


In [14]:
# Save BM25 index and control data for later use
import pickle

bm25_dir = Path("../models/bm25")
bm25_dir.mkdir(parents=True, exist_ok=True)

# Save BM25 index and related data
bm25_data = {
    "bm25": bm25,
    "control_tokens": control_tokens,
    "control_ids": controls["control_id"].tolist(),
    "control_texts": controls["index_text"].tolist()
}

bm25_path = bm25_dir / "bm25_index.pkl"
with open(bm25_path, "wb") as f:
    pickle.dump(bm25_data, f)

print(f"\n✓ Saved BM25 index to {bm25_path}")
print(f"  Size: {bm25_path.stat().st_size / 1024:.2f} KB")


✓ Saved BM25 index to ../models/bm25/bm25_index.pkl
  Size: 56.08 KB


## 3. Generate pairs with hard negatives

In [15]:
TOP_K_CANDIDATES = 32  # Number of hard negative candidates to retrieve

def generate_pairs_for_artifact(row, bm25_index, controls_df):
    """
    Generate positive and hard negative pairs for a single artifact.
    
    Returns list of pair dicts with schema:
    {artifact_id, artifact_text, evidence_type, control_id, control_text, family, label}
    """
    pairs = []
    
    # Parse gold controls
    gold_controls = set()
    if pd.notna(row["gold_controls"]):
        gold_controls = set(row["gold_controls"].split(";"))
    
    # Retrieve top-K candidates using BM25
    query_tokens = tokenize(row["text"])
    scores = bm25_index.get_scores(query_tokens)
    top_indices = np.argsort(scores)[::-1][:TOP_K_CANDIDATES]
    
    # Collect all controls to include (golds + top-K candidates)
    controls_to_include = set()
    
    # Add all gold controls as positives
    for control_id in gold_controls:
        controls_to_include.add(control_id)
    
    # Add top-K retrieved controls (will be negatives if not in gold)
    for idx in top_indices:
        control_id = controls_df.iloc[idx]["control_id"]
        controls_to_include.add(control_id)
    
    # Create pairs
    for control_id in controls_to_include:
        control_row = controls_df[controls_df["control_id"] == control_id].iloc[0]
        
        pair = {
            "artifact_id": row["artifact_id"],
            "artifact_text": row["text"],
            "evidence_type": row["evidence_type"],
            "control_id": control_id,
            "control_text": control_row["index_text"],
            "family": control_row["family"],
            "label": 1 if control_id in gold_controls else 0
        }
        pairs.append(pair)
    
    return pairs

# Test on one artifact
test_artifact = artifacts.iloc[0]
test_pairs = generate_pairs_for_artifact(test_artifact, bm25, controls)
print(f"✓ Generated {len(test_pairs)} pairs for test artifact {test_artifact['artifact_id']}")
print(f"  Positives: {sum(p['label'] == 1 for p in test_pairs)}")
print(f"  Negatives: {sum(p['label'] == 0 for p in test_pairs)}")
print(f"\n  Sample positive pair:")
pos_pair = [p for p in test_pairs if p['label'] == 1][0]
print(f"    {pos_pair['control_id']}: {pos_pair['control_text'][:80]}...")

✓ Generated 32 pairs for test artifact 10001
  Positives: 2
  Negatives: 30

  Sample positive pair:
    AU-6: Audit Review, Analysis, and Reporting. Regularly review and analyze audit logs; ...


## 4. Generate pairs for all partitions and save to JSONL

In [16]:
# Create output directory
output_dir = Path("../data/processed/pairs")
output_dir.mkdir(parents=True, exist_ok=True)

# Process each partition separately
partition_stats = {}

for partition in ["train", "dev", "test"]:
    print(f"\nProcessing {partition} partition...")
    
    # Filter artifacts for this partition
    partition_artifacts = artifacts[artifacts["partition"] == partition]
    print(f"  {len(partition_artifacts)} artifacts")
    
    # Generate pairs for all artifacts in this partition
    all_pairs = []
    for idx, row in partition_artifacts.iterrows():
        pairs = generate_pairs_for_artifact(row, bm25, controls)
        all_pairs.extend(pairs)
    
    # Save to JSONL
    output_path = output_dir / f"{partition}.jsonl"
    with open(output_path, "w") as f:
        for pair in all_pairs:
            f.write(json.dumps(pair) + "\n")
    
    # Collect statistics
    n_positives = sum(p["label"] == 1 for p in all_pairs)
    n_negatives = sum(p["label"] == 0 for p in all_pairs)
    
    partition_stats[partition] = {
        "n_artifacts": len(partition_artifacts),
        "n_pairs": len(all_pairs),
        "n_positives": n_positives,
        "n_negatives": n_negatives,
        "pos_ratio": n_positives / len(all_pairs) if all_pairs else 0
    }
    
    print(f"  ✓ Saved {len(all_pairs)} pairs to {output_path}")
    print(f"    Positives: {n_positives}, Negatives: {n_negatives}, Ratio: {partition_stats[partition]['pos_ratio']:.3f}")


Processing train partition...
  1855 artifacts
  ✓ Saved 59373 pairs to ../data/processed/pairs/train.jsonl
    Positives: 3146, Negatives: 56227, Ratio: 0.053

Processing dev partition...
  365 artifacts
  ✓ Saved 11683 pairs to ../data/processed/pairs/dev.jsonl
    Positives: 581, Negatives: 11102, Ratio: 0.050

Processing test partition...
  354 artifacts
  ✓ Saved 11332 pairs to ../data/processed/pairs/test.jsonl
    Positives: 565, Negatives: 10767, Ratio: 0.050


In [17]:
# Print summary table
print("\n" + "="*80)
print("PAIR GENERATION SUMMARY")
print("="*80)

summary_df = pd.DataFrame(partition_stats).T
summary_df.index.name = "partition"
print(summary_df)


PAIR GENERATION SUMMARY
           n_artifacts  n_pairs  n_positives  n_negatives  pos_ratio
partition                                                           
train           1855.0  59373.0       3146.0      56227.0   0.052987
dev              365.0  11683.0        581.0      11102.0   0.049730
test             354.0  11332.0        565.0      10767.0   0.049859


## 5. Acceptance checks

In [18]:
print("="*80)
print("ACCEPTANCE CHECKS")
print("="*80)

# Check 1: No test artifacts appear in train/dev pair files
print("\n✓ Check 1: No test artifact IDs in train/dev pair files")

# Load all pair files and collect artifact IDs
train_artifact_ids = set()
dev_artifact_ids = set()
test_artifact_ids = set()

for partition, artifact_set in [("train", train_artifact_ids), ("dev", dev_artifact_ids), ("test", test_artifact_ids)]:
    pair_file = output_dir / f"{partition}.jsonl"
    with open(pair_file, "r") as f:
        for line in f:
            pair = json.loads(line)
            artifact_set.add(pair["artifact_id"])

# Check for overlap
test_in_train = test_artifact_ids & train_artifact_ids
test_in_dev = test_artifact_ids & dev_artifact_ids

check1 = len(test_in_train) == 0 and len(test_in_dev) == 0
print(f"  Test artifacts in train: {len(test_in_train)}")
print(f"  Test artifacts in dev: {len(test_in_dev)}")
print(f"  Result: {'PASS' if check1 else 'FAIL'}")

# Check 2: Every train artifact has at least one positive pair
print("\n✓ Check 2: Every train artifact has at least one positive pair")

train_pairs = []
with open(output_dir / "train.jsonl", "r") as f:
    for line in f:
        train_pairs.append(json.loads(line))

# Group by artifact and check for positives
artifact_has_positive = {}
for pair in train_pairs:
    aid = pair["artifact_id"]
    if aid not in artifact_has_positive:
        artifact_has_positive[aid] = False
    if pair["label"] == 1:
        artifact_has_positive[aid] = True

artifacts_without_positives = [aid for aid, has_pos in artifact_has_positive.items() if not has_pos]
check2 = len(artifacts_without_positives) == 0

print(f"  Train artifacts: {len(artifact_has_positive)}")
print(f"  Artifacts without positives: {len(artifacts_without_positives)}")
print(f"  Result: {'PASS' if check2 else 'FAIL'}")

if not check2:
    print(f"  Example artifacts without positives: {artifacts_without_positives[:5]}")

# Overall
all_checks_passed = check1 and check2
print("\n" + "="*80)
if all_checks_passed:
    print("✅ ALL ACCEPTANCE CHECKS PASSED")
else:
    print("❌ SOME ACCEPTANCE CHECKS FAILED")
print("="*80)

ACCEPTANCE CHECKS

✓ Check 1: No test artifact IDs in train/dev pair files
  Test artifacts in train: 0
  Test artifacts in dev: 0
  Result: PASS

✓ Check 2: Every train artifact has at least one positive pair
  Train artifacts: 1855
  Artifacts without positives: 0
  Result: PASS

✅ ALL ACCEPTANCE CHECKS PASSED
