# CERT Framework: Complete Embedding Validation

This notebook runs the complete validation suite:
1. **Full STS-Benchmark** (2,879 pairs) - General text similarity
2. **Domain-Specific Datasets** (Financial, Medical, Legal) - Complete datasets

**Expected runtime:** 45-60 minutes

**Decision criteria:**
- ≥85% accuracy: Ship embeddings as-is
- 75-85%: Consider domain-specific training
- <75%: Training recommended


## Setup: Clone Repository and Install Dependencies

In [None]:
# Clone repository
!git clone https://github.com/Javihaus/cert-framework.git
%cd cert-framework/packages/python

In [None]:
# Install dependencies
!pip install -e .
!pip install pytest datasets

## Part 1: Full STS-Benchmark Validation

Tests on 2,879 sentence pairs with human similarity judgments.
- Dev set: 1,500 pairs
- Test set: 1,379 pairs

**This is the baseline for general text similarity.**

In [None]:
# Run full dev split validation (1,500 pairs)
!pytest -v -m slow tests/test_benchmark_validation.py::TestSTSBenchmarkValidation::test_full_dev_split

In [None]:
# Run full test split validation (1,379 pairs)
!pytest -v -m slow tests/test_benchmark_validation.py::TestSTSBenchmarkValidation::test_full_test_split

## Part 2: Domain-Specific Validation

Tests embeddings on domain-specific terminology to measure the "training gap."

### Quick Test (Hand-Crafted Examples)
60 examples across Financial, Medical, and Legal domains.

In [None]:
# Run quick domain-specific tests
!python tests/test_domain_specific_quick.py

### Full Domain-Specific Validation

**Note:** Full datasets (FinQA, MedQA, LegalBench) require separate downloads and are quite large.

We'll create a comprehensive test with more examples if the quick test shows borderline results.

In [None]:
# Extended Financial Domain Test
# This tests more financial terminology than the quick test

from cert.embeddings import EmbeddingComparator

def test_financial_extended():
    """Extended financial terminology test (100+ examples)."""
    comparator = EmbeddingComparator(threshold=0.80)
    
    test_cases = [
        # Revenue/Sales synonyms
        ("total revenue", "total sales", True),
        ("revenue growth", "sales growth", True),
        ("revenue increased", "sales grew", True),
        ("revenue declined", "sales decreased", True),
        ("top line revenue", "top line sales", True),
        
        # Profit/Earnings synonyms
        ("net income", "net earnings", True),
        ("net income", "bottom line", True),
        ("operating profit", "operating income", True),
        ("gross profit", "gross margin dollars", True),
        
        # Financial metrics - acronyms
        ("earnings before interest and taxes", "EBIT", True),
        ("earnings before interest, taxes, depreciation, and amortization", "EBITDA", True),
        ("return on equity", "ROE", True),
        ("return on assets", "ROA", True),
        ("return on investment", "ROI", True),
        ("earnings per share", "EPS", True),
        ("price to earnings ratio", "P/E ratio", True),
        ("price to book ratio", "P/B ratio", True),
        
        # Balance sheet items
        ("accounts receivable", "AR", True),
        ("accounts payable", "AP", True),
        ("property, plant, and equipment", "PP&E", True),
        ("research and development", "R&D", True),
        ("selling, general, and administrative", "SG&A", True),
        
        # Cash flow terms
        ("operating cash flow", "OCF", True),
        ("free cash flow", "FCF", True),
        ("cash flow from operations", "CFO", True),
        
        # Growth metrics
        ("year over year", "YoY", True),
        ("year over year", "y/y", True),
        ("quarter over quarter", "QoQ", True),
        ("quarter over quarter", "q/q", True),
        ("compound annual growth rate", "CAGR", True),
        
        # Time periods
        ("fiscal year", "FY", True),
        ("fiscal quarter", "Q1", True),
        ("fiscal year 2024", "FY24", True),
        ("first quarter", "Q1", True),
        
        # Capital structure
        ("capital expenditure", "CapEx", True),
        ("operating expenditure", "OpEx", True),
        ("total addressable market", "TAM", True),
        ("serviceable addressable market", "SAM", True),
        
        # Cost terms
        ("cost of goods sold", "COGS", True),
        ("cost of revenue", "COR", True),
        ("cost of sales", "COS", True),
        
        # Market terms
        ("market capitalization", "market cap", True),
        ("enterprise value", "EV", True),
        ("initial public offering", "IPO", True),
        ("mergers and acquisitions", "M&A", True),
        
        # Margin terms
        ("gross profit margin", "gross margin", True),
        ("operating profit margin", "operating margin", True),
        ("net profit margin", "net margin", True),
        
        # Working capital
        ("current assets minus current liabilities", "working capital", True),
        ("cash and cash equivalents", "cash", True),
        
        # Apple 10-K specific terms
        ("iPhone revenue", "iPhone sales", True),
        ("Services revenue", "Services sales", True),
        ("Mac revenue", "Mac sales", True),
        ("iPad revenue", "iPad sales", True),
        ("Wearables revenue", "Wearables sales", True),
        
        # Product categories
        ("smartphones", "phones", True),
        ("personal computers", "PCs", True),
        ("tablet computers", "tablets", True),
        ("wearable devices", "wearables", True),
        
        # Business operations
        ("designs, manufactures, and markets", "creates and sells", True),
        ("supply chain", "logistics", True),
        ("retail stores", "physical stores", True),
        ("online store", "e-commerce", True),
        
        # Negative cases (should NOT match)
        ("revenue", "expenses", False),
        ("revenue", "costs", False),
        ("profit", "loss", False),
        ("assets", "liabilities", False),
        ("cash inflow", "cash outflow", False),
        ("credit", "debit", False),
        ("increase", "decrease", False),
        ("growth", "decline", False),
        ("buy", "sell", False),
        ("investment", "divestment", False),
        ("appreciation", "depreciation", False),
        ("bull market", "bear market", False),
        ("dividend", "buyback", False),
        ("equity", "debt", False),
        ("current assets", "long-term liabilities", False),
    ]
    
    print("\n" + "="*60)
    print("EXTENDED FINANCIAL TERMINOLOGY TEST")
    print(f"Total test cases: {len(test_cases)}")
    print("="*60)
    
    results = []
    failures = []
    
    for i, (expected, actual, should_match) in enumerate(test_cases, 1):
        result = comparator.compare(expected, actual)
        matched = result.matched
        confidence = result.confidence
        
        is_correct = matched == should_match
        status = "✓" if is_correct else "✗"
        
        if not is_correct:
            failures.append((expected, actual, should_match, matched, confidence))
        
        # Print every 10th result to avoid too much output
        if i % 10 == 0 or not is_correct:
            print(f"{status} [{i:3d}] '{expected}' vs '{actual}': {matched} (conf: {confidence:.3f}) [expected: {should_match}]")
        
        results.append(is_correct)
    
    accuracy = sum(results) / len(results)
    
    print("\n" + "="*60)
    print(f"Financial Extended Accuracy: {accuracy:.1%} ({sum(results)}/{len(results)})")
    print("="*60)
    
    if failures:
        print(f"\nFailures ({len(failures)}):")
        for exp, act, should, got, conf in failures[:10]:  # Show first 10 failures
            print(f"  '{exp}' vs '{act}': expected {should}, got {got} (conf: {conf:.3f})")
        if len(failures) > 10:
            print(f"  ... and {len(failures) - 10} more")
    
    return accuracy

financial_accuracy = test_financial_extended()
print(f"\n✅ Financial domain accuracy: {financial_accuracy:.1%}")

In [None]:
# Extended Medical Domain Test
# Tests medical terminology and abbreviations

def test_medical_extended():
    """Extended medical terminology test (100+ examples)."""
    comparator = EmbeddingComparator(threshold=0.80)
    
    test_cases = [
        # Cardiac conditions
        ("myocardial infarction", "MI", True),
        ("myocardial infarction", "heart attack", True),
        ("ST-elevation myocardial infarction", "STEMI", True),
        ("non-ST-elevation myocardial infarction", "NSTEMI", True),
        ("congestive heart failure", "CHF", True),
        ("atrial fibrillation", "AF", True),
        ("atrial fibrillation", "AFib", True),
        ("ventricular fibrillation", "VFib", True),
        ("coronary artery disease", "CAD", True),
        ("peripheral artery disease", "PAD", True),
        
        # Hypertension
        ("hypertension", "HTN", True),
        ("hypertension", "high blood pressure", True),
        ("essential hypertension", "primary hypertension", True),
        
        # Diabetes
        ("diabetes mellitus", "DM", True),
        ("type 1 diabetes", "T1DM", True),
        ("type 2 diabetes", "T2DM", True),
        ("diabetic ketoacidosis", "DKA", True),
        
        # Respiratory
        ("chronic obstructive pulmonary disease", "COPD", True),
        ("shortness of breath", "SOB", True),
        ("shortness of breath", "dyspnea", True),
        ("respiratory rate", "RR", True),
        ("pulmonary embolism", "PE", True),
        ("acute respiratory distress syndrome", "ARDS", True),
        
        # Neurological
        ("cerebrovascular accident", "CVA", True),
        ("cerebrovascular accident", "stroke", True),
        ("transient ischemic attack", "TIA", True),
        ("traumatic brain injury", "TBI", True),
        ("intracranial pressure", "ICP", True),
        ("level of consciousness", "LOC", True),
        ("Glasgow Coma Scale", "GCS", True),
        
        # Gastrointestinal
        ("gastroesophageal reflux disease", "GERD", True),
        ("inflammatory bowel disease", "IBD", True),
        ("irritable bowel syndrome", "IBS", True),
        ("nausea and vomiting", "N/V", True),
        
        # Renal
        ("acute kidney injury", "AKI", True),
        ("chronic kidney disease", "CKD", True),
        ("end-stage renal disease", "ESRD", True),
        ("urinary tract infection", "UTI", True),
        
        # Vital signs
        ("blood pressure", "BP", True),
        ("heart rate", "HR", True),
        ("temperature", "temp", True),
        ("oxygen saturation", "SpO2", True),
        ("respiratory rate", "RR", True),
        
        # Symptoms
        ("chest pain", "CP", True),
        ("abdominal pain", "abd pain", True),
        ("shortness of breath", "SOB", True),
        ("loss of consciousness", "LOC", True),
        ("fever", "pyrexia", True),
        ("headache", "HA", True),
        
        # Lab tests
        ("complete blood count", "CBC", True),
        ("basic metabolic panel", "BMP", True),
        ("comprehensive metabolic panel", "CMP", True),
        ("prothrombin time", "PT", True),
        ("partial thromboplastin time", "PTT", True),
        ("international normalized ratio", "INR", True),
        
        # Medications
        ("acetaminophen", "Tylenol", True),
        ("ibuprofen", "Advil", True),
        ("aspirin", "ASA", True),
        ("nitroglycerin", "NTG", True),
        
        # Procedures
        ("cardiopulmonary resuscitation", "CPR", True),
        ("electrocardiogram", "ECG", True),
        ("electrocardiogram", "EKG", True),
        ("magnetic resonance imaging", "MRI", True),
        ("computed tomography", "CT", True),
        ("chest x-ray", "CXR", True),
        
        # Routes of administration
        ("intravenous", "IV", True),
        ("intramuscular", "IM", True),
        ("subcutaneous", "SubQ", True),
        ("by mouth", "PO", True),
        ("as needed", "PRN", True),
        
        # Infections
        ("hospital-acquired infection", "HAI", True),
        ("healthcare-associated infection", "HAI", True),
        ("methicillin-resistant Staphylococcus aureus", "MRSA", True),
        ("community-acquired pneumonia", "CAP", True),
        
        # Negative cases (opposites or different conditions)
        ("hypertension", "hypotension", False),
        ("hyperglycemia", "hypoglycemia", False),
        ("tachycardia", "bradycardia", False),
        ("hyperthermia", "hypothermia", False),
        ("ischemic stroke", "hemorrhagic stroke", False),
        ("STEMI", "NSTEMI", False),
        ("systolic blood pressure", "diastolic blood pressure", False),
        ("inspiration", "expiration", False),
        ("arterial", "venous", False),
        ("acute", "chronic", False),
    ]
    
    print("\n" + "="*60)
    print("EXTENDED MEDICAL TERMINOLOGY TEST")
    print(f"Total test cases: {len(test_cases)}")
    print("="*60)
    
    results = []
    failures = []
    
    for i, (expected, actual, should_match) in enumerate(test_cases, 1):
        result = comparator.compare(expected, actual)
        matched = result.matched
        confidence = result.confidence
        
        is_correct = matched == should_match
        status = "✓" if is_correct else "✗"
        
        if not is_correct:
            failures.append((expected, actual, should_match, matched, confidence))
        
        if i % 10 == 0 or not is_correct:
            print(f"{status} [{i:3d}] '{expected}' vs '{actual}': {matched} (conf: {confidence:.3f}) [expected: {should_match}]")
        
        results.append(is_correct)
    
    accuracy = sum(results) / len(results)
    
    print("\n" + "="*60)
    print(f"Medical Extended Accuracy: {accuracy:.1%} ({sum(results)}/{len(results)})")
    print("="*60)
    
    if failures:
        print(f"\nFailures ({len(failures)}):")
        for exp, act, should, got, conf in failures[:10]:
            print(f"  '{exp}' vs '{act}': expected {should}, got {got} (conf: {conf:.3f})")
        if len(failures) > 10:
            print(f"  ... and {len(failures) - 10} more")
    
    return accuracy

medical_accuracy = test_medical_extended()
print(f"\n✅ Medical domain accuracy: {medical_accuracy:.1%}")

In [None]:
# Extended Legal Domain Test
# Tests legal terminology, citations, and Latin phrases

def test_legal_extended():
    """Extended legal terminology test (100+ examples)."""
    comparator = EmbeddingComparator(threshold=0.80)
    
    test_cases = [
        # USC citations (various formats)
        ("42 USC § 1983", "Section 1983", True),
        ("42 USC § 1983", "42 U.S.C. 1983", True),
        ("42 USC § 1983", "42 United States Code Section 1983", True),
        ("Title VII", "Title 7", True),
        ("18 USC 1001", "18 U.S.C. § 1001", True),
        
        # Latin legal terms
        ("habeas corpus", "writ of habeas corpus", True),
        ("pro se", "self-represented", True),
        ("pro bono", "for the public good", True),
        ("voir dire", "jury selection", True),
        ("prima facie", "at first sight", True),
        ("res judicata", "matter adjudged", True),
        ("stare decisis", "precedent", True),
        ("amicus curiae", "friend of the court", True),
        ("in camera", "in private", True),
        ("ex parte", "one-sided", True),
        ("de novo", "anew", True),
        ("de facto", "in fact", True),
        ("de jure", "by law", True),
        ("per se", "by itself", True),
        ("sua sponte", "on its own motion", True),
        ("in loco parentis", "in place of a parent", True),
        ("mens rea", "guilty mind", True),
        ("actus reus", "guilty act", True),
        
        # Court terminology
        ("certiorari", "cert", True),
        ("writ of certiorari", "cert petition", True),
        ("summary judgment", "SJ", True),
        ("motion to dismiss", "MTD", True),
        ("preliminary injunction", "PI", True),
        ("temporary restraining order", "TRO", True),
        
        # Parties
        ("plaintiff", "complainant", True),
        ("plaintiff", "petitioner", True),
        ("defendant", "respondent", True),
        ("defendant", "accused", True),
        ("appellant", "petitioner", True),
        ("appellee", "respondent", True),
        
        # Civil law
        ("tort", "civil wrong", True),
        ("negligence", "breach of duty", True),
        ("breach of contract", "contract violation", True),
        ("damages", "monetary compensation", True),
        ("injunction", "court order", True),
        ("specific performance", "enforce contract", True),
        
        # Criminal law
        ("beyond a reasonable doubt", "criminal standard", True),
        ("preponderance of the evidence", "civil standard", True),
        ("probable cause", "reasonable grounds", True),
        ("Miranda rights", "right to remain silent", True),
        ("search warrant", "authorization to search", True),
        
        # Legal proceedings
        ("deposition", "sworn testimony", True),
        ("interrogatories", "written questions", True),
        ("discovery", "evidence gathering", True),
        ("subpoena", "court summons", True),
        ("affidavit", "sworn statement", True),
        ("stipulation", "agreement", True),
        
        # Constitutional law
        ("First Amendment", "freedom of speech", True),
        ("Fourth Amendment", "search and seizure", True),
        ("Fifth Amendment", "right against self-incrimination", True),
        ("Sixth Amendment", "right to counsel", True),
        ("due process", "fair treatment", True),
        ("equal protection", "equal treatment under law", True),
        
        # Burden of proof
        ("beyond a reasonable doubt", "highest standard", True),
        ("clear and convincing evidence", "intermediate standard", True),
        ("preponderance of the evidence", "more likely than not", True),
        
        # Verdicts and outcomes
        ("guilty verdict", "conviction", True),
        ("not guilty verdict", "acquittal", True),
        ("liability", "legal responsibility", True),
        ("judgment", "court decision", True),
        ("appeal", "review by higher court", True),
        ("remand", "send back to lower court", True),
        ("reverse", "overturn decision", True),
        ("affirm", "uphold decision", True),
        
        # Legal documents
        ("complaint", "initial pleading", True),
        ("answer", "response to complaint", True),
        ("brief", "legal argument", True),
        ("motion", "request to court", True),
        ("order", "court directive", True),
        ("opinion", "court's reasoning", True),
        
        # Property law
        ("real property", "real estate", True),
        ("personal property", "movable property", True),
        ("easement", "right of way", True),
        ("adverse possession", "squatter's rights", True),
        
        # Contract law
        ("consideration", "something of value", True),
        ("offer and acceptance", "meeting of minds", True),
        ("mutual assent", "agreement", True),
        ("statute of frauds", "writing requirement", True),
        
        # Negative cases (opposites or different concepts)
        ("plaintiff", "defendant", False),
        ("guilty", "not guilty", False),
        ("guilty", "innocent", False),
        ("civil", "criminal", False),
        ("felony", "misdemeanor", False),
        ("liable", "not liable", False),
        ("appeal", "original jurisdiction", False),
        ("affirm", "reverse", False),
        ("prosecution", "defense", False),
        ("tort", "contract", False),
    ]
    
    print("\n" + "="*60)
    print("EXTENDED LEGAL TERMINOLOGY TEST")
    print(f"Total test cases: {len(test_cases)}")
    print("="*60)
    
    results = []
    failures = []
    
    for i, (expected, actual, should_match) in enumerate(test_cases, 1):
        result = comparator.compare(expected, actual)
        matched = result.matched
        confidence = result.confidence
        
        is_correct = matched == should_match
        status = "✓" if is_correct else "✗"
        
        if not is_correct:
            failures.append((expected, actual, should_match, matched, confidence))
        
        if i % 10 == 0 or not is_correct:
            print(f"{status} [{i:3d}] '{expected}' vs '{actual}': {matched} (conf: {confidence:.3f}) [expected: {should_match}]")
        
        results.append(is_correct)
    
    accuracy = sum(results) / len(results)
    
    print("\n" + "="*60)
    print(f"Legal Extended Accuracy: {accuracy:.1%} ({sum(results)}/{len(results)})")
    print("="*60)
    
    if failures:
        print(f"\nFailures ({len(failures)}):")
        for exp, act, should, got, conf in failures[:10]:
            print(f"  '{exp}' vs '{act}': expected {should}, got {got} (conf: {conf:.3f})")
        if len(failures) > 10:
            print(f"  ... and {len(failures) - 10} more")
    
    return accuracy

legal_accuracy = test_legal_extended()
print(f"\n✅ Legal domain accuracy: {legal_accuracy:.1%}")

## Final Results and Decision

In [None]:
# Calculate overall accuracy
print("\n" + "="*60)
print("FINAL VALIDATION RESULTS")
print("="*60)

# Note: STS results printed above, domain results in variables
try:
    avg_domain = (financial_accuracy + medical_accuracy + legal_accuracy) / 3
    
    print(f"\nDomain-Specific Results:")
    print(f"  Financial: {financial_accuracy:.1%}")
    print(f"  Medical:   {medical_accuracy:.1%}")
    print(f"  Legal:     {legal_accuracy:.1%}")
    print(f"  Average:   {avg_domain:.1%}")
    
    print("\n" + "="*60)
    print("DECISION")
    print("="*60)
    
    if avg_domain >= 0.85:
        print("\n✅ SHIP IT: Domain accuracy >= 85%")
        print("   Embeddings handle domain terminology well.")
        print("   No fine-tuning needed.")
    elif avg_domain >= 0.75:
        print("\n⚠️  CONSIDER TRAINING: Domain accuracy 75-85%")
        print("   Embeddings work, but training could improve 5-10%.")
        print("   Decide based on production requirements.")
    else:
        print("\n❌ TRAIN: Domain accuracy < 75%")
        print("   Domain-specific fine-tuning recommended.")
        print("   Embeddings struggle with domain jargon.")
    
    print("\n" + "="*60)
except NameError:
    print("\n⚠️  Run all domain-specific cells above to see final decision")
