# Investigation: Yield Rate Anomalies

Dig into curious results in the ghidra join pipeline.



## The Big Picture: Survivorship Bias in Yield Rates

The single most important insight is that **the yield rate % is misleading in isolation** because both the numerator (HC count) and denominator (eligible count) change across optimization levels. Several test cases show counterintuitive yield *increases* at higher optimization — but this is always because **the optimizer eliminates the hard-to-match functions from the eligible pool**, leaving only the easy ones.

---

## t06_recursion_inline: 66.7% → 40% → 100% → 100%

**This is the clearest example of survivorship bias.**

| Opt | Eligible | HC | Yield | What happened |
|-----|-----------|-----|-------|---------------|
| O0 | 12 | 8 | 66.7% | `abs_val`, `clamp`, `max2` fail (AMBIGUOUS, 3 candidates) |
| O1 | 5 | 2 | 40.0% | `fibonacci`, `tree_sum`, `tree_depth` fail (qw≈0.2, 4 candidates) |
| O2 | **1** | 1 | 100% | Only `main` remains eligible |
| O3 | **1** | 1 | 100% | Only `main` remains eligible |

**What happens mechanically:**

1. **O0 failures** — `abs_val`, `clamp`, `max2` are `static inline` 1-line helpers from recurse.h (e.g., `return x < 0 ? -x : x`). They're included in 3 translation units, so DWARF has 3 copies of each. The compiled code is identical across TUs, so the alignment finds 3 candidates with perfect overlap → **AMBIGUOUS**. The pipeline correctly refuses to pick one.

2. **O1 failures** — At `-O1`, the inline helpers get inlined into their callers. But the recursive functions (`fibonacci`, `tree_sum`, `tree_depth`) now have **scattered DWARF line info** because the inlined helper code mixes with the caller's code. The alignment finds 4 candidates with only ~76-83% overlap → qw = overlap/n_candidates ≈ 0.19-0.21 → too low for GOLD.

3. **O2/O3 "100%"** — The compiler aggressively inlines everything except `main`. All recursive functions lose their DWARF address ranges (they become `DW_TAG_inlined_subroutine` entries without standalone ranges → NO_RANGE → NON_TARGET). The eligible pool collapses to just `main`, which passes trivially. The 5-6 anonymous DWARF entries that appear are compiler-generated clones/outlines from optimization.

**Bottom line:** The 100% yield at O2/O3 represents **1 function**, not a quality improvement. The compiler effectively "solves" the alignment problem by eliminating everything that's hard to match.

---

## t04_static_dup_names: 100% → 25% → 25% → 25%

| Opt | Eligible | HC | Yield |
|-----|-----------|-----|-------|
| O0 | 17 | 17 | 100% |
| O1 | 4 | 1 | 25% |

This test case has 3 modules each defining `static validate()`, `static process()`, `static report()` with **different implementations but identical names**. At O0, all 17 functions (including the duplicates) pass — the oracle correctly distinguishes them by CU/declaration. At O1, the small static helpers get inlined, and only `main` + 3 `run_module_*` wrappers remain eligible. The `run_module_*` functions fail with **NO_MATCH** (n_cand=4-5, overlap≈0.27-0.56) because the inlined static function code creates cross-TU confusion.

---

## t03_header_dominant: 22.2% → 33.3% → 50% → 50%

**Yield goes UP with optimization — another survivorship case.**

At O0, 18 functions are eligible but 14 are `vec_*` functions from a `static inline` header included in 3 TUs. All 14 fail as **AMBIGUOUS** (3 candidates each, perfect overlap). Only `main`, `linear_search`, `run_stack_tests`, `run_search_tests` pass = 4/18 = 22.2%.

At O1+, the `vec_*` functions get inlined and vanish from the eligible pool. The denominator shrinks from 18 → 3 → 2 → 2, and the yield % mechanically increases even though **the absolute HC count is dropping** (4 → 1 → 1 → 1).

---

## t12_state_machine: 100% → 54.5% → 65.2% → 57.1%

The non-monotonic O1→O2 increase is particularly interesting.

**O1 (18/33 = 54.5%)**: 15 functions fail despite having **perfect quality metrics** (qw=1.0, n_cand=1, MATCH):
- 7 `*_exit` functions (e.g., `idle_exit`, `conn_exit`) — These have **empty bodies** (`(void)ctx;`). At O1, the compiler optimizes them to no-ops but keeps them because their addresses are stored in function pointer tables. Result: `ghidra_match_kind = NO_MATCH`, `asm_insn_count = 0`, `cfg = nan`. Ghidra doesn't even detect them as functions.
- 7 `get_*_state` functions + `shut_handle` — These are 1-line accessors (2 ASM instructions!). They have `ghidra_match_kind = JOINED_WEAK`, meaning the DWARF range and Ghidra's function boundary overlap by less than 90%. For such tiny functions, **even 1 byte of boundary disagreement** drops them below the 90% threshold.

**O2 (15/23 = 65.2%)**: The 7 `*_exit` functions get fully inlined/eliminated → no longer eligible. The eligible pool drops from 33 to 23, removing 10 guaranteed failures. The `get_*_state` functions still fail (still JOINED_WEAK), but the denominator shrank faster than the numerator = yield goes **up**.

**O3 (12/21 = 57.1%)**: One more function (`sm_run_sequence`) drops to NO_MATCH → yield decreases slightly.

---

## t13_goto_labels: 100% → 37.5% → 14.3% → 14.3%

This is the **worst-performing** test case at high optimization and the pattern is different — it's not survivorship bias, it's genuine quality degradation.

The functions (`acquire_resources`, `process_pipeline`, `multi_stage_init`, `release_resources`, `run_bytecode_safe`) all use `goto`-based cleanup patterns and computed-goto dispatch. They have:
- `n_candidates = 2` — the `static validate()` and `static log_action()` functions are duplicated across `resource.c` and `interpreter.c` with the same name. When inlined at O1+, their source lines appear in both TU candidates, creating a false second candidate for every function.
- `qw ≈ 0.35-0.49` — with 2 candidates and ~70-98% overlap, the quality weight stays below the GOLD threshold.

This is a real limitation: **cross-TU static function name collisions + inlining** pollutes the alignment signal. The `goto` patterns themselves aren't the problem — it's the shared static helper names.

---

## t02_shared_header_macros: 50% → 33.3% → 0% → 0%

At O0, 9/18 pass. All 9 failures are `static inline` header functions (`print_sep`, `square`, `is_even`, `safe_mod`, `safe_div`) duplicated across 3 TUs → AMBIGUOUS.

At O2/O3, only 2 functions remain eligible (`run_arith_tests`, `run_sort_tests`), and both fail with **NO_MATCH** because aggressive inlining pulled in source lines from many functions → n_candidates = 10-19, overlap ≈ 0.24-0.38. The 0% yield is genuine — these functions are too intermingled after inlining to reliably match.

---

## Summary of Root Causes

| Phenomenon | Root Cause | Affected Test Cases |
|---|---|---|
| Yield increases at higher opt | **Survivorship bias**: optimizer eliminates hard functions, shrinking the eligible pool | t03, t06, t12 |
| 100% yield at O2/O3 | Only `main` survives as eligible | t06 |
| AMBIGUOUS at O0 | `static inline` header functions compiled identically across TUs | t02, t03, t06 |
| NO_MATCH at O1+ | Aggressive inlining scatters source lines across many candidates | t02, t04, t12 (exit), t13 |
| JOINED_WEAK | Tiny functions (2 instructions) where DWARF/Ghidra boundaries disagree by 1+ bytes | t12 (get_*_state) |
| Cross-TU pollution at O1+ | Identically-named `static` functions inlined into callers create false candidates | t13 |
| Anonymous functions appearing | Compiler-generated function clones/outlines at O2/O3 | t06 |

The **yield % as a metric is most meaningful when the eligible pool is stable or large**. For small test cases where optimization reshapes the function landscape, the absolute HC count and the failure mechanisms matter more than the rate.


In [1]:
import sys
from pathlib import Path
import json
import pandas as pd

sys.path.insert(0, str(Path("../../..").resolve()))
from data.loader import load_ghidra_dataset

gds = load_ghidra_dataset(test_cases=None, opt_levels=["O0", "O1", "O2", "O3"])
df = gds.functions
dr = gds.reports
print(f"Loaded {len(df)} functions, {len(dr)} report cells")
print(f"Test cases: {gds.test_cases}")

Loaded 1792 functions, 60 report cells
Test cases: ['t01_crossfile_calls', 't02_shared_header_macros', 't03_header_dominant', 't04_static_dup_names', 't05_fptr_callbacks', 't06_recursion_inline', 't07_switch_parser', 't08_loop_heavy', 't09_string_format', 't10_math_libm', 't11_mixed_stress', 't12_state_machine', 't13_goto_labels', 't14_anon_aggregates', 't15_deep_nesting']


In [2]:
# Full summary table with yield rates per (test_case, opt)
summary = dr[["test_case", "opt", "n_dwarf_funcs",
              "excl_n_eligible_for_gold", "n_joined_strong",
              "hc_count", "hc_yield_rate"]].copy()
summary.columns = ["Test Case", "Opt", "DWARF", "Eligible", "Strong", "HC", "Yield"]
summary["Yield %"] = (summary["Yield"] * 100).round(1)

# Pivot to see all opts side by side
pivot = summary.pivot_table(index="Test Case", columns="Opt",
                            values=["DWARF", "Eligible", "HC", "Yield %"],
                            aggfunc="first")
# Show yield % pivot
yield_pivot = summary.pivot_table(index="Test Case", columns="Opt",
                                   values="Yield %", aggfunc="first")
yield_pivot = yield_pivot.reindex(columns=["O0", "O1", "O2", "O3"])
print("=== HC Yield Rate (%) per test case and opt ===")
display(yield_pivot)

# Also show eligible and HC counts
elig_pivot = summary.pivot_table(index="Test Case", columns="Opt",
                                  values="Eligible", aggfunc="first")
elig_pivot = elig_pivot.reindex(columns=["O0", "O1", "O2", "O3"])
print("\n=== Eligible for Gold ===")
display(elig_pivot)

hc_pivot = summary.pivot_table(index="Test Case", columns="Opt",
                                values="HC", aggfunc="first")
hc_pivot = hc_pivot.reindex(columns=["O0", "O1", "O2", "O3"])
print("\n=== HC Count ===")
display(hc_pivot)

dwarf_pivot = summary.pivot_table(index="Test Case", columns="Opt",
                                   values="DWARF", aggfunc="first")
dwarf_pivot = dwarf_pivot.reindex(columns=["O0", "O1", "O2", "O3"])
print("\n=== Total DWARF Functions ===")
display(dwarf_pivot)

=== HC Yield Rate (%) per test case and opt ===


Opt,O0,O1,O2,O3
Test Case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
t01_crossfile_calls,84.6,77.8,75.0,75.0
t02_shared_header_macros,50.0,33.3,0.0,0.0
t03_header_dominant,22.2,33.3,50.0,50.0
t04_static_dup_names,100.0,25.0,25.0,25.0
t05_fptr_callbacks,100.0,100.0,100.0,100.0
t06_recursion_inline,66.7,40.0,100.0,100.0
t07_switch_parser,100.0,100.0,50.0,55.6
t08_loop_heavy,100.0,100.0,85.7,85.7
t09_string_format,100.0,80.0,66.7,66.7
t10_math_libm,100.0,90.9,80.0,77.8



=== Eligible for Gold ===


Opt,O0,O1,O2,O3
Test Case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
t01_crossfile_calls,13,9,8,8
t02_shared_header_macros,18,3,2,2
t03_header_dominant,18,3,2,2
t04_static_dup_names,17,4,4,4
t05_fptr_callbacks,10,10,10,10
t06_recursion_inline,12,5,1,1
t07_switch_parser,12,12,10,9
t08_loop_heavy,16,16,14,14
t09_string_format,11,10,9,9
t10_math_libm,12,11,10,9



=== HC Count ===


Opt,O0,O1,O2,O3
Test Case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
t01_crossfile_calls,11,7,6,6
t02_shared_header_macros,9,1,0,0
t03_header_dominant,4,1,1,1
t04_static_dup_names,17,1,1,1
t05_fptr_callbacks,10,10,10,10
t06_recursion_inline,8,2,1,1
t07_switch_parser,12,12,5,5
t08_loop_heavy,16,16,12,12
t09_string_format,11,8,6,6
t10_math_libm,12,10,8,7



=== Total DWARF Functions ===


Opt,O0,O1,O2,O3
Test Case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
t01_crossfile_calls,23,23,25,25
t02_shared_header_macros,23,24,24,24
t03_header_dominant,26,31,31,32
t04_static_dup_names,24,29,29,29
t05_fptr_callbacks,21,23,23,23
t06_recursion_inline,20,21,26,27
t07_switch_parser,22,24,25,25
t08_loop_heavy,29,31,33,33
t09_string_format,26,28,28,28
t10_math_libm,32,37,38,39


In [3]:
# Deep dive: t06_recursion_inline
tc = "t06_recursion_inline"
tc_df = df[df["test_case"] == tc].copy()
print(f"=== {tc}: {len(tc_df)} total rows ===")

for opt in ["O0", "O1", "O2", "O3"]:
    opt_df = tc_df[tc_df["opt"] == opt]
    print(f"\n--- {opt} ({len(opt_df)} functions) ---")
    print(f"  ghidra_match_kind: {dict(opt_df['ghidra_match_kind'].value_counts())}")
    print(f"  align_verdict: {dict(opt_df['align_verdict'].value_counts())}")
    print(f"  confidence_tier: {dict(opt_df['confidence_tier'].value_counts())}")
    print(f"  eligible_for_gold: {opt_df['eligible_for_gold'].sum()}")
    print(f"  is_high_confidence: {opt_df['is_high_confidence'].sum()}")
    
    # Show all function names and their status
    cols = ["dwarf_function_name", "dwarf_function_id", "ghidra_match_kind",
            "align_verdict", "confidence_tier", "is_high_confidence",
            "eligible_for_gold", "quality_weight", "exclusion_reason"]
    available_cols = [c for c in cols if c in opt_df.columns]
    display(opt_df[available_cols].sort_values("dwarf_function_name"))

=== t06_recursion_inline: 94 total rows ===

--- O0 (20 functions) ---
  ghidra_match_kind: {'JOINED_STRONG': np.int64(12), 'NO_RANGE': np.int64(8)}
  align_verdict: {'NON_TARGET': np.int64(8), 'MATCH': np.int64(8), 'AMBIGUOUS': np.int64(4)}
  confidence_tier: {'': np.int64(8), 'GOLD': np.int64(8), 'SILVER': np.int64(4)}
  eligible_for_gold: 12
  is_high_confidence: 8


Unnamed: 0,dwarf_function_name,dwarf_function_id,ghidra_match_kind,align_verdict,confidence_tier,is_high_confidence,eligible_for_gold,quality_weight,exclusion_reason
526,abs_val,cu0x1be:die0x37d,JOINED_STRONG,AMBIGUOUS,SILVER,False,True,0.0,
531,abs_val,cu0x3aa:die0x590,JOINED_STRONG,AMBIGUOUS,SILVER,False,True,0.0,
515,ackermann,cu0x0:die0x77,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
521,ackermann,cu0x1be:die0x22b,JOINED_STRONG,MATCH,GOLD,True,True,1.0,
525,clamp,cu0x1be:die0x331,JOINED_STRONG,AMBIGUOUS,SILVER,False,True,0.0,
518,fibonacci,cu0x0:die0xc4,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
524,fibonacci,cu0x1be:die0x303,JOINED_STRONG,MATCH,GOLD,True,True,1.0,
517,gcd,cu0x0:die0xa9,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
523,gcd,cu0x1be:die0x2c6,JOINED_STRONG,MATCH,GOLD,True,True,1.0,
514,main,cu0x0:die0x154,JOINED_STRONG,MATCH,GOLD,True,True,1.0,



--- O1 (21 functions) ---
  ghidra_match_kind: {'NO_RANGE': np.int64(13), 'JOINED_STRONG': np.int64(8)}
  align_verdict: {'NON_TARGET': np.int64(13), 'MATCH': np.int64(5), 'NO_MATCH': np.int64(3)}
  confidence_tier: {'': np.int64(13), 'BRONZE': np.int64(3), 'SILVER': np.int64(3), 'GOLD': np.int64(2)}
  eligible_for_gold: 5
  is_high_confidence: 2


Unnamed: 0,dwarf_function_name,dwarf_function_id,ghidra_match_kind,align_verdict,confidence_tier,is_high_confidence,eligible_for_gold,quality_weight,exclusion_reason
535,__builtin_puts,cu0x0:die0x5eb,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
547,abs_val,cu0x5f7:die0xa2d,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
552,abs_val,cu0xa45:die0xe15,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
536,ackermann,cu0x0:die0x77,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
542,ackermann,cu0x5f7:die0x664,JOINED_STRONG,NO_MATCH,BRONZE,False,False,0.0,
546,clamp,cu0x5f7:die0x9fe,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
539,fibonacci,cu0x0:die0xc4,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
545,fibonacci,cu0x5f7:die0x954,JOINED_STRONG,MATCH,SILVER,False,True,0.191176,
538,gcd,cu0x0:die0xa9,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
544,gcd,cu0x5f7:die0x85e,JOINED_STRONG,NO_MATCH,BRONZE,False,False,0.0,



--- O2 (26 functions) ---
  ghidra_match_kind: {'NO_RANGE': np.int64(18), 'JOINED_STRONG': np.int64(8)}
  align_verdict: {'NON_TARGET': np.int64(18), 'MATCH': np.int64(6), 'AMBIGUOUS': np.int64(1), 'NO_MATCH': np.int64(1)}
  confidence_tier: {'': np.int64(18), 'BRONZE': np.int64(7), 'GOLD': np.int64(1)}
  eligible_for_gold: 1
  is_high_confidence: 1


Unnamed: 0,dwarf_function_name,dwarf_function_id,ghidra_match_kind,align_verdict,confidence_tier,is_high_confidence,eligible_for_gold,quality_weight,exclusion_reason
556,__builtin_puts,cu0x0:die0x5db,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
568,abs_val,cu0x5e7:die0x779,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
575,abs_val,cu0xe60:die0x10dd,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
557,ackermann,cu0x0:die0x73,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
563,ackermann,cu0x5e7:die0x654,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
567,clamp,cu0x5e7:die0x74a,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
560,fibonacci,cu0x0:die0xc0,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
566,fibonacci,cu0x5e7:die0x731,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
559,gcd,cu0x0:die0xa5,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
565,gcd,cu0x5e7:die0x6ac,JOINED_STRONG,AMBIGUOUS,BRONZE,False,False,0.0,



--- O3 (27 functions) ---
  ghidra_match_kind: {'NO_RANGE': np.int64(19), 'JOINED_STRONG': np.int64(8)}
  align_verdict: {'NON_TARGET': np.int64(19), 'MATCH': np.int64(5), 'NO_MATCH': np.int64(2), 'AMBIGUOUS': np.int64(1)}
  confidence_tier: {'': np.int64(19), 'BRONZE': np.int64(7), 'GOLD': np.int64(1)}
  eligible_for_gold: 1
  is_high_confidence: 1


Unnamed: 0,dwarf_function_name,dwarf_function_id,ghidra_match_kind,align_verdict,confidence_tier,is_high_confidence,eligible_for_gold,quality_weight,exclusion_reason
582,__builtin_puts,cu0x0:die0x5db,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
594,abs_val,cu0x5e7:die0x779,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
600,abs_val,cu0xf2d:die0x1096,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
583,ackermann,cu0x0:die0x73,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
589,ackermann,cu0x5e7:die0x654,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
593,clamp,cu0x5e7:die0x74a,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
586,fibonacci,cu0x0:die0xc0,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
592,fibonacci,cu0x5e7:die0x731,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
585,gcd,cu0x0:die0xa5,NO_RANGE,NON_TARGET,,False,False,0.0,NO_RANGE
591,gcd,cu0x5e7:die0x6ac,JOINED_STRONG,AMBIGUOUS,BRONZE,False,False,0.0,


In [7]:
# Focused: t06_recursion_inline failures
tc_name = "t06_recursion_inline"
tc_df = df[df["test_case"] == tc_name]

for opt in ["O0", "O1", "O2", "O3"]:
    opt_df = tc_df[tc_df["opt"] == opt]
    failed = opt_df[(opt_df["eligible_for_gold"]) & (~opt_df["is_high_confidence"])]
    if len(failed) > 0:
        print(f"\n{opt} FAILED HC ({len(failed)}):")
        for _, r in failed.iterrows():
            name = r.get('dwarf_function_name', '?')
            av = r.get('align_verdict', '?')
            qw = r.get('quality_weight', None)
            nc = r.get('align_n_candidates', '?')
            olap = r.get('align_overlap_ratio', '?')
            cfg = r.get('cfg_completeness', '?')
            jw = r.get('join_warnings', [])
            ct = r.get('confidence_tier', '?')
            print(f"    {name}: verdict={av}, qw={qw:.3f}, n_cand={nc}, overlap={olap}, cfg={cfg}, tier={ct}")
            if jw:
                print(f"      warnings: {jw}")


O0 FAILED HC (4):
    clamp: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH, tier=SILVER
    abs_val: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH, tier=SILVER
    max2: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH, tier=SILVER
    abs_val: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH, tier=SILVER

O1 FAILED HC (3):
    fibonacci: verdict=MATCH, qw=0.191, n_cand=4.0, overlap=0.764706, cfg=HIGH, tier=SILVER
    tree_sum: verdict=MATCH, qw=0.207, n_cand=4.0, overlap=0.826087, cfg=HIGH, tier=SILVER
    tree_depth: verdict=MATCH, qw=0.207, n_cand=4.0, overlap=0.826087, cfg=HIGH, tier=SILVER


In [8]:
# For t06: what happens to functions at O2/O3? Why do they disappear?
tc_name = "t06_recursion_inline"
tc_df = df[df["test_case"] == tc_name]

for opt in ["O0", "O1", "O2", "O3"]:
    opt_df = tc_df[tc_df["opt"] == opt]
    print(f"\n{'='*50}")
    print(f"{opt}: {len(opt_df)} total DWARF functions")
    
    # Group by exclusion reason
    if "exclusion_reason" in opt_df.columns:
        excl_counts = opt_df["exclusion_reason"].value_counts()
        print(f"  Exclusion breakdown: {dict(excl_counts)}")
    
    # Show which functions are NOT eligible and why
    not_elig = opt_df[~opt_df["eligible_for_gold"]]
    print(f"  NOT eligible ({len(not_elig)}):")
    for _, r in not_elig.iterrows():
        name = r.get("dwarf_function_name", None)
        name_str = name if name else f"<anon:{r['dwarf_function_id'][:20]}>"
        excl = r.get("exclusion_reason", "?")
        mk = r.get("ghidra_match_kind", "?")
        av = r.get("align_verdict", "?")
        print(f"    {name_str}: excl={excl}, match={mk}, verdict={av}")


O0: 20 total DWARF functions
  Exclusion breakdown: {'NO_RANGE': np.int64(8)}
  NOT eligible (8):
    tree_depth: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    printf: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    ackermann: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    power: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    gcd: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    fibonacci: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    tree_max: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    tree_sum: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET

O1: 21 total DWARF functions
  Exclusion breakdown: {'NO_RANGE': np.int64(13)}
  NOT eligible (16):
    tree_depth: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    printf: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    __builtin_puts: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    ackermann: excl=NO_RANGE, match=NO_RANGE, verdict=NON_TARGET
    power: excl=NO_RANGE, match

In [4]:
# For t06_recursion_inline: compare function names across opt levels
tc = "t06_recursion_inline"
tc_df = df[df["test_case"] == tc]
print(f"=== Function name universe for {tc} ===")

for opt in ["O0", "O1", "O2", "O3"]:
    opt_df = tc_df[tc_df["opt"] == opt]
    names = set(opt_df["dwarf_function_name"].dropna().tolist())
    anon = opt_df["dwarf_function_name"].isna().sum()
    elig_names = set(opt_df[opt_df["eligible_for_gold"]]["dwarf_function_name"].dropna().tolist())
    hc_names = set(opt_df[opt_df["is_high_confidence"]]["dwarf_function_name"].dropna().tolist())
    print(f"\n{opt}:")
    print(f"  All names ({len(names)}): {sorted(names)}")
    print(f"  Anonymous: {anon}")
    print(f"  Eligible ({len(elig_names)}): {sorted(elig_names)}")
    print(f"  HC ({len(hc_names)}): {sorted(hc_names)}")
    print(f"  Failed = Eligible - HC: {sorted(elig_names - hc_names)}")

=== Function name universe for t06_recursion_inline ===

O0:
  All names (12): ['abs_val', 'ackermann', 'clamp', 'fibonacci', 'gcd', 'main', 'max2', 'power', 'printf', 'tree_depth', 'tree_max', 'tree_sum']
  Anonymous: 0
  Eligible (11): ['abs_val', 'ackermann', 'clamp', 'fibonacci', 'gcd', 'main', 'max2', 'power', 'tree_depth', 'tree_max', 'tree_sum']
  HC (8): ['ackermann', 'fibonacci', 'gcd', 'main', 'power', 'tree_depth', 'tree_max', 'tree_sum']
  Failed = Eligible - HC: ['abs_val', 'clamp', 'max2']

O1:
  All names (13): ['__builtin_puts', 'abs_val', 'ackermann', 'clamp', 'fibonacci', 'gcd', 'main', 'max2', 'power', 'printf', 'tree_depth', 'tree_max', 'tree_sum']
  Anonymous: 0
  Eligible (5): ['fibonacci', 'main', 'power', 'tree_depth', 'tree_sum']
  HC (2): ['main', 'power']
  Failed = Eligible - HC: ['fibonacci', 'tree_depth', 'tree_sum']

O2:
  All names (13): ['__builtin_puts', 'abs_val', 'ackermann', 'clamp', 'fibonacci', 'gcd', 'main', 'max2', 'power', 'printf', 'tree_depth

In [9]:
# t04_static_dup_names: O0=100% (17/17) → O1=25% (1/4)
# Why did eligible drop from 17 to 4 and then 3 fail?
tc_name = "t04_static_dup_names"
tc_df = df[df["test_case"] == tc_name]

for opt in ["O0", "O1"]:
    opt_df = tc_df[tc_df["opt"] == opt]
    elig = opt_df[opt_df["eligible_for_gold"]]
    hc = opt_df[opt_df["is_high_confidence"]]
    failed = opt_df[(opt_df["eligible_for_gold"]) & (~opt_df["is_high_confidence"])]
    
    print(f"\n{'='*50}")
    print(f"{opt}: {len(opt_df)} DWARF, {len(elig)} eligible, {len(hc)} HC")
    print(f"  Eligible names: {sorted(elig['dwarf_function_name'].dropna().tolist())}")
    if len(failed) > 0:
        print(f"  Failed ({len(failed)}):")
        for _, r in failed.iterrows():
            name = r.get('dwarf_function_name', '?')
            av = r.get('align_verdict', '?')
            qw = r.get('quality_weight', None)
            nc = r.get('align_n_candidates', '?')
            olap = r.get('align_overlap_ratio', '?')
            print(f"    {name}: verdict={av}, qw={qw:.3f}, n_cand={nc}, overlap={olap}")


O0: 24 DWARF, 17 eligible, 17 HC
  Eligible names: ['accumulate', 'count_valid', 'main', 'process', 'process', 'process', 'reduce_max', 'reduce_min', 'report', 'report', 'report', 'run_module_a', 'run_module_b', 'run_module_c', 'validate', 'validate', 'validate']

O1: 29 DWARF, 4 eligible, 1 HC
  Eligible names: ['main', 'run_module_a', 'run_module_b', 'run_module_c']
  Failed (3):
    run_module_a: verdict=NO_MATCH, qw=0.000, n_cand=4.0, overlap=0.5625
    run_module_b: verdict=NO_MATCH, qw=0.000, n_cand=5.0, overlap=0.323944
    run_module_c: verdict=NO_MATCH, qw=0.000, n_cand=5.0, overlap=0.27027


In [10]:
# t03_header_dominant: yield INCREASES with opt: O0=22.2% → O3=50%
tc_name = "t03_header_dominant"
tc_df = df[df["test_case"] == tc_name]

for opt in ["O0", "O1", "O2", "O3"]:
    opt_df = tc_df[tc_df["opt"] == opt]
    elig = opt_df[opt_df["eligible_for_gold"]]
    hc = opt_df[opt_df["is_high_confidence"]]
    failed = opt_df[(opt_df["eligible_for_gold"]) & (~opt_df["is_high_confidence"])]
    
    print(f"\n{opt}: {len(opt_df)} DWARF, {len(elig)} elig, {len(hc)} HC = {len(hc)}/{len(elig)}")
    print(f"  Eligible: {sorted(elig['dwarf_function_name'].dropna().tolist())}")
    print(f"  HC: {sorted(hc['dwarf_function_name'].dropna().tolist())}")
    if len(failed) > 0:
        print(f"  Failed ({len(failed)}):")
        for _, r in failed.iterrows():
            name = r.get('dwarf_function_name', '?')
            av = r.get('align_verdict', '?')
            qw = r.get('quality_weight', None)
            nc = r.get('align_n_candidates', '?')
            olap = r.get('align_overlap_ratio', '?')
            jw = r.get('join_warnings', [])
            cfg = r.get('cfg_completeness', '?')
            print(f"    {name}: verdict={av}, qw={qw:.3f}, n_cand={nc}, overlap={olap}, cfg={cfg}")


O0: 26 DWARF, 18 elig, 4 HC = 4/18
  Eligible: ['linear_search', 'main', 'run_search_tests', 'run_stack_tests', 'vec_find', 'vec_free', 'vec_free', 'vec_grow', 'vec_grow', 'vec_init', 'vec_init', 'vec_pop', 'vec_print', 'vec_print', 'vec_push', 'vec_push', 'vec_reverse', 'vec_sum']
  HC: ['linear_search', 'main', 'run_search_tests', 'run_stack_tests']
  Failed (14):
    vec_sum: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_print: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_pop: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_push: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_grow: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_free: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_init: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_reverse: verdict=AMBIGUOUS, qw=0.000, n_cand=3.0, overlap=1.0, cfg=HIGH
    vec_prin

In [11]:
# t12_state_machine: O0=100%, O1=54.5%, O2=65.2% (UP!), O3=57.1%
# Why does yield go UP from O1→O2?
tc_name = "t12_state_machine"
tc_df = df[df["test_case"] == tc_name]

for opt in ["O0", "O1", "O2", "O3"]:
    opt_df = tc_df[tc_df["opt"] == opt]
    elig = opt_df[opt_df["eligible_for_gold"]]
    hc = opt_df[opt_df["is_high_confidence"]]
    failed = opt_df[(opt_df["eligible_for_gold"]) & (~opt_df["is_high_confidence"])]
    
    print(f"\n{opt}: {len(elig)} elig, {len(hc)} HC = {len(hc)}/{len(elig)} = {len(hc)/max(len(elig),1)*100:.1f}%")
    if len(failed) > 0:
        print(f"  Failed ({len(failed)}):")
        for _, r in failed.iterrows():
            name = r.get('dwarf_function_name', '?')
            av = r.get('align_verdict', '?')
            qw = r.get('quality_weight', None)
            nc = r.get('align_n_candidates', '?')
            jw = r.get('join_warnings', [])
            cfg = r.get('cfg_completeness', '?')
            qw_str = f"{qw:.3f}" if qw is not None and not pd.isna(qw) else "n/a"
            print(f"    {name}: verdict={av}, qw={qw_str}, n_cand={nc}, cfg={cfg}")


O0: 33 elig, 33 HC = 33/33 = 100.0%

O1: 33 elig, 18 HC = 18/33 = 54.5%
  Failed (15):
    get_shutdown_state: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    get_error_state: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    get_processing_state: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    get_ready_state: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    get_authenticating_state: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    get_connecting_state: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    get_idle_state: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    shut_handle: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=HIGH
    shut_exit: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=nan
    err_exit: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=nan
    proc_exit: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=nan
    ready_exit: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=nan
    auth_exit: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=nan
    conn_exit: verdict=MATCH, qw=1.000, n_cand=1.0, cfg=nan

In [14]:
# t12: print only HC-relevant columns for failed functions at O2
tc_df = df[df["test_case"] == "t12_state_machine"]
opt_df = tc_df[tc_df["opt"] == "O2"]
failed = opt_df[(opt_df["eligible_for_gold"]) & (~opt_df["is_high_confidence"])]

key_cols = ["dwarf_function_name", "is_high_confidence", "confidence_tier",
            "align_verdict", "quality_weight", "align_n_candidates", "align_overlap_ratio",
            "ghidra_match_kind", "joined_strength", "is_noise_function",
            "cfg_completeness", "has_fatal_warnings", "join_warnings",
            "eligible_for_gold", "c_line_count", "asm_insn_count"]
avail = [c for c in key_cols if c in failed.columns]
for _, r in failed.head(3).iterrows():
    print(f"\n{r['dwarf_function_name']}:")
    for c in avail:
        print(f"  {c}: {r[c]}")


get_shutdown_state:
  dwarf_function_name: get_shutdown_state
  is_high_confidence: False
  confidence_tier: BRONZE
  align_verdict: MATCH
  quality_weight: 1.0
  align_n_candidates: 1.0
  align_overlap_ratio: 1.0
  ghidra_match_kind: JOINED_WEAK
  cfg_completeness: HIGH
  eligible_for_gold: True
  asm_insn_count: 2

get_error_state:
  dwarf_function_name: get_error_state
  is_high_confidence: False
  confidence_tier: BRONZE
  align_verdict: MATCH
  quality_weight: 1.0
  align_n_candidates: 1.0
  align_overlap_ratio: 1.0
  ghidra_match_kind: JOINED_WEAK
  cfg_completeness: HIGH
  eligible_for_gold: True
  asm_insn_count: 2

get_processing_state:
  dwarf_function_name: get_processing_state
  is_high_confidence: False
  confidence_tier: BRONZE
  align_verdict: MATCH
  quality_weight: 1.0
  align_n_candidates: 1.0
  align_overlap_ratio: 1.0
  ghidra_match_kind: JOINED_WEAK
  cfg_completeness: HIGH
  eligible_for_gold: True
  asm_insn_count: 2


In [15]:
# t12 O1: why do _exit functions (with cfg=nan) fail?
tc_df = df[df["test_case"] == "t12_state_machine"]
opt_df = tc_df[tc_df["opt"] == "O1"]
exit_funcs = opt_df[opt_df["dwarf_function_name"].str.contains("exit", na=False)]

key_cols = ["dwarf_function_name", "is_high_confidence", "confidence_tier",
            "align_verdict", "quality_weight", "ghidra_match_kind",
            "cfg_completeness", "asm_insn_count", "c_line_count"]
avail = [c for c in key_cols if c in exit_funcs.columns]
for _, r in exit_funcs.iterrows():
    vals = {c: r[c] for c in avail}
    print(vals)

# Also: which gate causes the failure?
# The HC requires: MATCH, unique, ratio>=0.95, JOINED_STRONG, not noise, CFG!= LOW, no fatal warnings
print("\n--- HC gate check for get_shutdown_state at O2 ---")
r = opt_df[opt_df["dwarf_function_name"] == "get_shutdown_state"]
if len(r) > 0:
    r = r.iloc[0]
    print(f"  MATCH: {r['align_verdict'] == 'MATCH'}")
    print(f"  unique (n_cand=1): {r['align_n_candidates'] == 1}")
    print(f"  ratio >= 0.95: {r['align_overlap_ratio'] >= 0.95}")
    print(f"  JOINED_STRONG: {r['ghidra_match_kind'] == 'JOINED_STRONG'} (actual: {r['ghidra_match_kind']})")
    print(f"  cfg != LOW: {r.get('cfg_completeness') != 'LOW'}")
    
print("\n--- At O2 ---")
opt_df2 = tc_df[tc_df["opt"] == "O2"]
r = opt_df2[opt_df2["dwarf_function_name"] == "get_shutdown_state"]
if len(r) > 0:
    r = r.iloc[0]
    print(f"  MATCH: {r['align_verdict'] == 'MATCH'}")
    print(f"  unique: {r['align_n_candidates'] == 1}")
    print(f"  JOINED_STRONG: {r['ghidra_match_kind'] == 'JOINED_STRONG'} (actual: {r['ghidra_match_kind']})")
    print(f"  asm_insn_count: {r['asm_insn_count']}")

{'dwarf_function_name': 'shut_exit', 'is_high_confidence': False, 'confidence_tier': '', 'align_verdict': 'MATCH', 'quality_weight': 1.0, 'ghidra_match_kind': 'NO_MATCH', 'cfg_completeness': nan, 'asm_insn_count': 0}
{'dwarf_function_name': 'err_exit', 'is_high_confidence': False, 'confidence_tier': '', 'align_verdict': 'MATCH', 'quality_weight': 1.0, 'ghidra_match_kind': 'NO_MATCH', 'cfg_completeness': nan, 'asm_insn_count': 0}
{'dwarf_function_name': 'proc_exit', 'is_high_confidence': False, 'confidence_tier': '', 'align_verdict': 'MATCH', 'quality_weight': 1.0, 'ghidra_match_kind': 'NO_MATCH', 'cfg_completeness': nan, 'asm_insn_count': 0}
{'dwarf_function_name': 'ready_exit', 'is_high_confidence': False, 'confidence_tier': '', 'align_verdict': 'MATCH', 'quality_weight': 1.0, 'ghidra_match_kind': 'NO_MATCH', 'cfg_completeness': nan, 'asm_insn_count': 0}
{'dwarf_function_name': 'auth_exit', 'is_high_confidence': False, 'confidence_tier': '', 'align_verdict': 'MATCH', 'quality_weight':

In [17]:
# t13: Why do the goto functions get n_cand=2 and low qw?
# These are all defined in separate .c files but the static helpers
# (validate, log_action) are duplicated across TUs
tc_df = df[df["test_case"] == "t13_goto_labels"]
opt_df = tc_df[tc_df["opt"] == "O1"]

# Show ALL functions and their alignment details
print("t13_goto_labels at O1 - all functions:")
cols = ["dwarf_function_name", "align_verdict", "quality_weight", 
        "align_n_candidates", "align_overlap_ratio", "ghidra_match_kind",
        "eligible_for_gold", "is_high_confidence", "exclusion_reason"]
avail = [c for c in cols if c in opt_df.columns]
for _, r in opt_df[avail].sort_values("dwarf_function_name").iterrows():
    print(f"  {r['dwarf_function_name']}: verdict={r['align_verdict']}, "
          f"qw={r['quality_weight']:.3f}, n_cand={r['align_n_candidates']}, "
          f"mk={r['ghidra_match_kind']}, elig={r['eligible_for_gold']}")

t13_goto_labels at O1 - all functions:
  __builtin_putchar: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  __builtin_puts: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  __builtin_puts: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  acquire_resources: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  acquire_resources: verdict=MATCH, qw=0.417, n_cand=2.0, mk=JOINED_STRONG, elig=True
  free: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  free: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  log_action: verdict=MATCH, qw=1.000, n_cand=1.0, mk=JOINED_STRONG, elig=True
  log_action: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  main: verdict=MATCH, qw=1.000, n_cand=1.0, mk=JOINED_STRONG, elig=True
  malloc: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
  memset: verdict=NON_TARGET, qw=0.000, n_cand=nan, mk=NO_RANGE, elig=False
 