In [1]:
# %% [markdown]
# # Qualitative Error Analysis: Issues & ORT by Method, Pipeline, and Topology
#
# Methods:
# 1. Direct (Non-Reasoning)
# 2. Direct (Reasoning)
# 3. Prompt2DAG (Template)
# 4. Prompt2DAG (LLM)
# 5. Prompt2DAG (Hybrid)
#
# We:
# - Compute ORT_Score (issue-adjusted robustness) with proper normalization
# - Summarize issue statistics (Total, Critical, Major, Minor) and ORT by method
# - Aggregate per pipeline and compare Prompt2DAG vs Direct
# - Compute Δ issues and Δ ORT vs Direct per pipeline
# - Group improvements by topology (with df_meta)
# - Investigate correlation patterns and anomalies

# %%
import pandas as pd
import numpy as np
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

csv_path = "/Users/abubakarialidu/Desktop/Data Result/all_sessions_cleaned.csv"
df = pd.read_csv(csv_path)

print(f"Loaded {len(df):,} rows, {len(df.columns)} columns")
print(f"\nColumns available: {df.columns.tolist()[:10]}...")
df.head(3)

# %% [markdown]
# ## 0. Classify methods and compute ORT_Score

# %%
def classify_method(row):
    workflow = row.get("Workflow", "")
    strategy = str(row.get("Strategy") or "").lower()
    if workflow == "Direct":
        return "Direct (Non-Reasoning)"
    elif workflow == "Reasoning":
        return "Direct (Reasoning)"
    elif workflow == "Prompt2DAG":
        if "template" in strategy:
            return "Prompt2DAG (Template)"
        elif "llm" in strategy:
            return "Prompt2DAG (LLM)"
        elif "hybrid" in strategy:
            return "Prompt2DAG (Hybrid)"
        else:
            return f"Prompt2DAG ({row.get('Strategy')})"
    else:
        return workflow

df["Method"] = df.apply(classify_method, axis=1)

METHOD_ORDER = [
    "Direct (Non-Reasoning)",
    "Prompt2DAG (Template)",
    "Prompt2DAG (LLM)",
    "Prompt2DAG (Hybrid)",
    "Direct (Reasoning)",
]

df_methods = df[df["Method"].isin(METHOD_ORDER)].copy()

print("\n" + "=" * 100)
print("METHOD CLASSIFICATION")
print("=" * 100)
print("\nRows per Method:")
method_counts = df_methods["Method"].value_counts().reindex(METHOD_ORDER)
for method, count in method_counts.items():
    print(f"  {method:<30}: {count:>6,} rows")

# %% [markdown]
# ## Ensure issue columns exist and compute ORT

# %%
print("\n" + "=" * 100)
print("ISSUE COLUMN VERIFICATION")
print("=" * 100)

# Ensure issue columns exist
for col in ["Critical_Issues", "Major_Issues", "Minor_Issues"]:
    if col not in df_methods.columns:
        df_methods[col] = 0
    df_methods[col] = df_methods[col].fillna(0)

# Recalculate Total_Issues
df_methods["Total_Issues"] = (
    df_methods["Critical_Issues"] + 
    df_methods["Major_Issues"] + 
    df_methods["Minor_Issues"]
)

print(f"\nIssue columns verified:")
print(f"  Critical_Issues range: [{df_methods['Critical_Issues'].min():.0f}, {df_methods['Critical_Issues'].max():.0f}]")
print(f"  Major_Issues range:    [{df_methods['Major_Issues'].min():.0f}, {df_methods['Major_Issues'].max():.0f}]")
print(f"  Minor_Issues range:    [{df_methods['Minor_Issues'].min():.0f}, {df_methods['Minor_Issues'].max():.0f}]")
print(f"  Total_Issues range:    [{df_methods['Total_Issues'].min():.0f}, {df_methods['Total_Issues'].max():.0f}]")

# %%
print("\n" + "=" * 100)
print("COMPUTING ORT SCORES")
print("=" * 100)

# Penalty weights
ALPHA_CRIT = 2.0
BETA_MAJOR = 1.0
GAMMA_MINOR = 0.25

print(f"\nPenalty weights:")
print(f"  Critical issues: α = {ALPHA_CRIT}")
print(f"  Major issues:    β = {BETA_MAJOR}")
print(f"  Minor issues:    γ = {GAMMA_MINOR}")

# Base score: Combined_Score if Passed, else 0
df_methods["Base_Score"] = np.where(
    df_methods["Passed"] == True, 
    df_methods["Combined_Score"], 
    0.0
)

# Calculate penalty
df_methods["Penalty"] = (
    ALPHA_CRIT * df_methods["Critical_Issues"] +
    BETA_MAJOR * df_methods["Major_Issues"] +
    GAMMA_MINOR * df_methods["Minor_Issues"]
)

# ORT_Score_raw (can be negative)
df_methods["ORT_Score_raw"] = df_methods["Base_Score"] - df_methods["Penalty"]

# ORT_Score_capped (clamped to [0, 10])
df_methods["ORT_Score_capped"] = df_methods["ORT_Score_raw"].clip(lower=0.0, upper=10.0)

# ORT_Score_scaled (min-max normalization to [0, 10])
ort_min = df_methods["ORT_Score_raw"].min()
ort_max = df_methods["ORT_Score_raw"].max()

if ort_max > ort_min:
    df_methods["ORT_Score_scaled"] = 10 * (df_methods["ORT_Score_raw"] - ort_min) / (ort_max - ort_min)
else:
    df_methods["ORT_Score_scaled"] = 0.0

# Use ORT_Score_scaled as the default ORT_Score
df_methods["ORT_Score"] = df_methods["ORT_Score_scaled"]

print(f"\nORT Score Statistics:")
print(f"  ORT_raw range:    [{df_methods['ORT_Score_raw'].min():.2f}, {df_methods['ORT_Score_raw'].max():.2f}]")
print(f"  ORT_capped range: [{df_methods['ORT_Score_capped'].min():.2f}, {df_methods['ORT_Score_capped'].max():.2f}]")
print(f"  ORT_scaled range: [{df_methods['ORT_Score_scaled'].min():.2f}, {df_methods['ORT_Score_scaled'].max():.2f}]")

# %%
print("\n" + "=" * 100)
print("CORRELATION CHECK: ORT_Score vs Combined_Score")
print("=" * 100)

corr_all = df_methods[["Combined_Score", "ORT_Score"]].corr().iloc[0, 1]
print(f"\nAll runs: r = {corr_all:.3f}")

df_passed = df_methods[df_methods["Passed"] == True]
if len(df_passed) > 0:
    corr_passed = df_passed[["Combined_Score", "ORT_Score"]].corr().iloc[0, 1]
    print(f"Passed runs only: r = {corr_passed:.3f}")

# %% [markdown]
# ## 1. Global Issue & ORT Statistics by Method (Row-Level)

# %%
print("\n" + "=" * 100)
print("TABLE Q1: ISSUE & ORT STATISTICS BY METHOD (ROW-LEVEL, ALL RUNS)")
print("=" * 100)

records = []
for method in METHOD_ORDER:
    df_m = df_methods[df_methods["Method"] == method]
    if len(df_m) == 0:
        continue
    
    df_m_passed = df_m[df_m["Passed"] == True]
    
    row = {
        "Method": method,
        "N_Total": len(df_m),
        "N_Passed": len(df_m_passed),
        "Pass_%": f"{df_m['Passed'].mean() * 100:.1f}",
        "Combined_All": f"{df_m['Combined_Score'].mean():.2f} ± {df_m['Combined_Score'].std():.2f}",
        "ORT_All": f"{df_m['ORT_Score'].mean():.2f} ± {df_m['ORT_Score'].std():.2f}",
        "Total_Issues_All": f"{df_m['Total_Issues'].mean():.2f} ± {df_m['Total_Issues'].std():.2f}",
        "Critical_All": f"{df_m['Critical_Issues'].mean():.2f} ± {df_m['Critical_Issues'].std():.2f}",
        "Major_All": f"{df_m['Major_Issues'].mean():.2f} ± {df_m['Major_Issues'].std():.2f}",
        "Minor_All": f"{df_m['Minor_Issues'].mean():.2f} ± {df_m['Minor_Issues'].std():.2f}",
        "Frac_w_Critical_%": f"{(df_m['Critical_Issues'] > 0).mean() * 100:.1f}",
        "Frac_Zero_Issues_%": f"{(df_m['Total_Issues'] == 0).mean() * 100:.1f}",
    }
    
    records.append(row)

q1_df = pd.DataFrame(records)
print("\n" + q1_df.to_string(index=False))

# %%
print("\n" + "=" * 100)
print("TABLE Q1b: ISSUE & ORT STATISTICS BY METHOD (ROW-LEVEL, PASSED RUNS ONLY)")
print("=" * 100)

records_passed = []
for method in METHOD_ORDER:
    df_m = df_methods[(df_methods["Method"] == method) & (df_methods["Passed"] == True)]
    if len(df_m) == 0:
        continue
    
    row = {
        "Method": method,
        "N_Passed": len(df_m),
        "Combined_Passed": f"{df_m['Combined_Score'].mean():.2f} ± {df_m['Combined_Score'].std():.2f}",
        "ORT_Passed": f"{df_m['ORT_Score'].mean():.2f} ± {df_m['ORT_Score'].std():.2f}",
        "Total_Issues_Passed": f"{df_m['Total_Issues'].mean():.2f} ± {df_m['Total_Issues'].std():.2f}",
        "Critical_Passed": f"{df_m['Critical_Issues'].mean():.2f} ± {df_m['Critical_Issues'].std():.2f}",
        "Major_Passed": f"{df_m['Major_Issues'].mean():.2f} ± {df_m['Major_Issues'].std():.2f}",
        "Minor_Passed": f"{df_m['Minor_Issues'].mean():.2f} ± {df_m['Minor_Issues'].std():.2f}",
        "Frac_w_Critical_%": f"{(df_m['Critical_Issues'] > 0).mean() * 100:.1f}",
        "Frac_Zero_Issues_%": f"{(df_m['Total_Issues'] == 0).mean() * 100:.1f}",
    }
    
    records_passed.append(row)

q1b_df = pd.DataFrame(records_passed)
print("\n" + q1b_df.to_string(index=False))

# %% [markdown]
# ## 2. Pipeline × Method: Average Issues & ORT per Pipeline

# %%
print("\n" + "=" * 100)
print("PIPELINE × METHOD AGGREGATION")
print("=" * 100)

agg_cols = {
    "Combined_Score": "mean",
    "Static_Score": "mean",
    "Compliance_Score": "mean",
    "ORT_Score": "mean",
    "Total_Issues": "mean",
    "Critical_Issues": "mean",
    "Major_Issues": "mean",
    "Minor_Issues": "mean",
    "Passed": "mean",  # Pass rate per pipeline-method
}

pipe_method_agg = (
    df_methods
    .groupby(["Pipeline_ID", "Method"])
    .agg(agg_cols)
    .reset_index()
)

pipe_method_agg.rename(columns={"Passed": "Pass_Rate"}, inplace=True)

print(f"\nAggregated to {len(pipe_method_agg)} Pipeline × Method combinations")
print(f"Unique pipelines: {pipe_method_agg['Pipeline_ID'].nunique()}")
print(f"Unique methods: {pipe_method_agg['Method'].nunique()}")

print("\nSample:")
print(pipe_method_agg.head(10).to_string(index=False))

# %% [markdown]
# ## 3. Per-Pipeline Δ in Issues & ORT: Direct vs Prompt2DAG Methods

# %%
print("\n" + "=" * 100)
print("COMPUTING PER-PIPELINE DELTAS (vs Direct Non-Reasoning)")
print("=" * 100)

# Pivot to get one row per pipeline, columns per method for issues and ORT
pivot = pipe_method_agg.pivot_table(
    index="Pipeline_ID",
    columns="Method",
    values=["Total_Issues", "Critical_Issues", "Major_Issues", "Minor_Issues", "ORT_Score", "Combined_Score"]
)

# Flatten multiindex columns
pivot.columns = [f"{metric}__{method}" for metric, method in pivot.columns]

print(f"\nPivot shape: {pivot.shape}")
print(f"Columns: {pivot.columns.tolist()[:10]}...")

# %%
baseline = "Direct (Non-Reasoning)"

delta_records = []
for pipeline_id, row in pivot.iterrows():
    rec = {"Pipeline_ID": pipeline_id}
    
    # Get baseline values
    base_total = row.get(f"Total_Issues__{baseline}", np.nan)
    base_crit = row.get(f"Critical_Issues__{baseline}", np.nan)
    base_major = row.get(f"Major_Issues__{baseline}", np.nan)
    base_minor = row.get(f"Minor_Issues__{baseline}", np.nan)
    base_ort = row.get(f"ORT_Score__{baseline}", np.nan)
    base_combined = row.get(f"Combined_Score__{baseline}", np.nan)
    
    # Store baseline values
    rec["Base_Total_Issues"] = base_total
    rec["Base_Critical"] = base_crit
    rec["Base_ORT"] = base_ort
    rec["Base_Combined"] = base_combined
    
    # Calculate deltas for each P2D method
    for method in ["Prompt2DAG (Template)", "Prompt2DAG (LLM)", "Prompt2DAG (Hybrid)", "Direct (Reasoning)"]:
        method_short = method.replace("Prompt2DAG ", "").replace("(", "").replace(")", "").replace(" ", "_")
        
        total_col = f"Total_Issues__{method}"
        crit_col = f"Critical_Issues__{method}"
        major_col = f"Major_Issues__{method}"
        minor_col = f"Minor_Issues__{method}"
        ort_col = f"ORT_Score__{method}"
        combined_col = f"Combined_Score__{method}"
        
        if total_col in row and crit_col in row and ort_col in row:
            rec[f"Δ_Total_Issues_{method_short}"] = row[total_col] - base_total
            rec[f"Δ_Critical_{method_short}"] = row[crit_col] - base_crit
            rec[f"Δ_Major_{method_short}"] = row[major_col] - base_major
            rec[f"Δ_Minor_{method_short}"] = row[minor_col] - base_minor
            rec[f"Δ_ORT_{method_short}"] = row[ort_col] - base_ort
            rec[f"Δ_Combined_{method_short}"] = row[combined_col] - base_combined
            
            # Store absolute values too
            rec[f"{method_short}_ORT"] = row[ort_col]
            rec[f"{method_short}_Critical"] = row[crit_col]
    
    delta_records.append(rec)

df_delta = pd.DataFrame(delta_records).set_index("Pipeline_ID")

print(f"\nDelta dataframe shape: {df_delta.shape}")
print("\nSample (first 5 pipelines):")
display_cols = [c for c in df_delta.columns if c.startswith("Δ_ORT") or c.startswith("Δ_Critical")][:6]
print(df_delta[display_cols].head(5).round(2).to_string())

# %% [markdown]
# ## 3.1 Counts: Pipelines where P2D reduces issues and improves ORT vs Direct

# %%
print("\n" + "=" * 100)
print("TABLE Q3: PIPELINES WHERE P2D METHODS REDUCE ISSUES AND/OR IMPROVE ORT VS DIRECT")
print("=" * 100)

comparison_results = []

for method in ["Prompt2DAG (Template)", "Prompt2DAG (LLM)", "Prompt2DAG (Hybrid)", "Direct (Reasoning)"]:
    method_short = method.replace("Prompt2DAG ", "").replace("(", "").replace(")", "").replace(" ", "_")
    
    col_crit = f"Δ_Critical_{method_short}"
    col_total = f"Δ_Total_Issues_{method_short}"
    col_major = f"Δ_Major_{method_short}"
    col_minor = f"Δ_Minor_{method_short}"
    col_ort = f"Δ_ORT_{method_short}"
    col_combined = f"Δ_Combined_{method_short}"
    
    if col_crit not in df_delta.columns or col_ort not in df_delta.columns:
        continue
    
    df_m = df_delta.dropna(subset=[col_crit, col_ort])
    n_pipelines = len(df_m)
    
    # Critical issues
    fewer_crit = (df_m[col_crit] < 0).sum()
    more_crit = (df_m[col_crit] > 0).sum()
    same_crit = (df_m[col_crit] == 0).sum()
    
    # Total issues
    fewer_total = (df_m[col_total] < 0).sum()
    more_total = (df_m[col_total] > 0).sum()
    same_total = (df_m[col_total] == 0).sum()
    
    # ORT
    better_ort = (df_m[col_ort] > 0).sum()
    worse_ort = (df_m[col_ort] < 0).sum()
    same_ort = (df_m[col_ort] == 0).sum()
    
    # Combined Score
    better_combined = (df_m[col_combined] > 0).sum()
    worse_combined = (df_m[col_combined] < 0).sum()
    
    # Win conditions
    fewer_crit_and_better_ort = ((df_m[col_crit] < 0) & (df_m[col_ort] > 0)).sum()
    fewer_total_and_better_ort = ((df_m[col_total] < 0) & (df_m[col_ort] > 0)).sum()
    better_ort_and_combined = ((df_m[col_ort] > 0) & (df_m[col_combined] > 0)).sum()
    
    # Calculate averages
    avg_delta_crit = df_m[col_crit].mean()
    avg_delta_total = df_m[col_total].mean()
    avg_delta_ort = df_m[col_ort].mean()
    avg_delta_combined = df_m[col_combined].mean()
    
    comparison_results.append({
        "Method": method,
        "N_Pipelines": n_pipelines,
        "Fewer_Critical": fewer_crit,
        "More_Critical": more_crit,
        "Fewer_Total_Issues": fewer_total,
        "More_Total_Issues": more_total,
        "Better_ORT": better_ort,
        "Worse_ORT": worse_ort,
        "Better_Combined": better_combined,
        "Fewer_Crit_AND_Better_ORT": fewer_crit_and_better_ort,
        "Fewer_Total_AND_Better_ORT": fewer_total_and_better_ort,
        "Better_ORT_AND_Combined": better_ort_and_combined,
        "Avg_Δ_Critical": f"{avg_delta_crit:.2f}",
        "Avg_Δ_Total_Issues": f"{avg_delta_total:.2f}",
        "Avg_Δ_ORT": f"{avg_delta_ort:.2f}",
        "Avg_Δ_Combined": f"{avg_delta_combined:.2f}",
    })

q3_df = pd.DataFrame(comparison_results)
print("\n" + q3_df.to_string(index=False))

# %%
print("\n" + "=" * 100)
print("DETAILED BREAKDOWN BY METHOD")
print("=" * 100)

for method in ["Prompt2DAG (Template)", "Prompt2DAG (LLM)", "Prompt2DAG (Hybrid)", "Direct (Reasoning)"]:
    method_short = method.replace("Prompt2DAG ", "").replace("(", "").replace(")", "").replace(" ", "_")
    
    col_crit = f"Δ_Critical_{method_short}"
    col_ort = f"Δ_ORT_{method_short}"
    
    if col_crit not in df_delta.columns or col_ort not in df_delta.columns:
        continue
    
    df_m = df_delta.dropna(subset=[col_crit, col_ort])
    
    print(f"\n{method}:")
    print(f"  Pipelines: {len(df_m)}")
    print(f"  Critical Issues: {(df_m[col_crit] < 0).sum()} fewer, {(df_m[col_crit] > 0).sum()} more, {(df_m[col_crit] == 0).sum()} same")
    print(f"  ORT Score:       {(df_m[col_ort] > 0).sum()} better, {(df_m[col_ort] < 0).sum()} worse, {(df_m[col_ort] == 0).sum()} same")
    print(f"  Win-Win:         {((df_m[col_crit] < 0) & (df_m[col_ort] > 0)).sum()} pipelines (fewer critical + better ORT)")
    print(f"  Avg Δ Critical:  {df_m[col_crit].mean():.2f}")
    print(f"  Avg Δ ORT:       {df_m[col_ort].mean():.2f}")

# %% [markdown]
# ## 3.2 Load Pipeline Metadata

# %%
import json
from pathlib import Path

print("\n" + "=" * 100)
print("LOADING PIPELINE METADATA")
print("=" * 100)

META_JSON_PATH = 'pipeline_analysis_results/pipeline_analysis_complete.json'
meta_path = Path(META_JSON_PATH)

if not meta_path.exists():
    print(f"⚠️ Metadata JSON not found at {META_JSON_PATH}")
    print("Skipping topology-based analysis")
    df_meta = pd.DataFrame()
else:
    with open(meta_path, "r", encoding="utf-8") as f:
        meta_json = json.load(f)
    
    meta_entries = meta_json.get("analyses", [])
    print(f"Metadata entries: {len(meta_entries)}")
    
    # Build a metadata DataFrame
    meta_records = []
    for entry in meta_entries:
        src = entry.get("source_file", "")
        pipeline_id = src.replace("_description.txt", "").replace(".txt", "")
        
        topology = entry.get("topology", {})
        processing = entry.get("processing", {})
        scheduling = entry.get("scheduling", {})
        complexity = entry.get("complexity", {})
        external = entry.get("external_services", {})
        
        meta_records.append({
            "Pipeline_ID": pipeline_id,
            "pipeline_name": entry.get("pipeline_name", pipeline_id),
            "business_domain": entry.get("business_domain"),
            "domain_category": entry.get("domain_category"),
            "primary_objective": entry.get("primary_objective"),
            "topology_pattern": topology.get("pattern"),
            "parallelization_level": topology.get("parallelization_level"),
            "has_sensors": topology.get("has_sensors"),
            "has_branches": topology.get("has_branches"),
            "total_tasks": processing.get("total_tasks"),
            "etl_pattern": processing.get("etl_pattern"),
            "service_integration_pattern": external.get("service_integration_pattern"),
            "schedule_type": scheduling.get("schedule_type"),
            "complexity_score": complexity.get("complexity_score")
        })
    
    df_meta = pd.DataFrame(meta_records)
    print(f"\nMetadata columns: {df_meta.columns.tolist()}")
    print(f"Metadata shape: {df_meta.shape}")
    
    # Coverage check
    print("\n" + "=" * 80)
    print("METADATA COVERAGE CHECK")
    print("=" * 80)
    print(f"Unique pipelines in df_methods: {df_methods['Pipeline_ID'].nunique()}")
    print(f"Unique pipelines in df_meta: {df_meta['Pipeline_ID'].nunique()}")
    
    missing_in_meta = sorted(set(df_methods["Pipeline_ID"]) - set(df_meta["Pipeline_ID"]))
    missing_in_scores = sorted(set(df_meta["Pipeline_ID"]) - set(df_methods["Pipeline_ID"]))
    
    print(f"\nPipelines in scores but NOT in metadata: {len(missing_in_meta)}")
    if missing_in_meta and len(missing_in_meta) <= 10:
        print(f"  {missing_in_meta}")
    
    print(f"\nPipelines in metadata but NOT in scores: {len(missing_in_scores)}")
    if missing_in_scores and len(missing_in_scores) <= 10:
        print(f"  {missing_in_scores}")

# %% [markdown]
# ## 4. Grouped Δ Critical Issues and Δ ORT by Topology

# %%
if len(df_meta) > 0:
    print("\n" + "=" * 100)
    print("TOPOLOGY-BASED ANALYSIS")
    print("=" * 100)
    
    # Merge delta results with metadata
    df_delta_meta = df_delta.merge(
        df_meta.set_index("Pipeline_ID"),
        left_index=True,
        right_index=True,
        how="left"
    )
    
    print(f"\nMerged df_delta_meta shape: {df_delta_meta.shape}")
    print(f"Rows with topology_pattern: {df_delta_meta['topology_pattern'].notna().sum()}")
    
    # Analyze for each Prompt2DAG method
    for method in ["Prompt2DAG (Template)", "Prompt2DAG (LLM)", "Prompt2DAG (Hybrid)"]:
        method_short = method.replace("Prompt2DAG ", "").replace("(", "").replace(")", "").replace(" ", "_")
        
        col_crit = f"Δ_Critical_{method_short}"
        col_ort = f"Δ_ORT_{method_short}"
        
        if col_crit in df_delta_meta.columns and col_ort in df_delta_meta.columns:
            df_with_topology = df_delta_meta.dropna(subset=["topology_pattern", col_crit, col_ort])
            
            if len(df_with_topology) > 0:
                print("\n" + "=" * 80)
                print(f"TABLE Q4: Δ CRITICAL ISSUES & Δ ORT BY TOPOLOGY ({method} vs Direct)")
                print("=" * 80)
                
                topol_summary = (
                    df_with_topology
                    .groupby("topology_pattern")
                    .agg({
                        col_crit: ["mean", "std", "count"],
                        col_ort: ["mean", "std", "count"]
                    })
                    .round(2)
                )
                
                # Flatten column names
                topol_summary.columns = [f"{col}_{stat}" for col, stat in topol_summary.columns]
                topol_summary = topol_summary.sort_values(f"{col_ort}_mean", ascending=False)
                
                print(f"\n{method}:")
                print(topol_summary.to_string())
    
    # Summary table across all P2D methods
    print("\n" + "=" * 80)
    print("TABLE Q5: Δ ORT BY TOPOLOGY - ALL PROMPT2DAG METHODS")
    print("=" * 80)
    
    summary_rows = []
    
    for topology in df_delta_meta["topology_pattern"].dropna().unique():
        row = {"Topology": topology}
        
        for method in ["Prompt2DAG (Template)", "Prompt2DAG (LLM)", "Prompt2DAG (Hybrid)"]:
            method_short = method.replace("Prompt2DAG ", "").replace("(", "").replace(")", "").replace(" ", "_")
            col_ort = f"Δ_ORT_{method_short}"
            
            if col_ort in df_delta_meta.columns:
                subset = df_delta_meta[
                    (df_delta_meta["topology_pattern"] == topology) & 
                    (df_delta_meta[col_ort].notna())
                ]
                
                if len(subset) > 0:
                    row[f"{method_short}_Δ_ORT"] = subset[col_ort].mean()
                    row[f"{method_short}_N"] = len(subset)
        
        summary_rows.append(row)
    
    topology_summary = pd.DataFrame(summary_rows)
    
    # Add best method per topology
    ort_cols = [c for c in topology_summary.columns if c.endswith("_Δ_ORT")]
    if len(ort_cols) > 0:
        topology_summary["Best_Method"] = topology_summary[ort_cols].idxmax(axis=1)
        topology_summary["Best_Δ_ORT"] = topology_summary[ort_cols].max(axis=1)
    
    print("\n" + topology_summary.round(2).to_string(index=False))

else:
    print("\n⚠️ Skipping topology-based analysis (no metadata available)")

# %% [markdown]
# ## 5. Deep Dive: Correlation Analysis

# %%
print("\n" + "=" * 100)
print("CORRELATION ANALYSIS: ISSUES vs SCORES")
print("=" * 100)

corr_metrics = ["Total_Issues", "Critical_Issues", "Major_Issues", "Minor_Issues"]
score_metrics = ["Combined_Score", "Static_Score", "Compliance_Score", "ORT_Score"]

print("\n--- All Runs ---")
print(f"{'':>20} {'Combined':>12} {'Static':>12} {'Compliance':>12} {'ORT':>12}")
print("-" * 75)

for issue in corr_metrics:
    row = [issue]
    for score in score_metrics:
        r, _ = stats.pearsonr(df_methods[issue], df_methods[score])
        row.append(f"{r:+.3f}")
    print(f"{row[0]:>20} {row[1]:>12} {row[2]:>12} {row[3]:>12} {row[4]:>12}")

print("\n--- Passed Runs Only ---")
print(f"{'':>20} {'Combined':>12} {'Static':>12} {'Compliance':>12} {'ORT':>12}")
print("-" * 75)

df_passed = df_methods[df_methods["Passed"] == True]
for issue in corr_metrics:
    row = [issue]
    for score in score_metrics:
        r, _ = stats.pearsonr(df_passed[issue], df_passed[score])
        row.append(f"{r:+.3f}")
    print(f"{row[0]:>20} {row[1]:>12} {row[2]:>12} {row[3]:>12} {row[4]:>12}")

# %% [markdown]
# ## 6. Investigate: High Scores with High Issues

# %%
print("\n" + "=" * 100)
print("ANOMALY CHECK: HIGH SCORES WITH HIGH CRITICAL ISSUES")
print("=" * 100)

# Define thresholds
HIGH_SCORE_THRESHOLD = 7.0
CRITICAL_ISSUE_THRESHOLD = 1

anomalous = df_methods[
    (df_methods["Combined_Score"] >= HIGH_SCORE_THRESHOLD) & 
    (df_methods["Critical_Issues"] > CRITICAL_ISSUE_THRESHOLD)
]

print(f"\nRows with Combined_Score ≥ {HIGH_SCORE_THRESHOLD} AND Critical_Issues > {CRITICAL_ISSUE_THRESHOLD}: {len(anomalous)}")
print(f"This is {len(anomalous)/len(df_methods)*100:.2f}% of all rows")

if len(anomalous) > 0:
    print("\nBreakdown by Method:")
    for method in METHOD_ORDER:
        count = len(anomalous[anomalous["Method"] == method])
        total_method = len(df_methods[df_methods["Method"] == method])
        pct = count / total_method * 100 if total_method > 0 else 0
        print(f"  {method:<30}: {count:>5} / {total_method:>5} ({pct:.1f}%)")
    
    print("\nSample of anomalous rows:")
    print(anomalous[["Method", "Pipeline_ID", "Combined_Score", "ORT_Score", 
                     "Critical_Issues", "Major_Issues", "Minor_Issues", "Passed"]].head(10).to_string())

# %% [markdown]
# ## 7. Final Summary Statistics

# %%
print("\n" + "=" * 100)
print("FINAL SUMMARY: PROMPT2DAG vs DIRECT")
print("=" * 100)

# Best performing method per metric
print("\n--- Best Performing Method per Metric (All Runs) ---")

metrics = {
    "Pass Rate": ("Passed", "mean", True),
    "Combined Score": ("Combined_Score", "mean", True),
    "ORT Score": ("ORT_Score", "mean", True),
    "Fewest Total Issues": ("Total_Issues", "mean", False),
    "Fewest Critical Issues": ("Critical_Issues", "mean", False),
}

for metric_name, (col, agg, higher_better) in metrics.items():
    method_scores = df_methods.groupby("Method")[col].agg(agg).sort_values(ascending=not higher_better)
    best_method = method_scores.index[0]
    best_value = method_scores.iloc[0]
    worst_value = method_scores.iloc[-1]
    
    print(f"\n{metric_name}:")
    print(f"  Best:  {best_method:<30} {best_value:.2f}")
    print(f"  Worst: {method_scores.index[-1]:<30} {worst_value:.2f}")
    print(f"  Gap:   {abs(best_value - worst_value):.2f}")

print("\n--- Win Rate: P2D Methods vs Direct (Pipeline-Level) ---")

for method in ["Prompt2DAG (Template)", "Prompt2DAG (LLM)", "Prompt2DAG (Hybrid)"]:
    method_short = method.replace("Prompt2DAG ", "").replace("(", "").replace(")", "").replace(" ", "_")
    col_ort = f"Δ_ORT_{method_short}"
    
    if col_ort in df_delta.columns:
        wins = (df_delta[col_ort] > 0).sum()
        losses = (df_delta[col_ort] < 0).sum()
        ties = (df_delta[col_ort] == 0).sum()
        total = len(df_delta[col_ort].dropna())
        
        print(f"\n{method}:")
        print(f"  Wins:   {wins} / {total} ({wins/total*100:.1f}%)")
        print(f"  Losses: {losses} / {total} ({losses/total*100:.1f}%)")
        print(f"  Ties:   {ties} / {total} ({ties/total*100:.1f}%)")

print("\n" + "=" * 100)
print("QUALITATIVE ANALYSIS COMPLETE")
print("=" * 100)

Loaded 8,742 rows, 94 columns

Columns available: ['Session', 'Run_Name', 'Pipeline_ID', 'Model_ID', 'Std_LLM', 'Reasoning_LLM', 'Workflow', 'Orchestrator', 'Strategy', 'Static_Score']...

METHOD CLASSIFICATION

Rows per Method:
  Direct (Non-Reasoning)        :  2,394 rows
  Prompt2DAG (Template)         :  1,578 rows
  Prompt2DAG (LLM)              :  2,043 rows
  Prompt2DAG (Hybrid)           :  2,043 rows
  Direct (Reasoning)            :    684 rows

ISSUE COLUMN VERIFICATION

Issue columns verified:
  Critical_Issues range: [0, 5]
  Major_Issues range:    [0, 8]
  Minor_Issues range:    [0, 10]
  Total_Issues range:    [0, 17]

COMPUTING ORT SCORES

Penalty weights:
  Critical issues: α = 2.0
  Major issues:    β = 1.0
  Minor issues:    γ = 0.25

ORT Score Statistics:
  ORT_raw range:    [-13.50, 7.69]
  ORT_capped range: [0.00, 7.69]
  ORT_scaled range: [0.00, 10.00]

CORRELATION CHECK: ORT_Score vs Combined_Score

All runs: r = 0.736
Passed runs only: r = 0.247

TABLE Q1: ISSU

In [2]:
#!/usr/bin/env python3
"""
investigate_issue_consistency.py

Deep investigation into issue counts vs scores to identify the root cause
of apparent inconsistencies in the qualitative analysis.
"""

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("=" * 120)
print("COMPREHENSIVE INVESTIGATION: ISSUE COUNTS vs SCORES CONSISTENCY")
print("=" * 120)

# ============================================================================
# 1. LOAD DATA
# ============================================================================

print("\n" + "=" * 120)
print("1. LOADING DATA")
print("=" * 120)

csv_path = "/Users/abubakarialidu/Desktop/Data Result/all_sessions_cleaned.csv"
df = pd.read_csv(csv_path)

print(f"Loaded {len(df):,} rows, {len(df.columns)} columns")

# Classify methods
def classify_method(row):
    workflow = row.get("Workflow", "")
    strategy = str(row.get("Strategy") or "").lower()
    if workflow == "Direct":
        return "Direct (Non-Reasoning)"
    elif workflow == "Reasoning":
        return "Direct (Reasoning)"
    elif workflow == "Prompt2DAG":
        if "template" in strategy:
            return "Prompt2DAG (Template)"
        elif "llm" in strategy:
            return "Prompt2DAG (LLM)"
        elif "hybrid" in strategy:
            return "Prompt2DAG (Hybrid)"
    return workflow

df["Method"] = df.apply(classify_method, axis=1)

METHOD_ORDER = [
    "Direct (Non-Reasoning)",
    "Prompt2DAG (Template)",
    "Prompt2DAG (LLM)",
    "Prompt2DAG (Hybrid)",
    "Direct (Reasoning)",
]

df = df[df["Method"].isin(METHOD_ORDER)].copy()

print(f"\nFiltered to {len(df):,} rows across {len(METHOD_ORDER)} methods")

# ============================================================================
# 2. VERIFY ISSUE COLUMNS EXIST AND ARE NUMERIC
# ============================================================================

print("\n" + "=" * 120)
print("2. ISSUE COLUMN VERIFICATION")
print("=" * 120)

issue_cols = ["Critical_Issues", "Major_Issues", "Minor_Issues", "Total_Issues"]

for col in issue_cols[:3]:
    if col not in df.columns:
        print(f"⚠️  {col} column not found!")
        df[col] = 0
    df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)

# Recalculate Total_Issues
df["Total_Issues"] = df["Critical_Issues"] + df["Major_Issues"] + df["Minor_Issues"]

print(f"\nIssue columns statistics:")
for col in issue_cols:
    print(f"  {col}:")
    print(f"    Range: [{df[col].min():.0f}, {df[col].max():.0f}]")
    print(f"    Mean:  {df[col].mean():.2f}")
    print(f"    NaN:   {df[col].isna().sum()}")
    print(f"    Zeros: {(df[col] == 0).sum()} ({(df[col] == 0).mean()*100:.1f}%)")

# ============================================================================
# 3. CHECK IF ORT_Score IS ALREADY IN DATA OR NEEDS CALCULATION
# ============================================================================

print("\n" + "=" * 120)
print("3. ORT SCORE VERIFICATION")
print("=" * 120)

has_ort = "ORT_Score" in df.columns or "ORT_Score_scaled" in df.columns

if has_ort:
    print("✓ ORT_Score column found in data")
    if "ORT_Score_scaled" in df.columns:
        df["ORT_Score"] = df["ORT_Score_scaled"]
    print(f"  Range: [{df['ORT_Score'].min():.2f}, {df['ORT_Score'].max():.2f}]")
    print(f"  Mean:  {df['ORT_Score'].mean():.2f}")
else:
    print("⚠️  ORT_Score not found, calculating from scratch...")
    
    ALPHA_CRIT = 2.0
    BETA_MAJOR = 1.0
    GAMMA_MINOR = 0.25
    
    df["Base_Score"] = np.where(df["Passed"] == True, df["Combined_Score"], 0.0)
    df["Penalty"] = (
        ALPHA_CRIT * df["Critical_Issues"] +
        BETA_MAJOR * df["Major_Issues"] +
        GAMMA_MINOR * df["Minor_Issues"]
    )
    df["ORT_Score_raw"] = df["Base_Score"] - df["Penalty"]
    df["ORT_Score_capped"] = df["ORT_Score_raw"].clip(lower=0.0, upper=10.0)
    
    ort_min = df["ORT_Score_raw"].min()
    ort_max = df["ORT_Score_raw"].max()
    if ort_max > ort_min:
        df["ORT_Score_scaled"] = 10 * (df["ORT_Score_raw"] - ort_min) / (ort_max - ort_min)
    else:
        df["ORT_Score_scaled"] = 0.0
    
    df["ORT_Score"] = df["ORT_Score_scaled"]
    print(f"  Calculated ORT_Score range: [{df['ORT_Score'].min():.2f}, {df['ORT_Score'].max():.2f}]")

# ============================================================================
# 4. MANUAL ORT VERIFICATION: CHECK IF FORMULA IS CORRECT
# ============================================================================

print("\n" + "=" * 120)
print("4. MANUAL ORT VERIFICATION (Sample Check)")
print("=" * 120)

# Take a sample and manually verify ORT calculation
sample_rows = df[df["Passed"] == True].sample(10, random_state=42)

ALPHA_CRIT = 2.0
BETA_MAJOR = 1.0
GAMMA_MINOR = 0.25

print(f"\n{'Method':<30} {'Combined':>8} {'Crit':>5} {'Major':>6} {'Minor':>6} {'Expected_Penalty':>16} {'ORT_Score':>10} {'Manual_Check':>12}")
print("-" * 120)

for _, row in sample_rows.iterrows():
    combined = row["Combined_Score"]
    crit = row["Critical_Issues"]
    major = row["Major_Issues"]
    minor = row["Minor_Issues"]
    ort = row["ORT_Score"]
    
    # Manual calculation
    expected_penalty = ALPHA_CRIT * crit + BETA_MAJOR * major + GAMMA_MINOR * minor
    expected_ort_raw = combined - expected_penalty
    
    # Note: We can't verify scaled ORT without knowing the exact min/max used
    # But we can verify the penalty
    
    print(f"{row['Method']:<30} {combined:>8.2f} {crit:>5.0f} {major:>6.0f} {minor:>6.0f} {expected_penalty:>16.2f} {ort:>10.2f} {'✓' if expected_ort_raw >= 0 else '✗ negative'}")

# ============================================================================
# 5. CORRELATION ANALYSIS: DETAILED BREAKDOWN
# ============================================================================

print("\n" + "=" * 120)
print("5. CORRELATION ANALYSIS: ISSUES vs SCORES (DETAILED)")
print("=" * 120)

def safe_corr(df_sub, col1, col2):
    """Calculate correlation, handling NaN/inf"""
    valid = df_sub[[col1, col2]].dropna()
    valid = valid[~valid.isin([np.inf, -np.inf]).any(axis=1)]
    if len(valid) < 2:
        return np.nan, np.nan
    return stats.pearsonr(valid[col1], valid[col2])

# Overall correlation
print("\n--- OVERALL (All Runs) ---")
print(f"{'Issue Type':>20} {'vs Combined':>15} {'p-value':>10} {'vs ORT':>15} {'p-value':>10}")
print("-" * 75)

for issue_col in ["Critical_Issues", "Major_Issues", "Minor_Issues", "Total_Issues"]:
    r_combined, p_combined = safe_corr(df, issue_col, "Combined_Score")
    r_ort, p_ort = safe_corr(df, issue_col, "ORT_Score")
    
    sig_combined = "***" if p_combined < 0.001 else "**" if p_combined < 0.01 else "*" if p_combined < 0.05 else "ns"
    sig_ort = "***" if p_ort < 0.001 else "**" if p_ort < 0.01 else "*" if p_ort < 0.05 else "ns"
    
    print(f"{issue_col:>20} {r_combined:>+14.3f} {sig_combined:>10} {r_ort:>+14.3f} {sig_ort:>10}")

# By Passed status
for passed_status in [True, False]:
    status_label = "PASSED" if passed_status else "FAILED"
    df_sub = df[df["Passed"] == passed_status]
    
    print(f"\n--- {status_label} RUNS ONLY (N={len(df_sub):,}) ---")
    print(f"{'Issue Type':>20} {'vs Combined':>15} {'p-value':>10} {'vs ORT':>15} {'p-value':>10}")
    print("-" * 75)
    
    for issue_col in ["Critical_Issues", "Major_Issues", "Minor_Issues", "Total_Issues"]:
        r_combined, p_combined = safe_corr(df_sub, issue_col, "Combined_Score")
        r_ort, p_ort = safe_corr(df_sub, issue_col, "ORT_Score")
        
        sig_combined = "***" if p_combined < 0.001 else "**" if p_combined < 0.01 else "*" if p_combined < 0.05 else "ns"
        sig_ort = "***" if p_ort < 0.001 else "**" if p_ort < 0.01 else "*" if p_ort < 0.05 else "ns"
        
        print(f"{issue_col:>20} {r_combined:>+14.3f} {sig_combined:>10} {r_ort:>+14.3f} {sig_ort:>10}")

# By Method
print("\n--- BY METHOD (Passed Runs Only) ---")

for method in METHOD_ORDER:
    df_method = df[(df["Method"] == method) & (df["Passed"] == True)]
    
    if len(df_method) < 10:
        continue
    
    print(f"\n{method} (N={len(df_method):,}):")
    print(f"{'Issue Type':>20} {'vs Combined':>15} {'vs ORT':>15}")
    print("-" * 55)
    
    for issue_col in ["Critical_Issues", "Total_Issues"]:
        r_combined, _ = safe_corr(df_method, issue_col, "Combined_Score")
        r_ort, _ = safe_corr(df_method, issue_col, "ORT_Score")
        print(f"{issue_col:>20} {r_combined:>+14.3f} {r_ort:>+14.3f}")

# ============================================================================
# 6. INVESTIGATE: WHY POSITIVE CORRELATION?
# ============================================================================

print("\n" + "=" * 120)
print("6. INVESTIGATING POSITIVE CORRELATION PARADOX")
print("=" * 120)

# Hypothesis 1: Are issues higher for PASSED runs?
print("\n--- Hypothesis 1: Do PASSED runs have MORE issues than FAILED? ---")

passed_issues = df[df["Passed"] == True]["Total_Issues"].mean()
failed_issues = df[df["Passed"] == False]["Total_Issues"].mean()

print(f"  Passed runs: {passed_issues:.2f} issues (mean)")
print(f"  Failed runs: {failed_issues:.2f} issues (mean)")
print(f"  Difference:  {passed_issues - failed_issues:.2f}")

if passed_issues > failed_issues:
    print("  ⚠️ ANOMALY DETECTED: Passed runs have MORE issues!")
else:
    print("  ✓ Normal: Failed runs have more issues")

# Hypothesis 2: Are issues correlated with Combined_Score WITHIN passed runs?
print("\n--- Hypothesis 2: Within PASSED runs, do higher scores have more issues? ---")

df_passed = df[df["Passed"] == True].copy()

# Create bins of Combined_Score
df_passed["Score_Bin"] = pd.cut(df_passed["Combined_Score"], bins=5, labels=["Very Low", "Low", "Medium", "High", "Very High"])

score_bin_issues = df_passed.groupby("Score_Bin").agg({
    "Total_Issues": ["mean", "count"],
    "Critical_Issues": "mean",
    "Major_Issues": "mean",
    "Minor_Issues": "mean"
}).round(2)

print("\n" + score_bin_issues.to_string())

# Hypothesis 3: Are issues synthetically added based on method?
print("\n--- Hypothesis 3: Are issue counts dependent on Method (not actual quality)? ---")

method_issue_stats = df.groupby("Method").agg({
    "Total_Issues": ["mean", "std"],
    "Critical_Issues": ["mean", "std"],
    "Combined_Score": ["mean", "std"],
    "Passed": "mean"
}).round(2)

print("\n" + method_issue_stats.to_string())

# Check if issue distribution is suspiciously uniform
print("\n--- Issue Distribution Check ---")
for method in METHOD_ORDER:
    df_m = df[df["Method"] == method]
    
    # Count unique issue values
    unique_total = df_m["Total_Issues"].nunique()
    unique_crit = df_m["Critical_Issues"].nunique()
    
    print(f"\n{method}:")
    print(f"  Unique Total_Issues values: {unique_total}")
    print(f"  Unique Critical_Issues values: {unique_crit}")
    
    # Show distribution
    if unique_crit <= 10:
        crit_dist = df_m["Critical_Issues"].value_counts().sort_index()
        print(f"  Critical_Issues distribution:")
        for val, count in crit_dist.items():
            print(f"    {val:.0f}: {count} ({count/len(df_m)*100:.1f}%)")

# ============================================================================
# 7. CHECK FOR CONFORMANCE-BASED PATTERNS
# ============================================================================

print("\n" + "=" * 120)
print("7. CONFORMANCE-BASED ISSUE PATTERNS")
print("=" * 120)

# Direct (Non-Reasoning) - Template_Conformance
if "Template_Conformance" in df.columns:
    print("\n--- Direct (Non-Reasoning) by Template_Conformance ---")
    df_dnr = df[df["Method"] == "Direct (Non-Reasoning)"]
    
    for conform in [True, False]:
        df_sub = df_dnr[df_dnr["Template_Conformance"] == conform]
        label = "Conforming" if conform else "Non-Conforming (Penalized)"
        
        print(f"\n{label} (N={len(df_sub):,}):")
        print(f"  Combined_Score: {df_sub['Combined_Score'].mean():.2f} ± {df_sub['Combined_Score'].std():.2f}")
        print(f"  ORT_Score:      {df_sub['ORT_Score'].mean():.2f} ± {df_sub['ORT_Score'].std():.2f}")
        print(f"  Total_Issues:   {df_sub['Total_Issues'].mean():.2f} ± {df_sub['Total_Issues'].std():.2f}")
        print(f"  Critical:       {df_sub['Critical_Issues'].mean():.2f}")
        print(f"  Major:          {df_sub['Major_Issues'].mean():.2f}")
        print(f"  Minor:          {df_sub['Minor_Issues'].mean():.2f}")
        print(f"  Pass Rate:      {df_sub['Passed'].mean()*100:.1f}%")

# Direct (Reasoning) - Reasoning_Conformance
if "Reasoning_Conformance" in df.columns:
    print("\n--- Direct (Reasoning) by Reasoning_Conformance ---")
    df_dr = df[df["Method"] == "Direct (Reasoning)"]
    
    for conform in [True, False]:
        df_sub = df_dr[df_dr["Reasoning_Conformance"] == conform]
        label = "Conforming" if conform else "Non-Conforming (Penalized)"
        
        if len(df_sub) > 0:
            print(f"\n{label} (N={len(df_sub):,}):")
            print(f"  Combined_Score: {df_sub['Combined_Score'].mean():.2f} ± {df_sub['Combined_Score'].std():.2f}")
            print(f"  ORT_Score:      {df_sub['ORT_Score'].mean():.2f} ± {df_sub['ORT_Score'].std():.2f}")
            print(f"  Total_Issues:   {df_sub['Total_Issues'].mean():.2f} ± {df_sub['Total_Issues'].std():.2f}")
            print(f"  Critical:       {df_sub['Critical_Issues'].mean():.2f}")
            print(f"  Major:          {df_sub['Major_Issues'].mean():.2f}")
            print(f"  Minor:          {df_sub['Minor_Issues'].mean():.2f}")
            print(f"  Pass Rate:      {df_sub['Passed'].mean()*100:.1f}%")

# ============================================================================
# 8. DEEP DIVE: THE 244 ANOMALOUS ROWS
# ============================================================================

print("\n" + "=" * 120)
print("8. DEEP DIVE: HIGH SCORE + HIGH CRITICAL ISSUES ANOMALY")
print("=" * 120)

anomalous = df[(df["Combined_Score"] >= 7.0) & (df["Critical_Issues"] > 1)]

print(f"\nTotal anomalous rows: {len(anomalous):,} ({len(anomalous)/len(df)*100:.2f}%)")

if len(anomalous) > 0:
    print("\n--- Anomalous Rows Statistics ---")
    print(f"  Combined_Score: {anomalous['Combined_Score'].mean():.2f} ± {anomalous['Combined_Score'].std():.2f}")
    print(f"  ORT_Score:      {anomalous['ORT_Score'].mean():.2f} ± {anomalous['ORT_Score'].std():.2f}")
    print(f"  Critical:       {anomalous['Critical_Issues'].mean():.2f}")
    print(f"  Total_Issues:   {anomalous['Total_Issues'].mean():.2f}")
    print(f"  Pass Rate:      {anomalous['Passed'].mean()*100:.1f}%")
    
    print("\n--- By Method ---")
    for method in METHOD_ORDER:
        count = len(anomalous[anomalous["Method"] == method])
        total = len(df[df["Method"] == method])
        pct = count / total * 100 if total > 0 else 0
        
        if count > 0:
            avg_combined = anomalous[anomalous["Method"] == method]["Combined_Score"].mean()
            avg_ort = anomalous[anomalous["Method"] == method]["ORT_Score"].mean()
            avg_crit = anomalous[anomalous["Method"] == method]["Critical_Issues"].mean()
            
            print(f"  {method:<30}: {count:>4} / {total:>5} ({pct:>5.1f}%) | Combined={avg_combined:.2f}, ORT={avg_ort:.2f}, Crit={avg_crit:.1f}")
    
    # Check if these are mostly conforming or non-conforming
    if "Template_Conformance" in anomalous.columns:
        dnr_anom = anomalous[anomalous["Method"] == "Direct (Non-Reasoning)"]
        if len(dnr_anom) > 0:
            conform_count = dnr_anom["Template_Conformance"].sum()
            print(f"\n  Direct (Non-Reasoning) anomalous: {conform_count}/{len(dnr_anom)} conforming ({conform_count/len(dnr_anom)*100:.1f}%)")

# ============================================================================
# 9. SCATTER PLOTS: VISUALIZING THE RELATIONSHIPS
# ============================================================================

print("\n" + "=" * 120)
print("9. GENERATING SCATTER PLOTS (saved to outputs/)")
print("=" * 120)

import matplotlib.pyplot as plt
import seaborn as sns

output_dir = Path("outputs/experiment_v2/investigation_plots")
output_dir.mkdir(parents=True, exist_ok=True)

# Plot 1: Total_Issues vs Combined_Score (by Passed status)
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
for passed_status in [True, False]:
    df_sub = df[df["Passed"] == passed_status]
    label = "Passed" if passed_status else "Failed"
    alpha = 0.5
    ax.scatter(df_sub["Total_Issues"], df_sub["Combined_Score"], 
               label=label, alpha=alpha, s=20)

ax.set_xlabel("Total_Issues")
ax.set_ylabel("Combined_Score")
ax.set_title("Total Issues vs Combined Score (by Pass Status)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(output_dir / "issues_vs_combined_by_pass_status.png", dpi=150)
print(f"  ✓ Saved: issues_vs_combined_by_pass_status.png")
plt.close()

# Plot 2: Critical_Issues vs ORT_Score (by Method)
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
for method in METHOD_ORDER:
    df_method = df[(df["Method"] == method) & (df["Passed"] == True)]
    ax.scatter(df_method["Critical_Issues"], df_method["ORT_Score"], 
               label=method, alpha=0.6, s=30)

ax.set_xlabel("Critical_Issues")
ax.set_ylabel("ORT_Score")
ax.set_title("Critical Issues vs ORT Score (Passed Runs Only, by Method)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(output_dir / "critical_vs_ort_by_method.png", dpi=150)
print(f"  ✓ Saved: critical_vs_ort_by_method.png")
plt.close()

# Plot 3: Distribution of issues by Method
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

for idx, issue_col in enumerate(["Critical_Issues", "Major_Issues", "Minor_Issues", "Total_Issues"]):
    ax = axes[idx // 2, idx % 2]
    
    for method in METHOD_ORDER:
        df_method = df[df["Method"] == method]
        ax.hist(df_method[issue_col], bins=20, alpha=0.5, label=method)
    
    ax.set_xlabel(issue_col)
    ax.set_ylabel("Frequency")
    ax.set_title(f"Distribution of {issue_col}")
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(output_dir / "issue_distributions_by_method.png", dpi=150)
print(f"  ✓ Saved: issue_distributions_by_method.png")
plt.close()

# ============================================================================
# 10. FINAL DIAGNOSIS
# ============================================================================

print("\n" + "=" * 120)
print("10. FINAL DIAGNOSIS & RECOMMENDATIONS")
print("=" * 120)

# Calculate key diagnostic metrics
passed_vs_failed_issue_diff = df[df["Passed"] == True]["Total_Issues"].mean() - df[df["Passed"] == False]["Total_Issues"].mean()
overall_corr_issues_combined = df[["Total_Issues", "Combined_Score"]].corr().iloc[0, 1]
passed_corr_issues_ort = df[df["Passed"] == True][["Total_Issues", "ORT_Score"]].corr().iloc[0, 1]

print("\n" + "=" * 80)
print("DIAGNOSIS SUMMARY")
print("=" * 80)

issues_found = []

if passed_vs_failed_issue_diff > 0:
    issues_found.append("✗ CRITICAL: Passed runs have MORE issues than failed runs")
    print(f"\n1. ✗ CRITICAL ISSUE: Passed runs have MORE issues than failed runs")
    print(f"   - Passed:  {df[df['Passed'] == True]['Total_Issues'].mean():.2f} issues")
    print(f"   - Failed:  {df[df['Passed'] == False]['Total_Issues'].mean():.2f} issues")
    print(f"   - This suggests issues are NOT properly capturing code quality")

if overall_corr_issues_combined > 0.3:
    issues_found.append("✗ ANOMALY: Positive correlation between issues and scores")
    print(f"\n2. ✗ ANOMALY: Positive correlation between Total_Issues and Combined_Score")
    print(f"   - Correlation: {overall_corr_issues_combined:.3f}")
    print(f"   - Expected: Strong NEGATIVE correlation")
    print(f"   - This suggests issue counts may be synthetic or inverted")

if passed_corr_issues_ort > -0.5:
    issues_found.append("✗ WARNING: Weak negative correlation between issues and ORT")
    print(f"\n3. ✗ WARNING: Weak correlation between issues and ORT (passed runs)")
    print(f"   - Correlation: {passed_corr_issues_ort:.3f}")
    print(f"   - Expected: Strong NEGATIVE (< -0.7)")
    print(f"   - ORT penalty may not be working as intended")

if len(anomalous) / len(df) > 0.02:
    issues_found.append("✗ WARNING: Too many high-score + high-issue cases")
    print(f"\n4. ✗ WARNING: {len(anomalous)/len(df)*100:.1f}% of rows have high scores with high critical issues")
    print(f"   - This should be < 1% if scoring is consistent")

if len(issues_found) == 0:
    print("\n✓ No major inconsistencies detected")
else:
    print(f"\n" + "=" * 80)
    print(f"ISSUES DETECTED: {len(issues_found)}")
    print("=" * 80)
    for issue in issues_found:
        print(f"  {issue}")

print("\n" + "=" * 80)
print("RECOMMENDATIONS")
print("=" * 80)

print("""
Based on the investigation, here are the recommended actions:

1. VERIFY ISSUE EXTRACTION:
   - Check if issue counts were extracted from actual linting/validation results
   - Or were they synthetically generated based on scores?
   - Review the original extraction script

2. CHECK PENALTY APPLICATION:
   - Confirm that penalties were applied AFTER issues were counted
   - Not BEFORE (which would invert the relationship)

3. INVESTIGATE CONFORMANCE EFFECT:
   - Non-conforming outputs had scores multiplied by 0.5
   - Were issues also adjusted, or do they reflect original values?
   - This could explain the positive correlation

4. CONSIDER REMOVING ISSUE-BASED ANALYSIS:
   - If issues are not reliable, focus on Combined_Score and ORT_Score_scaled only
   - Use Pass Rate as the primary quality metric
   - Report topology-based improvements without issue breakdowns

5. ALTERNATIVE: RE-EXTRACT ISSUES:
   - Go back to raw outputs and re-count issues consistently
   - Ensure issues are independent of score calculations

6. FOR THE PAPER:
   - Focus on Pass Rate, Combined Score, and ORT (without issue breakdown)
   - Use qualitative examples instead of quantitative issue counts
   - Emphasize the 97% win rate of P2D Hybrid/LLM over Direct
""")

print("\n" + "=" * 120)
print("INVESTIGATION COMPLETE")
print("=" * 120)

COMPREHENSIVE INVESTIGATION: ISSUE COUNTS vs SCORES CONSISTENCY

1. LOADING DATA
Loaded 8,742 rows, 94 columns

Filtered to 8,742 rows across 5 methods

2. ISSUE COLUMN VERIFICATION

Issue columns statistics:
  Critical_Issues:
    Range: [0, 5]
    Mean:  0.55
    NaN:   0
    Zeros: 5265 (60.2%)
  Major_Issues:
    Range: [0, 8]
    Mean:  1.77
    NaN:   0
    Zeros: 2196 (25.1%)
  Minor_Issues:
    Range: [0, 10]
    Mean:  3.32
    NaN:   0
    Zeros: 1270 (14.5%)
  Total_Issues:
    Range: [0, 17]
    Mean:  5.64
    NaN:   0
    Zeros: 119 (1.4%)

3. ORT SCORE VERIFICATION
✓ ORT_Score column found in data
  Range: [0.00, 10.00]
  Mean:  6.64

4. MANUAL ORT VERIFICATION (Sample Check)

Method                         Combined  Crit  Major  Minor Expected_Penalty  ORT_Score Manual_Check
------------------------------------------------------------------------------------------------------------------------
Prompt2DAG (Hybrid)                6.27     1      2      1             4.25 