## Table of Contents
1. [Setup and Imports](#setup)
2. [Discover All Manifests](#discover-manifests)
3. [Build Manifest Index Table](#manifest-index)
4. [Reconcile Manifest Outputs to Disk](#reconciliation)
5. [Generate Gaps Table](#gaps-table)
6. [Run Completeness Interpretation](#interpretation)
7. [Write Report Outputs](#write-outputs)
8. [Reproducibility Notes](#reproducibility)

In [14]:
# ============================================================================
# SETUP AND IMPORTS
# ============================================================================

import json
import os
from pathlib import Path
from datetime import datetime
import warnings

import pandas as pd
import polars as pl

# Project paths
REPO_ROOT = Path.cwd().parent.parent  # Adjust if running from different location
RESULTS_DIR = REPO_ROOT / "results"
LOGS_DIR = RESULTS_DIR / "logs"
TABLES_REPORT_DIR = RESULTS_DIR / "tables" / "report"
FIGURES_REPORT_DIR = RESULTS_DIR / "figures" / "report"
WARNINGS_LOG = TABLES_REPORT_DIR / "_warnings.log"

# Notebook identity
NOTEBOOK_ID = "nb01"
NOTEBOOK_NAME = "run_inventory__manifest_reconciliation"

# Ensure output directories exist
TABLES_REPORT_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_REPORT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Repo root: {REPO_ROOT}")
print(f"Results dir exists: {RESULTS_DIR.exists()}")
print(f"Logs dir exists: {LOGS_DIR.exists()}")

Repo root: c:\Users\aster\projects-source\network_science_VTSL
Results dir exists: True
Logs dir exists: True


In [15]:
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def append_warning(message: str, notebook_id: str = NOTEBOOK_ID):
    """Append a warning to the consolidated warnings log."""
    timestamp = datetime.now().isoformat()
    with open(WARNINGS_LOG, "a") as f:
        f.write(f"[{timestamp}] [{notebook_id}] {message}\n")
    print(f"WARNING: {message}")

def load_manifest(path: Path) -> dict:
    """Safely load a JSON manifest file."""
    try:
        with open(path) as f:
            return json.load(f)
    except Exception as e:
        append_warning(f"Failed to load manifest {path.name}: {e}")
        return {}

def file_exists_on_disk(file_path: str, base_dir: Path = REPO_ROOT) -> bool:
    """Check if a file exists, handling both absolute and relative paths."""
    p = Path(file_path)
    if p.is_absolute():
        return p.exists()
    return (base_dir / p).exists()

<a id="discover-manifests"></a>
## 2. Discover All Manifests

Scan `results/logs/` for all `*_manifest.json` files and summarize counts by step name and timestamp.

In [16]:
# ============================================================================
# DISCOVER ALL MANIFESTS
# ============================================================================

manifest_files = sorted(LOGS_DIR.glob("*_manifest.json"))
print(f"Found {len(manifest_files)} manifest files:")
for mf in manifest_files:
    print(f"  - {mf.name}")

# Load all manifests
manifests = {}
for mf in manifest_files:
    manifests[mf.name] = load_manifest(mf)
    manifests[mf.name]["_file_path"] = str(mf)

Found 11 manifest files:
  - 00_validate_inputs_manifest.json
  - 01_build_airport_network_manifest.json
  - 02_build_flight_network_manifest.json
  - 03_build_multilayer_manifest.json
  - 04_run_centrality_manifest.json
  - 05_run_communities_manifest.json
  - 06_run_robustness_manifest.json
  - 07_run_delay_propagation_manifest.json
  - 08_run_embeddings_linkpred_manifest.json
  - 09_run_business_module_manifest.json
  - 10_make_all_figures_manifest.json


<a id="manifest-index"></a>
## 3. Build Manifest Index Table

Create a structured table with:
- Step name
- Timestamp
- Git hash (if present)
- Number of outputs listed
- Manifest path

In [17]:
# ============================================================================
# BUILD MANIFEST INDEX TABLE
# ============================================================================

index_rows = []
for manifest_name, manifest_data in manifests.items():
    if not manifest_data:
        continue
    
    # Extract key fields
    step_name = manifest_data.get("script", "UNKNOWN")
    timestamp = manifest_data.get("timestamp", "UNKNOWN")
    git_hash = manifest_data.get("git_commit", "N/A")
    
    # Count outputs
    output_files = manifest_data.get("output_files", [])
    n_outputs = len(output_files) if isinstance(output_files, list) else 0
    
    index_rows.append({
        "step_name": step_name,
        "timestamp": timestamp,
        "git_hash": git_hash[:12] if git_hash and git_hash != "N/A" else git_hash,
        "n_outputs_listed": n_outputs,
        "manifest_file": manifest_name
    })

run_index_df = pd.DataFrame(index_rows)
run_index_df = run_index_df.sort_values(["step_name", "timestamp"]).reset_index(drop=True)

print(f"\nManifest Index Table ({len(run_index_df)} entries):")
display(run_index_df)


Manifest Index Table (11 entries):


Unnamed: 0,step_name,timestamp,git_hash,n_outputs_listed,manifest_file
0,00_validate_inputs,2025-12-25T01:27:19.293096,8b0acc125e32,0,00_validate_inputs_manifest.json
1,01_build_airport_network,2025-12-25T01:31:51.521928,8b0acc125e32,0,01_build_airport_network_manifest.json
2,02_build_flight_network,2025-12-25T01:32:13.036729,8b0acc125e32,0,02_build_flight_network_manifest.json
3,03_build_multilayer,2025-12-25T13:53:45.706155,440c47ae7c6b,0,03_build_multilayer_manifest.json
4,04_run_centrality.py,2025-12-25T01:32:25.279009,8b0acc125e32,0,04_run_centrality_manifest.json
5,05_run_communities.py,2025-12-25T14:39:35.481742,440c47ae7c6b,0,05_run_communities_manifest.json
6,06_run_robustness.py,2025-12-25T01:55:18.507337,8b0acc125e32,0,06_run_robustness_manifest.json
7,07_run_delay_propagation.py,2025-12-25T01:59:03.855242,8b0acc125e32,0,07_run_delay_propagation_manifest.json
8,08_run_embeddings_linkpred.py,2025-12-25T01:59:26.748291,8b0acc125e32,0,08_run_embeddings_linkpred_manifest.json
9,09_run_business_module.py,2025-12-25T01:59:36.883206,8b0acc125e32,0,09_run_business_module_manifest.json


<a id="reconciliation"></a>
## 4. Reconcile Manifest Outputs to Disk

For each manifest-listed output, check if it exists on disk. Compute missing rates per step.

In [18]:
# ============================================================================
# RECONCILE MANIFEST OUTPUTS TO DISK
# ============================================================================

reconciliation_rows = []

for manifest_name, manifest_data in manifests.items():
    if not manifest_data:
        continue
    
    step_name = manifest_data.get("script", "UNKNOWN")
    outputs_raw = manifest_data.get("outputs", [])
    
    for output_item in outputs_raw:
        if isinstance(output_item, dict):
            output_path = output_item.get("path", "")
        else:
            output_path = str(output_item)
        
        if not output_path:
            continue
            
        exists = file_exists_on_disk(output_path)
        reconciliation_rows.append({
            "step_name": step_name,
            "manifest_file": manifest_name,
            "output_path": output_path,
            "exists_on_disk": exists,
            "status": "PRESENT" if exists else "MISSING"
        })

reconciliation_df = pd.DataFrame(reconciliation_rows)

if len(reconciliation_df) > 0:
    # Compute missing rate per step
    step_summary = reconciliation_df.groupby("step_name").agg(
        total_outputs=("output_path", "count"),
        missing_count=("exists_on_disk", lambda x: (~x).sum()),
        present_count=("exists_on_disk", "sum")
    ).reset_index()
    step_summary["missing_rate"] = step_summary["missing_count"] / step_summary["total_outputs"]
    
    print(f"Total outputs tracked: {len(reconciliation_df)}")
    print("\nReconciliation Summary by Step:")
    display(step_summary)
    
    # Show any missing files
    missing_files = reconciliation_df[~reconciliation_df["exists_on_disk"]]
    if len(missing_files) > 0:
        print(f"\n‚ö†Ô∏è {len(missing_files)} outputs listed in manifests are MISSING from disk:")
        display(missing_files)
    else:
        print("\n‚úÖ All manifest-listed outputs are present on disk.")
else:
    print("No output files found in manifests.")
    append_warning("No output files found in any manifest")

Total outputs tracked: 32

Reconciliation Summary by Step:


Unnamed: 0,step_name,total_outputs,missing_count,present_count,missing_rate
0,00_validate_inputs,2,1,1,0.5
1,01_build_airport_network,4,0,4,0.0
2,02_build_flight_network,3,0,3,0.0
3,03_build_multilayer,2,0,2,0.0
4,04_run_centrality.py,4,4,0,1.0
5,05_run_communities.py,3,3,0,1.0
6,06_run_robustness.py,3,3,0,1.0
7,07_run_delay_propagation.py,3,3,0,1.0
8,08_run_embeddings_linkpred.py,3,3,0,1.0
9,09_run_business_module.py,4,4,0,1.0



‚ö†Ô∏è 22 outputs listed in manifests are MISSING from disk:


Unnamed: 0,step_name,manifest_file,output_path,exists_on_disk,status
0,00_validate_inputs,00_validate_inputs_manifest.json,C:\Users\aster\projects-source\network_science...,False,MISSING
11,04_run_centrality.py,04_run_centrality_manifest.json,centrality,False,MISSING
12,04_run_centrality.py,04_run_centrality_manifest.json,degree_dist_in,False,MISSING
13,04_run_centrality.py,04_run_centrality_manifest.json,degree_dist_out,False,MISSING
14,04_run_centrality.py,04_run_centrality_manifest.json,top_centrality,False,MISSING
15,05_run_communities.py,05_run_communities_manifest.json,airport,False,MISSING
16,05_run_communities.py,05_run_communities_manifest.json,airport_sbm,False,MISSING
17,05_run_communities.py,05_run_communities_manifest.json,flight,False,MISSING
18,06_run_robustness.py,06_run_robustness_manifest.json,curves_parquet,False,MISSING
19,06_run_robustness.py,06_run_robustness_manifest.json,critical_nodes_csv,False,MISSING


### Note on Manifest Output Format

Some later pipeline manifests (04-10) use **symbolic output names** (e.g., "centrality", "embeddings") instead of full file paths. This is a manifest format inconsistency, not missing files. The actual artifacts exist on disk as verified by the critical artifacts check below.

<a id="gaps-table"></a>
## 5. Generate Gaps Table

Create a consolidated table of missing/unreadable artifacts with:
- Expected location
- Detection method
- Impact on interpretation
- Likely pipeline step to rerun

In [19]:
# ============================================================================
# GENERATE GAPS TABLE
# ============================================================================

# Define expected critical artifacts and their pipeline steps
CRITICAL_ARTIFACTS = {
    "results/networks/airport_nodes.parquet": ("01_build_airport_network", "Airport network node data"),
    "results/networks/airport_edges.parquet": ("01_build_airport_network", "Airport network edge data"),
    "results/networks/flight_nodes.parquet": ("02_build_flight_network", "Flight network node data"),
    "results/networks/flight_edges.parquet": ("02_build_flight_network", "Flight network edge data"),
    "results/networks/multilayer_edges.parquet": ("03_build_multilayer", "Multilayer network edges"),
    "results/analysis/airport_centrality.parquet": ("04_run_centrality", "Centrality metrics"),
    "results/analysis/airport_leiden_membership.parquet": ("05_run_communities", "Community detection results"),
    "results/analysis/robustness_curves.parquet": ("06_run_robustness", "Robustness analysis"),
    "results/analysis/delay_cascades.parquet": ("07_run_delay_propagation", "Delay propagation cascades"),
    "results/analysis/airport_embeddings.parquet": ("08_run_embeddings_linkpred", "Node embeddings"),
    "results/analysis/linkpred_metrics.json": ("08_run_embeddings_linkpred", "Link prediction metrics"),
    "results/business/airline_summary_metrics.parquet": ("09_run_business_module", "Business metrics"),
}

gaps_rows = []
for artifact_path, (step, description) in CRITICAL_ARTIFACTS.items():
    full_path = REPO_ROOT / artifact_path
    if not full_path.exists():
        gaps_rows.append({
            "expected_location": artifact_path,
            "description": description,
            "detection_method": "critical_artifact_check",
            "impact_on_interpretation": f"Blocks {description.lower()} analysis",
            "likely_step_to_rerun": step
        })

# Add manifest-detected missing files
if len(reconciliation_df) > 0:
    for _, row in reconciliation_df[~reconciliation_df["exists_on_disk"]].iterrows():
        if row["output_path"] not in [g["expected_location"] for g in gaps_rows]:
            gaps_rows.append({
                "expected_location": row["output_path"],
                "description": "Manifest-listed output",
                "detection_method": "manifest_reconciliation",
                "impact_on_interpretation": "May affect step-specific analysis",
                "likely_step_to_rerun": row["step_name"]
            })

gaps_df = pd.DataFrame(gaps_rows)

if len(gaps_df) > 0:
    print(f"\n‚ö†Ô∏è GAPS TABLE: {len(gaps_df)} missing artifacts detected")
    display(gaps_df)
    for _, gap in gaps_df.iterrows():
        append_warning(f"Missing artifact: {gap['expected_location']} (rerun {gap['likely_step_to_rerun']})")
else:
    print("\n‚úÖ No critical artifacts are missing. Run appears complete.")
    gaps_df = pd.DataFrame(columns=["expected_location", "description", "detection_method", 
                                     "impact_on_interpretation", "likely_step_to_rerun"])


‚ö†Ô∏è GAPS TABLE: 21 missing artifacts detected


Unnamed: 0,expected_location,description,detection_method,impact_on_interpretation,likely_step_to_rerun
0,C:\Users\aster\projects-source\network_science...,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,00_validate_inputs
1,centrality,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,04_run_centrality.py
2,degree_dist_in,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,04_run_centrality.py
3,degree_dist_out,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,04_run_centrality.py
4,top_centrality,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,04_run_centrality.py
5,airport,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,05_run_communities.py
6,airport_sbm,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,05_run_communities.py
7,flight,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,05_run_communities.py
8,curves_parquet,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,06_run_robustness.py
9,critical_nodes_csv,Manifest-listed output,manifest_reconciliation,May affect step-specific analysis,06_run_robustness.py




<a id="interpretation"></a>
## 6. Run Completeness Interpretation

### Key Findings (Evidence-Grounded)

*(This section will be populated after running the cells above)*

In [21]:
# ============================================================================
# INTERPRETATION SUMMARY
# ============================================================================

print("="*80)
print("RUN COMPLETENESS INTERPRETATION")
print("="*80)

# Pipeline steps covered
expected_steps = [
    "00_validate_inputs", "01_build_airport_network", "02_build_flight_network",
    "03_build_multilayer", "04_run_centrality", "05_run_communities",
    "06_run_robustness", "07_run_delay_propagation", "08_run_embeddings_linkpred",
    "09_run_business_module", "10_make_all_figures"
]

found_steps = set(run_index_df["step_name"].unique()) if len(run_index_df) > 0 else set()
missing_steps = set(expected_steps) - found_steps

print(f"\nüìä PIPELINE COVERAGE:")
print(f"   - Expected steps: {len(expected_steps)}")
print(f"   - Found in manifests: {len(found_steps)}")
print(f"   - Missing steps: {len(missing_steps)}")

if missing_steps:
    print(f"\n   ‚ö†Ô∏è Missing steps: {sorted(missing_steps)}")
else:
    print("\n   ‚úÖ All expected pipeline steps have manifests.")

# Artifact coverage
n_missing = len(gaps_df)
print(f"\nüìÅ ARTIFACT COVERAGE:")
print(f"   - Critical artifacts checked: {len(CRITICAL_ARTIFACTS)}")
print(f"   - Missing: {n_missing}")

if n_missing == 0:
    print("\n   ‚úÖ Run is COMPLETE and ready for scientific interpretation.")
else:
    print(f"\n   ‚ö†Ô∏è {n_missing} artifacts missing - some analyses will be marked 'Not available'.")

# Time window
if len(run_index_df) > 0 and "timestamp" in run_index_df.columns:
    timestamps = pd.to_datetime(run_index_df["timestamp"], errors="coerce")
    valid_ts = timestamps.dropna()
    if len(valid_ts) > 0:
        print(f"\n‚è±Ô∏è RUN TIME WINDOW:")
        print(f"   - Earliest: {valid_ts.min()}")
        print(f"   - Latest: {valid_ts.max()}")

RUN COMPLETENESS INTERPRETATION

üìä PIPELINE COVERAGE:
   - Expected steps: 11
   - Found in manifests: 11
   - Missing steps: 7

   ‚ö†Ô∏è Missing steps: ['04_run_centrality', '05_run_communities', '06_run_robustness', '07_run_delay_propagation', '08_run_embeddings_linkpred', '09_run_business_module', '10_make_all_figures']

üìÅ ARTIFACT COVERAGE:
   - Critical artifacts checked: 12
   - Missing: 21

   ‚ö†Ô∏è 21 artifacts missing - some analyses will be marked 'Not available'.

‚è±Ô∏è RUN TIME WINDOW:
   - Earliest: 2025-12-25 01:27:19.293096
   - Latest: 2025-12-25 14:39:35.481742


In [22]:
# ============================================================================
# DIRECT DISK CHECK OF CRITICAL ARTIFACTS
# ============================================================================

# This is the authoritative check - ignores manifest format inconsistencies
disk_check_artifacts = {
    "results/networks/airport_nodes.parquet": "Airport nodes",
    "results/networks/airport_edges.parquet": "Airport edges", 
    "results/networks/flight_nodes.parquet": "Flight nodes",
    "results/networks/flight_edges.parquet": "Flight edges",
    "results/networks/multilayer_edges.parquet": "Multilayer edges",
    "results/analysis/airport_centrality.parquet": "Centrality",
    "results/analysis/airport_leiden_membership.parquet": "Leiden communities",
    "results/analysis/airport_sbm_membership.parquet": "SBM communities",
    "results/analysis/flight_leiden_membership.parquet": "Flight communities",
    "results/analysis/robustness_curves.parquet": "Robustness curves",
    "results/analysis/robustness_summary.json": "Robustness summary",
    "results/analysis/delay_cascades.parquet": "Delay cascades",
    "results/analysis/delay_propagation_summary.json": "Delay summary",
    "results/analysis/airport_embeddings.parquet": "Embeddings",
    "results/analysis/linkpred_metrics.json": "Link prediction",
    "results/business/airline_summary_metrics.parquet": "Airline metrics",
    "results/business/hub_concentration.parquet": "Hub concentration",
    "results/business/disruption_cost_proxy.parquet": "Disruption cost",
}

print("üìÅ CRITICAL ARTIFACTS - DIRECT DISK CHECK")
print("=" * 60)

n_present = 0
n_missing = 0
missing_list = []

for path, desc in disk_check_artifacts.items():
    full_path = REPO_ROOT / path
    exists = full_path.exists()
    status = "‚úÖ" if exists else "‚ùå"
    print(f"  {status} {desc}: {path}")
    if exists:
        n_present += 1
    else:
        n_missing += 1
        missing_list.append(path)

print("=" * 60)
print(f"\nüìä SUMMARY: {n_present}/{len(disk_check_artifacts)} artifacts present on disk")

if n_missing == 0:
    print("\n‚úÖ ALL CRITICAL ARTIFACTS PRESENT - Pipeline run is COMPLETE")
    print("   Ready for scientific interpretation in downstream notebooks.")
else:
    print(f"\n‚ö†Ô∏è {n_missing} artifacts missing:")
    for m in missing_list:
        print(f"   - {m}")

üìÅ CRITICAL ARTIFACTS - DIRECT DISK CHECK
  ‚úÖ Airport nodes: results/networks/airport_nodes.parquet
  ‚úÖ Airport edges: results/networks/airport_edges.parquet
  ‚úÖ Flight nodes: results/networks/flight_nodes.parquet
  ‚úÖ Flight edges: results/networks/flight_edges.parquet
  ‚úÖ Multilayer edges: results/networks/multilayer_edges.parquet
  ‚úÖ Centrality: results/analysis/airport_centrality.parquet
  ‚úÖ Leiden communities: results/analysis/airport_leiden_membership.parquet
  ‚úÖ SBM communities: results/analysis/airport_sbm_membership.parquet
  ‚úÖ Flight communities: results/analysis/flight_leiden_membership.parquet
  ‚úÖ Robustness curves: results/analysis/robustness_curves.parquet
  ‚úÖ Robustness summary: results/analysis/robustness_summary.json
  ‚úÖ Delay cascades: results/analysis/delay_cascades.parquet
  ‚úÖ Delay summary: results/analysis/delay_propagation_summary.json
  ‚úÖ Embeddings: results/analysis/airport_embeddings.parquet
  ‚úÖ Link prediction: results/analysis/

<a id="write-outputs"></a>
## 7. Write Report Outputs

In [23]:
# ============================================================================
# WRITE REPORT OUTPUTS
# ============================================================================

# Write run index
run_index_path = TABLES_REPORT_DIR / f"{NOTEBOOK_ID}_run_index.csv"
run_index_df.to_csv(run_index_path, index=False)
print(f"‚úÖ Wrote: {run_index_path}")

# Write reconciliation table
recon_path = TABLES_REPORT_DIR / f"{NOTEBOOK_ID}_manifest_reconciliation.csv"
reconciliation_df.to_csv(recon_path, index=False)
print(f"‚úÖ Wrote: {recon_path}")

# Write gaps table
gaps_path = TABLES_REPORT_DIR / f"{NOTEBOOK_ID}_missing_artifacts.csv"
gaps_df.to_csv(gaps_path, index=False)
print(f"‚úÖ Wrote: {gaps_path}")

print(f"\nüìã All {NOTEBOOK_ID} outputs written to {TABLES_REPORT_DIR}")

‚úÖ Wrote: c:\Users\aster\projects-source\network_science_VTSL\results\tables\report\nb01_run_index.csv
‚úÖ Wrote: c:\Users\aster\projects-source\network_science_VTSL\results\tables\report\nb01_manifest_reconciliation.csv
‚úÖ Wrote: c:\Users\aster\projects-source\network_science_VTSL\results\tables\report\nb01_missing_artifacts.csv

üìã All nb01 outputs written to c:\Users\aster\projects-source\network_science_VTSL\results\tables\report


<a id="reproducibility"></a>
## 8. Reproducibility Notes

### Input Files Consumed
- Manifest files: `results/logs/*_manifest.json`

### Assumptions Made
1. Manifest files are valid JSON and follow the expected schema
2. `output_files` field in manifests contains relative or absolute paths
3. Critical artifacts list is comprehensive for this pipeline

### Seed/Config
- No sampling or randomization in this notebook
- Sorting is deterministic (by step_name, then timestamp)

### Outputs Generated
| Artifact | Path |
|----------|------|
| Run Index | `results/tables/report/nb01_run_index.csv` |
| Manifest Reconciliation | `results/tables/report/nb01_manifest_reconciliation.csv` |
| Missing Artifacts | `results/tables/report/nb01_missing_artifacts.csv` |