## Table of Contents
1. [Setup and Imports](#setup)
2. [Discover Business Artifacts](#discover)
3. [Load and Inspect Business Data](#load)
4. [Airline KPI Summary](#kpi-summary)
5. [Hub Concentration Analysis](#hub-concentration)
6. [Disruption Cost Proxy](#disruption-cost)
7. [Strategy-Resilience Trade-off](#tradeoff)
8. [Interpretation](#interpretation)
9. [Write Report Outputs](#write-outputs)
10. [Reproducibility Notes](#reproducibility)

In [None]:
# ============================================================================
# SETUP AND IMPORTS
# ============================================================================

import json
from pathlib import Path
from datetime import datetime
import warnings

import pandas as pd
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Project paths
REPO_ROOT = Path.cwd().parent.parent
RESULTS_DIR = REPO_ROOT / "results"
BUSINESS_DIR = RESULTS_DIR / "business"
TABLES_REPORT_DIR = RESULTS_DIR / "tables" / "report"
FIGURES_REPORT_DIR = RESULTS_DIR / "figures" / "report"
WARNINGS_LOG = TABLES_REPORT_DIR / "_warnings.log"

# Notebook identity
NOTEBOOK_ID = "nb08"
NOTEBOOK_NAME = "business__hub_strategy_and_resilience"

# Plotting settings
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")

# Ensure output directories exist
TABLES_REPORT_DIR.mkdir(parents=True, exist_ok=True)
FIGURES_REPORT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Business dir exists: {BUSINESS_DIR.exists()}")

In [None]:
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def append_warning(message: str, notebook_id: str = NOTEBOOK_ID):
    """Append a warning to the consolidated warnings log."""
    timestamp = datetime.now().isoformat()
    with open(WARNINGS_LOG, "a") as f:
        f.write(f"[{timestamp}] [{notebook_id}] {message}\n")
    print(f"WARNING: {message}")

def safe_load_parquet(path: Path) -> pl.DataFrame | None:
    """Safely load a parquet file, returning None if it fails."""
    try:
        return pl.read_parquet(path)
    except Exception as e:
        append_warning(f"Failed to load {path.name}: {e}")
        return None

def plot_top15_bar(df: pd.DataFrame, metric: str, airline_col: str, title: str, output_path: Path):
    """Create horizontal bar chart for top 15 airlines by a metric."""
    top15 = df.nlargest(15, metric)
    
    fig, ax = plt.subplots(figsize=(12, 8))
    colors = sns.color_palette("viridis", len(top15))
    bars = ax.barh(range(len(top15)), top15[metric], color=colors)
    ax.set_yticks(range(len(top15)))
    ax.set_yticklabels(top15[airline_col])
    ax.invert_yaxis()
    ax.set_xlabel(metric.replace("_", " ").title())
    ax.set_title(title)
    
    plt.tight_layout()
    plt.savefig(output_path, dpi=150)
    plt.show()
    print(f"‚úÖ Saved: {output_path.name}")

<a id="discover"></a>
## 2. Discover Business Artifacts

In [None]:
# ============================================================================
# DISCOVER BUSINESS ARTIFACTS
# ============================================================================

business_files = list(BUSINESS_DIR.glob("*.parquet")) + list(BUSINESS_DIR.glob("*.csv")) + list(BUSINESS_DIR.glob("*.json"))

print(f"Found {len(business_files)} business artifacts:")
for bf in sorted(business_files):
    print(f"  - {bf.name}")

# Primary files
airline_summary_file = BUSINESS_DIR / "airline_summary_metrics.parquet"
hub_concentration_file = BUSINESS_DIR / "hub_concentration.parquet"
disruption_cost_file = BUSINESS_DIR / "disruption_cost_proxy.parquet"

print(f"\nAirline summary exists: {airline_summary_file.exists()}")
print(f"Hub concentration exists: {hub_concentration_file.exists()}")
print(f"Disruption cost exists: {disruption_cost_file.exists()}")

<a id="load"></a>
## 3. Load and Inspect Business Data

In [None]:
# ============================================================================
# LOAD AND INSPECT BUSINESS DATA
# ============================================================================

business_dfs = {}

for name, path in [
    ("airline_summary", airline_summary_file),
    ("hub_concentration", hub_concentration_file),
    ("disruption_cost", disruption_cost_file)
]:
    if path.exists():
        df = safe_load_parquet(path)
        if df is not None:
            business_dfs[name] = df
            print(f"\n{'='*60}")
            print(f"{name.upper()}")
            print(f"{'='*60}")
            print(f"Shape: {df.shape}")
            print(f"Columns: {df.columns}")
            display(df.head(10).to_pandas())
    else:
        append_warning(f"{path.name} not found")

if len(business_dfs) == 0:
    append_warning("No business data could be loaded")

<a id="kpi-summary"></a>
## 4. Airline KPI Summary

Create a consolidated view of airline-level key performance indicators.

In [None]:
# ============================================================================
# AIRLINE KPI SUMMARY
# ============================================================================

airline_kpi = None

if "airline_summary" in business_dfs:
    airline_summary = business_dfs["airline_summary"].to_pandas()
    
    # Identify airline column
    airline_col = next((c for c in ["carrier", "airline", "OP_UNIQUE_CARRIER"] 
                        if c in airline_summary.columns), None)
    
    if airline_col:
        print(f"Airline column: {airline_col}")
        print(f"\nNumber of airlines: {airline_summary[airline_col].nunique()}")
        
        # Identify numeric KPI columns
        kpi_cols = [c for c in airline_summary.columns 
                   if airline_summary[c].dtype in [np.float64, np.int64, np.float32, np.int32]
                   and c != airline_col]
        
        print(f"\nKPI columns: {kpi_cols}")
        
        airline_kpi = airline_summary[[airline_col] + kpi_cols].copy()
        display(airline_kpi.describe())
    else:
        append_warning("Could not identify airline column in summary")
else:
    print("Not available: airline summary not loaded")

In [None]:
# ============================================================================
# PLOT TOP-15 FOR KEY METRICS
# ============================================================================

if airline_kpi is not None and airline_col:
    # Select key metrics to visualize
    priority_metrics = ["total_flights", "mean_dep_delay", "mean_arr_delay", 
                       "cancellation_rate", "total_cost", "delay_cost"]
    
    metrics_to_plot = [m for m in priority_metrics if m in kpi_cols]
    if len(metrics_to_plot) == 0:
        metrics_to_plot = kpi_cols[:4]  # Fallback to first 4
    
    for metric in metrics_to_plot:
        fig_path = FIGURES_REPORT_DIR / f"{NOTEBOOK_ID}_airline_kpi_top15__{metric}.png"
        plot_top15_bar(
            airline_kpi, 
            metric, 
            airline_col,
            f"Top 15 Airlines by {metric.replace('_', ' ').title()}",
            fig_path
        )

<a id="hub-concentration"></a>
## 5. Hub Concentration Analysis

Examine airline hub concentration patterns.

In [None]:
# ============================================================================
# HUB CONCENTRATION ANALYSIS
# ============================================================================

if "hub_concentration" in business_dfs:
    hub_conc = business_dfs["hub_concentration"].to_pandas()
    
    # Identify columns
    airline_col_hub = next((c for c in ["carrier", "airline", "OP_UNIQUE_CARRIER"] 
                            if c in hub_conc.columns), None)
    
    print(f"Hub concentration columns: {list(hub_conc.columns)}")
    display(hub_conc.head(15))
    
    # Look for top-1 and top-3 concentration metrics
    top1_col = next((c for c in hub_conc.columns if "top1" in c.lower() or "hub_1" in c.lower()), None)
    top3_col = next((c for c in hub_conc.columns if "top3" in c.lower() or "hub_3" in c.lower()), None)
    
    if airline_col_hub and (top1_col or top3_col):
        conc_col = top1_col or top3_col
        
        # Sort by concentration
        hub_sorted = hub_conc.sort_values(conc_col, ascending=False)
        
        fig, ax = plt.subplots(figsize=(12, 8))
        colors = sns.color_palette("RdYlBu_r", len(hub_sorted))
        bars = ax.barh(range(len(hub_sorted)), hub_sorted[conc_col], color=colors)
        ax.set_yticks(range(len(hub_sorted)))
        ax.set_yticklabels(hub_sorted[airline_col_hub])
        ax.invert_yaxis()
        ax.set_xlabel(conc_col.replace("_", " ").title())
        ax.set_title("Hub Concentration by Airline")
        
        plt.tight_layout()
        fig_path = FIGURES_REPORT_DIR / f"{NOTEBOOK_ID}_hub_concentration.png"
        plt.savefig(fig_path, dpi=150)
        plt.show()
        print(f"‚úÖ Saved: {fig_path.name}")
else:
    print("Not available: hub concentration data not loaded")

<a id="disruption-cost"></a>
## 6. Disruption Cost Proxy

Examine estimated disruption costs by airline.

In [None]:
# ============================================================================
# DISRUPTION COST PROXY
# ============================================================================

if "disruption_cost" in business_dfs:
    disruption = business_dfs["disruption_cost"].to_pandas()
    
    print(f"Disruption cost columns: {list(disruption.columns)}")
    display(disruption.head(15))
    
    # Identify columns
    airline_col_dis = next((c for c in ["carrier", "airline", "OP_UNIQUE_CARRIER"] 
                            if c in disruption.columns), None)
    cost_col = next((c for c in disruption.columns if "cost" in c.lower() and "total" in c.lower()), 
                   next((c for c in disruption.columns if "cost" in c.lower()), None))
    
    if airline_col_dis and cost_col:
        fig_path = FIGURES_REPORT_DIR / f"{NOTEBOOK_ID}_disruption_cost_proxy.png"
        plot_top15_bar(
            disruption,
            cost_col,
            airline_col_dis,
            f"Top 15 Airlines by {cost_col.replace('_', ' ').title()}",
            fig_path
        )
else:
    print("Not available: disruption cost data not loaded")

<a id="tradeoff"></a>
## 7. Strategy-Resilience Trade-off

Explore the relationship between hub concentration and disruption vulnerability.

In [None]:
# ============================================================================
# STRATEGY-RESILIENCE TRADE-OFF
# ============================================================================

if "hub_concentration" in business_dfs and "disruption_cost" in business_dfs:
    hub_conc = business_dfs["hub_concentration"].to_pandas()
    disruption = business_dfs["disruption_cost"].to_pandas()
    
    # Identify common airline column
    hub_airline = next((c for c in ["carrier", "airline", "OP_UNIQUE_CARRIER"] if c in hub_conc.columns), None)
    dis_airline = next((c for c in ["carrier", "airline", "OP_UNIQUE_CARRIER"] if c in disruption.columns), None)
    
    if hub_airline and dis_airline:
        # Merge datasets
        merged = hub_conc.merge(disruption, left_on=hub_airline, right_on=dis_airline, how="inner")
        
        # Find concentration and cost columns
        conc_col = next((c for c in merged.columns if "top1" in c.lower() or "concentration" in c.lower()), None)
        cost_col = next((c for c in merged.columns if "total_cost" in c.lower() or "cost" in c.lower()), None)
        
        if conc_col and cost_col:
            fig, ax = plt.subplots(figsize=(10, 8))
            
            ax.scatter(merged[conc_col], merged[cost_col], s=100, alpha=0.7)
            
            # Label points
            for _, row in merged.iterrows():
                ax.annotate(row[hub_airline], (row[conc_col], row[cost_col]), 
                           fontsize=8, alpha=0.8)
            
            ax.set_xlabel(conc_col.replace("_", " ").title())
            ax.set_ylabel(cost_col.replace("_", " ").title())
            ax.set_title("Hub Concentration vs Disruption Cost")
            
            # Compute correlation
            corr = merged[conc_col].corr(merged[cost_col])
            ax.text(0.05, 0.95, f"Correlation: {corr:.3f}", transform=ax.transAxes,
                   fontsize=12, verticalalignment="top",
                   bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5))
            
            plt.tight_layout()
            fig_path = FIGURES_REPORT_DIR / f"{NOTEBOOK_ID}_concentration_vs_cost.png"
            plt.savefig(fig_path, dpi=150)
            plt.show()
            print(f"‚úÖ Saved: {fig_path.name}")
            
            print(f"\nüìä TRADE-OFF ANALYSIS:")
            print(f"   Correlation between {conc_col} and {cost_col}: {corr:.3f}")
            if corr > 0.3:
                print("   ‚ö†Ô∏è Positive correlation suggests concentrated hubs may increase costs")
            elif corr < -0.3:
                print("   ‚úÖ Negative correlation suggests distributed networks may reduce costs")
            else:
                print("   ‚ÜîÔ∏è Weak correlation - relationship is unclear")
        else:
            print("Could not identify concentration or cost columns for trade-off analysis")
else:
    print("Not available: need both hub concentration and disruption cost data")

<a id="interpretation"></a>
## 8. Interpretation

### Key Findings (Evidence-Grounded)

*(Populated after running cells above)*

### Mechanistic Explanation

- **Hub concentration**: Airlines with high top-1 share route most traffic through a single hub
- **Disruption cost proxy**: Estimated operational cost from delays and cancellations
- **Trade-off hypothesis**: Concentrated hubs may be efficient but vulnerable

### Caveats (Important)
1. **Ecological fallacy**: Airline-level patterns may not reflect individual route behavior
2. **Cost proxies**: Based on parameter assumptions, not actual financial data
3. **Correlation ‚â† causation**: Hub strategy may correlate with other confounders (airline size, routes served)

### Evidence Links
- Table: `results/tables/report/nb08_airline_kpi_summary.csv`
- Figures: `results/figures/report/nb08_*.png`

<a id="write-outputs"></a>
## 9. Write Report Outputs

In [None]:
# ============================================================================
# WRITE REPORT OUTPUTS
# ============================================================================

# Write airline KPI summary
if airline_kpi is not None:
    kpi_path = TABLES_REPORT_DIR / f"{NOTEBOOK_ID}_airline_kpi_summary.csv"
    airline_kpi.to_csv(kpi_path, index=False)
    print(f"‚úÖ Wrote: {kpi_path}")

print(f"\nüìã All {NOTEBOOK_ID} outputs written.")

<a id="reproducibility"></a>
## 10. Reproducibility Notes

### Input Files Consumed
- `results/business/airline_summary_metrics.parquet`
- `results/business/hub_concentration.parquet`
- `results/business/disruption_cost_proxy.parquet`

### Assumptions Made
1. Cost proxies use parameters from config.yaml
2. Hub concentration = share of flights through top-N airports
3. Aggregation is by operating carrier

### Aggregation Semantics
| Metric Type | Aggregation Method |
|-------------|-------------------|
| Volume metrics (flights, passengers) | Sum |
| Rate metrics (delay rate, cancellation rate) | Weighted mean by volume |
| Cost metrics | Sum |

### Outputs Generated
| Artifact | Path |
|----------|------|
| Airline KPI Summary | `results/tables/report/nb08_airline_kpi_summary.csv` |
| KPI Top-15 Figures | `results/figures/report/nb08_airline_kpi_top15__*.png` |
| Hub Concentration | `results/figures/report/nb08_hub_concentration.png` |
| Disruption Cost | `results/figures/report/nb08_disruption_cost_proxy.png` |