# Swissmetro LLM-based Synthetic Data Generation

## 整体思路

### 三条LLM生成路线

| 路线 | 输出变量 | LLM输出内容 | 采样方法 | 关键参数 |
|------|----------|-------------|----------|----------|
| **A. Two-stage CHOICE** | `syn_final` | CHOICE (1/2/3) | 直接取值 + 修复 | low_prob_threshold=0.01 |
| **B. Utility + Softmax** | `syn_u` | U_TRAIN/U_SM/U_CAR | softmax(U/tau)采样 | tau ∈ [0.2, 0.5, 0.8, 1.0, 1.2] |
| **C. MNL + Residual** | `syn_r` | dU ∈ [-1,1] | logits + λ*dU → softmax | λ ∈ [0.1, 0.2, 0.3, 0.5] |

### 基准MNL的作用
- **是** Step2(main) MNL（含MALE/INCOME/AGE/PURPOSE交互项，42个参数）
- 用于：P_CHOSEN计算、bad_idx筛查、evaluate_one指标、参数稳定性比较

## 0. Setup

In [None]:
# Add parent directory to path if running from notebooks folder
import sys
sys.path.insert(0, '..')

import os
import numpy as np
import pandas as pd

# Import swissmetro_llm modules
from swissmetro_llm import config
from swissmetro_llm.data import load_swissmetro, build_matrices, split_train_test_by_id
from swissmetro_llm.data.preprocessing import compute_scales_from_train, apply_scales
from swissmetro_llm.models import (
    fit_mnl_step1, predict_step1, fit_mnl_step2, predict_step2,
    build_X_ind, cluster_bootstrap_thetas, compute_bootstrap_se,
    fit_step2_main, get_step2_main_param_names
)
from swissmetro_llm.models.utils import accuracy, neg_loglike_from_P, softmax_rows
from swissmetro_llm.evaluation import (
    evaluate_one, feasibility_min, diversity_report,
    score_with_baseline_mnl, downstream_metrics
)
from swissmetro_llm.generation import (
    create_templates, generate_from_utilities_batch,
    generate_from_residual_batch, generate_two_stage
)
from swissmetro_llm.stability import (
    make_augmented, stability_analysis, format_stability_for_report
)

print("Package loaded successfully!")

## 1. Data Loading

In [None]:
# Set your data path here
DATA_PATH = "../swissmetro.dat"  # UPDATE THIS PATH if needed

# Load data
df = load_swissmetro(DATA_PATH)
print(f"Loaded {len(df)} rows, {df['ID'].nunique()} respondents")
df.head()

In [None]:
# Split into train/test by respondent ID
df_train, df_test, train_ids, test_ids = split_train_test_by_id(df, test_size=0.2, seed=42)

print(f"Train: {len(df_train)} rows, {len(train_ids)} respondents")
print(f"Test: {len(df_test)} rows, {len(test_ids)} respondents")

## 2. Build Matrices and Compute Scales

In [None]:
# Build matrices for MNL estimation
train_data = build_matrices(df_train)
test_data = build_matrices(df_test)

# Compute scales from training data only
tt_scale, co_scale, he_scale = compute_scales_from_train(train_data)

print(f"TT scale: {tt_scale}")
print(f"CO scale: {co_scale}")
print(f"HE scale: {he_scale}")

# Apply scales
train_data = apply_scales(train_data, tt_scale, co_scale, he_scale)
test_data = apply_scales(test_data, tt_scale, co_scale, he_scale)

## 3. MNL Model Estimation (Step1 + Step2)

In [None]:
# Step 1: Basic MNL with level-of-service attributes (6 parameters)
print("=" * 50)
print("Step 1: Basic MNL (6 parameters)")
print("=" * 50)

res1 = fit_mnl_step1(train_data)

print(f"Converged: {res1.success}")
print(f"Neg log-likelihood: {res1.fun:.2f}")

theta1 = res1.x
param_names1 = ["B_TT", "B_CO", "B_HE", "B_SEATS", "ASC_SM", "ASC_CAR"]
print("\nParameters:")
for name, val in zip(param_names1, theta1):
    print(f"  {name:12s}: {val:8.4f}")

In [None]:
# Step 2: Extended MNL with individual characteristics (42 parameters)
print("\n" + "=" * 50)
print("Step 2: Extended MNL with interactions (42 parameters)")
print("=" * 50)

X_train = build_X_ind(train_data)
X_test = build_X_ind(test_data)

res2 = fit_mnl_step2(train_data, X_train, theta1=theta1, maxiter=20000, maxfun=20000)

print(f"Converged: {res2.success}")
print(f"Neg log-likelihood: {res2.fun:.2f}")
print(f"Number of parameters: {len(res2.x)}")

theta2 = res2.x

# Note: Step2 may not converge but parameters are still usable if accuracy is good

In [None]:
# Evaluate on train and test
P_train = predict_step2(theta2, train_data, X_train)
P_test = predict_step2(theta2, test_data, X_test)

print(f"Train accuracy: {accuracy(P_train, train_data['y']):.3f}")
print(f"Test accuracy: {accuracy(P_test, test_data['y']):.3f}")
print(f"Train LL: {-neg_loglike_from_P(P_train, train_data['y']):.2f}")
print(f"Test LL: {-neg_loglike_from_P(P_test, test_data['y']):.2f}")

## 4. Bootstrap Inference

In [None]:
# Run clustered bootstrap
# Set B=30 for quick test, B=200 for publication
B = 30

print(f"Running {B} bootstrap iterations...")
print("(This may take several minutes)")

thetas_boot = cluster_bootstrap_thetas(
    df_train, 
    train_ids,
    tt_scale, co_scale, he_scale,
    theta_init=theta2,
    B=B,
    seed=2025,
    maxiter=20000,
    maxfun=20000,
    verbose_every=5
)

print(f"\nBootstrap completed: {thetas_boot.shape[0]} / {B} successful")

In [None]:
# Compute bootstrap standard errors and confidence intervals
from swissmetro_llm.models.bootstrap import compute_bootstrap_ci

se_boot = compute_bootstrap_se(thetas_boot)
ci_lower, ci_upper = compute_bootstrap_ci(thetas_boot, alpha=0.05)

print("First 6 parameters with bootstrap inference:")
print("-" * 60)
print(f"{'Param':<12} {'Coef':>10} {'SE':>10} {'z':>8} {'CI_low':>10} {'CI_up':>10}")
print("-" * 60)
for i, name in enumerate(param_names1):
    val = theta2[i]
    se = se_boot[i]
    z = val / se if se > 0 else 0
    print(f"{name:<12} {val:>10.4f} {se:>10.4f} {z:>8.2f} {ci_lower[i]:>10.4f} {ci_upper[i]:>10.4f}")

## 5. Evaluation Framework Setup

In [None]:
# Prepare beta dictionary and scales for evaluation
param_names = get_step2_main_param_names(K=18)
beta = dict(zip(param_names, theta2))

scales = {
    "tt_scale": tt_scale,
    "co_scale": co_scale,
    "he_scale": he_scale,
    "train_ids": train_ids,
}

print(f"Beta dictionary has {len(beta)} parameters")
print(f"First 6 params: {list(beta.keys())[:6]}")

In [None]:
# Evaluate real train/test data as baseline
train_scored = score_with_baseline_mnl(df_train, beta, scales)
test_scored = score_with_baseline_mnl(df_test, beta, scales)

train_metrics = downstream_metrics(train_scored)
test_metrics = downstream_metrics(test_scored)

print("Real Train Metrics:")
for k, v in train_metrics.items():
    print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")

print("\nReal Test Metrics:")
for k, v in test_metrics.items():
    print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")

## 6. Create Templates for Synthetic Generation

In [None]:
# Create templates for synthetic data generation
N = 2000  # Number of synthetic samples
p_unseen = 0.2  # Probability of sampling unseen demographic combinations

templates = create_templates(
    real_train=df_train,
    real_test=df_test,
    N=N,
    p_unseen=p_unseen,
    seed=123
)

print(f"Created {len(templates)} templates")
print(f"Columns: {list(templates.columns)[:10]} ...")

## 7. 路线A: Two-stage CHOICE Generation

直接让LLM输出CHOICE(1/2/3)，然后用基准MNL筛查low_prob样本进行修复

In [None]:
# Route A: Two-stage CHOICE generation
# Requires OpenAI API key

if os.environ.get("OPENAI_API_KEY"):
    print("API key found. Running two-stage generation...")
    
    # Define score function for bad_idx detection
    def score_fn(df):
        return score_with_baseline_mnl(df, beta, scales)
    
    syn_final = generate_two_stage(
        templates,
        score_fn=score_fn,
        model_stage1="gpt-4o-mini",
        model_stage2="gpt-4o-mini",
        low_prob_threshold=0.01,
        jsonl_dir="./",
        seed=123,
        cot_stage1=False,
        cot_stage2=True
    )
    
    print(f"\nGenerated {len(syn_final)} synthetic samples")
    print("Choice distribution:")
    print(syn_final["CHOICE"].value_counts(normalize=True).sort_index())
    
    # Evaluate
    eval_twostage = evaluate_one(syn_final, beta, scales, df_train, df_test, "two_stage_choice")
    print("\nEvaluation:")
    for k, v in eval_twostage.items():
        if 'ds_' in k:
            print(f"  {k}: {v:.4f}" if isinstance(v, float) else f"  {k}: {v}")
else:
    print("OPENAI_API_KEY not set. Skipping LLM generation.")
    print("Set: os.environ['OPENAI_API_KEY'] = 'your-key'")

## 8. 路线B: Utility + Softmax Generation

让LLM输出三个效用值(U_TRAIN, U_SM, U_CAR)，然后用softmax(tau)采样

In [None]:
# Route B: Utility + Softmax with tau sweep

if os.environ.get("OPENAI_API_KEY"):
    taus = [0.5, 1.0, 1.5]  # Adjust as needed
    results_u = []
    
    for tau in taus:
        print(f"\n--- Utility generation with tau={tau} ---")
        
        syn_u = generate_from_utilities_batch(
            templates,
            model="gpt-4o-mini",
            tau=tau,
            jsonl_path=f"./util_tau{tau}.jsonl",
            out_path=f"./util_tau{tau}_out.jsonl",
            seed=123,
            cot=False
        )
        
        # Choice distribution
        share = syn_u["CHOICE"].value_counts(normalize=True).reindex([1,2,3]).fillna(0)
        print(f"Choice share: TRAIN={share[1]:.3f}, SM={share[2]:.3f}, CAR={share[3]:.3f}")
        
        # Evaluate
        ev = evaluate_one(syn_u, beta, scales, df_train, df_test, f"util_tau{tau}")
        ev["tau"] = tau
        results_u.append(ev)
        
        print(f"ds_avg_P_chosen: {ev['ds_avg_P_chosen']:.4f}")
        print(f"ds_accuracy: {ev['ds_accuracy']:.4f}")
    
    df_results_u = pd.DataFrame(results_u)
    print("\nSummary:")
    print(df_results_u[["label", "tau", "ds_avg_P_chosen", "ds_accuracy", "ds_low_prob_rate(<0.01)"]])
else:
    print("OPENAI_API_KEY not set. Skipping.")

## 9. 路线C: MNL + Residual Generation

保留基准MNL系统性部分，让LLM生成残差dU ∈ [-1,1]，用logits + λ*dU采样

In [None]:
# Route C: MNL + Residual with lambda sweep

if os.environ.get("OPENAI_API_KEY"):
    lambdas = [0.0, 0.2, 0.3, 0.5]  # 0.0 = pure MNL baseline
    results_r = []
    
    def score_fn(df):
        return score_with_baseline_mnl(df, beta, scales)
    
    for lam in lambdas:
        print(f"\n--- Residual generation with lambda={lam} ---")
        
        syn_r = generate_from_residual_batch(
            templates,
            score_fn=score_fn,
            model="gpt-4o-mini",
            lam=lam,
            jsonl_path=f"./resid_lam{lam}.jsonl",
            out_path=f"./resid_lam{lam}_out.jsonl",
            seed=123,
            cot=False
        )
        
        # Choice distribution
        share = syn_r["CHOICE"].value_counts(normalize=True).reindex([1,2,3]).fillna(0)
        print(f"Choice share: TRAIN={share[1]:.3f}, SM={share[2]:.3f}, CAR={share[3]:.3f}")
        
        # Evaluate
        ev = evaluate_one(syn_r, beta, scales, df_train, df_test, f"mnl_plus_resid_lam{lam}")
        ev["lambda"] = lam
        results_r.append(ev)
        
        print(f"ds_avg_P_chosen: {ev['ds_avg_P_chosen']:.4f}")
        print(f"ds_low_prob_rate: {ev['ds_low_prob_rate(<0.01)']:.4f}")
    
    df_results_r = pd.DataFrame(results_r)
    print("\nSummary:")
    print(df_results_r[["label", "lambda", "ds_avg_P_chosen", "ds_accuracy", "ds_low_prob_rate(<0.01)"]])
else:
    print("OPENAI_API_KEY not set. Skipping.")

## 10. MNL Baseline (No LLM)

作为对照：直接用基准MNL概率采样

In [None]:
# Generate baseline synthetic data using MNL probabilities (no LLM)
def generate_mnl_baseline(templates, beta, scales, seed=123):
    """Generate synthetic choices using baseline MNL probabilities."""
    scored = score_with_baseline_mnl(templates.copy(), beta, scales)
    P = scored[["P_TRAIN", "P_SM", "P_CAR"]].to_numpy()
    
    rng = np.random.default_rng(seed)
    choices = 1 + np.array([rng.choice(3, p=p) for p in P])
    
    syn = templates.copy()
    syn["CHOICE"] = choices
    return syn

syn_mnl = generate_mnl_baseline(templates, beta, scales, seed=123)

print("MNL Baseline (no LLM):")
print("Choice distribution:")
print(syn_mnl["CHOICE"].value_counts(normalize=True).sort_index())

# Evaluate
eval_mnl = evaluate_one(syn_mnl, beta, scales, df_train, df_test, "MNL_baseline")
print(f"\nds_avg_P_chosen: {eval_mnl['ds_avg_P_chosen']:.4f}")
print(f"ds_accuracy: {eval_mnl['ds_accuracy']:.4f}")
print(f"ds_low_prob_rate: {eval_mnl['ds_low_prob_rate(<0.01)']:.4f}")

## 11. 参数稳定性分析 (Real vs Real+Synthetic)

比较只用真实数据 vs 用真实+合成数据估计的参数变化

In [None]:
# Parameter stability analysis
# Use the MNL baseline synthetic data for demonstration
# In practice, use your best LLM-generated synthetic data (e.g., mnl_plus_resid_lam0.3)

print("=" * 60)
print("Parameter Stability Analysis: Real vs Real+Synthetic")
print("=" * 60)

# Run stability analysis
stab_df, metadata = stability_analysis(
    real_train=df_train,
    syn_df=syn_mnl,  # Replace with your LLM synthetic data
    tt_scale=tt_scale,
    co_scale=co_scale,
    he_scale=he_scale,
    theta_init=theta2,
    boot_se=se_boot if 'se_boot' in dir() else None,
    ci_lower=ci_lower if 'ci_lower' in dir() else None,
    ci_upper=ci_upper if 'ci_upper' in dir() else None,
    ratio=1.0,  # 1:1 ratio of synthetic to real
    seed=123,
    verbose=True
)

In [None]:
# Format and display results
summary = format_stability_for_report(
    stab_df,
    metadata,
    output_path="param_stability_real_vs_aug.csv",
    top_n=15
)
print(summary)

In [None]:
# Display full table
print("\nFull stability table (top 20 by |diff|):")
display_cols = ["param", "coef_base", "coef_new", "diff", "converged_base", "converged_new"]
if "boot_se" in stab_df.columns:
    display_cols.extend(["boot_se", "z_diff_vs_bootse"])
stab_df[display_cols].head(20)

## 12. Summary and Export

In [None]:
# Summary statistics
print("=" * 60)
print("Summary Statistics")
print("=" * 60)

print(f"\nTotal parameters: {len(stab_df)}")
print(f"Max |diff|: {stab_df['abs_diff'].max():.4f} ({stab_df.iloc[0]['param']})")
print(f"Mean |diff|: {stab_df['abs_diff'].mean():.4f}")
print(f"Median |diff|: {stab_df['abs_diff'].median():.4f}")

print(f"\nConvergence:")
print(f"  Base (real only): {metadata['converged_base']}")
print(f"  Augmented (real+syn): {metadata['converged_new']}")

print(f"\nLog-likelihood:")
print(f"  Base: {metadata['ll_base']:.2f}")
print(f"  Augmented: {metadata['ll_new']:.2f}")

In [None]:
# Export key results
print("\nExported files:")
print("  - param_stability_real_vs_aug.csv")

# Export synthetic data if generated
if 'syn_final' in dir():
    syn_final.to_csv("synthetic_llm_two_stage.csv", index=False)
    print("  - synthetic_llm_two_stage.csv")

syn_mnl.to_csv("synthetic_mnl_baseline.csv", index=False)
print("  - synthetic_mnl_baseline.csv")

## Done!

### 完成内容

1. **数据加载和预处理** - `load_swissmetro()`, `build_matrices()`
2. **MNL估计** - Step1(6参数) + Step2(42参数)
3. **Bootstrap推断** - 聚类Bootstrap标准误和置信区间
4. **评估框架** - `evaluate_one()`, `downstream_metrics()`
5. **三条LLM生成路线**:
   - A. Two-stage CHOICE
   - B. Utility + Softmax
   - C. MNL + Residual
6. **参数稳定性分析** - Real vs Real+Synthetic

### 推荐主结果方案

基于 `ds_low_prob_rate=0` 的稳定性，建议使用 **MNL + Residual (λ=0.2或0.3)** 作为主文结果，其他方案放附录。