# GlycoForge Simulation Examples

Core functionality: **Simulate data → Add biological factors → Inject batch effects**

Two modes:
- **Simplified Mode**: Generate synthetic data from scratch and inject batch effect.
- **Hybrid Mode**: Extract baseline sample from real world data and simulate biological factor, then inject batch effect.

## Configuration
The simulation parameters are controlled via YAML configuration files located in `sample_config/`.
- `simlified_mode_config.yaml`: Configuration for Simplified Mode.
- `hybrid_mode_config.yaml`: Configuration for Hybrid Mode.

You can modify these files to adjust parameters like `n_glycans`, `n_batches`, `bio_strength`, etc.
Lists can be provided for certain parameters (e.g., `kappa_mu: [0.5, 1.0]`) to automatically trigger a Grid Search.

## Example 1: Simplified Mode (Pure Simulated Data)

Fully synthetic simulation without real data dependency. This mode is ideal for controlled experiments where you want to test batch correction methods under known ground truth with varying batch effect strengths.

- **Biological Signal**: Simulated
- **Batch Effect**: Simulated 



In [None]:
# Load configuration from YAML
from glycoforge import simulate
import yaml
import os

# Auto-detect config path: try installed package first, fallback to local repo
try:
    import glycoforge
    config_dir = os.path.join(os.path.dirname(glycoforge.__file__), 'sample_config')
    print(f"✓ Using config from installed package")
except (ImportError, FileNotFoundError):
    # Fallback: assume running from repo root
    config_dir = 'sample_config'
    print(f"✓ Using config from local repository")

config_path = os.path.join(config_dir, 'simlified_mode_config.yaml')

with open(config_path, 'r') as f:
    config = yaml.safe_load(f)

# Print configuration info
print(f"\nConfiguration: {os.path.basename(config_path)}")
print(f"Output directory: {config.get('output_dir')}")
print(f"Seeds to run: {config.get('random_seeds')}")
print(f"Data source: {config.get('data_source')}")
print(f"Number of glycans: {config.get('n_glycans')}")
print(f"Samples: {config.get('n_H')} healthy + {config.get('n_U')} unhealthy")

# Run pipeline with config
# Note: If config contains lists for kappa_mu or var_b, this will trigger a grid search
result = simulate(**config)

print('\n=== Pipeline Results ===')
if isinstance(result, dict) and 'metadata' not in result:
    # Grid search result
    print(f"Grid search completed. Total combinations: {len(result)}")
    print("Keys:", list(result.keys()))
    
    # Show example from first combination
    first_key = list(result.keys())[0]
    print(f"\nExample result from '{first_key}':")
    print('Number of runs:', len(result[first_key]['metadata']))
    print('Output directory:', result[first_key]['config']['output_dir'])
else:
    # Single run result
    print('Number of runs:', len(result['metadata']))
    print('Output directory:', result['config']['output_dir'])

## Example 2: Hybrid Mode (Simluate based on Real Data)

Start from real glycomics data to preserve biological signal structure. This mode keeps the virtual cohort faithful to real biological signal geometry while letting you systematically vary signal strength (`bio_strength`), concentration (`k_dir`), variance (`variance_ratio`), and batch effects for realistic batch correction benchmarking.

- **Data Source**: Real world glycomics data (CSV)
- **Biological Signal**: Injected into the real data (or using real effect sizes)
- **Batch Effect**: Simulated 

In [None]:
# Load configuration for Hybrid Mode
from glycoforge import simulate
import yaml
import os

# Auto-detect config path: try installed package first, fallback to local repo
try:
    import glycoforge
    config_dir = os.path.join(os.path.dirname(glycoforge.__file__), 'sample_config')
    print(f"✓ Using config from installed package")
except (ImportError, FileNotFoundError):
    # Fallback: assume running from repo root
    config_dir = 'sample_config'
    print(f"✓ Using config from local repository")

config_path = os.path.join(config_dir, 'hybrid_mode_config.yaml')

with open(config_path, 'r') as f:
    hybrid_config = yaml.safe_load(f)

# Print configuration info
print(f"\nConfiguration: {os.path.basename(config_path)}")
print(f"Output directory: {hybrid_config.get('output_dir')}")
print(f"Seeds to run: {hybrid_config.get('random_seeds')}")
print(f"Data source: {hybrid_config.get('data_source')}")
print(f"Data file: {hybrid_config.get('data_file')}")
print(f"Samples: {hybrid_config.get('n_H')} healthy + {hybrid_config.get('n_U')} unhealthy")

# Run pipeline with hybrid config
# This will use the real data file specified in the config
print("\nRunning Hybrid Mode Simulation...")
hybrid_result = simulate(**hybrid_config)

print('\n=== Hybrid Pipeline Results ===')
if isinstance(hybrid_result, dict) and 'metadata' not in hybrid_result:
    # Grid search result
    print(f"Grid search completed. Total combinations: {len(hybrid_result)}")
    print("Keys:", list(hybrid_result.keys()))
    
    # Show example from first combination
    first_key = list(hybrid_result.keys())[0]
    print(f"\nExample result from '{first_key}':")
    print('Number of runs:', len(hybrid_result[first_key]['metadata']))
    print('Output directory:', hybrid_result[first_key]['config']['output_dir'])
else:
    # Single run result
    print('Number of runs:', len(hybrid_result['metadata']))
    print('Output directory:', hybrid_result['config']['output_dir'])

## Example 3: Quality Check Functions

Use `check_batch_effect()` and `check_bio_effect()` to evaluate simulation quality.

In [None]:
# Example: Check batch effect and biological effect
from glycoforge import check_batch_effect, check_bio_effect, clr, load_data_from_glycowork, stratified_batches_from_columns
import pandas as pd
import numpy as np

# Load real glycomics data
data_file = "glycomics_human_leukemia_O_PMID34646384.csv"
df = load_data_from_glycowork(data_file)

# Get sample columns (R7=healthy, BM=unhealthy)
r7_cols = [c for c in df.columns if c.startswith('R7')]
bm_cols = [c for c in df.columns if c.startswith('BM')]
sample_cols = r7_cols + bm_cols

print(f"Loaded {len(sample_cols)} samples: {len(r7_cols)} healthy (R7), {len(bm_cols)} unhealthy (BM)")

# Extract numeric data (same as pipeline.py)
df_numeric = df[sample_cols]  # Shape: (glycans, samples)

# Apply CLR transformation (same as pipeline.py)
data_values = df_numeric.values.T  # (samples, glycans)
data_clr = clr(data_values).T      # CLR + transpose back → (glycans, samples)

# Convert to DataFrame (required format for check functions)
data_clr_df = pd.DataFrame(data_clr, columns=sample_cols)
print(f"data_clr_df.shape: {data_clr_df.shape} (glycans × samples)")

# Create bio labels (0=healthy, 1=unhealthy)
bio_labels = [0] * len(r7_cols) + [1] * len(bm_cols)

# Create batch labels using stratified assignment
# First, rename columns to match expected format (healthy_xxx, unhealthy_xxx)
renamed_cols = [f"healthy_{c}" if c.startswith('R7') else f"unhealthy_{c}" for c in sample_cols]


print(f"\n=== Stratified Batch Assignment ===")
batch_groups, batch_labels_array = stratified_batches_from_columns(
    renamed_cols, n_batches=3, seed=42, verbose=True
)
batch_labels = [f"Batch{b+1}" for b in batch_labels_array]  # Convert to string labels

print(f"\nBatch assignment: {batch_labels}")
print(f"Bio labels: {bio_labels}")

# Check biological effect
print("\n=== Biological Effect Check ===")
bio_metrics, pc = check_bio_effect(
    data_clr=data_clr_df,
    bio_labels=bio_labels,
    stage_name="Example Data",
    verbose=True
)

# Check batch effect
print("\n=== Batch Effect Check ===")
batch_metrics = check_batch_effect(
    data=df_numeric,  # Original data (glycans × samples)
    batch_labels=batch_labels,
    bio_groups=bio_labels,
    verbose=True
)

print("\n✓ Quality checks completed")
