# GlycoForge Simulation Examples

Core functionality: **Simulate data → Add biological factors → Inject batch effects**

Two modes:
- **Simplified Mode**: Generate synthetic data from scratch and inject batch effect.
- **Hybrid Mode**: Extract baseline sample from real world data and simulate biological factor, then inject batch effect.

## Configuration
The simulation parameters are controlled via YAML configuration files located in `sample_config/`.
- `simlified_mode_config.yaml`: Configuration for Simplified Mode.
- `hybrid_mode_config.yaml`: Configuration for Hybrid Mode.

You can modify these files to adjust parameters like `n_glycans`, `n_batches`, `bio_strength`, etc.
Lists can be provided for certain parameters (e.g., `kappa_mu: [0.5, 1.0]`) to automatically trigger a Grid Search.

In [None]:
from glycoforge import simulate
import yaml

## Example 1: Simplified Mode (Pure Simulated Data)

Fully synthetic simulation without real data dependency. This mode is ideal for controlled experiments where you want to test batch correction methods under known ground truth with varying batch effect strengths.

- **Biological Signal**: Simulated
- **Batch Effect**: Simulated 



In [None]:
# Load configuration from YAML
with open('sample_config/simlified_mode_config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Run pipeline with config
# Note: If config contains lists for kappa_mu or var_b, this will trigger a grid search
result = simulate(**config)

print('\n=== Pipeline Results ===')
if isinstance(result, dict) and 'metadata' not in result:
    # Grid search result
    print(f"Grid search completed. Total combinations: {len(result)}")
    print("Keys:", list(result.keys()))
    
    # Show example from first combination
    first_key = list(result.keys())[0]
    print(f"\nExample result from '{first_key}':")
    print('Number of runs:', len(result[first_key]['metadata']))
    print('Output directory:', result[first_key]['config']['output_dir'])
else:
    # Single run result
    print('Number of runs:', len(result['metadata']))
    print('Output directory:', result['config']['output_dir'])

## Example 2: Hybrid Mode (Simluate based on Real Data)

Start from real glycomics data to preserve biological signal structure. This mode keeps the virtual cohort faithful to real biological signal geometry while letting you systematically vary signal strength (`bio_strength`), concentration (`k_dir`), variance (`variance_ratio`), and batch effects for realistic batch correction benchmarking.

- **Data Source**: Real world glycomics data (CSV)
- **Biological Signal**: Injected into the real data (or using real effect sizes)
- **Batch Effect**: Simulated 

In [None]:
# Load configuration for Hybrid Mode
with open('sample_config/hybrid_mode_config.yaml', 'r') as f:
    hybrid_config = yaml.safe_load(f)

# Run pipeline with hybrid config
# This will use the real data file specified in the config
print("Running Hybrid Mode Simulation...")
hybrid_result = simulate(**hybrid_config)

print('\n=== Hybrid Pipeline Results ===')
if isinstance(hybrid_result, dict) and 'metadata' not in hybrid_result:
    # Grid search result
    print(f"Grid search completed. Total combinations: {len(hybrid_result)}")
    print("Keys:", list(hybrid_result.keys()))
    
    # Show example from first combination
    first_key = list(hybrid_result.keys())[0]
    print(f"\nExample result from '{first_key}':")
    print('Number of runs:', len(hybrid_result[first_key]['metadata']))
    print('Output directory:', hybrid_result[first_key]['config']['output_dir'])
else:
    # Single run result
    print('Number of runs:', len(hybrid_result['metadata']))
    print('Output directory:', hybrid_result['config']['output_dir'])