# Baseline Data Loading (Direct S3 Loader)

Script to load baseline virtual models directly from S3 without importing baseline_generator.

**S3 Structure**: {save_result_path}/data/{model_name}_baseline_virtual_models/
‚îú‚îÄ‚îÄ baseline_features.pkl      # DataFrame: n_samples √ó n_features
‚îú‚îÄ‚îÄ baseline_targets.pkl       # DataFrame: n_samples √ó 1 (outcome)
‚îú‚îÄ‚îÄ baseline_parameters.pkl    # DataFrame: n_samples √ó n_parameters
‚îú‚îÄ‚îÄ baseline_timecourses.pkl   # Timecourse data
‚îî‚îÄ‚îÄ baseline_metadata.pkl      # Generation metadata

## Configuration

In [1]:
# Configuration variables
MODEL_NAME = "sy_simple"  # Can be string: "sy_simple" or list: ["sy_simple", "model_v2"]
COMPONENTS_TO_LOAD = ["features", "targets", "parameters", "metadata"]  # Which components to load
SAMPLE_SIZE = None  # None to load all, or integer to sample

## Initialization

In [2]:
import os
import sys
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Set up project path - same pattern as other notebooks
path = os.getcwd()
# find the string 'project' in the path, return index
index_project = path.find("project")
# slice the path from the index of 'project' to the end
project_path = path[: index_project + 7]
print(f"‚úÖ Project path: {project_path}")

# Add src to Python path for imports
sys.path.insert(0, os.path.join(project_path, "src"))

# Import S3ConfigManager
from models.utils.s3_config_manager import S3ConfigManager

# Initialize S3 manager
s3_manager = S3ConfigManager()
print(f"‚úÖ S3 connection established. Save result path: {s3_manager.save_result_path}")

‚úÖ Project path: c:\Github\new-peak-project
‚úÖ S3 connection successful. Bucket: bio-data
‚úÖ S3 connection established. Save result path: new-peak-project/experiments/ch5-paper


## Direct S3 Loading Functions

In [3]:
def process_model_config(model_config):
    """
    Convert MODEL_NAME config to list of model names for processing.
    
    Args:
        model_config: Can be string (single model) or list (multiple models)
    
    Returns:
        List of model names
    """
    if isinstance(model_config, str):
        return [model_config]
    elif isinstance(model_config, list):
        return model_config
    else:
        raise ValueError(f"MODEL_NAME must be str or list, got {type(model_config)}")


def load_baseline_direct_single(model_name, components, s3_manager):
    """
    Load baseline components for a single model directly from S3.
    
    Args:
        model_name: Name of the model (string)
        components: List of components to load
        s3_manager: S3ConfigManager instance
        
    Returns:
        Dictionary of loaded components
    """
    gen_path = s3_manager.save_result_path
    folder_name = f"{model_name}_baseline_virtual_models"
    base_path = f"{gen_path}/data/{folder_name}"
    
    print(f"üìä Loading baseline data for model: {model_name}")
    print(f"   S3 path: {base_path}")
    
    # Map component names to file names
    file_mapping = {
        'features': 'baseline_features.pkl',
        'targets': 'baseline_targets.pkl',
        'parameters': 'baseline_parameters.pkl',
        'timecourses': 'baseline_timecourses.pkl',
        'metadata': 'baseline_metadata.pkl'
    }
    
    result = {}
    
    for component in components:
        if component in file_mapping:
            filename = file_mapping[component]
            s3_path = f"{base_path}/{filename}"
            
            try:
                data = s3_manager.load_data_from_path(s3_path, data_format="pkl")
                result[component] = data
                print(f"  ‚úÖ Loaded {component}: {type(data).__name__}")
                if isinstance(data, pd.DataFrame):
                    print(f"    Shape: {data.shape}")
            except Exception as e:
                print(f"  ‚ùå Error loading {component} from {s3_path}: {e}")
                result[component] = None
        else:
            print(f"  ‚ö†Ô∏è Unknown component: {component}")
            result[component] = None
    
    return result


def load_baseline_direct(model_config, components, s3_manager):
    """
    Load baseline components directly from S3, handling both single and multiple models.
    
    Args:
        model_config: Can be string (single model) or list (multiple models)
        components: List of components to load (e.g., ["features", "targets", "parameters"])
        s3_manager: S3ConfigManager instance
        
    Returns:
        If single model: Dictionary of loaded components
        If multiple models: Dictionary {model_name: component_dict}
    """
    model_names = process_model_config(model_config)
    
    if len(model_names) == 1:
        return load_baseline_direct_single(model_names[0], components, s3_manager)
    else:
        return {model_name: load_baseline_direct_single(model_name, components, s3_manager) for model_name in model_names}

In [4]:
# Load the baseline data
baseline_data = load_baseline_direct(MODEL_NAME, COMPONENTS_TO_LOAD, s3_manager)

üìä Loading baseline data for model: sy_simple
   S3 path: new-peak-project/experiments/ch5-paper/data/sy_simple_baseline_virtual_models


Downloading baseline_features.pkl: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 145k/145k [00:00<00:00, 3.71MB/s]


  ‚úÖ Loaded features: DataFrame
    Shape: (2000, 9)


Downloading baseline_targets.pkl: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16.7k/16.7k [00:00<00:00, 417kB/s]


  ‚úÖ Loaded targets: DataFrame
    Shape: (2000, 1)


Downloading baseline_parameters.pkl: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 817k/817k [00:00<00:00, 8.01MB/s]


  ‚úÖ Loaded parameters: DataFrame
    Shape: (2000, 51)


Downloading baseline_metadata.pkl: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 148/148 [00:00<00:00, 9.25kB/s]

  ‚úÖ Loaded metadata: dict





In [5]:
if baseline_data and SAMPLE_SIZE:
    print(f"\nüîç Creating sampled version ({SAMPLE_SIZE} samples)")
    sampled_data = {}
    
    for component, data in baseline_data.items():
        if data is not None and isinstance(data, pd.DataFrame):
            if len(data) > SAMPLE_SIZE:
                sampled_data[component] = data.sample(n=SAMPLE_SIZE, random_state=42).reset_index(drop=True)
                print(f"  ‚úÖ Sampled {component}: {sampled_data[component].shape}")
            else:
                sampled_data[component] = data
                print(f"  ‚ö†Ô∏è {component} has only {len(data)} samples, using all")
        else:
            sampled_data[component] = data
    
    # Update baseline_data with sampled version
    baseline_data = sampled_data

## Data Statistics (Optional)

Basic statistics for loaded data components

In [6]:
SHOW_STATISTICS = True  # Set to False to skip statistics

if baseline_data and SHOW_STATISTICS:
    print("\nüìà Basic Statistics:")
    print("=" * 50)
    
    for component, data in baseline_data.items():
        if data is not None and isinstance(data, pd.DataFrame):
            print(f"\n{component.upper()} Statistics:")
            print(f"  Number of samples: {len(data)}")
            print(f"  Number of features: {len(data.columns)}")
            
            # Basic numeric statistics
            numeric_cols = data.select_dtypes(include=[np.number]).columns
            if len(numeric_cols) > 0:
                print(f"  Numeric columns: {len(numeric_cols)}")
                # Show statistics for first few columns
                for col in numeric_cols[:3]:
                    print(f"    {col}: mean={data[col].mean():.4f}, std={data[col].std():.4f}, range=[{data[col].min():.4f}, {data[col].max():.4f}]")


üìà Basic Statistics:

FEATURES Statistics:
  Number of samples: 2000
  Number of features: 9
  Numeric columns: 9
    R1: mean=708.5194, std=398.7111, range=[108.9803, 4565.9002]
    R2: mean=341.4824, std=190.3874, range=[59.0742, 1698.4141]
    R3: mean=892.6998, std=466.4331, range=[88.4579, 4425.0961]

TARGETS Statistics:
  Number of samples: 2000
  Number of features: 1
  Numeric columns: 1
    Oa: mean=66.2690, std=66.4840, range=[2.2325, 314.9745]

PARAMETERS Statistics:
  Number of samples: 2000
  Number of features: 51
  Numeric columns: 51
    Km_J0: mean=98.7157, std=19.7167, range=[41.2681, 204.5239]
    Vmax_J0: mean=12.9255, std=2.6432, range=[6.7159, 24.6935]
    Km_J1: mean=118.5120, std=24.0326, range=[62.0340, 229.1479]
