# PycrucibleBatch Tutorial: Hierarchical Sample Management

This tutorial demonstrates how to use the pycrucible client to manage batches of samples using a hierarchical approach where:
- A "batch" is created as a parent sample
- Individual samples are created as children of the batch
- Datasets can be associated at both batch and individual sample levels
- Data can be retrieved and downloaded by batch

## Prerequisites
- pycrucible package installed
- Valid Crucible API credentials
- Sample data files (we'll create some for demonstration)

In [None]:
import os
import json
from datetime import datetime
from pycrucible import CrucibleClient
import uuid
from typing import List, Dict

# Configuration - Update these with your credentials
API_URL = "https://your-crucible-api.com"  # Replace with your API URL
API_KEY = "your-api-key-here"  # Replace with your API key

# Initialize the client
client = CrucibleClient(API_URL, API_KEY)
print("Crucible client initialized successfully!")

## Step 1: Create Sample Data Files

First, let's create some sample files to work with in our tutorial.

In [None]:
# Create a directory for our tutorial files
os.makedirs("tutorial_data", exist_ok=True)

# Create a batch-level data file (e.g., experimental protocol)
batch_data = {
    "experiment_name": "Protein Expression Analysis",
    "protocol_version": "2.1",
    "date": datetime.now().isoformat(),
    "conditions": {
        "temperature": "37°C",
        "pH": 7.4,
        "buffer": "PBS"
    }
}

with open("tutorial_data/batch_protocol.json", "w") as f:
    json.dump(batch_data, f, indent=2)

# Create individual sample image files (simulated microscopy images)
import numpy as np
from PIL import Image

# Generate 5 sample "microscopy" images
sample_names = ["Sample_A1", "Sample_A2", "Sample_B1", "Sample_B2", "Sample_C1"]

for i, sample_name in enumerate(sample_names):
    # Create a simple pattern image (simulating microscopy data)
    data = np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
    # Add some structure to make it look more realistic
    center = (128, 128)
    y, x = np.ogrid[:256, :256]
    mask = (x - center[0])**2 + (y - center[1])**2 < 60**2
    data[mask] = [100 + i*30, 150, 200]  # Different colors for each sample
    
    img = Image.fromarray(data)
    img.save(f"tutorial_data/{sample_name}_microscopy.png")

print(f"Created tutorial data files:")
for file in os.listdir("tutorial_data"):
    print(f"  - {file}")

## Step 2: Create a Batch (Parent Sample)

We'll create a parent sample that represents our batch. This sample will contain metadata about the entire batch.

In [None]:
# Generate a unique batch ID
batch_id = f"BATCH_{datetime.now().strftime('%Y%m%d_%H%M%S')}_{str(uuid.uuid4())[:8]}"
print(f"Creating batch with ID: {batch_id}")

# Create the batch sample
batch_sample_response = client.add_sample(
    sample_name=batch_id,
    sample_description=f"Batch sample for protein expression analysis experiment. Contains {len(sample_names)} individual samples.",
    sample_creation_date=datetime.now().isoformat(),
    sample_owner_orcid="0000-0000-0000-0000"  # Replace with actual ORCID
)

print(f"Batch sample created:")
print(f"Status Code: {batch_sample_response.status_code}")
if batch_sample_response.status_code == 201:
    batch_sample = batch_sample_response.json()
    batch_sample_id = batch_sample['id']
    print(f"Batch Sample ID: {batch_sample_id}")
    print(f"Batch Sample: {json.dumps(batch_sample, indent=2)}")
else:
    print(f"Error: {batch_sample_response.text}")

## Step 3: Create Individual Samples (Children of Batch)

Now we'll create individual samples that are linked to our batch through a custom parent_sample_id field.

In [None]:
# First, let's extend the add_sample method to support custom fields
def add_sample_with_parent(client, sample_name, sample_description, parent_sample_id=None, 
                          sample_creation_date=None, sample_owner_orcid=None, owner_id=None):
    """Extended sample creation that supports parent_sample_id"""
    sample_info = {
        "sample_name": sample_name, 
        "owner_orcid": sample_owner_orcid,
        "owner_user_id": owner_id,
        "description": sample_description,
        "date_created": sample_creation_date
    }
    
    # Add parent relationship if specified
    if parent_sample_id:
        sample_info["parent_sample_id"] = parent_sample_id
    
    import requests
    response = requests.post(f"{client.api_url}/samples", headers=client.headers, json=sample_info)
    return response

# Create individual samples linked to our batch
individual_samples = []

for sample_name in sample_names:
    print(f"Creating sample: {sample_name}")
    
    sample_response = add_sample_with_parent(
        client,
        sample_name=sample_name,
        sample_description=f"Individual sample from batch {batch_id}. Microscopy analysis of protein expression.",
        parent_sample_id=batch_sample_id,
        sample_creation_date=datetime.now().isoformat(),
        sample_owner_orcid="0000-0000-0000-0000"  # Replace with actual ORCID
    )
    
    if sample_response.status_code == 201:
        sample_data = sample_response.json()
        individual_samples.append(sample_data)
        print(f"  ✓ Created: {sample_data['sample_name']} (ID: {sample_data['id']})")
    else:
        print(f"  ✗ Failed: {sample_response.status_code} - {sample_response.text}")

print(f"\nCreated {len(individual_samples)} individual samples linked to batch {batch_id}")

## Step 4: Add Batch-Level Dataset

Upload the experimental protocol file as a dataset associated with the entire batch.

In [None]:
# Create a dataset for the batch-level protocol file
batch_dataset = client.create_dataset(
    dataset_name=f"{batch_id}_Protocol",
    unique_id=f"{batch_id}_protocol_{str(uuid.uuid4())[:8]}",
    public=False,
    owner_orcid="0000-0000-0000-0000",  # Replace with actual ORCID
    measurement="experimental_protocol",
    session_name=batch_id,
    data_format="json",
    scientific_metadata={
        "experiment_type": "protein_expression",
        "batch_id": batch_id,
        "protocol_version": "2.1",
        "sample_count": len(sample_names)
    },
    keywords=["protocol", "batch", "protein_expression"]
)

print(f"Batch dataset created: {batch_dataset['unique_id']}")

# Upload the protocol file
upload_result = client.upload_dataset(batch_dataset['unique_id'], "tutorial_data/batch_protocol.json")
print(f"Protocol file uploaded: {upload_result}")

# Associate the batch dataset with the batch sample
batch_link_response = client.add_dataset_to_sample(batch_sample_id, batch_dataset['unique_id'])
print(f"Batch dataset linked to batch sample: {batch_link_response.status_code}")

## Step 5: Create Individual Datasets for Each Sample

Upload microscopy images as individual datasets for each sample in the batch.

In [None]:
individual_datasets = []

for i, (sample_data, sample_name) in enumerate(zip(individual_samples, sample_names)):
    print(f"Creating dataset for {sample_name}")
    
    # Create dataset for the microscopy image
    dataset = client.create_dataset(
        dataset_name=f"{sample_name}_Microscopy",
        unique_id=f"{sample_name}_microscopy_{str(uuid.uuid4())[:8]}",
        public=False,
        owner_orcid="0000-0000-0000-0000",  # Replace with actual ORCID
        measurement="microscopy",
        session_name=batch_id,
        data_format="png",
        scientific_metadata={
            "experiment_type": "protein_expression",
            "batch_id": batch_id,
            "sample_name": sample_name,
            "sample_position": f"Position_{i+1}",
            "imaging_modality": "fluorescence_microscopy",
            "magnification": "40x",
            "exposure_time_ms": 100
        },
        keywords=["microscopy", "protein_expression", batch_id, sample_name]
    )
    
    # Upload the microscopy image
    image_file = f"tutorial_data/{sample_name}_microscopy.png"
    upload_result = client.upload_dataset(dataset['unique_id'], image_file)
    print(f"  Uploaded: {image_file}")
    
    # Associate the dataset with the individual sample
    link_response = client.add_dataset_to_sample(sample_data['id'], dataset['unique_id'])
    print(f"  Linked to sample: {link_response.status_code}")
    
    individual_datasets.append(dataset)
    print(f"  ✓ Dataset created: {dataset['unique_id']}")

print(f"\nCreated {len(individual_datasets)} individual datasets")

## Step 6: Add Additional Metadata to Samples

Demonstrate how to add custom metadata to individual samples.

In [None]:
# Add scientific metadata to individual datasets (since samples don't support metadata directly)
sample_metadata_examples = [
    {"treatment": "control", "cell_density": 1.2e6, "viability": 95.2},
    {"treatment": "treatment_A", "cell_density": 1.1e6, "viability": 92.8},
    {"treatment": "treatment_B", "cell_density": 1.3e6, "viability": 88.5},
    {"treatment": "control", "cell_density": 1.2e6, "viability": 96.1},
    {"treatment": "treatment_C", "cell_density": 1.0e6, "viability": 85.3}
]

for i, (dataset, metadata) in enumerate(zip(individual_datasets, sample_metadata_examples)):
    # Get current metadata and add our custom fields
    current_metadata = client.get_scientific_metadata(dataset['unique_id'])
    updated_metadata = {**current_metadata, **metadata}
    
    # Update the scientific metadata
    update_result = client.update_scientific_metadata(dataset['unique_id'], updated_metadata)
    print(f"Updated metadata for {sample_names[i]}: {metadata}")

print("\nAll sample metadata updated successfully!")

## Step 7: Query and Retrieve Batch Data

Demonstrate how to find all data associated with a batch using our hierarchical structure.

In [None]:
def get_batch_info(client, batch_id):
    """Retrieve all information for a batch"""
    
    # Find the batch sample
    batch_samples = client.list_samples(sample_name=batch_id)
    if batch_samples.status_code != 200 or not batch_samples.json():
        print(f"Batch {batch_id} not found")
        return None
    
    batch_sample = batch_samples.json()[0]
    batch_sample_id = batch_sample['id']
    
    print(f"Found batch sample: {batch_sample['sample_name']} (ID: {batch_sample_id})")
    
    # Find all child samples (this would need API support for parent_sample_id queries)
    # For now, we'll use our stored data
    child_samples = individual_samples  # In practice, you'd query by parent_sample_id
    
    print(f"\nChild samples ({len(child_samples)}):")
    for sample in child_samples:
        print(f"  - {sample['sample_name']} (ID: {sample['id']})")
    
    # Get datasets for batch and all child samples
    all_sample_ids = [batch_sample_id] + [s['id'] for s in child_samples]
    
    batch_datasets = []
    
    # Note: This would require API endpoints to get datasets by sample ID
    # For now, we'll use our stored dataset information
    batch_datasets = [batch_dataset] + individual_datasets
    
    print(f"\nAssociated datasets ({len(batch_datasets)}):")
    for dataset in batch_datasets:
        print(f"  - {dataset['dataset_name']} ({dataset['unique_id']})")
    
    return {
        'batch_sample': batch_sample,
        'child_samples': child_samples,
        'datasets': batch_datasets
    }

# Test the batch retrieval
batch_info = get_batch_info(client, batch_id)
print(f"\nBatch {batch_id} contains:")
print(f"  - 1 batch sample")
print(f"  - {len(batch_info['child_samples'])} individual samples")
print(f"  - {len(batch_info['datasets'])} datasets")

## Step 8: Download All Batch Data

Download all datasets associated with a batch.

In [None]:
def download_batch_data(client, batch_info, download_dir="batch_downloads"):
    """Download all datasets for a batch"""
    
    batch_id = batch_info['batch_sample']['sample_name']
    batch_download_dir = os.path.join(download_dir, batch_id)
    os.makedirs(batch_download_dir, exist_ok=True)
    
    print(f"Downloading batch data to: {batch_download_dir}")
    
    downloaded_files = []
    
    for dataset in batch_info['datasets']:
        dataset_id = dataset['unique_id']
        dataset_name = dataset['dataset_name']
        
        print(f"\nProcessing dataset: {dataset_name}")
        
        # Get dataset details to find file information
        dataset_details = client.get_dataset(dataset_id)
        
        # This is a simplified approach - in practice you'd need to:
        # 1. Get list of files in the dataset
        # 2. Download each file
        
        # For our tutorial, we know the file structure
        if "protocol" in dataset_name.lower():
            filename = "batch_protocol.json"
        else:
            # Extract sample name from dataset name
            sample_name = dataset_name.split('_Microscopy')[0]
            filename = f"{sample_name}_microscopy.png"
        
        output_path = os.path.join(batch_download_dir, filename)
        
        try:
            # Note: download_dataset requires knowing the exact filename
            # In practice, you'd first list files in the dataset
            client.download_dataset(dataset_id, filename, output_path)
            downloaded_files.append(output_path)
            print(f"  ✓ Downloaded: {filename}")
        except Exception as e:
            print(f"  ✗ Failed to download {filename}: {e}")
    
    # Create a manifest file with batch information
    manifest = {
        "batch_id": batch_id,
        "download_date": datetime.now().isoformat(),
        "batch_sample": batch_info['batch_sample'],
        "child_samples": batch_info['child_samples'],
        "datasets": batch_info['datasets'],
        "downloaded_files": downloaded_files
    }
    
    manifest_path = os.path.join(batch_download_dir, "batch_manifest.json")
    with open(manifest_path, 'w') as f:
        json.dump(manifest, f, indent=2)
    
    print(f"\nDownload complete! Files saved to: {batch_download_dir}")
    print(f"Manifest created: {manifest_path}")
    
    return batch_download_dir

# Download all batch data
download_dir = download_batch_data(client, batch_info)

# List downloaded files
print(f"\nDownloaded files:")
for file in os.listdir(download_dir):
    file_path = os.path.join(download_dir, file)
    size = os.path.getsize(file_path)
    print(f"  - {file} ({size} bytes)")

## Summary

This tutorial demonstrated a complete workflow for batch sample management using pycrucible:

1. **✅ Created a batch of samples** using a hierarchical parent-child structure
2. **✅ Added a batch-level dataset** (experimental protocol) associated with all samples
3. **✅ Created individual datasets** (microscopy images) for each sample in the batch
4. **✅ Added custom metadata** to individual samples via their datasets
5. **✅ Retrieved and downloaded** all data from the batch using the batch ID

### Key Implementation Notes:

- **Hierarchical Structure**: Used `parent_sample_id` to link individual samples to a batch sample
- **Metadata Storage**: Since samples don't support scientific metadata directly, we stored custom metadata in the associated datasets
- **Batch Queries**: Created helper functions to find and manage batch-related data
- **Download Strategy**: Implemented batch downloading with manifest files for data provenance

### Future Enhancements:

To make this workflow more robust, consider extending the pycrucible client with:
- Native batch sample creation methods
- Query methods to find samples by parent_sample_id
- Bulk dataset association methods
- Enhanced download methods that auto-discover files in datasets
- Sample-level metadata support

This approach provides a solid foundation for managing related samples and their associated data in a structured, queryable way.