# DDACS Dataset - Complete Tutorial

This notebook demonstrates all features of the Deep Drawing and Cutting Simulations (DDACS) Dataset package.

## Installation

If you haven't installed the package yet, run:
```bash
pip install "git+https://github.com/BaumSebastian/Deep-Drawing-and-Cutting-Simulations-Dataset.git[examples]"
```

## Topics Covered

1. Dataset Loading and Exploration
2. Different Access Patterns (PyTorch, Iterator, Generator)
3. Performance Comparisons
4. Data Visualizations
5. Machine Learning Workflow Examples

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import h5py
import time
from pathlib import Path

# Import DDACS package
from ddacs import SimulationDataset, SimulationIterator, iter_simulations

# PyTorch imports (optional)
try:
    import torch
    from torch.utils.data import DataLoader
    TORCH_AVAILABLE = True
    print(f"PyTorch version: {torch.__version__}")
except ImportError:
    TORCH_AVAILABLE = False
    print("PyTorch not available - some examples will be skipped")

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("All imports successful!")

## 1. Dataset Setup

First, make sure you have downloaded the dataset using the `darus-download` command:

In [None]:
# Check if dataset exists
data_dir = Path("./data")
h5_dir = data_dir / "h5"
metadata_file = data_dir / "metadata.csv"

if not data_dir.exists():
    print("❌ Dataset not found!")
    print("Please download the dataset first:")
    print('darus-download --url "https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/DARUS-4801" --path "./data"')
elif not h5_dir.exists() or not metadata_file.exists():
    print("❌ Dataset incomplete!")
    print(f"Missing: {h5_dir if not h5_dir.exists() else metadata_file}")
else:
    n_h5_files = len(list(h5_dir.glob("*.h5")))
    dataset_size = sum(f.stat().st_size for f in data_dir.rglob("*") if f.is_file())
    print(f"✅ Dataset found!")
    print(f"   H5 files: {n_h5_files}")
    print(f"   Total size: {dataset_size / (1024**3):.2f} GB")

## 2. Dataset Loading and Exploration

Let's explore the dataset using different access methods.

In [None]:
# Load dataset with PyTorch-compatible class
try:
    dataset = SimulationDataset(data_dir, "h5")
    print(dataset)
    print(f"\nDataset length: {len(dataset)}")
except FileNotFoundError as e:
    print(f"Error loading dataset: {e}")

In [None]:
# Load and explore metadata
if metadata_file.exists():
    metadata = pd.read_csv(metadata_file)
    print("Metadata shape:", metadata.shape)
    print("\nColumns:", list(metadata.columns))
    print("\nFirst few rows:")
    display(metadata.head())
    
    print("\nMetadata statistics:")
    display(metadata.describe())

## 3. Access Pattern Comparison

Compare different ways to access the dataset data.

In [None]:
# Method 1: PyTorch Dataset (for ML training)
print("=== PyTorch Dataset ===")
try:
    dataset = SimulationDataset(data_dir, "h5")
    sim_id, metadata_vals, h5_path = dataset[0]
    print(f"Sample ID: {sim_id}")
    print(f"Metadata shape: {metadata_vals.shape}")
    print(f"H5 file: {h5_path.name}")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Method 2: Lightweight Iterator (no PyTorch dependency)
print("=== Lightweight Iterator ===")
try:
    iterator = SimulationIterator(data_dir, "h5")
    print(iterator)
    
    # Get first simulation
    sim_id, metadata_vals, h5_path = next(iter(iterator))
    print(f"\nSample ID: {sim_id}")
    print(f"Metadata shape: {metadata_vals.shape}")
    
    # Sample random simulations
    print("\nRandom samples:")
    for i, (sim_id, metadata_vals, h5_path) in enumerate(iterator.sample(3)):
        print(f"  Sample {i+1}: ID={sim_id}")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Method 3: Fast Generator (for streaming)
print("=== Fast Generator ===")
try:
    count = 0
    for sim_id, metadata_vals, h5_path in iter_simulations(data_dir, "h5"):
        print(f"ID: {sim_id}, Metadata shape: {metadata_vals.shape}")
        count += 1
        if count >= 5:  # Only show first 5
            break
    print(f"\nGenerator can efficiently stream through all simulations")
except Exception as e:
    print(f"Error: {e}")

## 4. PyTorch Integration

Demonstrate how to use the dataset with PyTorch for machine learning.

In [None]:
if TORCH_AVAILABLE:
    print("=== PyTorch DataLoader Example ===")
    
    try:
        # Create dataset
        dataset = SimulationDataset(data_dir, "h5")
        
        # Create DataLoader
        dataloader = DataLoader(
            dataset, 
            batch_size=4, 
            shuffle=True, 
            num_workers=0  # Set to 0 to avoid multiprocessing issues in Jupyter
        )
        
        print(f"DataLoader created with batch_size=4")
        print(f"Number of batches: {len(dataloader)}")
        
        # Iterate through first batch
        for batch_idx, (sim_ids, metadata_batch, h5_paths) in enumerate(dataloader):
            print(f"\nBatch {batch_idx + 1}:")
            print(f"  Simulation IDs: {sim_ids}")
            print(f"  Metadata batch shape: {metadata_batch.shape}")
            print(f"  H5 paths: {[p.name for p in h5_paths]}")
            
            # Only show first batch
            break
            
    except Exception as e:
        print(f"Error with PyTorch DataLoader: {e}")
        
else:
    print("PyTorch not available - install with: pip install torch")

## 5. Data Visualization

Let's visualize the dataset characteristics and sample simulation data.

In [None]:
# Visualize metadata distributions
if metadata_file.exists():
    metadata = pd.read_csv(metadata_file)
    
    # Select numeric columns (exclude ID)
    numeric_cols = metadata.select_dtypes(include=[np.number]).columns.tolist()
    if 'ID' in numeric_cols:
        numeric_cols.remove('ID')
    
    if len(numeric_cols) > 0:
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        axes = axes.flatten()
        
        for i, col in enumerate(numeric_cols[:4]):  # Plot first 4 columns
            if i < len(axes):
                sns.histplot(metadata[col], ax=axes[i], kde=True)
                axes[i].set_title(f'Distribution of {col}')
        
        # Hide unused subplots
        for i in range(len(numeric_cols), len(axes)):
            axes[i].set_visible(False)
        
        plt.tight_layout()
        plt.suptitle('Metadata Parameter Distributions', y=1.02, fontsize=16)
        plt.show()
    else:
        print("No numeric columns found in metadata for visualization")

In [None]:
# Load and visualize sample simulation data
try:
    # Get first simulation
    dataset = SimulationDataset(data_dir, "h5")
    sim_id, metadata_vals, h5_path = dataset[0]
    
    print(f"Loading simulation data from: {h5_path.name}")
    
    # Load H5 data
    with h5py.File(h5_path, 'r') as f:
        print("H5 file structure:")
        def print_structure(name, obj):
            print(f"  {name}: {type(obj).__name__}")
        f.visititems(print_structure)
        
        # Try to load some sample data
        try:
            # This path might need adjustment based on actual H5 structure
            data = np.array(f["OP10"]["blank"]["node_displacement"])
            print(f"\nLoaded data shape: {data.shape}")
            print(f"Data type: {data.dtype}")
            print(f"Data range: [{data.min():.6f}, {data.max():.6f}]")
            
            # Visualize if data is reasonable size
            if data.size < 10000:  # Only plot if not too large
                plt.figure(figsize=(12, 4))
                
                plt.subplot(1, 2, 1)
                if len(data.shape) == 2:
                    plt.imshow(data, aspect='auto', cmap='viridis')
                    plt.colorbar()
                    plt.title('2D Data Heatmap')
                else:
                    plt.plot(data.flatten()[:1000])  # Plot first 1000 points
                    plt.title('Data Values (first 1000 points)')
                
                plt.subplot(1, 2, 2)
                plt.hist(data.flatten(), bins=50, alpha=0.7)
                plt.title('Data Value Distribution')
                plt.xlabel('Value')
                plt.ylabel('Frequency')
                
                plt.tight_layout()
                plt.show()
            else:
                print("Data too large for visualization (>10k elements)")
                
        except KeyError as e:
            print(f"Could not access expected data path: {e}")
            print("Please check the H5 file structure above and adjust the data path")
            
except Exception as e:
    print(f"Error loading simulation data: {e}")

## 6. Summary and Recommendations

### When to use each access method:

1. **SimulationDataset (PyTorch)**: 
   - ✅ Machine learning training with PyTorch
   - ✅ Random access to samples
   - ✅ Batch processing with DataLoader
   - ❌ Requires PyTorch dependency

2. **SimulationIterator (Lightweight)**:
   - ✅ No PyTorch dependency
   - ✅ Memory efficient streaming
   - ✅ Random sampling capability
   - ❌ Sequential access only

3. **iter_simulations (Generator)**:
   - ✅ Ultra-fast streaming
   - ✅ Minimal memory footprint
   - ✅ Simple function interface
   - ❌ No random access or sampling

### Next Steps:

- Explore the actual H5 file structure for your specific use case
- Implement custom data loading for your ML targets
- Consider data preprocessing and augmentation strategies
- Set up proper train/validation/test splits