# Supernova Dataset Generation Example

This notebook demonstrates how to use the YAML-based pipeline configuration system to generate mission-specific supernova training datasets.

## Overview

The pipeline consists of 5 stages:
1. **Query**: Search MAST archive for observations
2. **Filter**: Identify same-mission pairs
3. **Download**: Fetch FITS files with smart filtering
4. **Organize**: Structure files for training
5. **Differencing**: Generate difference images

All stages are configured via YAML files for reproducibility.

In [1]:
import sys
from pathlib import Path
import json
import yaml
from IPython.display import display, Markdown

# Add project root to path
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
sys.path.insert(0, str(project_root))

from src.pipeline.config import PipelineConfig

print(f"Project root: {project_root}")
print(f"Python version: {sys.version}")

Project root: /mnt/astrid/AstrID
Python version: 3.12.3 (main, Jan  8 2026, 11:30:50) [GCC 13.3.0]


## 1. Load and Inspect Configuration

Let's load a configuration file and inspect its parameters.

In [2]:
# Load SWIFT UV configuration
config_path = project_root / "configs" / "swift_uv_dataset.yaml"
config = PipelineConfig.from_yaml(config_path)

print(f"Dataset: {config.dataset_name}")
print(f"Description: {config.description}")
print(f"\nMissions: {config.query.missions}")
print(f"Filters: {config.query.filters}")
print(f"Year range: {config.query.min_year}-{config.query.max_year}")
print(f"\nTemporal windows:")
print(f"  Reference: {config.query.days_before} days before discovery")
print(f"  Science: {config.query.days_after} days after discovery")

Dataset: swift_uv_supernovae
Description: SWIFT UVOT UV-band supernova training set with matched filter pairs

Missions: ['SWIFT']
Filters: ['uuu', 'uvw1', 'uvm2', 'uvw2']
Year range: 2005-2020

Temporal windows:
  Reference: 1095 days before discovery
  Science: 730 days after discovery


## 2. Validate Configuration

Check for any configuration warnings or issues.

In [3]:
warnings = config.validate()

if warnings:
    print("⚠️  Configuration warnings:")
    for warning in warnings:
        print(f"  - {warning}")
else:
    print("✅ Configuration is valid!")

✅ Configuration is valid!


## 3. Customize Configuration (Optional)

You can modify configuration parameters programmatically.

In [4]:
# Example: Create a test configuration with limited scope
test_config = PipelineConfig.from_yaml(config_path)

# Limit to 10 SNe for testing
test_config.query.limit = 10
test_config.dataset_name = "swift_uv_test"

# Update output paths
base_dir = project_root / "output" / "datasets" / test_config.dataset_name
test_config.output.query_results = base_dir / "queries.json"
test_config.output.fits_downloads = base_dir / "fits_downloads"
test_config.output.fits_training = base_dir / "fits_training"
test_config.output.difference_images = base_dir / "difference_images"

print(f"Test configuration: {test_config.dataset_name}")
print(f"Will process: {test_config.query.limit} SNe")
print(f"Output directory: {base_dir}")

Test configuration: swift_uv_test
Will process: 10 SNe
Output directory: /mnt/astrid/AstrID/output/datasets/swift_uv_test


## 4. Save Custom Configuration

Save the customized configuration to a new YAML file.

In [5]:
# Convert config to dictionary
config_dict = test_config.to_dict()

# Save to YAML
custom_config_path = project_root / "configs" / "swift_uv_test.yaml"
with open(custom_config_path, 'w') as f:
    yaml.dump(config_dict, f, default_flow_style=False, sort_keys=False)

print(f"✅ Saved custom configuration to: {custom_config_path}")

# Display the YAML content
with open(custom_config_path) as f:
    print("\nConfiguration content:")
    print(f.read())

✅ Saved custom configuration to: /mnt/astrid/AstrID/configs/swift_uv_test.yaml

Configuration content:
dataset_name: swift_uv_test
description: SWIFT UVOT UV-band supernova training set with matched filter pairs
query:
  missions:
  - SWIFT
  filters:
  - uuu
  - uvw1
  - uvm2
  - uvw2
  archives: null
  min_year: 2005
  max_year: 2020
  days_before: 1095
  days_after: 730
  radius_deg: 0.1
  chunk_size: 250
  start_index: 0
  limit: 10
download:
  max_obs_per_type: 5
  max_products_per_obs: 3
  include_auxiliary: false
  require_same_mission: true
  verify_fits: true
  skip_reference: false
  skip_science: false
quality:
  min_overlap_fraction: 0.85
  max_file_size_mb: 500
  verify_wcs: true
output:
  query_results: /mnt/astrid/AstrID/output/datasets/swift_uv_test/queries.json
  fits_downloads: /mnt/astrid/AstrID/output/datasets/swift_uv_test/fits_downloads
  fits_training: /mnt/astrid/AstrID/output/datasets/swift_uv_test/fits_training
  difference_images: /mnt/astrid/AstrID/output/da

## 5. Run Pipeline (Dry Run)

Preview what the pipeline would execute without actually running it.

In [6]:
# Run dry-run to see commands
!python {project_root}/scripts/run_pipeline_from_config.py \
    --config {custom_config_path} \
    --dry-run

2026-01-16 05:43:55,826 - src.pipeline.config - INFO - Loading configuration from /mnt/astrid/AstrID/configs/swift_uv_test.yaml
2026-01-16 05:43:55,832 - __main__ - INFO - Loaded configuration: swift_uv_test
2026-01-16 05:43:55,833 - __main__ - INFO - Description: SWIFT UVOT UV-band supernova training set with matched filter pairs
2026-01-16 05:43:55,833 - __main__ - INFO - 
2026-01-16 05:43:55,834 - __main__ - INFO - Stage: QUERY
2026-01-16 05:43:55,834 - __main__ - INFO - Command: /home/chris/AstrID/.venv/bin/python /mnt/astrid/AstrID/scripts/query_sn_fits_chunked.py --catalog /mnt/astrid/AstrID/resources/sncat_compiled.txt --output /mnt/astrid/AstrID/output/datasets/swift_uv_test/queries.json --chunk-size 250 --checkpoint output/datasets/swift_uv/checkpoint.json --chunk-dir output/datasets/swift_uv/chunks --days-before 1095 --days-after 730 --radius 0.1 --missions SWIFT --min-year 2005 --max-year 2020 --limit 10 --reset-checkpoint
2026-01-16 05:43:55,834 - __main__ - INFO - [DRY RUN

## 6. Run Individual Stages

You can run stages individually for testing or debugging.

### Stage 1: Query MAST Archive

Search for observations matching the configuration.

In [7]:
# Run query stage only
# Uncomment to execute:
# !python {project_root}/scripts/run_pipeline_from_config.py \
#     --config {custom_config_path} \
#     --stage query

### Stage 2: Filter Same-Mission Pairs

In [8]:
# Run filter stage only
# Uncomment to execute:
# !python {project_root}/scripts/run_pipeline_from_config.py \
#     --config {custom_config_path} \
#     --stage filter

## 7. Inspect Results

After running stages, inspect the results.

In [9]:
# Load query results if they exist
query_results_path = test_config.output.query_results

if query_results_path.exists():
    with open(query_results_path) as f:
        query_data = json.load(f)
    
    print(f"Total SNe queried: {len(query_data)}")
    
    # Count viable pairs
    viable = sum(
        1 for entry in query_data 
        if entry.get('reference_observations') and entry.get('science_observations')
    )
    print(f"Viable pairs (both ref & sci): {viable}")
    
    # Show first entry
    if query_data:
        print("\nFirst entry:")
        print(json.dumps(query_data[0], indent=2, default=str)[:500] + "...")
else:
    print("Query results not found. Run the query stage first.")

Query results not found. Run the query stage first.


## 8. Visualize Pipeline Progress

Check which stages have completed.

In [10]:
import os

stages = [
    ("Query", test_config.output.query_results),
    ("Download", test_config.output.fits_downloads / "download_results.json"),
    ("Organize", test_config.output.fits_training / "training_manifest.json"),
    ("Differencing", test_config.output.difference_images / "processing_summary.json"),
]

print("Pipeline Stage Status:")
print("=" * 50)
for stage_name, path in stages:
    status = "✅ Complete" if path.exists() else "⏳ Pending"
    print(f"{stage_name:15} {status}")
    if path.exists():
        size = os.path.getsize(path) / 1024  # KB
        print(f"{'':15} File size: {size:.1f} KB")

Pipeline Stage Status:
Query           ⏳ Pending
Download        ⏳ Pending
Organize        ⏳ Pending
Differencing    ⏳ Pending


## 9. Compare Configurations

Compare parameters across different mission configurations.

In [11]:
import pandas as pd

# Load all configurations
configs_dir = project_root / "configs"
config_files = [
    "swift_uv_dataset.yaml",
    "ps1_optical_dataset.yaml",
    "galex_uv_dataset.yaml",
]

comparison_data = []
for config_file in config_files:
    cfg = PipelineConfig.from_yaml(configs_dir / config_file)
    comparison_data.append({
        "Dataset": cfg.dataset_name,
        "Missions": ", ".join(cfg.query.missions),
        "Filters": ", ".join(cfg.query.filters) if cfg.query.filters else "All",
        "Year Range": f"{cfg.query.min_year}-{cfg.query.max_year or 'present'}",
        "Days Before": cfg.query.days_before,
        "Days After": cfg.query.days_after,
        "Max Obs": cfg.download.max_obs_per_type,
    })

df = pd.DataFrame(comparison_data)
display(df)

Unnamed: 0,Dataset,Missions,Filters,Year Range,Days Before,Days After,Max Obs
0,swift_uv_supernovae,SWIFT,"uuu, uvw1, uvm2, uvw2",2005-2020,1095,730,5
1,ps1_optical_supernovae,PS1,"g, r, i, z, y",2010-2020,1095,730,5
2,galex_uv_supernovae,GALEX,"fuv, nuv",2005-2020,1095,730,5


## 10. Run Complete Pipeline

Execute all stages in sequence.

In [12]:
# Run complete pipeline
# WARNING: This will take several hours for a full dataset
# Uncomment to execute:

# !python {project_root}/scripts/run_pipeline_from_config.py \
#     --config {custom_config_path} \
#     --visualize

## 11. Analyze Results

After pipeline completion, analyze the generated dataset.

In [13]:
# Load processing summary
summary_path = test_config.output.difference_images / "processing_summary.json"

if summary_path.exists():
    with open(summary_path) as f:
        summary = json.load(f)
    
    print(f"Pipeline version: {summary.get('pipeline_version')}")
    print(f"SNe processed: {summary.get('n_processed')}")
    
    # Analyze results
    results = summary.get('results', [])
    if results:
        missions = {}
        filters = {}
        overlaps = []
        
        for r in results:
            mission = r.get('mission_name', 'Unknown')
            missions[mission] = missions.get(mission, 0) + 1
            
            filt = r.get('filter_name', 'Unknown')
            filters[filt] = filters.get(filt, 0) + 1
            
            overlaps.append(r.get('overlap_fraction', 0))
        
        print("\nMission breakdown:")
        for mission, count in sorted(missions.items()):
            print(f"  {mission}: {count} pairs")
        
        print("\nFilter breakdown:")
        for filt, count in sorted(filters.items()):
            print(f"  {filt}: {count} pairs")
        
        print(f"\nOverlap statistics:")
        print(f"  Mean: {sum(overlaps)/len(overlaps):.1f}%")
        print(f"  Min: {min(overlaps):.1f}%")
        print(f"  Max: {max(overlaps):.1f}%")
else:
    print("Processing summary not found. Run the complete pipeline first.")

Processing summary not found. Run the complete pipeline first.


## 12. Export Training Manifest

Generate a manifest for ML training.

In [14]:
# Load training manifest
manifest_path = test_config.output.fits_training / "training_manifest.json"

if manifest_path.exists():
    with open(manifest_path) as f:
        manifest = json.load(f)
    
    print(f"Total SNe in training set: {len(manifest)}")
    
    # Count files
    total_ref = sum(len(entry.get('reference_files', [])) for entry in manifest)
    total_sci = sum(len(entry.get('science_files', [])) for entry in manifest)
    
    print(f"Total reference files: {total_ref}")
    print(f"Total science files: {total_sci}")
    print(f"Total files: {total_ref + total_sci}")
    
    # Create simplified manifest for ML training
    ml_manifest = []
    for entry in manifest:
        sn_name = entry['sn_name']
        for ref_file in entry.get('reference_files', []):
            for sci_file in entry.get('science_files', []):
                ml_manifest.append({
                    'sn_name': sn_name,
                    'reference': str(test_config.output.fits_training / ref_file),
                    'science': str(test_config.output.fits_training / sci_file),
                })
    
    # Save ML manifest
    ml_manifest_path = test_config.output.fits_training / "ml_training_manifest.json"
    with open(ml_manifest_path, 'w') as f:
        json.dump(ml_manifest, f, indent=2)
    
    print(f"\n✅ ML training manifest saved to: {ml_manifest_path}")
    print(f"   Total training pairs: {len(ml_manifest)}")
else:
    print("Training manifest not found. Run the organize stage first.")

Training manifest not found. Run the organize stage first.


## Next Steps

After generating your dataset:

1. **Quality Check**: Inspect a few difference images to verify quality
2. **Training Data Preparation**: Generate image triplets (reference, science, difference)
3. **Labeling**: Create ground truth labels at known SN positions
4. **Model Training**: Train CNN classifier on the dataset
5. **Evaluation**: Test on held-out data

See the main project documentation for more details on each step.

## Resources

- [Data Pipeline Documentation](../docs/research/DATA_PIPELINE.md)
- [Configuration Files](../configs/)
- [Pipeline Scripts](../scripts/)
- [Midterm Notes](../docs/research/MIDTERM_NOTES.md)