# NeurIPS 2024 Ariel Data Challenge — HuggingFace Dataset Upload

**Purpose**: Preprocess every planet in the Ariel competition dataset and push  
the results to `Smooth-Cactus0/ariel-exoplanet-2024` on the HuggingFace Hub.

**Data format**: The competition data uses **parquet files** organized as  
`{split}/{planet_id}/AIRS-CH0_signal.parquet` (and calibration files).

**Outputs**:
- `data/preprocessed/{train,test}/{planet_id}.npz` — one compressed NumPy archive per planet
- HuggingFace dataset repository with `ariel_dataset.py` loading script

> **Note**: This notebook is Kaggle-ready and requires `ariel-data-challenge-2024` attached.

## 1. Setup

In [None]:
# Install / verify required packages
import subprocess, sys

REQUIRED_PACKAGES = [
    "pyarrow",
    "tqdm",
    "huggingface_hub",
    "datasets",
]

for pkg in REQUIRED_PACKAGES:
    try:
        __import__(pkg.replace("-", "_"))
        print(f"{pkg}: already installed")
    except ImportError:
        print(f"{pkg}: not found — installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg, "-q"])
        print(f"{pkg}: installed")

print("[Done] Package check complete.")

In [None]:
import os
import subprocess
import sys
import warnings
from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow.parquet as pq
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

# ---------------------------------------------------------------------------
# Clone the repo on Kaggle and add to sys.path
# ---------------------------------------------------------------------------
REPO_DIR = "/kaggle/working/ariel-exoplanet-ml"
PROJECT_DIR = REPO_DIR + "/Kaggle competition/ARIEL neurIPS"

if not Path(REPO_DIR).exists():
    subprocess.run(
        ["git", "clone",
         "https://github.com/Smooth-Cactus0/ariel-exoplanet-ml.git",
         REPO_DIR],
        check=True,
    )
    print(f"Cloned repo to {REPO_DIR}")
else:
    print(f"Repo already exists at {REPO_DIR}")

sys.path.insert(0, PROJECT_DIR)

# ---------------------------------------------------------------------------
# HuggingFace token — paste your token here or set HF_TOKEN env variable
# ---------------------------------------------------------------------------
os.environ["HF_TOKEN"] = ""  # Set your HuggingFace token here

# ---------------------------------------------------------------------------
# Directories
# ---------------------------------------------------------------------------
DATA_ROOT = Path("/kaggle/input/ariel-data-challenge-2024")
OUT_DIR   = Path("/kaggle/working/preprocessed")
OUT_DIR.mkdir(parents=True, exist_ok=True)
(OUT_DIR / "train").mkdir(exist_ok=True)
(OUT_DIR / "test").mkdir(exist_ok=True)

# Figure output directory
FIG_DIR = Path("/kaggle/working/figures_hf_upload")
FIG_DIR.mkdir(parents=True, exist_ok=True)

# Plot style
plt.rcParams.update({
    "savefig.dpi": 150,
    "savefig.facecolor": "white",
})

# ---------------------------------------------------------------------------
# Instrument geometry constants (confirmed from competition data)
# ---------------------------------------------------------------------------
AIRS_N_ROWS = 32    # spatial rows in AIRS-CH0 detector
AIRS_N_COLS = 356   # spectral channels in AIRS-CH0
FGS1_N_ROWS = 32    # FGS1 detector rows
FGS1_N_COLS = 32    # FGS1 detector columns
FGS1_RATIO  = 12    # FGS1 cadence is 12x AIRS cadence

# ---------------------------------------------------------------------------
# Preprocessing hyper-parameters (must match training config)
# ---------------------------------------------------------------------------
INGRESS  = 0.20
EGRESS   = 0.80
BIN_SIZE = 5

print(f"DATA_ROOT       : {DATA_ROOT}  (exists={DATA_ROOT.exists()})")
print(f"OUT_DIR         : {OUT_DIR}")
print(f"FIG_DIR         : {FIG_DIR}")
print(f"AIRS geometry   : {AIRS_N_ROWS} rows x {AIRS_N_COLS} spectral channels")
print(f"FGS1 geometry   : {FGS1_N_ROWS} x {FGS1_N_COLS}, cadence ratio={FGS1_RATIO}")
print(f"Ingress / Egress / BinSize : {INGRESS} / {EGRESS} / {BIN_SIZE}")

print("[Done] Setup complete.")

## 2. Preprocess All Training Planets

In [None]:
# Load ADC info table and labels
adc_path    = DATA_ROOT / "train_adc_info.csv"
labels_path = DATA_ROOT / "train_labels.csv"

df_adc = pd.read_csv(adc_path)
print(f"train_adc_info  : {df_adc.shape[0]} planets, {df_adc.shape[1]} columns")
print(f"  Columns: {list(df_adc.columns)}")

# Build a lookup dict: planet_id -> row of 5 ADC features
adc_feature_cols = [c for c in df_adc.columns if c != "planet_id"]
adc_lookup = {
    str(row["planet_id"]): row[adc_feature_cols].values.astype(np.float32)
    for _, row in df_adc.iterrows()
}

# Labels: train_labels.csv with columns planet_id, wl_1, ..., wl_283
if labels_path.exists():
    df_labels = pd.read_csv(labels_path)
    labelled_ids = set(df_labels["planet_id"].astype(str))
    # Build lookup: planet_id -> np.array of shape (283,)
    wl_cols = [c for c in df_labels.columns if c.startswith("wl_")]
    labels_lookup = {
        str(row["planet_id"]): row[wl_cols].values.astype(np.float32)
        for _, row in df_labels.iterrows()
    }
    print(f"train_labels    : {len(labelled_ids)} labelled planets, {len(wl_cols)} wavelength bins")
else:
    df_labels     = None
    labelled_ids  = set()
    labels_lookup = {}
    print("WARNING: train_labels.csv not found — no label extraction possible.")

print(f"\n[Done] Tables loaded. {len(labelled_ids)} of {df_adc.shape[0]} planets are labelled.")

In [None]:
def parse_label_row(labels_lookup: dict, planet_id: str):
    """
    Extract target_mean from the labels lookup for a single planet.

    Returns (target_mean,) where target_mean is a float32 array of shape (283,),
    or (None,) if the planet has no labels.

    Note: there is no target_std in the new label format — sigma is model-predicted.
    """
    if planet_id not in labels_lookup:
        return (None,)
    return (labels_lookup[planet_id],)


def calibrate(signal_raw, dark, flat, dead, gain, offset):
    """
    Apply detector calibration to raw ADC counts.

    Parameters
    ----------
    signal_raw : np.ndarray, uint16 — raw detector signal
    dark       : np.ndarray, float  — dark current frame
    flat       : np.ndarray, float  — flat field frame
    dead       : np.ndarray, float  — dead pixel mask (1 = dead)
    gain       : float              — ADC gain
    offset     : float              — ADC offset

    Returns
    -------
    np.ndarray, float32 — calibrated signal
    """
    # Convert from ADC counts to physical units
    signal = signal_raw.astype(np.float32) * gain + offset
    # Subtract dark current
    signal = signal - dark
    # Apply flat field correction (avoid division by zero)
    flat_safe = np.where(flat == 0, 1.0, flat)
    signal = signal / flat_safe
    # Mask dead pixels (set to NaN, will be handled downstream)
    signal = np.where(dead > 0.5, np.nan, signal)
    return signal


def load_calibration(planet_dir, instrument):
    """
    Load calibration frames for a given instrument from parquet files.

    Returns dict with keys: dark, flat, dead, read, linear_corr
    """
    cal_dir = planet_dir / f"{instrument}_calibration"
    cal = {}
    for name in ["dark", "flat", "dead", "read", "linear_corr"]:
        fpath = cal_dir / f"{name}.parquet"
        if fpath.exists():
            df = pd.read_parquet(fpath)
            cal[name] = df.values.astype(np.float32)
        else:
            cal[name] = None
    return cal


def load_planet_parquet(planet_dir, adc_info):
    """
    Load and calibrate AIRS-CH0 and FGS1 signals from parquet files.

    Parameters
    ----------
    planet_dir : Path — directory for one planet (contains signal + calibration parquets)
    adc_info   : np.ndarray, shape (5,) — [FGS1_adc_offset, FGS1_adc_gain,
                                            AIRS-CH0_adc_offset, AIRS-CH0_adc_gain, star]

    Returns
    -------
    airs_signal : np.ndarray, shape (n_time, 356), float32
    fgs1_signal : np.ndarray, shape (n_time,), float32
    """
    fgs1_offset, fgs1_gain, airs_offset, airs_gain, _star = adc_info

    # --- AIRS-CH0 ---------------------------------------------------------
    airs_path = planet_dir / "AIRS-CH0_signal.parquet"
    airs_flat = pd.read_parquet(airs_path).values  # (n_time, 32*356) uint16
    n_time_airs = airs_flat.shape[0]
    airs_3d = airs_flat.reshape(n_time_airs, AIRS_N_ROWS, AIRS_N_COLS)  # (t, 32, 356)

    airs_cal = load_calibration(planet_dir, "AIRS-CH0")
    if airs_cal["dark"] is not None:
        dark_2d = airs_cal["dark"].reshape(AIRS_N_ROWS, AIRS_N_COLS)
        flat_2d = airs_cal["flat"].reshape(AIRS_N_ROWS, AIRS_N_COLS) if airs_cal["flat"] is not None else np.ones((AIRS_N_ROWS, AIRS_N_COLS), dtype=np.float32)
        dead_2d = airs_cal["dead"].reshape(AIRS_N_ROWS, AIRS_N_COLS) if airs_cal["dead"] is not None else np.zeros((AIRS_N_ROWS, AIRS_N_COLS), dtype=np.float32)
        airs_calibrated = calibrate(airs_3d, dark_2d, flat_2d, dead_2d, airs_gain, airs_offset)
    else:
        # No calibration available — just convert from ADC
        airs_calibrated = airs_3d.astype(np.float32) * airs_gain + airs_offset

    # Sum over spatial rows -> (n_time, 356)
    airs_signal = np.nansum(airs_calibrated, axis=1)  # (n_time, 356)

    # --- FGS1 -------------------------------------------------------------
    fgs1_path = planet_dir / "FGS1_signal.parquet"
    fgs1_flat = pd.read_parquet(fgs1_path).values  # (n_time_fgs1, 32*32) uint16
    n_time_fgs1 = fgs1_flat.shape[0]
    fgs1_3d = fgs1_flat.reshape(n_time_fgs1, FGS1_N_ROWS, FGS1_N_COLS)  # (t, 32, 32)

    fgs1_cal = load_calibration(planet_dir, "FGS1")
    if fgs1_cal["dark"] is not None:
        dark_f = fgs1_cal["dark"].reshape(FGS1_N_ROWS, FGS1_N_COLS)
        flat_f = fgs1_cal["flat"].reshape(FGS1_N_ROWS, FGS1_N_COLS) if fgs1_cal["flat"] is not None else np.ones((FGS1_N_ROWS, FGS1_N_COLS), dtype=np.float32)
        dead_f = fgs1_cal["dead"].reshape(FGS1_N_ROWS, FGS1_N_COLS) if fgs1_cal["dead"] is not None else np.zeros((FGS1_N_ROWS, FGS1_N_COLS), dtype=np.float32)
        fgs1_calibrated = calibrate(fgs1_3d, dark_f, flat_f, dead_f, fgs1_gain, fgs1_offset)
    else:
        fgs1_calibrated = fgs1_3d.astype(np.float32) * fgs1_gain + fgs1_offset

    # Sum over spatial dims -> (n_time_fgs1,), then downsample 12:1 to match AIRS cadence
    fgs1_summed = np.nansum(fgs1_calibrated, axis=(1, 2))  # (n_time_fgs1,)

    # Downsample FGS1 by averaging groups of FGS1_RATIO frames
    n_trim = (n_time_fgs1 // FGS1_RATIO) * FGS1_RATIO
    fgs1_downsampled = fgs1_summed[:n_trim].reshape(-1, FGS1_RATIO).mean(axis=1)  # (n_time_airs,)

    return airs_signal, fgs1_downsampled


print("[Done] Helper functions defined: parse_label_row, calibrate, load_calibration, load_planet_parquet.")

In [None]:
# Import the preprocessing function from src/preprocessing.py
try:
    from src.preprocessing import preprocess_planet
    print("Imported preprocess_planet from src.preprocessing")
except ImportError as exc:
    print(f"WARNING: Could not import preprocess_planet ({exc}).")
    print("Ensure REPO_DIR is set correctly and the repo has been cloned.")
    raise

# ---------------------------------------------------------------------------
# Preprocess training planets
# ---------------------------------------------------------------------------
TRAIN_DIR = DATA_ROOT / "train"
train_out_dir = OUT_DIR / "train"

# Discover planet directories
all_train_dirs = sorted([d for d in TRAIN_DIR.iterdir() if d.is_dir()])
all_train_ids  = [d.name for d in all_train_dirs]
n_total   = len(all_train_ids)
n_success = 0
n_labelled_saved = 0
n_failed  = 0
total_bytes = 0

print(f"Found {n_total} planet directories in {TRAIN_DIR}")

with tqdm(total=n_total, desc="Train planets", unit="planet") as pbar:
    for planet_dir in all_train_dirs:
        pid = planet_dir.name
        out_path = train_out_dir / f"{pid}.npz"

        # Skip if already done (allows resuming after interruption)
        if out_path.exists():
            n_success += 1
            pbar.update(1)
            pbar.set_postfix(skipped="(cached)")
            continue

        # --- Auxiliary features (ADC info) --------------------------------
        aux_row = adc_lookup.get(pid, np.zeros(len(adc_feature_cols), dtype=np.float32))
        if pid not in adc_lookup:
            warnings.warn(f"ADC info not found for planet {pid} — using zeros.")

        # --- Load and calibrate from parquet ------------------------------
        try:
            airs_raw, fgs1_raw = load_planet_parquet(planet_dir, aux_row)
        except FileNotFoundError as exc:
            warnings.warn(f"Missing parquet for planet {pid}: {exc} — skipping.")
            n_failed += 1
            pbar.update(1)
            continue
        except Exception as exc:
            warnings.warn(f"Error loading planet {pid}: {exc} — skipping.")
            n_failed += 1
            pbar.update(1)
            continue

        # --- Preprocessing pipeline ---------------------------------------
        try:
            result = preprocess_planet(
                airs_raw, fgs1_raw,
                ingress=INGRESS, egress=EGRESS, bin_size=BIN_SIZE,
            )
        except Exception as exc:
            warnings.warn(f"Preprocessing failed for planet {pid}: {exc} — skipping.")
            n_failed += 1
            pbar.update(1)
            continue

        # --- Label extraction (optional) ----------------------------------
        (target_mean,) = parse_label_row(labels_lookup, pid)

        # --- Save to .npz ------------------------------------------------
        save_kwargs = dict(
            airs_norm        = result["airs_norm"],          # (time_binned, 356)
            fgs1_norm        = result["fgs1_norm"],          # (time_binned,)
            aux              = aux_row,                       # (5,)
            transit_depth    = result["transit_depth"],      # (356,)
            transit_depth_err= result["transit_depth_err"],  # (356,)
            mask_oot         = result["mask_oot"],           # (time_binned,)
        )
        if target_mean is not None:
            save_kwargs["target_mean"] = target_mean  # (283,)
            n_labelled_saved += 1

        np.savez_compressed(str(out_path), **save_kwargs)
        total_bytes += out_path.stat().st_size
        n_success += 1
        pbar.update(1)
        pbar.set_postfix(
            success=n_success, labelled=n_labelled_saved, failed=n_failed
        )

# Summary
avg_kb = (total_bytes / max(n_success, 1)) / 1024
print("\n" + "=" * 60)
print("Train preprocessing summary")
print("=" * 60)
print(f"  Total planets     : {n_total}")
print(f"  Successfully saved: {n_success}")
print(f"  Labelled planets  : {n_labelled_saved}")
print(f"  Failed / skipped  : {n_failed}")
print(f"  Total size on disk: {total_bytes / 1_048_576:.1f} MB")
print(f"  Avg file size     : {avg_kb:.1f} KB")
print("[Done] Train preprocessing complete.")

## 3. Preprocess All Test Planets

In [None]:
# ---------------------------------------------------------------------------
# Preprocess test planets (parquet directory layout)
# ---------------------------------------------------------------------------
TEST_DIR = DATA_ROOT / "test"
test_out_dir = OUT_DIR / "test"

# Load test ADC info if available
test_adc_path = DATA_ROOT / "test_adc_info.csv"
if test_adc_path.exists():
    df_adc_test = pd.read_csv(test_adc_path)
    test_adc_feature_cols = [c for c in df_adc_test.columns if c != "planet_id"]
    test_adc_lookup = {
        str(row["planet_id"]): row[test_adc_feature_cols].values.astype(np.float32)
        for _, row in df_adc_test.iterrows()
    }
    print(f"test_adc_info   : {df_adc_test.shape[0]} planets, {df_adc_test.shape[1]} columns")
else:
    # Fallback: use train ADC lookup (planets may overlap) or zeros
    test_adc_lookup = adc_lookup.copy()
    test_adc_feature_cols = adc_feature_cols
    print("WARNING: test_adc_info.csv not found — falling back to train ADC lookup.")

if not TEST_DIR.exists():
    print(f"WARNING: Test directory {TEST_DIR} does not exist — skipping test split.")
    all_test_dirs = []
else:
    all_test_dirs = sorted([d for d in TEST_DIR.iterdir() if d.is_dir()])

all_test_ids     = [d.name for d in all_test_dirs]
n_total_test     = len(all_test_ids)
n_success_test   = 0
n_failed_test    = 0
total_bytes_test = 0

print(f"Test planets to process : {n_total_test}")

if n_total_test > 0:
    with tqdm(total=n_total_test, desc="Test planets", unit="planet") as pbar:
        for planet_dir in all_test_dirs:
            pid = planet_dir.name
            out_path = test_out_dir / f"{pid}.npz"

            if out_path.exists():
                n_success_test += 1
                pbar.update(1)
                pbar.set_postfix(skipped="(cached)")
                continue

            # Auxiliary features
            aux_row = test_adc_lookup.get(pid, np.zeros(len(test_adc_feature_cols), dtype=np.float32))
            if pid not in test_adc_lookup:
                warnings.warn(f"ADC info not found for test planet {pid} — using zeros.")

            try:
                airs_raw, fgs1_raw = load_planet_parquet(planet_dir, aux_row)
            except FileNotFoundError as exc:
                warnings.warn(f"Missing parquet for test planet {pid}: {exc} — skipping.")
                n_failed_test += 1
                pbar.update(1)
                continue
            except Exception as exc:
                warnings.warn(f"Error loading test planet {pid}: {exc} — skipping.")
                n_failed_test += 1
                pbar.update(1)
                continue

            try:
                result = preprocess_planet(
                    airs_raw, fgs1_raw,
                    ingress=INGRESS, egress=EGRESS, bin_size=BIN_SIZE,
                )
            except Exception as exc:
                warnings.warn(f"Preprocessing failed for test planet {pid}: {exc}.")
                n_failed_test += 1
                pbar.update(1)
                continue

            np.savez_compressed(
                str(out_path),
                airs_norm        = result["airs_norm"],
                fgs1_norm        = result["fgs1_norm"],
                aux              = aux_row,
                transit_depth    = result["transit_depth"],
                transit_depth_err= result["transit_depth_err"],
                mask_oot         = result["mask_oot"],
            )
            total_bytes_test += out_path.stat().st_size
            n_success_test += 1
            pbar.update(1)
            pbar.set_postfix(success=n_success_test, failed=n_failed_test)

avg_kb_test = (total_bytes_test / max(n_success_test, 1)) / 1024
print("\n" + "=" * 60)
print("Test preprocessing summary")
print("=" * 60)
print(f"  Total planets     : {n_total_test}")
print(f"  Successfully saved: {n_success_test}")
print(f"  Failed / skipped  : {n_failed_test}")
print(f"  Total size on disk: {total_bytes_test / 1_048_576:.1f} MB")
print(f"  Avg file size     : {avg_kb_test:.1f} KB")
print("[Done] Test preprocessing complete.")

## 4. Validate One Sample

In [None]:
import json

# Load back one .npz from the train split to sanity-check contents and shapes
train_npz_files = sorted((OUT_DIR / "train").glob("*.npz"))

if not train_npz_files:
    print("WARNING: No .npz files found in train output dir — cannot validate.")
else:
    sample_path = train_npz_files[0]
    sample = np.load(sample_path, allow_pickle=False)

    planet_id = sample_path.stem
    print(f"Sample planet ID  : {planet_id}")
    print(f"NPZ file path     : {sample_path}")
    print(f"File size         : {sample_path.stat().st_size / 1024:.1f} KB")
    print()
    print(f"{'Key':<22} {'Shape':<25} {'dtype':<10} {'min':>10}  {'max':>10}")
    print("-" * 80)
    for key in sorted(sample.files):
        arr = sample[key]
        vmin = float(arr.min()) if arr.size > 0 else float('nan')
        vmax = float(arr.max()) if arr.size > 0 else float('nan')
        print(f"  {key:<20} {str(arr.shape):<25} {str(arr.dtype):<10} {vmin:>10.4f}  {vmax:>10.4f}")

    # Validate expected shapes
    assert sample["aux"].shape == (len(adc_feature_cols),), \
        f"Expected aux shape ({len(adc_feature_cols)},), got {sample['aux'].shape}"
    print(f"\n  aux shape confirmed: ({len(adc_feature_cols)},) — "
          f"columns: {adc_feature_cols}")

    if "target_mean" in sample.files:
        print(f"  target_mean shape : {sample['target_mean'].shape} (labelled planet)")
        assert "target_std" not in sample.files, \
            "target_std should NOT be present — sigma is model-predicted."
    else:
        print("  (unlabelled planet — no target_mean)")

    # --- Plot transit depth spectrum ----------------------------------------
    transit_depth = sample["transit_depth"]
    transit_depth_err = sample["transit_depth_err"]
    n_channels = len(transit_depth)
    wl_idx = np.arange(n_channels)

    fig, ax = plt.subplots(figsize=(12, 4))
    ax.plot(wl_idx, transit_depth, lw=0.9, color="steelblue", label="Transit depth")
    ax.fill_between(
        wl_idx,
        transit_depth - transit_depth_err,
        transit_depth + transit_depth_err,
        alpha=0.25, color="steelblue", label="+-1 sigma",
    )

    # Overlay target_mean if available
    if "target_mean" in sample.files and len(sample["target_mean"]) == 283:
        target_mean = sample["target_mean"]
        target_wl   = np.linspace(0, n_channels - 1, 283)
        ax.plot(target_wl, target_mean, lw=1.2, color="darkorange",
                linestyle="--", label="Target mean (283 channels)")

    ax.set_xlabel("AIRS-CH0 channel index")
    ax.set_ylabel("Transit depth (fractional)")
    ax.set_title(f"Planet {planet_id} — extracted transit depth spectrum ({n_channels} channels)")
    ax.legend()
    plt.tight_layout()
    fig.savefig(FIG_DIR / "hf_validation_spectrum.png", bbox_inches="tight")
    plt.show()

    # --- Save preprocessing summary as JSON --------------------------------
    n_train_npz = len(train_npz_files)
    n_test_npz  = len(sorted((OUT_DIR / "test").glob("*.npz")))
    preprocess_summary = {
        "train_planets_preprocessed": n_train_npz,
        "test_planets_preprocessed": n_test_npz,
        "sample_planet_id": planet_id,
        "sample_keys": sorted(sample.files),
        "airs_norm_shape": list(sample["airs_norm"].shape),
        "transit_depth_channels": n_channels,
        "ingress": INGRESS,
        "egress": EGRESS,
        "bin_size": BIN_SIZE,
    }
    summary_path = FIG_DIR / "preprocessing_summary.json"
    with open(summary_path, "w") as f:
        json.dump(preprocess_summary, f, indent=2)
    print(f"\nPreprocessing summary saved to {summary_path}")

    print(f"[Done] Validation complete for planet {planet_id}.")

## 5. Push to HuggingFace Hub

In [None]:
from huggingface_hub import HfApi, login

HF_TOKEN = os.environ.get("HF_TOKEN", "")
if not HF_TOKEN:
    raise ValueError(
        "HF_TOKEN is empty. Set os.environ['HF_TOKEN'] in the Setup cell "
        "before running this section."
    )

login(token=HF_TOKEN, add_to_git_credential=False)
api = HfApi()

REPO_ID = "Smooth-Cactus0/ariel-exoplanet-2024"

print(f"[Done] Logged in to HuggingFace Hub. Target repo: {REPO_ID}")

In [None]:
# Create the dataset repository (no-op if it already exists)
api.create_repo(
    repo_id  = REPO_ID,
    repo_type= "dataset",
    exist_ok = True,
    private  = False,
)
print(f"Repository ready: https://huggingface.co/datasets/{REPO_ID}")
print("[Done] Repository created (or already exists).")

In [None]:
# Upload the entire preprocessed directory tree to data/preprocessed/ in the repo
print(f"Uploading {OUT_DIR} → {REPO_ID}:data/preprocessed/ ...")
print("(This may take several minutes depending on dataset size.)")

api.upload_folder(
    folder_path  = str(OUT_DIR),
    repo_id      = REPO_ID,
    repo_type    = "dataset",
    path_in_repo = "data/preprocessed",
    commit_message = "Upload preprocessed .npz files (train + test)",
)

print("[Done] Preprocessed data uploaded.")

In [None]:
# Upload the HuggingFace Datasets loading script
LOADING_SCRIPT = os.path.join(PROJECT_DIR, "hf_dataset", "ariel_dataset.py")

if not os.path.exists(LOADING_SCRIPT):
    print(f"WARNING: Loading script not found at {LOADING_SCRIPT}.")
    print("Ensure the repository is cloned at REPO_DIR.")
else:
    api.upload_file(
        path_or_fileobj= LOADING_SCRIPT,
        path_in_repo   = "ariel_dataset.py",
        repo_id        = REPO_ID,
        repo_type      = "dataset",
        commit_message = "Add HuggingFace Datasets loading script",
    )
    print(f"Uploaded loading script: {LOADING_SCRIPT}")

print("[Done] Upload complete!")
print(f"Dataset URL: https://huggingface.co/datasets/{REPO_ID}")

## 6. Verify Load from Hub

In [None]:
from datasets import load_dataset

print(f"Loading dataset from Hub: {REPO_ID}")
print("(First load will download and cache the data — may take a while.)")

ds = load_dataset(REPO_ID, split="train")

print("\nDataset info:")
print(ds)

print("\nFirst example keys:")
sample_hub = ds[0]
print(list(sample_hub.keys()))

print("\nFirst example shapes / lengths:")
for key, val in sample_hub.items():
    if hasattr(val, '__len__'):
        if hasattr(val, 'shape'):
            print(f"  {key:<22}: shape={val.shape}")
        else:
            print(f"  {key:<22}: len={len(val)}")
    else:
        print(f"  {key:<22}: {val}")

print("[Done] Dataset verified from Hub.")

## 8. Push Preprocessing Summary to GitHub

Push the validation spectrum plot and preprocessing summary JSON to the repo.

In [None]:
import shutil
import subprocess
from pathlib import Path

# ── Repo paths ────────────────────────────────────────────────────────────
repo_dir    = Path(REPO_DIR)
project_dir = repo_dir / "Kaggle competition" / "ARIEL neurIPS"

# ── Ensure repo is up-to-date ─────────────────────────────────────────────
if not repo_dir.exists():
    subprocess.run(
        ["git", "clone", "https://github.com/Smooth-Cactus0/ariel-exoplanet-ml.git",
         str(repo_dir)],
        check=True,
    )
else:
    subprocess.run(["git", "-C", str(repo_dir), "pull", "--ff-only"], check=False)

# ── Configure git identity (required on Kaggle kernels) ───────────────────
subprocess.run(["git", "-C", str(repo_dir), "config", "user.email", "alexy.louis@kaggle-notebook.local"], check=True)
subprocess.run(["git", "-C", str(repo_dir), "config", "user.name", "Alexy Louis (Kaggle)"], check=True)

# ── Copy artifacts to repo ─────────────────────────────────────────────────
repo_results_dir = project_dir / "results"
repo_results_dir.mkdir(parents=True, exist_ok=True)
repo_fig_dir = project_dir / "figures"
repo_fig_dir.mkdir(parents=True, exist_ok=True)

# Copy preprocessing_summary.json → results/
summary_json = FIG_DIR / "preprocessing_summary.json"
if summary_json.exists():
    shutil.copy2(summary_json, repo_results_dir / "preprocessing_summary.json")
    print(f"  preprocessing_summary.json -> results/preprocessing_summary.json")

# Copy validation figure → figures/
val_fig = FIG_DIR / "hf_validation_spectrum.png"
if val_fig.exists():
    shutil.copy2(val_fig, repo_fig_dir / "hf_validation_spectrum.png")
    print(f"  hf_validation_spectrum.png -> figures/hf_validation_spectrum.png")

# ── Git add, commit, push ─────────────────────────────────────────────────
subprocess.run(
    ["git", "-C", str(repo_dir), "add",
     "Kaggle competition/ARIEL neurIPS/results/",
     "Kaggle competition/ARIEL neurIPS/figures/"],
    check=True,
)

status = subprocess.run(
    ["git", "-C", str(repo_dir), "diff", "--cached", "--quiet"],
    capture_output=True,
)
if status.returncode != 0:
    subprocess.run(
        ["git", "-C", str(repo_dir), "commit", "-m",
         "data: update HF preprocessing summary from Kaggle notebook run"],
        check=True,
    )
    subprocess.run(
        ["git", "-C", str(repo_dir), "push", "origin", "master"],
        check=True,
    )
    print("\n[Done] Preprocessing summary pushed to GitHub.")
else:
    print("\n[Done] No changes to push (summary already up-to-date).")

## 7. Summary

### What was uploaded

- **Train split**: one `.npz` file per planet under `data/preprocessed/train/`  
  Each file contains: `airs_norm`, `fgs1_norm`, `aux`, `transit_depth`, `transit_depth_err`, `mask_oot`.  
  Labelled planets additionally contain: `target_mean` (283 wavelength bins).  
  Note: `target_std` is **not** included — sigma is model-predicted.

- **Test split**: one `.npz` file per planet under `data/preprocessed/test/`  
  Same structure, without label arrays.

- **Loading script**: `ariel_dataset.py` — a `datasets.GeneratorBasedBuilder` enabling  
  one-line loading via `load_dataset()`.

### Data format

Raw data is stored as **parquet files** (not HDF5) in a nested directory layout:
```
{split}/{planet_id}/
    AIRS-CH0_signal.parquet      (n_time, 32*356) uint16
    FGS1_signal.parquet          (n_time_fgs1, 32*32) uint16
    AIRS-CH0_calibration/{dark,flat,dead,read,linear_corr}.parquet
    FGS1_calibration/{dark,flat,dead,read,linear_corr}.parquet
```

Auxiliary features come from `train_adc_info.csv` (5 columns):
`planet_id | FGS1_adc_offset | FGS1_adc_gain | AIRS-CH0_adc_offset | AIRS-CH0_adc_gain | star`

Labels come from `train_labels.csv` (283 wavelength bins):
`planet_id | wl_1 | ... | wl_283`

### How to use the dataset

```python
from datasets import load_dataset

# Load full train split
ds_train = load_dataset("Smooth-Cactus0/ariel-exoplanet-2024", split="train")

# Load full test split
ds_test = load_dataset("Smooth-Cactus0/ariel-exoplanet-2024", split="test")

# Access a single planet
planet = ds_train[0]
print(planet.keys())
# dict_keys: planet_id, airs_norm, fgs1_norm, aux,
#            transit_depth, transit_depth_err,
#            target_mean (labelled only)

import numpy as np
airs = np.array(planet["airs_norm"])   # (time_binned, 356)
fgs1 = np.array(planet["fgs1_norm"])   # (time_binned,)
aux  = np.array(planet["aux"])         # (5,)
td   = np.array(planet["transit_depth"])  # (356,)
```

### Links

- HuggingFace Dataset: https://huggingface.co/datasets/Smooth-Cactus0/ariel-exoplanet-2024
- Kaggle Competition: https://www.kaggle.com/competitions/ariel-data-challenge-2024
- GitHub Repository: https://github.com/Smooth-Cactus0/ariel-exoplanet-ml