
# GZ10 dataset — quick downloader & inspector

This notebook helps you:
- Install dependencies
- Discover available configs & splits
- Download the dataset (optionally with streaming first)
- Inspect schema & a few rows
- Save each split to disk (`datasets` Arrow format) and optionally Parquet/CSV

> Dataset: `MultimodalUniverse/gz10`


In [2]:

# If running locally, uncomment the next line.
# %pip install -U datasets huggingface_hub pandas pyarrow


In [3]:
# %pip install -U datasets huggingface_hub pyarrow pandas
from datasets import load_dataset

ds_name = "MultimodalUniverse/gz10"

# 1) Short streaming preview (optional)
try:
    stream = load_dataset(ds_name, split="train", streaming=True)
    for i, row in enumerate(stream):
        print(row)
        if i == 2:
            break
except Exception as e:
    print("Streaming non dispo:", e)

# 2) Full download of current splits
splits = ["train", "validation", "test"]
loaded = {}
for sp in splits:
    try:
        loaded[sp] = load_dataset(ds_name, split=sp)  # mis en cache automatiquement
        print(loaded[sp])
    except Exception as e:
        print(f"Split '{sp}' indisponible:", e)

# 3) Local backup (Arrow Datasets format)
from pathlib import Path
base = Path("data_gz10")
base.mkdir(exist_ok=True, parents=True)
for sp, d in loaded.items():
    d.save_to_disk(str(base / f"{sp}_arrow"))
    try:
        d.to_parquet(str(base / f"{sp}.parquet"))
        d.to_csv(str(base / f"{sp}.csv"))
    except Exception as e:
        print("parquet/csv export not available in your 'datasets' version", e)


{'gz10_label': 0, 'redshift': 0.07258749008178711, 'object_id': '904', 'rgb_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=256x256 at 0x7F5CA8A04D40>, 'rgb_pixel_scale': 0.2619999945163727}
{'gz10_label': 1, 'redshift': 0.13457761704921722, 'object_id': '1558', 'rgb_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=256x256 at 0x7F5CA8A906B0>, 'rgb_pixel_scale': 0.2619999945163727}
{'gz10_label': 1, 'redshift': 0.12955686450004578, 'object_id': '1768', 'rgb_image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=256x256 at 0x7F5CA8A90710>, 'rgb_pixel_scale': 0.2619999945163727}
Dataset({
    features: ['gz10_label', 'redshift', 'object_id', 'rgb_image', 'rgb_pixel_scale'],
    num_rows: 17736
})
Split 'validation' indisponible: Unknown split "validation". Should be one of ['train'].
Split 'test' indisponible: Unknown split "test". Should be one of ['train'].


Saving the dataset (6/6 shards): 100%|██████████| 17736/17736 [00:11<00:00, 1606.19 examples/s]
Creating parquet from Arrow format: 100%|██████████| 26/26 [00:06<00:00,  3.76ba/s]
Creating CSV from Arrow format: 100%|██████████| 18/18 [02:36<00:00,  8.70s/ba]
