# No-Set-Up Dataset Downloading

CMB-ML has been set up as a large, interlinked system. Perhaps you don't want that. In this notebook, I assume you simply want to download the dataset (or a portion of it). Afterwards, you can iterate through the dataset as you see fit. 

I do ask that any work derived from this please release your source code. If your results are to be included alongside ours, reproducibility is required.

This notebook needs to use the CMB-ML framework, as installed following instructions in the README (TODO: Revisit this. Would it be better to have the utilities required here? How much is required?)

The next cell is the only one that needs editing.

- `dataset_dir_str` The target directory for the dataset
- `temp_dir_str`: A temp directory
    - The directory in which the downloaded archive will be stored, temporarily
    - The archive will be deleted automatically after extraction
- `dataset_name`: The dataset to download
    - Options are 'CMB-ML_128_1450' or 'CMB-ML_512_1450'
    - This is used to choose the correct upload records to use
    - It also affects the target directory (change that below, where noted)
- `split`: The portion of the dataset to download.
    - Options are "Test", "Valid", and "Train"
- `n_sims`: The number of simulations to download. 
    - If $3$, then `sim0000`, `sim0001`, and `sim0002` will be downloaded.
    - This behavior should be easy to change, below.

In [1]:
dataset_dir_str = '/data/jim/CMB_Data/Datasets'
temp_dir_str = '/data/jim/CMB_Data/Temp'
dataset_name = 'CMB-ML_512_1450'

split = "Test"
n_sims = 3

In [2]:
import sys
import os

repo_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.insert(0, repo_root)

In [3]:
import yaml
from tqdm import tqdm

from pathlib import Path
from cmbml.get_data.utils.get_from_shared_link import download_shared_link_info, download_file

Confirm here that the output is where you want it to be. Optionally, change the dataset directory to not include the dataset name (in the first line).

In [4]:
dataset_directory = Path(dataset_dir_str) / dataset_name
temp_directory = Path(temp_dir_str)

print(dataset_directory.absolute())
print(temp_directory.absolute())

/data/jim/CMB_Data/Datasets/CMB-ML_512_1450
/data/jim/CMB_Data/Temp


In [5]:
# Create directories
dataset_directory.mkdir(parents=True, exist_ok=True)
temp_directory.mkdir(parents=True, exist_ok=True)

# Shared links from the file included in the repo
json_path = f"./assets/CMB-ML/upload_records_{dataset_name}.json"
json_path = Path(json_path)

if not json_path.exists():
    print(f"Downloading shared links for {dataset_name} from the CMB-ML repo")
    json_path.parent.mkdir(parents=True, exist_ok=True)
    
    download_file(f"https://raw.githubusercontent.com/CMB-ML/cmb-ml/refs/heads/main/assets/CMB-ML/upload_records_{dataset_name}.json",
                  destination=json_path,
                  filesize=340000,
                  tqdm_position=0)

Downloading shared links for CMB-ML_512_1450 from the CMB-ML repo


                                                                                           

In [6]:
# Load the shared links using the default yaml library
with open(json_path, 'r') as f:
    all_shared_links = yaml.safe_load(f)

In [7]:
# Iterate over the number of simulations and download the data
# tqdm is simply used to show a progress bar
for i in tqdm(range(n_sims)):
    key = f"{split}_sim{i:04d}"
    shared_link = all_shared_links[key]
    download_shared_link_info(shared_link, temp_directory, dataset_directory)

100%|██████████| 3/3 [00:36<00:00, 12.02s/it]


At this point, you will have acquired files in the following structure (assuming the above settings):

```
└─ /path/to/Datasets
      └─ CMB-ML_512_1450
           ├─ Simulation
           |    └─ Test
           |         ├─ sim0000
           |         |    ├─ cmb_map_fid.fits
           |         |    ├─ obs_30_map.fits
           |         |    ├─ obs_44_map.fits
           |         |    ├─ obs_70_map.fits
           |         |    ├─ obs_100_map.fits
           |         |    ├─ obs_143_map.fits
           |         |    ├─ obs_217_map.fits
           |         |    ├─ obs_353_map.fits
           |         |    ├─ obs_545_map.fits
           |         |    └─ obs_857_map.fits
           |         ├─ sim0001
           |         |   ...
           |         └─ sim0002
           |             ...
           └─ Simulation_Working
                ├─ Simulation_C_Configs
                |    └─ Test
                |         ├─ sim0000
                |         |   └─ wmap_params.yaml
                |         ├─ sim0001
                |         |    └─ wmap_params.yaml
                |         └─ sim0002
                |              └─ wmap_params.yaml
                └─ Simulation_CMB_Power_Spectra
                     └─ Test
                          ├─ sim0000
                          |    └─ cmb_ps_fid.txt
                          ├─ sim0001
                          |    └─ cmb_ps_fid.txt
                          └─ sim0002
                               └─ cmb_ps_fid.txt
```

Please reach out to us if you have issues.