# Setting Up Your Local System

## Introduction

CMB-ML manages a complex pipeline that processes data across multiple stages. Each stage produces outputs that need to be tracked, reused, and processed in later stages. Without a clear framework, this can lead to disorganized code, redundant logic, and errors.

The CMB-ML library provides a set of tools to manage the pipeline in a modular and scalable way. 

This notebook guides you through the essential setup required to run CMB-ML locally. File locations need to be set, and critical external assets are needed. Without completing these steps, the pipeline -- and subsequent demos -- will encounter missing dependencies or misaligned file paths.

The previous demonstration, [describing Hydra](./A_hydra_tutorial.ipynb), explained the configurations generally. This notebook is a practical checklist, getting everything in place. It can be skipped, but only if you aren't running code yet.

## Contents

View this notebook with [nbviewer](https://nbviewer.org/github/CMB-ML/cmb-ml/tree/main/demonstrations/C_setting_up_local.ipynb#Introduction) to enable these links.

- [Setting up the configuration](#Setting-the-local_system-configuration-file)
- [Setting up PyILC](#Setting-up-PyILC)
- [Download external science assets](#Getting-science-assets)
- [Next steps](#Next-steps)

# Setting Configurations

## Your local system

First you'll create a configuration file to say where files should be stored.

I suggest using [this example](../cfg/local_system/generic_lab.yaml) as reference. Open that file and take a look. It has two keys.

- `datasets_root`: This is where datasets are stored. At first this will contain, for a dataset, only the simulation and the Logs generated during simulation. As the pipeline is run, additional subfolders will be created.

- `assets_dir`: This is where external science assets are stored (e.g., maps used for noise, instrument parameters, cosmological parameter distributions). The CMB-ML code will only read from this location, so it can be shared across users.

Set those according to the needs of your local system in a yaml file in the local_system folder.

If more granularity of file storage is needed (e.g., storing data according to faster or slower drives), this can be further customized in the pipeline yamls. Contact us and we can help to configure it.

## Specifying your local system

You also need to let your system know where the YAML configuration file is located. This information is specified in the **top-level configurations**, such as [config_setup.yaml](../cfg/config_setup.yaml), which look like:

```yaml
defaults:
  - local_system: ${oc.env:CMB_ML_LOCAL_SYSTEM}
  - file_system : common_fs
  - override hydra/job_logging: custom_log
  - _self_
```

In this setup, "`${oc.env:CMB_ML_LOCAL_SYSTEM}`" tells Hydra to get the variable from your system's environment variables.

There are two options to specify the path. 

### Option 1: Change this value directly in top-level configurations

You can change `local_system: ${oc.env:CMB_ML_LOCAL_SYSTEM}` to your particular location, such as

```yaml
local_system: my_workstation
```

However, **if you do this, you will need to do this wherever there is a reference to local_system**.

### Option 2: or you can set the environment variable.

A more flexible option it to set an environment variable for CMB_ML_LOCAL_SYSTEM, which Hydra will read automatically. How to do this depends on your setup:
- **In the terminal:** Run 
```bash
export CMB_ML_LOCAL_SYSTEM=generic_lab.yaml
```
before calling a script. To make this automatic, add the command to your shell startup script (e.g., `.bashrc` or `.szhrc`).
- **In jupyter notebooks**: Use the os library (currently done):
```python
import os
os.environ["CMB_ML_LOCAL_SYSTEM"] = "generic_lab.yaml"
```
- **In VS Code (for debugging)**: Add the environment variable to your `launch.json`:
```json
"env": {
  "CMB_ML_LOCAL_SYSTEM": "generic_lab.yaml"
}
```
Choose the option(s) that best fits your workflow. Using the environment variable approach is recommended for flexibility and maintainability.

## Checking the configuration

Set this up now for both your local system configuration and [config_setup.yaml](../cfg/config_setup.yaml). Let's see how it looks. If you aren't using an environment variable, you can remove that line.

In [4]:
import os
from pathlib import Path
import hydra
from hydra import compose, initialize
from omegaconf import OmegaConf

# Set the environment variable, only effective for this notebook.
# Remove the next line if preferred
os.environ['CMB_ML_LOCAL_SYSTEM'] = 'generic_lab'

In [5]:
# Clear any previous hydra instance to prevent conflicts
hydra.core.global_hydra.GlobalHydra.instance().clear()

# Initialize hydra with the configuration directory
with initialize(version_base=None, config_path="../cfg"):
    cfg = compose(config_name='config_setup.yaml',
                  overrides=["~file_system"])

print(OmegaConf.to_yaml(cfg))

print()
ds_root = Path(cfg.local_system.datasets_root)
assets_root = Path(cfg.local_system.assets_dir)
print(f"{ds_root} exists: {ds_root.exists()}")
print(f"{assets_root} exists: {assets_root.exists()}")

local_system:
  datasets_root: /data/jim/CMB_Data/Datasets2/
  assets_dir: /data/jim/CMB_Data/Assets2/


/data/jim/CMB_Data/Datasets2 exists: True
/data/jim/CMB_Data/Assets2 exists: True


If everything is set up correctly, the printed paths should match what you expect. If not, double-check your YAML and environment variables.

If the directories do not exist, they should be created.

In [6]:
# Uncomment the next lines to create those directories
# ds_root.mkdir(parents=True, exist_ok=True)
# assets_root.mkdir(parents=True, exist_ok=True)

# Moving CMB-ML Assets

Some files are included in this repository. One (`cmb-ml_deltabandpass.tbl`) describes the simplified instrumentation we model in our simulations. The README gives a description of it. The others, beginning with `upload_records_` has the information needed to download the available datasets.

They need to be moved to the directory specified in your `local_system` for assets. 

1. Navigate to the **assets** folder in the base directory of this repository. 
2. Copy the **CMB-ML** folder to the **assets directory** specified in `local_system`.

After moving the files, the structure of that folder should be:

```
└─ Assets
   └─ CMB-ML
      ├─ cmb-ml_deltabandpass.tbl
      ├─ README.txt
      ├─ upload_records_I_128_1450.json
      └─ upload_records_I_512_1450.json
```

Later in this notebook, folders for **Planck** folder and **WMAP** will be created and added.

# Setting up PyILC

PyILC is used as a baseline method for cleaning observation maps. Please check out their work at [PyILC on GitHub](https://github.com/jcolinhill/pyilc). As of December, 2024, PyILC isn't structured as an installable library. I've settled on a workaround of importing the necessary elements in a CMB-ML module so it can be used without modification or unnecessary duplication of effort.

This may not be ideal and I'm open to feedback.

## Step 1: Clone PyILC
Skip ahead if you're comfortable with this.

Navigate to where you want to put the PyILC code. Assuming your file structure is something like
```
└─ home
   └─ code
      ├─ cmb-ml
      |     └─ all this stuff
      ├─ other-repo
```
I suggest:
```bash
cd /home/code
git clone https://github.com/jcolinhill/pyilc.git 
```
which will install it where "other-repo" is. I do not suggest installing it within `home/code/cmb-ml` to prevent confusion with `.git`.

## Step 2: Edit the redirection module

Within the CMB-ML repository, open (cmbml/pyilc_redir/__init__.py)[./cmbml/pyilc_redir/__init__.py]. Edit the path to match the location where you've installed PyILC, specifically `input.py` and `wavelets.py`.

After the edits (assuming the example above), it should look like:
```
import sys
sys.path.append('/home/code/pyilc/pyilc')

from input import ILCInfo
from wavelets import Wavelets, wavelet_ILC, harmonic_ILC
```

Note the double "pyilc" at the end.

## Why?

This isn't ideal, and I don't recommend this practice in general as it may cause security vulnerabilities or path conflicts. However, for now, it's the most practical way to integrate PyILC into CMB-ML. If you have suggestions for a better approach, I'd love to hear them.

# Getting science assets

<!-- We now need to get either:
- All science assets for running simulations
- Just the asset containing the mask used for analysis -->

CMB-ML needs to use files for 

## All Science Assets

The easiest method is the simplest: run [the get_data/get_assets.py](../get_data/get_assets.py) script. This will download from the ESA's Planck Legacy Archive and from NASA's LAMBDA Archive. Downloads may be slow.

<!-- There is also a CMB-ML data mirror for these files, but links are not currently available. Please contact us through the GitHub repository and they will be re-enabled. -->

## Assorted Assets

If you prefer not using the script, individual files are available from the source.


Planck files should go into the `Assets/Planck` folder (or whatever is specified in your `local_system` YAML). Similarly, WMAP files should go into `Assets/WMAP`.

- Planck Maps
    - [Planck Collaboration Observation at 30 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/LFI_SkyMap_030-BPassCorrected_1024_R3.00_full.fits)
    - [Planck Collaboration Observation at 44 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/LFI_SkyMap_044-BPassCorrected_1024_R3.00_full.fits)
    - [Planck Collaboration Observation at 70 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/LFI_SkyMap_070-BPassCorrected_1024_R3.00_full.fits)
    - [Planck Collaboration Observation at 100 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/HFI_SkyMap_100_2048_R3.01_full.fits)
    - [Planck Collaboration Observation at 143 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/HFI_SkyMap_143_2048_R3.01_full.fits)
    - [Planck Collaboration Observation at 217 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/HFI_SkyMap_217_2048_R3.01_full.fits)
    - [Planck Collaboration Observation at 353 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/HFI_SkyMap_353-psb_2048_R3.01_full.fits)
    - [Planck Collaboration Observation at 545 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/HFI_SkyMap_545_2048_R3.01_full.fits)
    - [Planck Collaboration Observation at 847 GHz](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/HFI_SkyMap_857_2048_R3.01_full.fits)
    - [Planck Collaboration NILC-cleaned Map](https://irsa.ipac.caltech.edu/data/Planck/release_3/all-sky-maps/maps/component-maps/cmb/COM_CMB_IQU-nilc_2048_R3.00_full.fits)
- Others
    - [WMAP9 Chains, direct download](https://lambda.gsfc.nasa.gov/data/map/dr5/dcp/chains/wmap_lcdm_mnu_wmap9_chains_v5.tar.gz)
    - [Planck delta bandpass table, from Simons Observatory](https://github.com/galsci/mapsims/raw/main/mapsims/data/planck_deltabandpass/planck_deltabandpass.tbl)
    - [Original delta bandpass table, from Simons Observatory](assets/delta_bandpasses/CMB-ML/cmb-ml_deltabandpass.tbl)




# Next steps

Your system is now set up to use CMB-ML.

Next, we'll look at a couple simulations to better understand the data, in [the next demonstration notebook](./D_first_look_at_sims.ipynb).

When you're ready, either [download simulations](../get_data/get_dataset.py) or [create simulations](../main_sims.py).

There are also optional demonstration notebooks if you intend to write code using CMB-ML, starting with [a description of the CMB-ML framework](./E_CMB_ML_framework.ipynb).