# Useful Utilities

**Authors:** Olivia Lynn

**Last Run Successfully:** September 20, 2023

This is a notebook that contains various utilities that may be used when working with RAIL.

## Setting Things Up

In [1]:
import rail

### Listing imported stages (1/2)

Let's list out our currently imported stages. Right now, this will only be what we get by importing `rail` and `rail.stages`.

In [2]:
import rail.stages
for val in rail.core.RailStage.pipeline_stages.values():
    print(val[0])

<class 'rail.core.utilStages.ColumnMapper'>
<class 'rail.core.utilStages.RowSelector'>
<class 'rail.core.utilStages.TableConverter'>
<class 'rail.estimation.estimator.CatEstimator'>
<class 'rail.estimation.algos.naive_stack.NaiveStackSummarizer'>
<class 'rail.estimation.algos.random_gauss.RandomGaussEstimator'>
<class 'rail.estimation.algos.point_est_hist.PointEstHistSummarizer'>
<class 'rail.estimation.algos.train_z.TrainZInformer'>
<class 'rail.estimation.algos.train_z.TrainZEstimator'>
<class 'rail.estimation.algos.var_inf.VarInfStackSummarizer'>
<class 'rail.estimation.algos.uniform_binning.UniformBinningClassifier'>
<class 'rail.estimation.algos.equal_count.EqualCountClassifier'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelection'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelection_GAMA'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelection_BOSS'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelectio

### Import and attach all

Using `rail.stages.import_and_attach_all()` lets you import all packages within the RAIL ecosystem at once. 

This kind of blanket import is a useful shortcut; however, it will be slower than specific imports, as you will import things you'll never need. 

As of such, `import_and_attach_all` is recommended for new users and those who wish to do rapid exploration with notebooks; pipelines designed to be run at scale would generally prefer lightweight, specific imports.


In [3]:
import rail
import rail.stages
rail.stages.import_and_attach_all()


Imported rail.hub
Imported rail.astro_tools
Imported rail.core
Imported rail.stages
Imported rail.bpz
Imported rail.cmnn
Imported rail.delight
Failed to import rail.dsps because: You need to have the SPS_HOME environment variable
Imported rail.flexzboost
Imported rail.gpz
Imported rail.pipelines
Failed to import rail.pzflow because: No module named 'rail.estimation.algos.pzflow'
Imported rail.sklearn
Imported rail.som
Attached 12 base classes and 53 fully formed stages to rail.stages


Now that we've attached all available stages to rail.stages, we can use `from rail.stages import *` to let us omit prefixes. 

To see this in action:

In [4]:
# with prefix

print(rail.core.utilStages.ColumnMapper)

<class 'rail.core.utilStages.ColumnMapper'>


In [5]:
# without prefix

try:
    print(ColumnMapper)
except Exception as e: 
    print(e)

name 'ColumnMapper' is not defined


In [6]:
from rail.stages import *

In [7]:
print(ColumnMapper)

<class 'rail.core.utilStages.ColumnMapper'>


### Listing imported stages (2/2)

Now, let's try listing imported stages again.

Note that we can now just call `RailStage` instead of `rail.core.RailStage`.

In [8]:
for val in RailStage.pipeline_stages.values():
    print(val[0])

<class 'rail.core.utilStages.ColumnMapper'>
<class 'rail.core.utilStages.RowSelector'>
<class 'rail.core.utilStages.TableConverter'>
<class 'rail.estimation.estimator.CatEstimator'>
<class 'rail.estimation.algos.naive_stack.NaiveStackSummarizer'>
<class 'rail.estimation.algos.random_gauss.RandomGaussEstimator'>
<class 'rail.estimation.algos.point_est_hist.PointEstHistSummarizer'>
<class 'rail.estimation.algos.train_z.TrainZInformer'>
<class 'rail.estimation.algos.train_z.TrainZEstimator'>
<class 'rail.estimation.algos.var_inf.VarInfStackSummarizer'>
<class 'rail.estimation.algos.uniform_binning.UniformBinningClassifier'>
<class 'rail.estimation.algos.equal_count.EqualCountClassifier'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelection'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelection_GAMA'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelection_BOSS'>
<class 'rail.creation.degradation.spectroscopic_selections.SpecSelectio

We can use this list of imported stages to browse for specifics, such as looking through our available estimators.

**Note:** this will only filter through what you've imported, so if you haven't imported everything above, this will not be a complete list of all estimators available in RAIL.

In [9]:
for val in RailStage.pipeline_stages.values():
    if issubclass(val[0], rail.estimation.estimator.CatEstimator):
        print(val[0])

<class 'rail.estimation.estimator.CatEstimator'>
<class 'rail.estimation.algos.random_gauss.RandomGaussEstimator'>
<class 'rail.estimation.algos.train_z.TrainZEstimator'>
<class 'rail.estimation.algos.bpz_lite.BPZliteEstimator'>
<class 'rail.estimation.algos.cmnn.CMNNPDF'>
<class 'rail.estimation.algos.flexzboost.FlexZBoostEstimator'>
<class 'rail.estimation.algos.gpz.GPzEstimator'>
<class 'rail.estimation.algos.k_nearneigh.KNearNeighEstimator'>
<class 'rail.estimation.algos.sklearn_neurnet.SklNeurNetEstimator'>
<class 'rail.estimation.algos.nz_dir.NZDirSummarizer'>


### Listing keys in the Data Store (1/2)

Let's list out the keys in the Data Store to see what data we have stored.

First, we must set up the Data Store:

In [10]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

And because we've only just created the store, as you may have guessed, it is empty. 

We'll come back to this in a bit.

In [11]:
DS.keys()

dict_keys([])

### Finding data files with find_rail_file

We need to define our flow file that we'll use in our pipeline

If we already know its path, we can just point directly to the file (relative to the directory that holds our `rail/` directory):

In [12]:
flow_file = os.path.join(
    RAILDIR, "rail/examples_data/goldenspike_data/data/pretrained_flow.pkl"
)

But if we aren't sure where our file is (or we're just feeling lazy) we can use `find_rail_file`.

This is especially helpful in cases where our installation is spread out, and some rail modules are located separately from others.

In [13]:
from rail.core.utils import find_rail_file
flow_file = find_rail_file('examples_data/goldenspike_data/data/pretrained_flow.pkl')



We can set our FLOWDIR based on the location of our flow file, too.

In [14]:
os.environ['FLOWDIR'] = os.path.dirname(flow_file)

In [15]:
# Now, we have to set up some other variables for our pipeline:

bands = ["u", "g", "r", "i", "z", "y"]
band_dict = {band: f"mag_{band}_lsst" for band in bands}
rename_dict = {f"mag_{band}_lsst_err": f"mag_err_{band}_lsst" for band in bands}
post_grid = [float(x) for x in np.linspace(0.0, 5, 21)]


## Creating the Pipeline

In [16]:
import ceci

In [17]:
# Make some stages

flow_engine_test = FlowCreator.make_stage(
    name="flow_engine_test", model=flow_file, n_samples=50
)
col_remapper_test = ColumnMapper.make_stage(
    name="col_remapper_test", hdf5_groupname="", columns=rename_dict
)
#flow_engine_test.sample(6, seed=0).data

Inserting handle into data store.  model: /Users/orl/code/DESC-RAIL/rail_base/src/rail/examples_data/goldenspike_data/data/pretrained_flow.pkl, flow_engine_test


In [18]:
# Add the stages to the pipeline

pipe = ceci.Pipeline.interactive()
stages = [flow_engine_test, col_remapper_test]
for stage in stages:
    pipe.add_stage(stage)

In [19]:
# Connect stages

col_remapper_test.connect_input(flow_engine_test)

Inserting handle into data store.  output_flow_engine_test: inprogress_output_flow_engine_test.pq, flow_engine_test


## Introspecting the Pipeline

### Listing keys in the Data Store (2/2)

Now that we have a some data in the Data Store, let's take another look at it.

In [20]:
DS.keys()

dict_keys(['model', 'output_flow_engine_test'])

### Getting names of stages in the pipeline

In [21]:
pipe.stage_names

['flow_engine_test', 'col_remapper_test']

### Getting the configuration of a particular stage

Let's take a look a the config of the first stage we just listed above.

In [22]:
pipe.flow_engine_test.config

StageConfig{output_mode:default,n_samples:50,seed:12345,name:flow_engine_test,model:/Users/orl/code/DESC-RAIL/rail_base/src/rail/examples_data/goldenspike_data/data/pretrained_flow.pkl,config:None,aliases:{'output': 'output_flow_engine_test'},}

### Updating a configuration value
 
We can update config values even after the stage has been created. Let's give it a try.

In [23]:
pipe.flow_engine_test.config.update(seed=42)

pipe.flow_engine_test.config

StageConfig{output_mode:default,n_samples:50,seed:42,name:flow_engine_test,model:/Users/orl/code/DESC-RAIL/rail_base/src/rail/examples_data/goldenspike_data/data/pretrained_flow.pkl,config:None,aliases:{'output': 'output_flow_engine_test'},}

### Listing stage outputs (as both tags and aliased tags)

Let's get the list of outputs as 'tags'.

These are how the stage thinks of the outputs, as a list names associated to DataHandle types.

In [24]:
pipe.flow_engine_test.outputs

[('output', rail.core.data.PqHandle)]

We can also get the list of outputs as 'aliased tags'.

These are how the pipeline thinks of the outputs, as a unique key that points to a particular file

In [25]:
pipe.flow_engine_test._outputs

{'output_flow_engine_test': 'output_flow_engine_test.pq'}

### Listing all pipeline methods and parameters that can be set

If you'd like to take a closer look at what you can do with a pipeline, use `dir(pipe)` to list out available methods and parameters.

In [26]:
for item in dir(pipe):
    if '__' not in item:
        print(item)

add_stage
build_config
build_dag
build_stage
callback
create
enqueue_job
find_all_outputs
get_stage_aliases
global_config
initialize
initiate_run
interactive
launcher_config
make_flow_chart
modules
ordered_stages
overall_inputs
pipeline_files
pipeline_outputs
print_stages
read
remove_stage
run
run_config
run_info
run_jobs
save
should_skip_stage
sleep
stage_config_data
stage_execution_config
stage_names
stages
stages_config


## Initializing the Pipeline

### Toggling resume mode

We can turn 'resume mode' on when initializing a pipeline.

Resume mode lets us skip stages that already have output files, so we don't have to rerun the same stages as we iterate on a pipeline.

Just add a `resume=True` to do so.

In [27]:
pipe.initialize(
    dict(model=flow_file), dict(output_dir=".", log_dir=".", resume=True), None
)

Skipping stage flow_engine_test because its outputs exist already
Skipping stage col_remapper_test because its outputs exist already


(({}, []), {'output_dir': '.', 'log_dir': '.', 'resume': True})

Running `pipe.stages` should show order of classes, or all the stages this pipeline will run.

In [28]:
pipe.stages

[<rail.creation.engines.flowEngine.FlowCreator at 0x165d2ac90>,
 Stage that applies remaps the following column names in a pandas DataFrame:
 f{str(self.config.columns)}]

## Managing notebooks with git

_(thank you to https://stackoverflow.com/a/58004619)_

You can modify your git settings to run a filter over certain files before they are added to git. This will leave the original file on disk as-is, but commit the "cleaned" version.


First, add the following to your local `.git/config` file (or global `~/.gitconfig`):

Then, create a `.gitattributes` file in your directory with notebooks and add the following line: