# 1: Caching and Reporting

This notebook demonstrates some of the features the components from the previous notebook enable. Two major abilities are easily caching objects (to short circuit computation of already-computed values) and quickly adding graphs and other "reportables" to a jupyter display or a generated HTML experiment run report.

In [1]:
%cd ..

/home/81n/lab/curifactory/examples


We create an artifact manager, a parameter class, and some parameter sets, like in the previous notebook:

In [2]:
from dataclasses import dataclass
import curifactory as cf

manager = cf.ArtifactManager("notebook_example_1")

@dataclass
class Params(cf.ExperimentParameters):
    my_parameter: int = 1
        
default_params = Params(name="default")
doubled_params = Params(name="doubled", my_parameter=2)

## Caching

Caching is done at each stage, by listing a `curifactory.Cacheable` subclass for each output. After the stage runs, each cacher will save the returned object in the data cache path. The cached filename includes the name of the experiment (the string passed to `ArtifactManager`, "notebook_example_1" in this case), the hash string of the parameters, the name of the stage doing the caching, and the name of the output itself.

On any subsequent run of that stage, the cachers all check to see if their file has already been created, and if it has, they directly load the object from file and return it rather than running the stage code.

The `@stage` decorator has a `cachers` parameter which should be given a list of cachers to use for the associated outputs list. Curifactory comes with a set of default cachers you can use, including `JsonCacher`, `PandasCSVCacher`, `PandasJsonCacher`, and `PickleCacher`.

In the example below, we define a "long-running compute" stage, to demonstrate cachers short-circuiting computation

In [3]:
from time import sleep
from curifactory.caching import JsonCacher

@cf.stage(inputs=None, outputs=["long-compute-data"], cachers=[JsonCacher])
def long_compute_step(record):
    some_data = {
        "my_value": record.params.my_parameter, 
        "magic_value": 42
    }
    sleep(5)  # making dictionaries is hard work
    return some_data

In [4]:
import os
import shutil

# to demonstrate cache files, we first clear our cache path
for file in os.listdir("data/cache"):
    if os.path.isdir(os.path.join("data/cache", file)):
        shutil.rmtree(f"data/cache/{file}")
    else:
        os.remove(f"data/cache/{file}")
os.listdir("data/cache")

[]

We run a record through our long running stage, and as expected it takes 5 seconds

In [5]:
%%time
r0 = cf.Record(manager, default_params)
r0 = long_compute_step(r0)

CPU times: user 31.4 ms, sys: 1.43 ms, total: 32.8 ms
Wall time: 5.03 s


Inspecting our cache path now, there's a new json entry for our output, which we can load up and see is the output from our stage

In [6]:
import json

print(os.listdir("data/cache"))
print()
with open(f"data/cache/{os.listdir('data/cache')[0]}", 'r') as infile:
    print(json.load(infile))

['notebook_example_1_ec3f8f4559d33912cdfa01467994ce46_long_compute_step_long-compute-data.json', 'notebook_example_1_ec3f8f4559d33912cdfa01467994ce46_long_compute_step_long-compute-data_metadata.json']

{'my_value': 1, 'magic_value': 42}


If we run the stage again with a record using the same parameter set as the previous time, it finds the correct cached output and returns before running the stage code:

In [7]:
%%time
r1 = cf.Record(manager, default_params)
r1 = long_compute_step(r1)

CPU times: user 15 ms, sys: 0 ns, total: 15 ms
Wall time: 13.1 ms


Using a different parameter set results in a different cache path, so computations with different parameters won't conflict:

In [8]:
r2 = cf.Record(manager, doubled_params)
r2 = long_compute_step(r2)

In [9]:
os.listdir("data/cache")

['notebook_example_1_ec3f8f4559d33912cdfa01467994ce46_long_compute_step_long-compute-data.json',
 'notebook_example_1_981bc72d0bdde272e010d9c2202b4ecf_long_compute_step_long-compute-data.json',
 'notebook_example_1_981bc72d0bdde272e010d9c2202b4ecf_long_compute_step_long-compute-data_metadata.json',
 'notebook_example_1_ec3f8f4559d33912cdfa01467994ce46_long_compute_step_long-compute-data_metadata.json']

## Lazy Loading

One potential pitfall with caching is that it will always load the object into memory, even if that object is never used. Projects with very large data objects can run into memory problems as a result. Curifactory includes a `Lazy` class that can wrap around a stage output string name - when it is first computed, the cacher saves it and the object is removed from memory (replaced in the record state with a `Lazy` instance.) When the lazy object is accessed, it will reload the object into memory from cache at that point.

This means that in a sequence of stages where all values are cached, earlier stage outputs may never need to load into memory at all.

In [10]:
from curifactory.caching import Lazy
import sys

@cf.stage(inputs=None, outputs=[Lazy("very-large-object")], cachers=[JsonCacher])
def make_mega_big_object(record):
    mega_big = [1]*1024*1024
    print(sys.getsizeof(mega_big))
    return mega_big

r3 = cf.Record(manager, default_params)
r3 = make_mega_big_object(r3)

8388664


In [11]:
r3.state.resolve = False
print(type(r3.state['very-large-object']))
print(sys.getsizeof(r3.state['very-large-object']))

<class 'curifactory.caching.Lazy'>
48


Note that `Record.state` is actually a custom subclass of `dict`, and by default it will automatically resolve lazy objects any time it's accessed on the state. the above cell turns this functionality off (with `state.resolve = False`) to show that what's actually in memory before a resolved access is just the lazy object, which is significantly smaller. 

When the record's state resolve is at it's default value of `True`:

In [12]:
r3.state.resolve = True
print(type(r3.state['very-large-object']))
print(sys.getsizeof(r3.state['very-large-object']))

<class 'list'>
8448728


## Reporting

A major part of experiments for debugging, understanding, and publishing them is the ability to present results and pretty graphs! This can be a challenge to keep organized, as one tries to manage folders for matplotlib graph images, result tables, and so on. Curifactory provides shortcuts to easily create `Reportable` items from inside stages, which the artifact manager can then display inside an experiment run report in its own uniquely named run folder, which contains all of the information about the run, all of the created reportables, and a map of the stages that were run. Many of these report components can be rendered inside a notebook as well.

Every record has a `report` function that takes a `Reportable` subclass. Curifactory includes multiple default reporters, such as `DFReporter`, `FigureReporter`, `HTMLReporter`, `JsonReporter`, and `LinePlotReporter`.

In [13]:
from curifactory.reporting import LinePlotReporter

@cf.stage(inputs=None, outputs=["line_history"])
def make_pretty_graphs(record):
    multiplier = record.params.my_parameter
    
    # here we just make a bunch of example arrays of data to plot
    line_0 = [1 * multiplier, 2 * multiplier, 3 * multiplier]
    line_1 = [3 * multiplier, 2 * multiplier, 1 * multiplier]
    line_2 = [4, 0, 3]
    
    # a LinePlotReporter makes a nicely formatted matplotlib graph
    record.report(LinePlotReporter(line_0, name="single_line_plot"))
    record.report(LinePlotReporter(
        y={
            "ascending": line_0,
            "descending": line_1,
            "static": line_2
        },
        name="multi_line_plot"
    ))
    return [line_0, line_1, line_2]

The example stage above adds a couple simple line plots to any record that is run through it.

In [14]:
r4 = cf.Record(manager, default_params)
r5 = cf.Record(manager, doubled_params)

r4 = make_pretty_graphs(r4)
r5 = make_pretty_graphs(r5)

When inside of a jupyter notebook or jupyter lab, the manager includes several display functions that allow you to render portions of the report directly in the notebook. 

A few of these are:
* `display_info()` - renders the top block of the report, containing metadata about the run
* `display_all_reportables()` - renders all reportables in the manager
* `display_record_reportables(record)` - renders only the reportables associated with the passed record
* `display_stage_graph()` - renders a diagram of all the records, state objects, and stages. Note that graphviz must be installed for these to generate correctly.

In [15]:
manager.display_info()

In [16]:
manager.display_all_reportables()  
# note that reportables may not display in github's live notebook render, 
# due to pathing problems. Running this notebook locally should correctly
# display the saved matplotlib images.

In [17]:
manager.display_record_reportables(r4)

In [18]:
manager.display_stage_graph()  # this obviously looks a lot more interesting in more complicated stage setups

Finally, a full HTML report can be produced with the `generate_report()` function. This will create a run-specific folder to contain the report and all rendered reportables, inside the reports path. Additionally, every time a report is generated, an overall project report index is put directly in the reports path, which lists and links to all of the individual reports.

In [19]:
manager.generate_report()

In [20]:
os.listdir("reports")

['style.css',
 'notebook_example_1_1_2023-07-27-T140513',
 'index.html',
 'iris_1_2023-07-27-T140421',
 '_latest']