Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Stage API design #90

Closed
deniederhut opened this issue Oct 3, 2023 · 3 comments
Closed

Question about Stage API design #90

deniederhut opened this issue Oct 3, 2023 · 3 comments

Comments

@deniederhut
Copy link

In this example from the documentation

@stage(inputs=["initial_value"], outputs=["final_value"])
def multiply_again(record, initial_value):
    return initial_value * record.params.some_scalar_multiplier

the stage accepts two arguments -- the record, and the initial value. The initial value is referenced directly by name, but other things the function needs -- specifically the params object -- are accessed as an attribute on the record.

Is there a reason that inputs aren't passed as attributes on the record?

Somewhat related -- later in the getting started docs is this example of using an aggregation stage

@aggregate(inputs=["final_value"], outputs=["all_final_values", "maximum_value"])
def find_maximum_final_value(record, records, final_value: dict[Record, float]):
    all_vals = {}
    for r, value in final_value.items():
        all_vals[r.params.name] = value

    maximum = max(all_vals.values())
    return all_vals, maximum

Here we are accepting a singular record, followed by an iterable of records. I feel like I must be misunderstanding something about the record objects because I don't see why the singular record is required. Could we just have the iterable?

@WarmCyan
Copy link
Collaborator

WarmCyan commented Oct 3, 2023

the stage accepts two arguments -- the record, and the initial value. The initial value is referenced directly by name, but other things the function needs -- specifically the params object -- are accessed as an attribute on the record.

Is there a reason that inputs aren't passed as attributes on the record?

I think part of it is the explicitness, to make stage functions look more semantically/syntactically like normal non-cf functions - since a stage is defined as a function that takes a set of artifact inputs and returns a set of artifact outputs, I think it's more reflective of that definition if said artifacts are actual function inputs.

State artifacts are also a bit more dynamic than parameters are, parameters are pretty well pre-defined and expected to not change mid-run, they're a record-wide (""global"") configuration, so to me it makes sense that's just an object you can get attributes off of. Artifacts on the other hand are a bit fuzzier and so to me it makes sense that we treat them like they're just variables being passed in rather than having to manually get them off the record yourself.

That's maybe a bit of a waffly answer, all of this is to say there's not necessarily a technical reason we don't, it comes down to for me it makes a lot more intuitive sense to have a semi-global configuration object, and the actual data/things I'm working with as looking like discrete variables/first class citizens within the stage (they're the things I really care about/would use in regular functions, and I just tangentially want access to configuration values without having to pass in every single parameter individually.)

Here we are accepting a singular record, followed by an iterable of records. I feel like I must be misunderstanding something about the record objects because I don't see why the singular record is required. Could we just have the iterable?

This one I can more concretely support. Philosophically, a record is supposed to represent a "single chain of computation", with some associated parameter set, and artifacts produced within a record are cached uniquely according to a hash based on that record/parameter set. An aggregate stage is a way to merge several chains of computation into one, to allow comparing/operating across them. But, once they've merged this now represents a new chain of computation, potentially with regular stages following the aggregate, and this new chain of computation needs to cache artifacts under a different hash then any of the individual chains that went into it. (More specifically, to avoid inappropriate cache hits and misses, the new hash needs to be representative of all of the chains that went into it, e.g. an aggregate stage producing a table comparing output results from model_x, model_y, and model_z should be stored under a different hash than one comparing the output results from model_x and model_y.)

So the singular record that an aggregate gets is the store (and parameter set) for tracking this new chain of computation, acting like a normal stage that has been extended to allow retrieving the artifacts from the states of all records being merged, rather than its own. Stages that come after an aggregate stage then act like normal, they can retrieve any artifacts output by the aggregate (which are stored in that singular record), but won't directly have access to the individual other records. So the intended point of the aggregate stage is literally just to run whatever aggregation operations are necessary to turn output artifacts from multiple records (the records iterable) into one/more outputs in a single record (its own). (it's perhaps more accurate to call it a "reduce" stage tbh, but I think that would be more confusing to people who aren't used to map-reduce and similar programming models)

In practice it might look like this diagram (much more detailed version than the one currently in the docs)
better_aggregate drawio
In the above, we have a couple records (0 and 1) that test how good a particular model is. Those records are passed into an aggregate (the iterable) that makes a combined table of each test's result and outputs that into the new chain's record (2), which then some regular stages operate on to do some final processing for final output results.

To rephrase all of that with substantially fewer words: each individual chain of computation ends once it merges, and a new chain of computation begins at that merge/reduction point, so that latter chain needs its own record to represent it to ensure caching/provenance is tracking based off of the correct set of computations.

If all of that was explaining something you already understood and instead you're asking why the record couldn't just be implicitly handled for aggregates and then passed around as normal in those later stages, I think it's important for the aggregate to have direct access to its own record 1. for consistency with how normal stages work and 2. to give the aggregate stage access to the handy helper functions on record objects, e.g. access to cache paths and current cachers, ability to add reportables directly from the aggregate stage etc.

@deniederhut
Copy link
Author

So the singular record that an aggregate gets is the store (and parameter set) for tracking this new chain of computation, acting like a normal stage that has been extended to allow retrieving the artifacts from the states of all records being merged, rather than its own.

💡 ahhh thank you this is the piece that I was missing. I'm still not sure that I really grok the record classes but at the surface level at least it makes sense.

That's maybe a bit of a waffly answer

I dunno, I think a lot of API design comes down to what feels intuitive, and I think your point about matching the affordances of an ordinary function is a good one.

But speaking of intuition it does feel odd to me to have that function get the stage object, based on my current understanding (which might still be wrong 🙃). Maybe you answered this already in your last paragraph, but passing in the params instance directly seems like it would make the wrapped function a little more like something that is just accepting an experiment configuration + dynamic inputs.

In any case this is all just my uninformed opinion and I don't need more explanations or changes. Thanks for letting me test out curifactory!

@WarmCyan
Copy link
Collaborator

Thanks for all the feedback and the in-depth review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants