-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelizable cannot aggregate or return multiple Collects #742
Comments
Hey! Thanks, this is a known limitation (see point 5 here) -- #301. That said, there's an easy workaround -- you can group them before running def all_metrics(sub_metric_1: ANALYSIS_RES, sub_metric_2: ANALYSIS_RES) -> ANALYSIS_RES:
return ... # join the two dicts in whatever way you want
def all_agg(all_metrics: Collect[ANALYSIS_RES]) -> pd.DataFrame:
return ... # join them all into a dataframe While its not ideal, it should just be adding one extra function! Note that this works in the above case where they're operating over the same partitions. In the case that they aren't, you'll want two separate parallelizations. Going to reference this issue from the other one -- this is a good one to keep around/I can spend some time scoping out a fix. |
and if you want the flexibility to only compute one of them, you can utilize |
@elijahbenizzy, @skrawcz Thank you both for the help! This is where I'm taking the workaround, and in case it's useful for you or anyone else looking at the ticket, this is successful: @resolve(
when=ResolveAt.CONFIG_AVAILABLE,
decorate_with= lambda metric_names: inject(sub_metrics=group(*[source(x) for x in metric_names])),
)
def all_metrics(sub_metrics: list[ANALYSIS_RES], columns: list[str]) -> pd.DataFrame:
frames = []
for a in sub_metrics:
frames.append(_to_frame(a, columns))
return pd.concat(frames) Don't forget: from hamilton import settings
_config = {settings.ENABLE_POWER_USER_MODE:True}
_config["metric_names"] = ["sub_metric_1", "sub_metric_2"]
# Then in the driver building:
.with_config(_config) |
Closing this issue for now. |
Current behavior
A DAG with one
Parallelizable
and twoCollect
statements cannot return results from bothCollect
nodes.Stack Traces
KeyError: 'Key metric_2 not found in cache'
Steps to replicate behavior
Library & System Information
Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:40:50) [MSC v.1937 64 bit (AMD64)] on win32
Expected behavior
I would expect to be able to retrieve any collections done. I can request one at a time for
metric_1
andmetric_2
and have it succeed.Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: