Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
500 encountered from metaflow service #7
@russellbrooks reported the following issue
For context, this was encountered as part of a hyperparameter tuning framework (each flow run is a model training evaluation) after ~6hrs with 125 runs successfully completed. Everything is executed on Batch with 12 variants being trained in parallel, then feeding those results back to a bayesian optimizer to generate the next.
The cloudwatch logs from Batch indicate that the step completed successfully, and the Metaflow service error was encountered on a separate on-demand EC2 instance that's running the optimizer and executes the flows using asyncio.create_subprocess_shell. Looking at API Gateway, the request rates seemed reasonable and its configured without throttling. RDS showed plenty of CPU credits and barely seemed phased throughout the workload. Each run was executed with --retry but this error seems to have short-circuited that logic and resulted in a hard-stop.
@russellbrooks can you try using the provided python client for the metadata service to access the task for your run you reported issues with?
Hey @ferras appreciate the help!
gives the start step as expected
For what it's worth – I'm able to load all of the task artifacts from within
The task metadata is accessible:
The task stdout/stderr seems to point towards the culprit:
Looking at the Batch job, it shows successful completion on the first attempt without getting kicked off its spot instance (managed, spot compute environment).
Interesting since it seems like it had already finished the Batch job for the first task and died while starting the next task.
Re-running the flow using
I do think you're right that this is more likely related to either metaflow's runtime or potentially how these flows are being executed as subprocesses in
Another interesting observation after repeating the process, this time getting through 244 flow runs in ~12hrs before an error occurred – the end step was submitted and finished successfully on Batch but the runtime bombed after starting it.
From cloudwatch logs, captured stdout from the subprocess flow execution:
From cloudwatch logs, captured stderr from the subprocess flow execution:
For comparison, here's stdout from metaflow for the step
stderr from metaflow for
What's really odd is that the Batch job for the end step was successfully submitted without showing up in the captured stdout logs and ran till completion. That's what is making me think something is going awry with metaflow runtime (maybe due to being executed via
Curious what your all's thoughts are and again, really appreciate the help.
@ferras I have yet to see the behavior outside of running on batch.
Right now I have a couple tests running to try to narrow things a bit more:
I've got all three of these running overnight and will follow up soon.
Hey guys, still been digging into this on my end and here's what I've found so far after letting these run for awhile:
When not using batch (with or without the mocked optimizer) and just running sleep statements, it works without errors for thousands of runs. FWIW this still uses S3 as a datastore and the metadata service.
When using batch for the task executions (still just sleep statements), it eventually hits:
The error seems to eventually occur as part of starting the end task from within the flow execution.
Like before, it does actually kick off the batch job and it finishes successfully despite the flow execution tanking.
While doing these tests I also managed to hit the default service limit of CloudWatch Logs'
All of that seems fine though and within the batch job/task it'll retry 5 times, independent of boto's retrying, before failing the task (still allowing batch retries on top from the flow's