-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
500 encountered from metaflow service #7
Comments
@russellbrooks can you try using the provided python client for the metadata service to access the task for your run you reported issues with? Something like
|
Hey @ferras appreciate the help! Running:
gives the start step as expected For what it's worth – I'm able to load all of the task artifacts from within
|
can you access the metadata associated with the task?
you can also access the logs via
|
The task metadata is accessible:
The task stdout/stderr seems to point towards the culprit:
Looking at the Batch job, it shows successful completion on the first attempt without getting kicked off its spot instance (managed, spot compute environment). |
so https://docs.metaflow.org/metaflow/debugging#how-to-use-the-resume-command |
Interesting since it seems like it had already finished the Batch job for the first task and died while starting the next task. Re-running the flow using I do think you're right that this is more likely related to either metaflow's runtime or potentially how these flows are being executed as subprocesses in Another interesting observation after repeating the process, this time getting through 244 flow runs in ~12hrs before an error occurred – the end step was submitted and finished successfully on Batch but the runtime bombed after starting it. From cloudwatch logs, captured stdout from the subprocess flow execution:
From cloudwatch logs, captured stderr from the subprocess flow execution:
For comparison, here's stdout from metaflow for the step
stderr from metaflow for What's really odd is that the Batch job for the end step was successfully submitted without showing up in the captured stdout logs and ran till completion. That's what is making me think something is going awry with metaflow runtime (maybe due to being executed via Curious what your all's thoughts are and again, really appreciate the help. |
@russellbrooks do you see the same behavior when not running on batch? @savingoyal do you see any potential problems on the batch side? |
@ferras I have yet to see the behavior outside of running on batch. Right now I have a couple tests running to try to narrow things a bit more:
I've got all three of these running overnight and will follow up soon. |
Hey guys, still been digging into this on my end and here's what I've found so far after letting these run for awhile: When not using batch (with or without the mocked optimizer) and just running sleep statements, it works without errors for thousands of runs. FWIW this still uses S3 as a datastore and the metadata service. When using batch for the task executions (still just sleep statements), it eventually hits: The error seems to eventually occur as part of starting the end task from within the flow execution.
Like before, it does actually kick off the batch job and it finishes successfully despite the flow execution tanking. MiscWhile doing these tests I also managed to hit the default service limit of CloudWatch Logs'
stderr
All of that seems fine though and within the batch job/task it'll retry 5 times, independent of boto's retrying, before failing the task (still allowing batch retries on top from the flow's |
Thanks @russellbrooks. I have a few ideas that I will try out over the next few days to isolate the issue. This is indeed very helpful. |
@savingoyal This appears to be a similar error to what I am experiencing on AWS Batch and which I communicated recently on the following Gitter thread: The main difference is that this issue is reporting the |
This issue also arises when running many concurrent |
@jaklinger Thanks for commenting on this issue. That is exactly where I am experiencing the issuse... concurrent
|
Thanks for the extra data points. I am actively looking into this issue. |
@dpatschke Can you help me out with the version of Metaflow (and service) that you are using? Also, does this happen always with AWS Batch only? If so, it could be an issue with AWS Batch calls and not the service itself. |
@dpatschke @jaklinger Do you run into the same issue when you use |
I created the above issue Netflix/metaflow#507 to provide wider visibility to this issue. |
Hi @savingoyal - thanks for looking into this, much appreciated! My compute environment is (I believe) the standard prescribed by metaflow via Cloudwatch:
Regarding running Extra info on my tasksMy batch tasks produce very little output, and will run for up to 8 minutes. According to these lines in One solution would be to avoid clashes between unique instance calls of |
@jaklinger As of Metaflow |
I worked with @dpatschke to triage this issue (Many thanks!). Looks like AWS Batch refuses to send the correct response to the |
@russellbrooks reported the following issue
For context, this was encountered as part of a hyperparameter tuning framework (each flow run is a model training evaluation) after ~6hrs with 125 runs successfully completed. Everything is executed on Batch with 12 variants being trained in parallel, then feeding those results back to a bayesian optimizer to generate the next.
The cloudwatch logs from Batch indicate that the step completed successfully, and the Metaflow service error was encountered on a separate on-demand EC2 instance that's running the optimizer and executes the flows using asyncio.create_subprocess_shell. Looking at API Gateway, the request rates seemed reasonable and its configured without throttling. RDS showed plenty of CPU credits and barely seemed phased throughout the workload. Each run was executed with --retry but this error seems to have short-circuited that logic and resulted in a hard-stop.
The text was updated successfully, but these errors were encountered: