Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 encountered from metaflow service #7

Closed
ferras opened this issue Feb 10, 2020 · 20 comments
Closed

500 encountered from metaflow service #7

ferras opened this issue Feb 10, 2020 · 20 comments
Assignees

Comments

@ferras
Copy link
Contributor

ferras commented Feb 10, 2020

@russellbrooks reported the following issue

Metaflow service error:
Metadata request (/flows/TuningXGB/runs/572/steps/start/tasks/1821/metadata) failed (code 500): {"message":"Internal server error"}

For context, this was encountered as part of a hyperparameter tuning framework (each flow run is a model training evaluation) after ~6hrs with 125 runs successfully completed. Everything is executed on Batch with 12 variants being trained in parallel, then feeding those results back to a bayesian optimizer to generate the next.

The cloudwatch logs from Batch indicate that the step completed successfully, and the Metaflow service error was encountered on a separate on-demand EC2 instance that's running the optimizer and executes the flows using asyncio.create_subprocess_shell. Looking at API Gateway, the request rates seemed reasonable and its configured without throttling. RDS showed plenty of CPU credits and barely seemed phased throughout the workload. Each run was executed with --retry but this error seems to have short-circuited that logic and resulted in a hard-stop.

@ferras
Copy link
Contributor Author

ferras commented Feb 10, 2020

@russellbrooks can you try using the provided python client for the metadata service to access the task for your run you reported issues with?

Something like

from metaflow import namespace, Run
namespace(None)
run = Run('TuningXGB/572')

for step in run:
    print(step)

@russellbrooks
Copy link

Hey @ferras appreciate the help!

Running:

from metaflow import namespace, Run
namespace(None)
run = Run('TuningXGB/572')

for step in run:
    print(step)

gives the start step as expected Step('TuningXGB/572/start'). The flow terminated with the 500 service error before starting the next (end) step.

For what it's worth – I'm able to load all of the task artifacts from within step.task.data.

from metaflow import namespace, Run
namespace(None)
run = Run('TuningXGB/572')

for step in run:
    print(step)

    # Just testing that we can actually load the artifacts from S3.
    for artifact in step.task.data._artifacts:
        print(f"{artifact}: {step.task.data.__getattr__(artifact)}")

@ferras
Copy link
Contributor Author

ferras commented Feb 10, 2020

can you access the metadata associated with the task?

Step('TuningXGB/572/start').task.metadata

you can also access the logs via

Step('TuningXGB/572/start').task.stdout
Step('TuningXGB/572/start').task.stderr

@russellbrooks
Copy link

The task metadata is accessible:

from metaflow import Step
Step('TuningXGB/572/start').task.metadata
[Metadata(name='attempt', value='0', created_at=1581380003965, type='attempt', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='attempt-done', value='0', created_at=1581380003965, type='attempt-done', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='code-package', value='{"ds_type": "s3", "sha": "9efdd934eb15e79f0da12b1f1e116df0fa5c7fc9", "location": "s3://......./TuningXGB/data/9e/9efdd934eb15e79f0da12b1f1e116df0fa5c7fc9"}', created_at=1581380003965, type='code-package', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='log_location_stderr', value='{"ds_type": "s3", "location": "s3://......./TuningXGB/572/start/1821/0.stderr.log", "attempt": 0}', created_at=1581380003965, type='log_path', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='log_location_stdout', value='{"ds_type": "s3", "location": "s3://......./TuningXGB/572/start/1821/0.stdout.log", "attempt": 0}', created_at=1581380003965, type='log_path', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='origin-run-id', value='None', created_at=1581380003965, type='origin-run-id', task=Task('TuningXGB/572/start/1821'))]

The task stdout/stderr seems to point towards the culprit:

Step('TuningXGB/572/start').task.stdout

[KILLED BY ORCHESTRATOR]

Step('TuningXGB/572/start').task.stderr

[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting (status RUNNABLE)...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting (status STARTING)...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting (status RUNNING)...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Setting up task environment.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Downloading code package.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Code package downloaded.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Using default logging configs, no overrides found.
... normal logs ...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task finished with exit code 0.
[KILLED BY ORCHESTRATOR]

Looking at the Batch job, it shows successful completion on the first attempt without getting kicked off its spot instance (managed, spot compute environment).

@ferras
Copy link
Contributor Author

ferras commented Feb 11, 2020

so [KILLED BY ORCHESTRATOR] implies that you lost connection while your job was running. So doesn't seem like the failure was a result of the metadata service. I would suggest re-running your flow using resume

https://docs.metaflow.org/metaflow/debugging#how-to-use-the-resume-command

@russellbrooks
Copy link

Interesting since it seems like it had already finished the Batch job for the first task and died while starting the next task.

Re-running the flow using resume --origin-run-id 572 does work as expected.

I do think you're right that this is more likely related to either metaflow's runtime or potentially how these flows are being executed as subprocesses in asyncio.

Another interesting observation after repeating the process, this time getting through 244 flow runs in ~12hrs before an error occurred – the end step was submitted and finished successfully on Batch but the runtime bombed after starting it.

From cloudwatch logs, captured stdout from the subprocess flow execution:

2020-02-11 07:02:01.889 Workflow starting (run-id 877):
2020-02-11 07:02:02.045 [877/start/3028 (pid 15173)] Task is starting.
2020-02-11 07:02:03.473 [877/start/3028 (pid 15173)] [d2820876-216b-4005-81f8-c7bfd452fb8a] Task is starting (status SUBMITTED)...
... first task logs ...
2020-02-11 07:03:14.694 [877/start/3028 (pid 15173)] [d2820876-216b-4005-81f8-c7bfd452fb8a] Task finished with exit code 0.
2020-02-11 07:03:15.456 [877/start/3028 (pid 15173)] Task finished successfully.
2020-02-11 07:03:15.797 [877/end/3031 (pid 15236)] Task is starting.
2020-02-11 07:03:17.366 [877/end/3031 (pid 15236)] list index out of range
2020-02-11 07:03:17.548 [877/end/3031 (pid 15236)] Task failed.
2020-02-11 07:03:17.548 This failed task will not be retried.

From cloudwatch logs, captured stderr from the subprocess flow execution:

Internal error:
The end step was not successful by the end of flow.

For comparison, here's stdout from metaflow for the step Step('TuningXGB/877/end').task.stdout):

list index out of range

stderr from metaflow for Step('TuningXGB/877/end').task.stderr) is empty.

What's really odd is that the Batch job for the end step was successfully submitted without showing up in the captured stdout logs and ran till completion. That's what is making me think something is going awry with metaflow runtime (maybe due to being executed via asyncio.create_subprocess_shell 🤷‍♂ ).

Curious what your all's thoughts are and again, really appreciate the help.

@ferras
Copy link
Contributor Author

ferras commented Feb 12, 2020

@russellbrooks do you see the same behavior when not running on batch?

@savingoyal do you see any potential problems on the batch side?

@russellbrooks
Copy link

@ferras I have yet to see the behavior outside of running on batch.

Right now I have a couple tests running to try to narrow things a bit more:

  1. (everything local) Trimmed the flow's steps to have just sleep(randint()). The bayesian optimizer still runs when any of the flows finish, but I'm just giving it a random number to simulate similar number crunching that occurs synchronously. The asyncio scheduler maintains 10 concurrent flow executions. Currently 652 flow runs in.
  2. (everything local) Same as (1), but removing the optimizer completely and just using await asyncio.sleep(100) between making the new flow executions to simulate a somewhat short-running, synchronous process (FWIW it's typically <30s). Currently 108 flow runs in.
  3. (flow runs on batch, optimizer + asyncio scheduler local): How it's been run previously, just giving it another go. Currently 147 flow runs in.

I've got all three of these running overnight and will follow up soon.

@ferras ferras assigned ferras and savingoyal and unassigned ferras Feb 13, 2020
@russellbrooks
Copy link

Hey guys, still been digging into this on my end and here's what I've found so far after letting these run for awhile:

When not using batch (with or without the mocked optimizer) and just running sleep statements, it works without errors for thousands of runs. FWIW this still uses S3 as a datastore and the metadata service.

When using batch for the task executions (still just sleep statements), it eventually hits:
list index out of range

The error seems to eventually occur as part of starting the end task from within the flow execution.

2020-02-13 07:25:19.045 [4346/start/14142 (pid 26351)] Task is starting.
2020-02-13 07:25:19.633 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting (status RUNNABLE)...
2020-02-13 07:25:23.803 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting (status STARTING)...
2020-02-13 07:25:26.935 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting (status RUNNING)...
2020-02-13 07:25:33.279 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Setting up task environment.
2020-02-13 07:25:36.493 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Downloading code package.
2020-02-13 07:25:36.493 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Code package downloaded.
2020-02-13 07:25:47.057 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting.
2020-02-13 07:25:47.057 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Sleeping for 8 sec.
2020-02-13 07:25:47.057 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task finished with exit code 0.
2020-02-13 07:25:47.535 [4346/start/14142 (pid 26351)] Task finished successfully.
2020-02-13 07:25:47.726 [4346/end/14146 (pid 26445)] Task is starting.
2020-02-13 07:25:48.199 [4346/end/14146 (pid 26445)] list index out of range
2020-02-13 07:25:48.475 [4346/end/14146 (pid 26445)] Task failed.
2020-02-13 07:25:48.475 This failed task will not be retried.

Like before, it does actually kick off the batch job and it finishes successfully despite the flow execution tanking.

Misc

While doing these tests I also managed to hit the default service limit of CloudWatch Logs' GetLogEvents (10 req/s) from executing tons of these parallel flow executions on batch 😅
stdout

2020-02-13 06:47:26.538 [3911/end/12842 (pid 1470)] ClientError('An error occurred (ThrottlingException) when calling the GetLogEvents operation (reached max retries: 4): Rate exceeded')    Batch error:
2020-02-13 06:47:26.538 [3911/end/12842 (pid 1470)] Task failed!
2020-02-13 06:47:26.851 [3911/end/12842 (pid 1470)] 
2020-02-13 06:47:27.038 [3911/end/12842 (pid 1470)] Task failed.
2020-02-13 06:47:27.038 Workflow failed.

stderr

Step failure:
    Step end (task-id 12842) failed.

All of that seems fine though and within the batch job/task it'll retry 5 times, independent of boto's retrying, before failing the task (still allowing batch retries on top from the flow's --with retry).

@savingoyal
Copy link
Collaborator

Thanks @russellbrooks. I have a few ideas that I will try out over the next few days to isolate the issue. This is indeed very helpful.

@dpatschke
Copy link

@savingoyal This appears to be a similar error to what I am experiencing on AWS Batch and which I communicated recently on the following Gitter thread:
https://gitter.im/metaflow_org/community?at=6088af4bda3e853b33d9096b

The main difference is that this issue is reporting the list index out of range error at the end step rather than in the middle of a 'fan-out' step (which is where I am experiencing the problem).

@jaklinger
Copy link

This issue also arises when running many concurrent foreach branches in batch. We didn't used to get this error with --max-workers=100, but for last couple of months we have and so we reduced to --max-workers=20 which "resolves" our situation for now (but I would really prefer to run --max-workers=100 😄 )

@dpatschke
Copy link

@jaklinger Thanks for commenting on this issue. That is exactly where I am experiencing the issuse... concurrent foreach branches being run on AWS Batch. I also didn't use to see it previously. In my case, I'm just running 16 workers at a time but I'm probably running about 60 total within a foreach fan-out.
@savingoyal This issue appeared again for me last night. It's happening about once every three or four days now.
Below was the status update shortly after the list index appeared in the logs:

2021-05-04 23:58:31.723 16 tasks are running: e.g. ....
2021-05-04 23:58:31.723 11 tasks are waiting in the queue.
2021-05-04 23:58:31.723 0 steps are pending: e.g. ....

@savingoyal
Copy link
Collaborator

Thanks for the extra data points. I am actively looking into this issue.

@savingoyal
Copy link
Collaborator

@dpatschke Can you help me out with the version of Metaflow (and service) that you are using? Also, does this happen always with AWS Batch only? If so, it could be an issue with AWS Batch calls and not the service itself.

@savingoyal
Copy link
Collaborator

@dpatschke @jaklinger Do you run into the same issue when you use METAFLOW_DEFAULT_DATASTORE=local python flow.py run --with batch? I have a few canaries bombarding the metaflow service and AWS Batch since last evening and I am yet to run into the issue 😞 Would you mind sharing what your Batch compute environment looks like and if you have any non-standard AWS API call limits in place?

@savingoyal
Copy link
Collaborator

I created the above issue Netflix/metaflow#507 to provide wider visibility to this issue.

@jaklinger
Copy link

jaklinger commented May 6, 2021

Hi @savingoyal - thanks for looking into this, much appreciated!

My compute environment is (I believe) the standard prescribed by metaflow via Cloudwatch:

Provisioning model: EC2
Minimum vCPUs: 0
Desired vCPUs: 0
Maximum vCPUs: 64
Spot fleet role: --
EC2 keypair: ***
Instance types m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, r5d.4xlarge
Instance role arn:aws:iam::******:instance-profile/nesta-metaflow-ECSInstanceProfile-LS7YJ38EBANG
Maximum % on-demand price: --
Allocation strategy: --

Regarding running METAFLOW_DEFAULT_DATASTORE=local python flow.py run --with batch: I've found that it produces the same outcome for max-workers=100.

Extra info on my tasks

My batch tasks produce very little output, and will run for up to 8 minutes. According to these lines in batch_client producing zero output will lead to persistent polling of CloudWatch logs. Split over my 100 max-workers will lead to 100 polls per seconds which is probably causing the GetLogEvents rate limit error - but I would assume that this issue wouldn't surface as quickly for tasks which are a) shorter and b) more verbose.

One solution would be to avoid clashes between unique instance calls of batch_client.BatchLogs.get_events, ensuring that max_workers / _sleep_on_no_data ~< 10 (10 reqs/s being the global log requests limit), or otherwise to dynamically set the polling time _sleep_on_no_data to ensure the limit is respected (i.e. _sleep_on_no_data = max_workers/10 + random_jitter()). This could lead to fairly long polling time (e.g. every 10 seconds) but in this case it's unavoidable due to AWS's hard limit. Other quick/static alternatives could be to increase the default _sleep_on_no_data from 1 to 10 (or more), or to impose a maximum max_workers when using batch?

@savingoyal
Copy link
Collaborator

@jaklinger As of Metaflow 2.2.10, we no longer rely on AWS CloudWatch for log fetching. I suspect if the issue still persists for the newer versions of Metaflow, we might be running into throttling issues with other API calls for Batch. I will put out a debugging PR with more verbose logging to collect more information.

@savingoyal
Copy link
Collaborator

I worked with @dpatschke to triage this issue (Many thanks!). Looks like AWS Batch refuses to send the correct response to the describe_jobs API call we make to ascertain job status for some requests (the response code is still 200). Upgrading to Metaflow 2.3.0 should address this issue Netflix/metaflow#543.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants