Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 encountered from metaflow service #7

Open
ferras opened this issue Feb 10, 2020 · 10 comments
Open

500 encountered from metaflow service #7

ferras opened this issue Feb 10, 2020 · 10 comments
Assignees

Comments

@ferras
Copy link
Contributor

@ferras ferras commented Feb 10, 2020

@russellbrooks reported the following issue

Metaflow service error:
Metadata request (/flows/TuningXGB/runs/572/steps/start/tasks/1821/metadata) failed (code 500): {"message":"Internal server error"}

For context, this was encountered as part of a hyperparameter tuning framework (each flow run is a model training evaluation) after ~6hrs with 125 runs successfully completed. Everything is executed on Batch with 12 variants being trained in parallel, then feeding those results back to a bayesian optimizer to generate the next.

The cloudwatch logs from Batch indicate that the step completed successfully, and the Metaflow service error was encountered on a separate on-demand EC2 instance that's running the optimizer and executes the flows using asyncio.create_subprocess_shell. Looking at API Gateway, the request rates seemed reasonable and its configured without throttling. RDS showed plenty of CPU credits and barely seemed phased throughout the workload. Each run was executed with --retry but this error seems to have short-circuited that logic and resulted in a hard-stop.

@ferras

This comment has been minimized.

Copy link
Contributor Author

@ferras ferras commented Feb 10, 2020

@russellbrooks can you try using the provided python client for the metadata service to access the task for your run you reported issues with?

Something like

from metaflow import namespace, Run
namespace(None)
run = Run('TuningXGB/572')

for step in run:
    print(step)
@russellbrooks

This comment has been minimized.

Copy link

@russellbrooks russellbrooks commented Feb 10, 2020

Hey @ferras appreciate the help!

Running:

from metaflow import namespace, Run
namespace(None)
run = Run('TuningXGB/572')

for step in run:
    print(step)

gives the start step as expected Step('TuningXGB/572/start'). The flow terminated with the 500 service error before starting the next (end) step.

For what it's worth – I'm able to load all of the task artifacts from within step.task.data.

from metaflow import namespace, Run
namespace(None)
run = Run('TuningXGB/572')

for step in run:
    print(step)

    # Just testing that we can actually load the artifacts from S3.
    for artifact in step.task.data._artifacts:
        print(f"{artifact}: {step.task.data.__getattr__(artifact)}")
@ferras

This comment has been minimized.

Copy link
Contributor Author

@ferras ferras commented Feb 10, 2020

can you access the metadata associated with the task?

Step('TuningXGB/572/start').task.metadata

you can also access the logs via

Step('TuningXGB/572/start').task.stdout
Step('TuningXGB/572/start').task.stderr
@russellbrooks

This comment has been minimized.

Copy link

@russellbrooks russellbrooks commented Feb 11, 2020

The task metadata is accessible:

from metaflow import Step
Step('TuningXGB/572/start').task.metadata
[Metadata(name='attempt', value='0', created_at=1581380003965, type='attempt', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='attempt-done', value='0', created_at=1581380003965, type='attempt-done', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='code-package', value='{"ds_type": "s3", "sha": "9efdd934eb15e79f0da12b1f1e116df0fa5c7fc9", "location": "s3://......./TuningXGB/data/9e/9efdd934eb15e79f0da12b1f1e116df0fa5c7fc9"}', created_at=1581380003965, type='code-package', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='log_location_stderr', value='{"ds_type": "s3", "location": "s3://......./TuningXGB/572/start/1821/0.stderr.log", "attempt": 0}', created_at=1581380003965, type='log_path', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='log_location_stdout', value='{"ds_type": "s3", "location": "s3://......./TuningXGB/572/start/1821/0.stdout.log", "attempt": 0}', created_at=1581380003965, type='log_path', task=Task('TuningXGB/572/start/1821')),
 Metadata(name='origin-run-id', value='None', created_at=1581380003965, type='origin-run-id', task=Task('TuningXGB/572/start/1821'))]

The task stdout/stderr seems to point towards the culprit:

Step('TuningXGB/572/start').task.stdout

[KILLED BY ORCHESTRATOR]

Step('TuningXGB/572/start').task.stderr

[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting (status RUNNABLE)...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting (status STARTING)...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting (status RUNNING)...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Setting up task environment.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Downloading code package.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Code package downloaded.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task is starting.
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Using default logging configs, no overrides found.
... normal logs ...
[19da02fe-4dd6-4f72-88fe-460f61f1d217] Task finished with exit code 0.
[KILLED BY ORCHESTRATOR]

Looking at the Batch job, it shows successful completion on the first attempt without getting kicked off its spot instance (managed, spot compute environment).

@ferras

This comment has been minimized.

Copy link
Contributor Author

@ferras ferras commented Feb 11, 2020

so [KILLED BY ORCHESTRATOR] implies that you lost connection while your job was running. So doesn't seem like the failure was a result of the metadata service. I would suggest re-running your flow using resume

https://docs.metaflow.org/metaflow/debugging#how-to-use-the-resume-command

@russellbrooks

This comment has been minimized.

Copy link

@russellbrooks russellbrooks commented Feb 11, 2020

Interesting since it seems like it had already finished the Batch job for the first task and died while starting the next task.

Re-running the flow using resume --origin-run-id 572 does work as expected.

I do think you're right that this is more likely related to either metaflow's runtime or potentially how these flows are being executed as subprocesses in asyncio.

Another interesting observation after repeating the process, this time getting through 244 flow runs in ~12hrs before an error occurred – the end step was submitted and finished successfully on Batch but the runtime bombed after starting it.

From cloudwatch logs, captured stdout from the subprocess flow execution:

2020-02-11 07:02:01.889 Workflow starting (run-id 877):
2020-02-11 07:02:02.045 [877/start/3028 (pid 15173)] Task is starting.
2020-02-11 07:02:03.473 [877/start/3028 (pid 15173)] [d2820876-216b-4005-81f8-c7bfd452fb8a] Task is starting (status SUBMITTED)...
... first task logs ...
2020-02-11 07:03:14.694 [877/start/3028 (pid 15173)] [d2820876-216b-4005-81f8-c7bfd452fb8a] Task finished with exit code 0.
2020-02-11 07:03:15.456 [877/start/3028 (pid 15173)] Task finished successfully.
2020-02-11 07:03:15.797 [877/end/3031 (pid 15236)] Task is starting.
2020-02-11 07:03:17.366 [877/end/3031 (pid 15236)] list index out of range
2020-02-11 07:03:17.548 [877/end/3031 (pid 15236)] Task failed.
2020-02-11 07:03:17.548 This failed task will not be retried.

From cloudwatch logs, captured stderr from the subprocess flow execution:

Internal error:
The end step was not successful by the end of flow.

For comparison, here's stdout from metaflow for the step Step('TuningXGB/877/end').task.stdout):

list index out of range

stderr from metaflow for Step('TuningXGB/877/end').task.stderr) is empty.

What's really odd is that the Batch job for the end step was successfully submitted without showing up in the captured stdout logs and ran till completion. That's what is making me think something is going awry with metaflow runtime (maybe due to being executed via asyncio.create_subprocess_shell 🤷‍♂ ).

Curious what your all's thoughts are and again, really appreciate the help.

@ferras

This comment has been minimized.

Copy link
Contributor Author

@ferras ferras commented Feb 12, 2020

@russellbrooks do you see the same behavior when not running on batch?

@savingoyal do you see any potential problems on the batch side?

@russellbrooks

This comment has been minimized.

Copy link

@russellbrooks russellbrooks commented Feb 12, 2020

@ferras I have yet to see the behavior outside of running on batch.

Right now I have a couple tests running to try to narrow things a bit more:

  1. (everything local) Trimmed the flow's steps to have just sleep(randint()). The bayesian optimizer still runs when any of the flows finish, but I'm just giving it a random number to simulate similar number crunching that occurs synchronously. The asyncio scheduler maintains 10 concurrent flow executions. Currently 652 flow runs in.
  2. (everything local) Same as (1), but removing the optimizer completely and just using await asyncio.sleep(100) between making the new flow executions to simulate a somewhat short-running, synchronous process (FWIW it's typically <30s). Currently 108 flow runs in.
  3. (flow runs on batch, optimizer + asyncio scheduler local): How it's been run previously, just giving it another go. Currently 147 flow runs in.

I've got all three of these running overnight and will follow up soon.

@ferras ferras assigned ferras and savingoyal and unassigned ferras Feb 13, 2020
@russellbrooks

This comment has been minimized.

Copy link

@russellbrooks russellbrooks commented Feb 14, 2020

Hey guys, still been digging into this on my end and here's what I've found so far after letting these run for awhile:

When not using batch (with or without the mocked optimizer) and just running sleep statements, it works without errors for thousands of runs. FWIW this still uses S3 as a datastore and the metadata service.

When using batch for the task executions (still just sleep statements), it eventually hits:
list index out of range

The error seems to eventually occur as part of starting the end task from within the flow execution.

2020-02-13 07:25:19.045 [4346/start/14142 (pid 26351)] Task is starting.
2020-02-13 07:25:19.633 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting (status RUNNABLE)...
2020-02-13 07:25:23.803 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting (status STARTING)...
2020-02-13 07:25:26.935 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting (status RUNNING)...
2020-02-13 07:25:33.279 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Setting up task environment.
2020-02-13 07:25:36.493 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Downloading code package.
2020-02-13 07:25:36.493 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Code package downloaded.
2020-02-13 07:25:47.057 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task is starting.
2020-02-13 07:25:47.057 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Sleeping for 8 sec.
2020-02-13 07:25:47.057 [4346/start/14142 (pid 26351)] [faf09017-46c7-45b3-bce3-4f5b656c05b6] Task finished with exit code 0.
2020-02-13 07:25:47.535 [4346/start/14142 (pid 26351)] Task finished successfully.
2020-02-13 07:25:47.726 [4346/end/14146 (pid 26445)] Task is starting.
2020-02-13 07:25:48.199 [4346/end/14146 (pid 26445)] list index out of range
2020-02-13 07:25:48.475 [4346/end/14146 (pid 26445)] Task failed.
2020-02-13 07:25:48.475 This failed task will not be retried.

Like before, it does actually kick off the batch job and it finishes successfully despite the flow execution tanking.

Misc

While doing these tests I also managed to hit the default service limit of CloudWatch Logs' GetLogEvents (10 req/s) from executing tons of these parallel flow executions on batch 😅
stdout

2020-02-13 06:47:26.538 [3911/end/12842 (pid 1470)] ClientError('An error occurred (ThrottlingException) when calling the GetLogEvents operation (reached max retries: 4): Rate exceeded')    Batch error:
2020-02-13 06:47:26.538 [3911/end/12842 (pid 1470)] Task failed!
2020-02-13 06:47:26.851 [3911/end/12842 (pid 1470)] 
2020-02-13 06:47:27.038 [3911/end/12842 (pid 1470)] Task failed.
2020-02-13 06:47:27.038 Workflow failed.

stderr

Step failure:
    Step end (task-id 12842) failed.

All of that seems fine though and within the batch job/task it'll retry 5 times, independent of boto's retrying, before failing the task (still allowing batch retries on top from the flow's --with retry).

@savingoyal

This comment has been minimized.

Copy link
Member

@savingoyal savingoyal commented Feb 14, 2020

Thanks @russellbrooks. I have a few ideas that I will try out over the next few days to isolate the issue. This is indeed very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.