Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing directory issue of local metadataprovider with aws batch. #141

Merged
merged 7 commits into from Mar 4, 2020

Conversation

valayDave
Copy link
Collaborator

There is an issue in syncing metadata for local metadata provider when using AWS Batch.

I am using a local metadata provider with executes on Batch with S3 datastore. I was able to complete the flow successfully on batch and was trying to analyze results in a notebook. But when I access properties of the run like run.data it throws an error.

I am using https://github.com/valayDave/mnist-experiments-with-metaflow/blob/without_conda_test/hello_mnist.py as the file and running the following command :python hello_mnist.py --with batch:cpu=2,memory=4000,image=tensorflow/tensorflow:latest-py3 run --num_training_examples 1000.

The odd thing is that there are two .metaflow/.metaflow folders in my directory and both of them holding the same flows and runs. When moved the data under .metaflow/.metaflow/FLOWNAME/RUN_ID/STEP_NAME/TASK_ID/_meta to .metaflow/FLOWNAME/RUN_ID/STEP_NAME/TASK_ID/_meta and executed my notebook, It worked perfectly fine. I am using version 2.0.2.

I investigated that there is data sync after a step from S3 which brings metadata back to the client and does a copy tree operation.

While on batch because of the METAFLOW_DATASTORE_SYSROOT_LOCAL is being referenced to DATASTORE_LOCAL_DIR which is DATASTORE_LOCAL_DIR = '.metaflow' in the metaflow_config, the flow-related files are created under .metaflow/.metaflow. So when it Tars and untars the data back on the local client and performs copy tree, it creates a new .metaflow under the .metaflow folder on the local machine.

Removing the METAFLOW_DATASTORE_SYSROOT_LOCAL the fixed the problem.

@@ -145,7 +145,6 @@ def launch_job(
.environment_variable('METAFLOW_USER', attrs['metaflow.user']) \
.environment_variable('METAFLOW_SERVICE_URL', BATCH_METADATA_SERVICE_URL) \
.environment_variable('METAFLOW_SERVICE_HEADERS', json.dumps(BATCH_METADATA_SERVICE_HEADERS)) \
.environment_variable('METAFLOW_DATASTORE_SYSROOT_LOCAL', DATASTORE_LOCAL_DIR) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to the code outlining why METAFLOW_DATASTORE_SYSROOT_LOCAL isn't needed here, so that it doesn't get added inadvertently later on? cc - @romain-intel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I can update the PR by putting a comment on top?

metaflow/plugins/aws/batch/batch.py Outdated Show resolved Hide resolved
# metadata (provided in task_finished in batch_decorator.py and in _sync_metadata in batch_cli.py) assumes that DATASTORE_LOCAL_DIR is where the metadata is stored on the remote batch instance.
# This is the default value if METAFLOW_DATASTORE_SYSROOT_LOCAL is NOT set;
# if it is, the resulting path is different (see get_datastore_root_from_config in datastore/local.py)
"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@savingoyal Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valayDave Just made a few minor changes to the comment. We are good to push this PR now.

@savingoyal savingoyal merged commit 927f552 into Netflix:master Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants