Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing directory issue of local metadataprovider with aws batch. #141

Merged
merged 7 commits into from Mar 4, 2020

Conversation

@valayDave
Copy link

valayDave commented Feb 25, 2020

There is an issue in syncing metadata for local metadata provider when using AWS Batch.

I am using a local metadata provider with executes on Batch with S3 datastore. I was able to complete the flow successfully on batch and was trying to analyze results in a notebook. But when I access properties of the run like run.data it throws an error.

I am using https://github.com/valayDave/mnist-experiments-with-metaflow/blob/without_conda_test/hello_mnist.py as the file and running the following command :python hello_mnist.py --with batch:cpu=2,memory=4000,image=tensorflow/tensorflow:latest-py3 run --num_training_examples 1000.

The odd thing is that there are two .metaflow/.metaflow folders in my directory and both of them holding the same flows and runs. When moved the data under .metaflow/.metaflow/FLOWNAME/RUN_ID/STEP_NAME/TASK_ID/_meta to .metaflow/FLOWNAME/RUN_ID/STEP_NAME/TASK_ID/_meta and executed my notebook, It worked perfectly fine. I am using version 2.0.2.

I investigated that there is data sync after a step from S3 which brings metadata back to the client and does a copy tree operation.

While on batch because of the METAFLOW_DATASTORE_SYSROOT_LOCAL is being referenced to DATASTORE_LOCAL_DIR which is DATASTORE_LOCAL_DIR = '.metaflow' in the metaflow_config, the flow-related files are created under .metaflow/.metaflow. So when it Tars and untars the data back on the local client and performs copy tree, it creates a new .metaflow under the .metaflow folder on the local machine.

Removing the METAFLOW_DATASTORE_SYSROOT_LOCAL the fixed the problem.

@@ -145,7 +145,6 @@ def launch_job(
.environment_variable('METAFLOW_USER', attrs['metaflow.user']) \
.environment_variable('METAFLOW_SERVICE_URL', BATCH_METADATA_SERVICE_URL) \
.environment_variable('METAFLOW_SERVICE_HEADERS', json.dumps(BATCH_METADATA_SERVICE_HEADERS)) \
.environment_variable('METAFLOW_DATASTORE_SYSROOT_LOCAL', DATASTORE_LOCAL_DIR) \

This comment has been minimized.

Copy link
@savingoyal

savingoyal Mar 2, 2020

Member

Can you add a comment to the code outlining why METAFLOW_DATASTORE_SYSROOT_LOCAL isn't needed here, so that it doesn't get added inadvertently later on? cc - @romain-intel.

This comment has been minimized.

Copy link
@valayDave

valayDave Mar 3, 2020

Author

Yes. I can update the PR by putting a comment on top?

valayDave added 3 commits Mar 3, 2020
metaflow/plugins/aws/batch/batch.py Outdated Show resolved Hide resolved
# metadata (provided in task_finished in batch_decorator.py and in _sync_metadata in batch_cli.py) assumes that DATASTORE_LOCAL_DIR is where the metadata is stored on the remote batch instance.
# This is the default value if METAFLOW_DATASTORE_SYSROOT_LOCAL is NOT set;
# if it is, the resulting path is different (see get_datastore_root_from_config in datastore/local.py)
"""

This comment has been minimized.

Copy link
@valayDave

This comment has been minimized.

Copy link
@savingoyal

savingoyal Mar 4, 2020

Member

@valayDave Just made a few minor changes to the comment. We are good to push this PR now.

@savingoyal savingoyal merged commit 927f552 into Netflix:master Mar 4, 2020
2 checks passed
2 checks passed
Test on ubuntu-latest
Details
Test on macos-latest
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.