Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Task Factory File/Blob Enumeration is Slow #303

Closed
JadenLy opened this issue Aug 29, 2019 · 14 comments
Closed

Task Factory File/Blob Enumeration is Slow #303

JadenLy opened this issue Aug 29, 2019 · 14 comments
Assignees
Labels

Comments

@JadenLy
Copy link

JadenLy commented Aug 29, 2019

Problem Description

I have been using shipyard for a while. There is one issue that I noticed, which is the uploading speed. It takes a long time for me to complete the job submission process.

Batch Shipyard Version

3.7.0

Steps to Reproduce

For my job, I used a jobs.yaml similar to the following:

job_specifications:
- id: eval
  default_working_dir: container
  auto_complete: false
  infiniband: true
  priority: 0
  environment_variables:
    PYTHONPATH: '/opt/models/research:/opt/models/research/slim:/home/caffe/caffe:/home/caffe/caffe/LoggingInfra/AppLog'
    LANG: 'en_US.UTF-8'
    LC_ALL: 'en_US.UTF-8'
  gpu: true
  data_volumes:
  - contdatavol_data
  - contdatavol_caffe
  allow_run_on_missing_image: true
  remove_container_after_exit: true
  merge_task:
    id: null
    docker_image: clobotics.azurecr.io/mlman:gpu-0ed4b4e

    command: bash -c "cd /home/caffe/caffe/; bash examples/eval/combine_eval_pgus.sh"

    resource_files:
    - blob_source: some file in blob
      file_path: local path for the file

  tasks:
  - id: null
    docker_image: clobotics.azurecr.io/mlman:gpu-0ed4b4e
    task_factory:
      file:
        azure_storage:
          storage_account_settings: mystorageaccount
          # remote_path should only use the container, as azure blob
          # doesn't really have a concept of directory; 
          remote_path: projects
          # filter the files using the entire path
          include:
          - 'shipyard_eval/pgus/sku_eval_file/*.csv' # iterate the files under the blob location
          is_file_share: false
        # this is required!!!
        task_filepath: file_name

    command: bash -c "cd /home/caffe/caffe/; bash examples/eval/eval.sh"

    resource_files:
    - blob_source: some file in blob
      file_path: local path for the file

Submitting pool is pretty quick. Submitting a job sometimes is pretty quick for a job without iterating files over blob, but it takes a long time in this case.

Expected Results

I hope that the submission can be done in one minute.

Actual Results

It takes about 20 minutes or more to complete the process.

Redacted Configuration

INSERT RELEVANT YAML FILES

Additional Logs

INSERT ADDITIONAL LOGS HERE

Additonal Comments

Let me know if there is any additional information I can provide to help.

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

This is not unexpected if you have lots of blobs to enumerate. You can optimize your file-based task factory by prefix filtering directly to the blob virtual directory (on the server side to eliminate filtering on the client side):

    task_factory:
      file:
        azure_storage:
          storage_account_settings: mystorageaccount
          remote_path: projects/shipyard_eval/pgus/sku_eval_file
          include:
          - '*.csv'
          is_file_share: false

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

The above is the expected behavior - after reviewing the code, there's a defect preventing prefix matching from filtering server side.

@alfpark alfpark added defect and removed question labels Aug 29, 2019
@alfpark alfpark self-assigned this Aug 29, 2019
@JadenLy
Copy link
Author

JadenLy commented Aug 29, 2019

@alfpark Thanks for your suggestion. In my case, I only has around 10 files in the blob. Would it still take such a long time to enumerate all the files?

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

The entire container only has 10 blobs total?

@JadenLy
Copy link
Author

JadenLy commented Aug 29, 2019

@alfpark I mean I only have around 10 files to enumerate, which would create 10 tasks. The whole container contain around ten thousands of files.

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

That is expected behavior in this case (with the defect).

I'll attempt to fix the prefix filter not being applied, then if you use my suggested yaml above, the task generation should be quick.

@JadenLy
Copy link
Author

JadenLy commented Aug 29, 2019

@alfpark I attempted your suggested yaml format. However, it looks like the it is enumerating the whole container (which is named as projects). I only want to enumerate the csv files under projects/shipyard_eval/pgus/sku_eval_file though. And finally it raised an error to generate a merge task for such a large number of tasks.

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

It's not fixed yet, the fix is coming right now.

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

Please check the devops build for the commit above, you can use the develop-cli Docker image to test once the build completes. Or alternatively check out the develop branch, resubmit your pool and jobs to test.

@alfpark alfpark changed the title Jobs Upload Speed Task Factory File/Blob Enumeration is Slow Aug 29, 2019
@JadenLy
Copy link
Author

JadenLy commented Aug 29, 2019

Thanks @alfpark . I tried to submit the pool under the develop branch, but there is an error:

2019-08-29 18:03:01.337 INFO - creating container: shipyardgr-clobotics-sku-eval-1
2019-08-29 18:03:01.875 INFO - creating table: shipyardimages
2019-08-29 18:03:02.346 INFO - creating table: shipyardslurm
2019-08-29 18:03:02.498 INFO - creating table: shipyardgr
2019-08-29 18:03:02.652 INFO - creating container: shipyardrf-clobotics-sku-eval-1
2019-08-29 18:03:02.806 INFO - creating table: shipyarddht
2019-08-29 18:03:02.960 INFO - deleting blobs: shipyardgr-clobotics-sku-eval-1
2019-08-29 18:03:03.912 DEBUG - clearing table (pk=clobotics$sku-eval-1): shipyardimages
2019-08-29 18:03:04.539 DEBUG - clearing table (pk=clobotics$sku-eval-1): shipyardgr
2019-08-29 18:03:04.881 INFO - deleting blobs: shipyardrf-clobotics-sku-eval-1
2019-08-29 18:03:05.443 DEBUG - clearing table (pk=clobotics$sku-eval-1): shipyardperf
2019-08-29 18:03:05.522 DEBUG - clearing table (pk=clobotics$sku-eval-1): shipyarddht
2019-08-29 18:03:05.598 DEBUG - autoscale enabled: True
2019-08-29 18:03:05.599 DEBUG - no virtual network settings specified
2019-08-29 18:03:05.599 DEBUG - no public ips settings specified
Traceback (most recent call last):
  File "/home/caffe/batch-shipyard/shipyard.py", line 3134, in <module>
    cli()
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/caffe/batch-shipyard/.shipyard/lib/python3.5/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/caffe/batch-shipyard/shipyard.py", line 1544, in pool_add
    ctx.table_client, ctx.keyvault_client, ctx.config, recreate)
  File "/home/caffe/batch-shipyard/convoy/fleet.py", line 3356, in action_pool_add
    batch_client, blob_client, keyvault_client, config
  File "/home/caffe/batch-shipyard/convoy/fleet.py", line 1804, in _add_pool
    batch_client, blob_client, keyvault_client, config)
  File "/home/caffe/batch-shipyard/convoy/fleet.py", line 1329, in _construct_pool_object
    batch_client, config, pool_settings)
  File "/home/caffe/batch-shipyard/convoy/fleet.py", line 909, in _pick_node_agent_for_vm
    batch_client, config, publisher, offer, sku)
  File "/home/caffe/batch-shipyard/convoy/batch.py", line 352, in get_node_agent_for_image
    images = batch_client.account.list_supported_images()
AttributeError: 'AccountOperations' object has no attribute 'list_supported_images'

My pool.yaml is the following:

pool_specification:
  id: sku-eval-1
  vm_configuration:
    platform_image:
      offer: UbuntuServer
      publisher: Canonical
      sku: 16.04-LTS
      native: true
  vm_size: STANDARD_NC6
  vm_count:
    dedicated: 0
    low_priority: 0
  max_tasks_per_node: 1
  autoscale:
    evaluation_interval: 00:05:00
    scenario:
      name: pending_tasks
      maximum_vm_count:
        dedicated: 0
        low_priority: 30
      node_deallocation_option: taskcompletion
  ssh:
    username: caffe

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

@JadenLy
Copy link
Author

JadenLy commented Aug 29, 2019

@alfpark Thanks, I have fixed this issue. However, my task submission still gave me the following output:

2019-08-29 18:34:14.071 DEBUG - 10000 tasks collated so far

My config for the task factory looks like:

task_factory:
      file:
        azure_storage:
          storage_account_settings: mystorageaccount
          # remote_path should only use the container, as azure blob
          # doesn't really have a concept of directory; 
          remote_path: projects/shipyard_eval/yizi/sku_eval_file/
          # filter the files using the entire path
          include:
          - '*.csv'
          is_file_share: false
        # this is required!!!
        task_filepath: file_name

@JadenLy
Copy link
Author

JadenLy commented Aug 29, 2019

Sorry, I forgot to switch back to develop branch after making the modification. The submission is really quick right now. Thanks! I wonder when this feature will be in the main branch?

@alfpark
Copy link
Collaborator

alfpark commented Aug 29, 2019

@JadenLy, thanks for testing! It'll be rolled up into the next hotfix release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants