Refactored Dockerfiles #625

borisfom · 2021-09-16T04:57:23Z

Signed-off-by: Boris Fomitchev bfomitchev@nvidia.com

Description

Refactored Dockerfiles into a single one with build arg for base
Added a script to build both .tar.gz and wheel distribution - used in Dockerfile.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

borisfom · 2021-09-16T20:12:03Z

Also, setup.py wanted pytorch < 1.10, which would make it to try downgrading pytorch in the container - fixed that.

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

peri044

Please update the README.md with updated instructions on how to use this docker file and versions etc. Tested 21.07 container. I face one intermediate error

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/ssd_pyt_ckpt_amp/versions/20.06.0/files/nvidia_ssdpyt_amp_200703.pt
Traceback (most recent call last):
  File "./hub.py", line 60, in <module>
    "model": torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', model_math="fp32"),
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 364, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 393, in _load_local
    model = entry(*args, **kwargs)
  File "/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Detection/SSD/src/entrypoints.py", line 187, in nvidia_ssd
    ckpt = torch.load(ckpt_file)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
    result = unpickler.load()
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

but rest is fine.

py/requirements.txt

docker/Dockerfile

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

borisfom · 2021-09-21T10:20:17Z

Please update the README.md with updated instructions on how to use this docker file and versions etc. Tested 21.07 container. I face one intermediate error

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/ssd_pyt_ckpt_amp/versions/20.06.0/files/nvidia_ssdpyt_amp_200703.pt
Traceback (most recent call last):
  File "./hub.py", line 60, in <module>
    "model": torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', model_math="fp32"),
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 364, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 393, in _load_local
    model = entry(*args, **kwargs)
  File "/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Detection/SSD/src/entrypoints.py", line 187, in nvidia_ssd
    ckpt = torch.load(ckpt_file)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
    result = unpickler.load()
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

but rest is fine.

Moved hub.py call to the test script - CUDA is not available during Docker build.

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

narendasan · 2021-09-22T19:15:02Z

py/setup.py

@@ -234,7 +234,7 @@ def run(self):
      long_description=long_description,
      ext_modules=ext_modules,
      install_requires=[
-          'torch>=1.9.0+cu111,<1.10.0',
+          'torch>=1.9.0+cu111,<1.11.0',


I dont think we want this change in master until we retarget the mainline codebase to PyTorch 1.10 (post PyTorch's 1.10 release). @andi4191 this might be a good time to start working through the DLFW release workflow. We can retarget this branch to be merged into a DLFW-21.10 branch (lets all agree on a convention for these branch names @borisfom @andi4191 @ptrblck) and then cherry-pick any changes we want in the mainline at a later date (Like the single unified Dockerfile I think will be super useful)

Why is this critical? It just allows TRTorch from master branch to be used in 21.07, 21.08, 21.09 containers, and it breaks nothing ?

Because in the typical case (like not for the state of the repo as it is right now) we only guarantee compatibility with the latest released pytorch since we depend on internal APIs. There have been cases in the past where changes needed to support pytorch-next break support for the previous version. We arent checking against pytorch ToT and if some user has a newer build installed, it might not work even though setup.py says the dependency is compatible. I'd prefer if we only make changes specifically targeted for DLFW in their own branches so that we can keep the guarantees we give to users simple to understand.

Well, the concern is understood. However, the effect of user trying to install current ToT TrTorch in 21.07+ container would not be non-working TrTorch - it would be messing user's container (it will uninstall Pytorch 1.10a and will try to install 1.9). So I believe we'll get less users frustrated if we relax the requirement. Especially that I have checked both 21.07 and 21.08 and at least mostly it works.

@borisfom: I agree with @narendasan on this. We don't have control over the end-user use cases here. It is safe to proceed with what we can guarantee that works.

Also, can you re-target this PR to release/ngc/21.10 instead of DLFW-21.10 ?

docker/Dockerfile

narendasan · 2021-09-22T19:17:56Z

docker/Dockerfile

+RUN pip install ipywidgets --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host=files.pythonhosted.org
+RUN jupyter nbextension enable --py widgetsnbextension
+
+RUN mkdir -p /opt/trtorch && tar xvf libtrtorch.tar.gz --strip-components 2 -C /opt/trtorch --exclude=LICENSE && pip3 install *.whl && rm -fr /workspace/trtorch/dist/*


Do we want this library to be in /usr/ or /opt?

This is up for you to decide. Since various parts of TRTorch are going to be pulled to other containers (Triton, Riva SM), it's easier to handle if it's under its own root - but I am fine with either way you choose.

If there is a standing convention in DLFW containers to put libraries we add in /opt then I am fine with /opt. My preference is /usr otherwise since it seems more conventional and easier to link against

Yes we did put things in /opt, like triton and such. Let's use that for now.

narendasan · 2021-09-22T19:18:48Z

docker/Dockerfile

+ENV LD_LIBRARY_PATH /opt/conda/lib/python3.8/site-packages/torch/lib:/opt/trtorch/lib:${LD_LIBRARY_PATH}
+ENV PATH /opt/trtorch/bin:${PATH}
+
+WORKDIR /workspace/trtorch/


Do we need to do any sort of clean up for size reduction? Do we conventionally leave the source in container?

Like things like bazel are probably not needed post testing and install in the container itself

Bazel is only installed in trtorch-builder image, not in target container. I do leave the source in TRTorch container - won't in Pytorch MR.

docker/dist-build.sh

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

narendasan · 2021-11-30T05:49:11Z

This got upstreamed during the rebrand

Refactored Dockerfiles

49ab288

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

Added dist-test.sh

91507a7

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

peri044 requested changes Sep 20, 2021

View reviewed changes

py/requirements.txt Show resolved Hide resolved

docker/Dockerfile Outdated Show resolved Hide resolved

borisfom added 2 commits September 20, 2021 20:27

Cleanup

24089cc

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

Moving hub.py to tests

3dd7b6a

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

borisfom requested a review from peri044 September 21, 2021 19:35

Added Dockefile documentation section

4301ca5

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

narendasan reviewed Sep 22, 2021

View reviewed changes

Bumping up base container version, cleanup

67ba228

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

borisfom requested a review from narendasan September 23, 2021 20:40

narendasan requested a review from andi4191 September 24, 2021 19:24

narendasan closed this Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored Dockerfiles #625

Refactored Dockerfiles #625

borisfom commented Sep 16, 2021 •

edited by ncomly-nvidia

Loading

borisfom commented Sep 16, 2021

peri044 left a comment •

edited

Loading

borisfom commented Sep 21, 2021

narendasan Sep 22, 2021

borisfom Sep 22, 2021

narendasan Sep 24, 2021

borisfom Sep 24, 2021 •

edited

Loading

andi4191 Sep 24, 2021

narendasan Sep 22, 2021

borisfom Sep 22, 2021

narendasan Sep 24, 2021

borisfom Sep 24, 2021

narendasan Sep 22, 2021

narendasan Sep 22, 2021

borisfom Sep 22, 2021

narendasan Sep 24, 2021

narendasan commented Nov 30, 2021

Refactored Dockerfiles #625

Refactored Dockerfiles #625

Conversation

borisfom commented Sep 16, 2021 • edited by ncomly-nvidia Loading

Description

Type of change

Checklist:

borisfom commented Sep 16, 2021

peri044 left a comment • edited Loading

Choose a reason for hiding this comment

borisfom commented Sep 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borisfom Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narendasan commented Nov 30, 2021

borisfom commented Sep 16, 2021 •

edited by ncomly-nvidia

Loading

peri044 left a comment •

edited

Loading

borisfom Sep 24, 2021 •

edited

Loading