Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored Dockerfiles #625

Closed
wants to merge 6 commits into from
Closed

Conversation

borisfom
Copy link
Collaborator

@borisfom borisfom commented Sep 16, 2021

Signed-off-by: Boris Fomitchev bfomitchev@nvidia.com

Description

Refactored Dockerfiles into a single one with build arg for base
Added a script to build both .tar.gz and wheel distribution - used in Dockerfile.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
@borisfom
Copy link
Collaborator Author

Also, setup.py wanted pytorch < 1.10, which would make it to try downgrading pytorch in the container - fixed that.

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Copy link
Collaborator

@peri044 peri044 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the README.md with updated instructions on how to use this docker file and versions etc. Tested 21.07 container. I face one intermediate error

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/ssd_pyt_ckpt_amp/versions/20.06.0/files/nvidia_ssdpyt_amp_200703.pt
Traceback (most recent call last):
  File "./hub.py", line 60, in <module>
    "model": torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', model_math="fp32"),
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 364, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 393, in _load_local
    model = entry(*args, **kwargs)
  File "/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Detection/SSD/src/entrypoints.py", line 187, in nvidia_ssd
    ckpt = torch.load(ckpt_file)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
    result = unpickler.load()
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

but rest is fine.

py/requirements.txt Show resolved Hide resolved
docker/Dockerfile Outdated Show resolved Hide resolved
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
@borisfom
Copy link
Collaborator Author

Please update the README.md with updated instructions on how to use this docker file and versions etc. Tested 21.07 container. I face one intermediate error

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/ssd_pyt_ckpt_amp/versions/20.06.0/files/nvidia_ssdpyt_amp_200703.pt
Traceback (most recent call last):
  File "./hub.py", line 60, in <module>
    "model": torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', model_math="fp32"),
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 364, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 393, in _load_local
    model = entry(*args, **kwargs)
  File "/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Detection/SSD/src/entrypoints.py", line 187, in nvidia_ssd
    ckpt = torch.load(ckpt_file)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
    result = unpickler.load()
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

but rest is fine.

Moved hub.py call to the test script - CUDA is not available during Docker build.

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
@@ -234,7 +234,7 @@ def run(self):
long_description=long_description,
ext_modules=ext_modules,
install_requires=[
'torch>=1.9.0+cu111,<1.10.0',
'torch>=1.9.0+cu111,<1.11.0',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we want this change in master until we retarget the mainline codebase to PyTorch 1.10 (post PyTorch's 1.10 release). @andi4191 this might be a good time to start working through the DLFW release workflow. We can retarget this branch to be merged into a DLFW-21.10 branch (lets all agree on a convention for these branch names @borisfom @andi4191 @ptrblck) and then cherry-pick any changes we want in the mainline at a later date (Like the single unified Dockerfile I think will be super useful)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this critical? It just allows TRTorch from master branch to be used in 21.07, 21.08, 21.09 containers, and it breaks nothing ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in the typical case (like not for the state of the repo as it is right now) we only guarantee compatibility with the latest released pytorch since we depend on internal APIs. There have been cases in the past where changes needed to support pytorch-next break support for the previous version. We arent checking against pytorch ToT and if some user has a newer build installed, it might not work even though setup.py says the dependency is compatible. I'd prefer if we only make changes specifically targeted for DLFW in their own branches so that we can keep the guarantees we give to users simple to understand.

Copy link
Collaborator Author

@borisfom borisfom Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the concern is understood. However, the effect of user trying to install current ToT TrTorch in 21.07+ container would not be non-working TrTorch - it would be messing user's container (it will uninstall Pytorch 1.10a and will try to install 1.9). So I believe we'll get less users frustrated if we relax the requirement. Especially that I have checked both 21.07 and 21.08 and at least mostly it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@borisfom: I agree with @narendasan on this. We don't have control over the end-user use cases here. It is safe to proceed with what we can guarantee that works.

Also, can you re-target this PR to release/ngc/21.10 instead of DLFW-21.10 ?

docker/Dockerfile Show resolved Hide resolved
RUN pip install ipywidgets --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host=files.pythonhosted.org
RUN jupyter nbextension enable --py widgetsnbextension

RUN mkdir -p /opt/trtorch && tar xvf libtrtorch.tar.gz --strip-components 2 -C /opt/trtorch --exclude=LICENSE && pip3 install *.whl && rm -fr /workspace/trtorch/dist/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this library to be in /usr/ or /opt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up for you to decide. Since various parts of TRTorch are going to be pulled to other containers (Triton, Riva SM), it's easier to handle if it's under its own root - but I am fine with either way you choose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a standing convention in DLFW containers to put libraries we add in /opt then I am fine with /opt. My preference is /usr otherwise since it seems more conventional and easier to link against

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we did put things in /opt, like triton and such. Let's use that for now.

ENV LD_LIBRARY_PATH /opt/conda/lib/python3.8/site-packages/torch/lib:/opt/trtorch/lib:${LD_LIBRARY_PATH}
ENV PATH /opt/trtorch/bin:${PATH}

WORKDIR /workspace/trtorch/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do any sort of clean up for size reduction? Do we conventionally leave the source in container?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like things like bazel are probably not needed post testing and install in the container itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bazel is only installed in trtorch-builder image, not in target container. I do leave the source in TRTorch container - won't in Pytorch MR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

docker/dist-build.sh Outdated Show resolved Hide resolved
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
@narendasan
Copy link
Collaborator

This got upstreamed during the rebrand

@narendasan narendasan closed this Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants