-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored Dockerfiles #625
Conversation
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Also, setup.py wanted pytorch < 1.10, which would make it to try downgrading pytorch in the container - fixed that. |
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the README.md with updated instructions on how to use this docker file and versions etc. Tested 21.07 container. I face one intermediate error
Downloading: "https://github.com/NVIDIA/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/ssd_pyt_ckpt_amp/versions/20.06.0/files/nvidia_ssdpyt_amp_200703.pt
Traceback (most recent call last):
File "./hub.py", line 60, in <module>
"model": torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', model_math="fp32"),
File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 364, in load
model = _load_local(repo_or_dir, model, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/hub.py", line 393, in _load_local
model = entry(*args, **kwargs)
File "/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Detection/SSD/src/entrypoints.py", line 187, in nvidia_ssd
ckpt = torch.load(ckpt_file)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
but rest is fine.
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
Moved hub.py call to the test script - CUDA is not available during Docker build. |
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
@@ -234,7 +234,7 @@ def run(self): | |||
long_description=long_description, | |||
ext_modules=ext_modules, | |||
install_requires=[ | |||
'torch>=1.9.0+cu111,<1.10.0', | |||
'torch>=1.9.0+cu111,<1.11.0', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we want this change in master until we retarget the mainline codebase to PyTorch 1.10 (post PyTorch's 1.10 release). @andi4191 this might be a good time to start working through the DLFW release workflow. We can retarget this branch to be merged into a DLFW-21.10
branch (lets all agree on a convention for these branch names @borisfom @andi4191 @ptrblck) and then cherry-pick any changes we want in the mainline at a later date (Like the single unified Dockerfile I think will be super useful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this critical? It just allows TRTorch from master branch to be used in 21.07, 21.08, 21.09 containers, and it breaks nothing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in the typical case (like not for the state of the repo as it is right now) we only guarantee compatibility with the latest released pytorch since we depend on internal APIs. There have been cases in the past where changes needed to support pytorch-next break support for the previous version. We arent checking against pytorch ToT and if some user has a newer build installed, it might not work even though setup.py says the dependency is compatible. I'd prefer if we only make changes specifically targeted for DLFW in their own branches so that we can keep the guarantees we give to users simple to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the concern is understood. However, the effect of user trying to install current ToT TrTorch in 21.07+ container would not be non-working TrTorch - it would be messing user's container (it will uninstall Pytorch 1.10a and will try to install 1.9). So I believe we'll get less users frustrated if we relax the requirement. Especially that I have checked both 21.07 and 21.08 and at least mostly it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@borisfom: I agree with @narendasan on this. We don't have control over the end-user use cases here. It is safe to proceed with what we can guarantee that works.
Also, can you re-target this PR to release/ngc/21.10 instead of DLFW-21.10 ?
RUN pip install ipywidgets --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host=files.pythonhosted.org | ||
RUN jupyter nbextension enable --py widgetsnbextension | ||
|
||
RUN mkdir -p /opt/trtorch && tar xvf libtrtorch.tar.gz --strip-components 2 -C /opt/trtorch --exclude=LICENSE && pip3 install *.whl && rm -fr /workspace/trtorch/dist/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this library to be in /usr/ or /opt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is up for you to decide. Since various parts of TRTorch are going to be pulled to other containers (Triton, Riva SM), it's easier to handle if it's under its own root - but I am fine with either way you choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a standing convention in DLFW containers to put libraries we add in /opt then I am fine with /opt. My preference is /usr otherwise since it seems more conventional and easier to link against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we did put things in /opt, like triton and such. Let's use that for now.
ENV LD_LIBRARY_PATH /opt/conda/lib/python3.8/site-packages/torch/lib:/opt/trtorch/lib:${LD_LIBRARY_PATH} | ||
ENV PATH /opt/trtorch/bin:${PATH} | ||
|
||
WORKDIR /workspace/trtorch/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do any sort of clean up for size reduction? Do we conventionally leave the source in container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like things like bazel are probably not needed post testing and install in the container itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bazel is only installed in trtorch-builder image, not in target container. I do leave the source in TRTorch container - won't in Pytorch MR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>
This got upstreamed during the rebrand |
Signed-off-by: Boris Fomitchev bfomitchev@nvidia.com
Description
Refactored Dockerfiles into a single one with build arg for base
Added a script to build both .tar.gz and wheel distribution - used in Dockerfile.
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: