update dockerfile for cuda130#1087
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new Dockerfile for a CUDA 13.0 environment, integrating several deep learning and kernel optimization libraries. The review identified several critical issues that will prevent a successful build: conflicting curl flags during the Miniforge installation, the use of non-existent PyTorch versions and repository indices, and incorrectly scoped environment variables for build parallelism. Additionally, the CUDA architecture list for the SpargeAttn-Fix library needs to be updated to include Blackwell (12.0) to ensure proper optimization for the target hardware.
| libsoup2.4-dev libnice-dev libopus-dev libvpx-dev libx264-dev libsrtp2-dev libglib2.0-dev libdrm-dev libjpeg-dev libpng-dev \ | ||
| && apt-get clean && rm -rf /var/lib/apt/lists/* && git lfs install | ||
|
|
||
| RUN curl -fsSL -v -o /app/miniconda.sh -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh" |
There was a problem hiding this comment.
The use of the -O (remote-name) flag alongside -o /app/miniconda.sh is problematic. In curl, when both are provided, the -O flag typically takes precedence or causes conflict, saving the file with its original remote name (Miniforge3-Linux-x86_64.sh) in the current directory. This will cause the subsequent chmod and bash commands on lines 16-17 to fail as they expect the file to be at /app/miniconda.sh.
RUN curl -fsSL -v -o /app/miniconda.sh "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh"
|
|
||
| RUN pip install --no-cache-dir packaging ninja cmake scikit-build-core uv meson ruff pre-commit fastapi uvicorn requests -U | ||
|
|
||
| RUN pip install torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0 --index-url https://download.pytorch.org/whl/cu130 |
There was a problem hiding this comment.
The specified versions torch==2.11.0, torchvision==0.26.0, and torchaudio==2.11.0 do not currently exist in the official PyTorch repositories (the latest stable is 2.5.x). Additionally, the cu130 wheel index is not yet available on download.pytorch.org. Unless these refer to a private or future internal repository, this command will fail to find the packages during the build process.
|
|
||
| COPY lightx2v_kernel /app/lightx2v_kernel | ||
|
|
||
| RUN git clone https://github.com/NVIDIA/cutlass.git --depth 1 && cd /app/lightx2v_kernel && MAX_JOBS=32 && CMAKE_BUILD_PARALLEL_LEVEL=4 \ |
There was a problem hiding this comment.
The environment variable MAX_JOBS=32 is followed by &&, which makes it a separate shell command. This sets a local shell variable but does not export it to the subsequent uv build command. To ensure the build process respects this limit, the environment variables should be prefixed directly to the command without the && separator.
RUN git clone https://github.com/NVIDIA/cutlass.git --depth 1 && cd /app/lightx2v_kernel && MAX_JOBS=32 CMAKE_BUILD_PARALLEL_LEVEL=4 \
| RUN cd q8_kernels && git submodule init && git submodule update && python setup.py install && rm -rf build | ||
|
|
||
| RUN git clone https://github.com/ModelTC/SpargeAttn-Fix.git --depth 1 | ||
| RUN cd SpargeAttn-Fix && TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0" pip install --no-cache-dir --no-build-isolation -v -e . |
There was a problem hiding this comment.
The TORCH_CUDA_ARCH_LIST is missing the architecture for Blackwell (12.0 or 10.0), which appears to be the primary target of this CUDA 13.0 Dockerfile (consistent with the architectures listed in line 51). Without including it here, the SpargeAttn-Fix kernels will not be optimized for the intended hardware.
RUN cd SpargeAttn-Fix && TORCH_CUDA_ARCH_LIST="8.0;8.6;8.9;9.0;12.0" pip install --no-cache-dir --no-build-isolation -v -e .
No description provided.