Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault with cuda 11.1 #2506

Closed
dk-hong opened this issue Nov 30, 2020 · 8 comments
Closed

segmentation fault with cuda 11.1 #2506

dk-hong opened this issue Nov 30, 2020 · 8 comments
Labels
bug Something isn't working

Comments

@dk-hong
Copy link

dk-hong commented Nov 30, 2020

I want to use DALI in a docker container which has environments below.
Framework: pytorch 1.7.0
CUDA version: 11.1
python version: 3.8.0
DALI version: nvidia-dali-cuda110 0.27.0

The code worked well when I run it with CUDA version 10.2 & without docker
But now it doesn't work with the above environments.

There are no error messages that I can get some hints for fixing it but segmentation fault (core dumped)

Only I know is there is something wrong in pipe.build()

My code is right below.

class TrainPipe(Pipeline):
    def __init__(self, batch_size, n_workers, device_id, data_dir, crop, shard_id, num_shards, dali_cpu=False):
        super(TrainPipe, self).__init__(batch_size, n_workers, device_id, seed=device_id)
        self.input = ops.FileReader(file_root=data_dir, shard_id=shard_id, num_shards=num_shards,
                                    random_shuffle=True, pad_last_batch=True)
        
        dali_device = 'cpu' if dali_cpu else 'gpu'
        decoder_device = 'cpu' if dali_cpu else 'mixed'
        
        device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
        host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
        self.decode = ops.ImageDecoderRandomCrop(device=decoder_device, output_type=types.RGB,
                                                 device_memory_padding=device_memory_padding,
                                                 host_memory_padding=host_memory_padding,
                                                 random_aspect_ratio=[0.8, 1.25],
                                                 random_area=[0.1, 1.0],
                                                 num_attempts=100)
        
        self.res = ops.Resize(device=dali_device,
                              resize_x=crop,
                              resize_y=crop,
                              interp_type=types.INTERP_TRIANGULAR)

        self.cmnp = ops.CropMirrorNormalize(device="gpu",
                                            dtype=types.FLOAT,
                                            output_layout=types.NCHW,
                                            crop=(crop, crop),
                                            mean=[0.485 * 255,0.456 * 255,0.406 * 255],
                                            std=[0.229 * 255,0.224 * 255,0.225 * 255])
        self.coin = ops.CoinFlip(probability=0.5)

    def define_graph(self):
        rng = self.coin()
        self.jpegs, self.labels = self.input(name="Reader")
        images = self.decode(self.jpegs)
        images = self.res(images)
        output = self.cmnp(images.gpu(), mirror=rng)
        return [output, self.labels]

pipe = HybridTrainPipe(batch_size=64, n_workers=3, device_id=0, data_dir=&{path to dataset},
                        crop=224, dali_cpu=False, shard_id=0, num_shards=1)

pipe.build()

If I set dali_cpu=True, it isn't terminated.

Can I solve this problem?

@JanuszL JanuszL added question Further information is requested bug Something isn't working and removed question Further information is requested labels Nov 30, 2020
@JanuszL
Copy link
Contributor

JanuszL commented Nov 30, 2020

Hi,
If you are running bare metal, you are probably using CUDA 10.0 based DALI build. There is a known issue with CUDA 10.0 nvJPEG that can lead to such a crash for some images. You can read more in #2475 thread.

@JanuszL JanuszL closed this as completed Dec 2, 2020
@ben0it8
Copy link

ben0it8 commented Dec 16, 2020

Hi,

I also get Segmentation fault whenever trying to use ImageDecoder(device="mixed") with the following setup:

  • base image: nvidia/cuda:11.1-devel-ubuntu20.04
  • Python 3.8
  • nvidia-dali-cuda110==0.28.0

Any idea what causes this? I'm running bare metal RTX 3090s, hence cuda 11.1 is needed.

Cheers
Oliver

@JanuszL
Copy link
Contributor

JanuszL commented Dec 16, 2020

Hi,
There is a problem with nvJPEG from CUDA 11.0 and 11.1 that will cause a crash when libnvcuvid.so.1 is not available.
Please add --gpus '"capabilities=compute,utility,video"' to the docker invocation command. This issue will be fixed soon when CUDA 11.2 build is available - please check the nightly build that follows the merge of #2553.

@ben0it8
Copy link

ben0it8 commented Dec 17, 2020

thanks Janusz, will try that & wait for the new build.

@Bycqg
Copy link

Bycqg commented Jun 12, 2024

Hi, There is a problem with nvJPEG from CUDA 11.0 and 11.1 that will cause a crash when libnvcuvid.so.1 is not available. Please add --gpus '"capabilities=compute,utility,video"' to the docker invocation command. This issue will be fixed soon when CUDA 11.2 build is available - please check the nightly build that follows the merge of #2553.

@JanuszL If I must use CUDA 11.1, how should I fix this issue? Can I simply download the NVIDIA Video Codec SDK and then copy the .so files to /usr/local/cuda/lib64? (I am using version dali1.10)

@JanuszL
Copy link
Contributor

JanuszL commented Jun 12, 2024

Hi @Bycqg,

DALI relies on the CUDA minor version compatibility, so it provides cuda110 build that should cover the whole family of CUDA 11 compatible drivers and uses the latest cuda from that family (11.8). Also, the cuda110 build links statically to all the libraries and it should not have the mentioned problem. Please run DALI and let us know if it doesn't work.

@Bycqg
Copy link

Bycqg commented Jun 13, 2024

Hi @JanuszL

I need to compile the DALI source code in an environment without internet access (so using Docker for compilation is not an option. Currently, I am using the nvidia/cuda:11.1.1-devel-ubuntu18.04 image for compilation).

The subsequent use is to call DALI in C++ for image preprocessing on the GPU, facilitating TensorRT inference later (by the way, is there any C++ demo tutorial available? Most of the demos on the official website are in Python).

Could you please recommend a DALI release version (I am using CUDA version 11.1 and TensorRT version 7.2.2.3), or can any version, whether DALI 1.38 or DALI 1.12, be compiled on CUDA 11.8 and subsequently used on CUDA 11.1?

@JanuszL
Copy link
Contributor

JanuszL commented Jun 13, 2024

Hi @Bycqg,

Could you please recommend a DALI release version (I am using CUDA version 11.1 and TensorRT version 7.2.2.3), or can any version, whether DALI 1.38 or DALI 1.12, be compiled on CUDA 11.8 and subsequently used on CUDA 11.1?

This should work and I would recommend this path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants