segmentation fault with cuda 11.1 #2506

dk-hong · 2020-11-30T05:41:17Z

I want to use DALI in a docker container which has environments below.
Framework: pytorch 1.7.0
CUDA version: 11.1
python version: 3.8.0
DALI version: nvidia-dali-cuda110 0.27.0

The code worked well when I run it with CUDA version 10.2 & without docker
But now it doesn't work with the above environments.

There are no error messages that I can get some hints for fixing it but segmentation fault (core dumped)

Only I know is there is something wrong in pipe.build()

My code is right below.

class TrainPipe(Pipeline):
    def __init__(self, batch_size, n_workers, device_id, data_dir, crop, shard_id, num_shards, dali_cpu=False):
        super(TrainPipe, self).__init__(batch_size, n_workers, device_id, seed=device_id)
        self.input = ops.FileReader(file_root=data_dir, shard_id=shard_id, num_shards=num_shards,
                                    random_shuffle=True, pad_last_batch=True)
        
        dali_device = 'cpu' if dali_cpu else 'gpu'
        decoder_device = 'cpu' if dali_cpu else 'mixed'
        
        device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
        host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
        self.decode = ops.ImageDecoderRandomCrop(device=decoder_device, output_type=types.RGB,
                                                 device_memory_padding=device_memory_padding,
                                                 host_memory_padding=host_memory_padding,
                                                 random_aspect_ratio=[0.8, 1.25],
                                                 random_area=[0.1, 1.0],
                                                 num_attempts=100)
        
        self.res = ops.Resize(device=dali_device,
                              resize_x=crop,
                              resize_y=crop,
                              interp_type=types.INTERP_TRIANGULAR)

        self.cmnp = ops.CropMirrorNormalize(device="gpu",
                                            dtype=types.FLOAT,
                                            output_layout=types.NCHW,
                                            crop=(crop, crop),
                                            mean=[0.485 * 255,0.456 * 255,0.406 * 255],
                                            std=[0.229 * 255,0.224 * 255,0.225 * 255])
        self.coin = ops.CoinFlip(probability=0.5)

    def define_graph(self):
        rng = self.coin()
        self.jpegs, self.labels = self.input(name="Reader")
        images = self.decode(self.jpegs)
        images = self.res(images)
        output = self.cmnp(images.gpu(), mirror=rng)
        return [output, self.labels]

pipe = HybridTrainPipe(batch_size=64, n_workers=3, device_id=0, data_dir=&{path to dataset},
                        crop=224, dali_cpu=False, shard_id=0, num_shards=1)

pipe.build()

If I set dali_cpu=True, it isn't terminated.

Can I solve this problem?

The text was updated successfully, but these errors were encountered:

JanuszL · 2020-11-30T09:08:48Z

Hi,
If you are running bare metal, you are probably using CUDA 10.0 based DALI build. There is a known issue with CUDA 10.0 nvJPEG that can lead to such a crash for some images. You can read more in #2475 thread.

ben0it8 · 2020-12-16T16:46:55Z

Hi,

I also get Segmentation fault whenever trying to use ImageDecoder(device="mixed") with the following setup:

base image: nvidia/cuda:11.1-devel-ubuntu20.04
Python 3.8
nvidia-dali-cuda110==0.28.0

Any idea what causes this? I'm running bare metal RTX 3090s, hence cuda 11.1 is needed.

Cheers
Oliver

JanuszL · 2020-12-16T22:10:45Z

Hi,
There is a problem with nvJPEG from CUDA 11.0 and 11.1 that will cause a crash when libnvcuvid.so.1 is not available.
Please add --gpus '"capabilities=compute,utility,video"' to the docker invocation command. This issue will be fixed soon when CUDA 11.2 build is available - please check the nightly build that follows the merge of #2553.

ben0it8 · 2020-12-17T08:56:59Z

thanks Janusz, will try that & wait for the new build.

Bycqg · 2024-06-12T11:59:08Z

Hi, There is a problem with nvJPEG from CUDA 11.0 and 11.1 that will cause a crash when libnvcuvid.so.1 is not available. Please add --gpus '"capabilities=compute,utility,video"' to the docker invocation command. This issue will be fixed soon when CUDA 11.2 build is available - please check the nightly build that follows the merge of #2553.

@JanuszL If I must use CUDA 11.1, how should I fix this issue? Can I simply download the NVIDIA Video Codec SDK and then copy the .so files to /usr/local/cuda/lib64? (I am using version dali1.10)

JanuszL · 2024-06-12T20:50:06Z

Hi @Bycqg,

DALI relies on the CUDA minor version compatibility, so it provides cuda110 build that should cover the whole family of CUDA 11 compatible drivers and uses the latest cuda from that family (11.8). Also, the cuda110 build links statically to all the libraries and it should not have the mentioned problem. Please run DALI and let us know if it doesn't work.

Bycqg · 2024-06-13T01:10:24Z

Hi @JanuszL

I need to compile the DALI source code in an environment without internet access (so using Docker for compilation is not an option. Currently, I am using the nvidia/cuda:11.1.1-devel-ubuntu18.04 image for compilation).

The subsequent use is to call DALI in C++ for image preprocessing on the GPU, facilitating TensorRT inference later (by the way, is there any C++ demo tutorial available? Most of the demos on the official website are in Python).

Could you please recommend a DALI release version (I am using CUDA version 11.1 and TensorRT version 7.2.2.3), or can any version, whether DALI 1.38 or DALI 1.12, be compiled on CUDA 11.8 and subsequently used on CUDA 11.1?

JanuszL · 2024-06-13T21:35:09Z

Hi @Bycqg,

Could you please recommend a DALI release version (I am using CUDA version 11.1 and TensorRT version 7.2.2.3), or can any version, whether DALI 1.38 or DALI 1.12, be compiled on CUDA 11.8 and subsequently used on CUDA 11.1?

This should work and I would recommend this path.

JanuszL added question Further information is requested bug Something isn't working and removed question Further information is requested labels Nov 30, 2020

JanuszL closed this as completed Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segmentation fault with cuda 11.1 #2506

segmentation fault with cuda 11.1 #2506

dk-hong commented Nov 30, 2020

JanuszL commented Nov 30, 2020

ben0it8 commented Dec 16, 2020 •

edited

Loading

JanuszL commented Dec 16, 2020

ben0it8 commented Dec 17, 2020

Bycqg commented Jun 12, 2024 •

edited

Loading

JanuszL commented Jun 12, 2024

Bycqg commented Jun 13, 2024 •

edited

Loading

JanuszL commented Jun 13, 2024

segmentation fault with cuda 11.1 #2506

segmentation fault with cuda 11.1 #2506

Comments

dk-hong commented Nov 30, 2020

JanuszL commented Nov 30, 2020

ben0it8 commented Dec 16, 2020 • edited Loading

JanuszL commented Dec 16, 2020

ben0it8 commented Dec 17, 2020

Bycqg commented Jun 12, 2024 • edited Loading

JanuszL commented Jun 12, 2024

Bycqg commented Jun 13, 2024 • edited Loading

JanuszL commented Jun 13, 2024

ben0it8 commented Dec 16, 2020 •

edited

Loading

Bycqg commented Jun 12, 2024 •

edited

Loading

Bycqg commented Jun 13, 2024 •

edited

Loading