Multi-GPU PyTorch example freezes docker containers #1010

Dubrzr · 2019-07-12T12:32:09Z

1. Issue or feature description

We have 8xP40, all mounted inside multiple docker containers running JupyterLab using nvidia-docker2.
When one person tries to use multiple GPUs for machine learning, it freezes all docker containers on the machine. We cannot restart the docker containers in question. I don't know if it only concerns docker containers that mounted nvidia gpus but I think so.
We tried to restart docker daemon but it cannot exit.
The only solution is to reboot the machine.

2. Steps to reproduce the issue

Inside a Docker container with multiple GPUs (here multiple P40), run the following Python3 code:

import os


###tutorial from https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
###no error with only 1 gpu
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

#### to reproduce error allow multi gpu
# os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'

import torch


torch.cuda.device_count()

import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5000 #increased input size (works with 500 on multi gpu)
output_size = 2000 #increased output size (works with 200 on multi gpu)

batch_size = 300
data_size = 100

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)


        return output

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

for i in range(10000):
    for data in rand_loader:
        input = data.to(device)
        output = model(input)

3. Information to attach (optional if deemed irrelevant)

nvidia-docker --version : Docker version 18.09.4, build d14af54266
lsb_release -a
nvidia-container information
uname -a : Linux bbs-edsgpu-p002 3.10.0-957.10.1.el7.x86_64 Add README image #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
dmesg
nvidia-smi -a
rpm -qa 'nvidia'
nvidia-container-cli -V
Dockerfile base
Python version : 3.6.7
pip freeze
conda list

The text was updated successfully, but these errors were encountered:

Dubrzr · 2019-07-18T07:49:29Z

Hello @RenaudWasTaken, long time no see! Waiting for your awesomeness ahah ;=)

RenaudWasTaken · 2019-07-18T18:00:29Z

Seems to be working with the latest pytorch image.
Trying to build your dockerfile on my machine.

Can you maybe provide an strace log?
Could also very well be a pytorch issue.

Dubrzr · 2019-07-18T18:11:55Z

Hi @RenaudWasTaken, I hope you are doing well :)
What would you like me to strace exactly? The execution of the Python code inside the docker?
Thanks!

RenaudWasTaken · 2019-07-18T18:59:38Z

Your dockerfile is also incorrect, when trying to get it running torch is not installed:

ImportError: No module named torch

Can you provide a container that reproduces the issue ?

RenaudWasTaken · 2019-07-18T18:59:57Z

What would you like me to strace exactly? The execution of the Python code inside the docker?
yes

2bestnick · 2019-08-07T01:34:55Z

@Dubrzr Hello, sir. I am having the same problem, did you fix it? I am lost in google for two days to find a solution , but .....

Dubrzr · 2019-08-09T13:29:29Z

@itsnickyang I reproduced the error outside of docker containers, the process freezes and is not killable. I created an issue in PyTorch repo pytorch/pytorch#24081 But don't know whether this is a PyTorch or Nvidia problem.

RenaudWasTaken · 2019-08-12T02:50:28Z

So closing as not a container issue :D !

Dubrzr mentioned this issue Aug 9, 2019

Multi-gpu example freeze and is not killable pytorch/pytorch#24081

Open

RenaudWasTaken closed this as completed Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU PyTorch example freezes docker containers #1010

Multi-GPU PyTorch example freezes docker containers #1010

Dubrzr commented Jul 12, 2019 •

edited

Dubrzr commented Jul 18, 2019

RenaudWasTaken commented Jul 18, 2019

Dubrzr commented Jul 18, 2019 •

edited

RenaudWasTaken commented Jul 18, 2019

RenaudWasTaken commented Jul 18, 2019

2bestnick commented Aug 7, 2019

Dubrzr commented Aug 9, 2019

RenaudWasTaken commented Aug 12, 2019

Multi-GPU PyTorch example freezes docker containers #1010

Multi-GPU PyTorch example freezes docker containers #1010

Comments

Dubrzr commented Jul 12, 2019 • edited

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Dubrzr commented Jul 18, 2019

RenaudWasTaken commented Jul 18, 2019

Dubrzr commented Jul 18, 2019 • edited

RenaudWasTaken commented Jul 18, 2019

RenaudWasTaken commented Jul 18, 2019

2bestnick commented Aug 7, 2019

Dubrzr commented Aug 9, 2019

RenaudWasTaken commented Aug 12, 2019

Dubrzr commented Jul 12, 2019 •

edited

Dubrzr commented Jul 18, 2019 •

edited