Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Multi-GPU PyTorch example freezes docker containers #1010

Closed
Dubrzr opened this issue Jul 12, 2019 · 8 comments
Closed

Multi-GPU PyTorch example freezes docker containers #1010

Dubrzr opened this issue Jul 12, 2019 · 8 comments

Comments

@Dubrzr
Copy link

Dubrzr commented Jul 12, 2019

1. Issue or feature description

We have 8xP40, all mounted inside multiple docker containers running JupyterLab using nvidia-docker2.
When one person tries to use multiple GPUs for machine learning, it freezes all docker containers on the machine. We cannot restart the docker containers in question. I don't know if it only concerns docker containers that mounted nvidia gpus but I think so.
We tried to restart docker daemon but it cannot exit.
The only solution is to reboot the machine.

2. Steps to reproduce the issue

Inside a Docker container with multiple GPUs (here multiple P40), run the following Python3 code:

import os


###tutorial from https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
###no error with only 1 gpu
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

#### to reproduce error allow multi gpu
# os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'

import torch


torch.cuda.device_count()

import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5000 #increased input size (works with 500 on multi gpu)
output_size = 2000 #increased output size (works with 200 on multi gpu)

batch_size = 300
data_size = 100

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)


        return output

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

for i in range(10000):
    for data in rand_loader:
        input = data.to(device)
        output = model(input)

3. Information to attach (optional if deemed irrelevant)

@Dubrzr
Copy link
Author

Dubrzr commented Jul 18, 2019

Hello @RenaudWasTaken, long time no see! Waiting for your awesomeness ahah ;=)

@RenaudWasTaken
Copy link
Contributor

Seems to be working with the latest pytorch image.
Trying to build your dockerfile on my machine.

Can you maybe provide an strace log?
Could also very well be a pytorch issue.

@Dubrzr
Copy link
Author

Dubrzr commented Jul 18, 2019

Hi @RenaudWasTaken, I hope you are doing well :)
What would you like me to strace exactly? The execution of the Python code inside the docker?
Thanks!

@RenaudWasTaken
Copy link
Contributor

Your dockerfile is also incorrect, when trying to get it running torch is not installed:

ImportError: No module named torch

Can you provide a container that reproduces the issue ?

@RenaudWasTaken
Copy link
Contributor

What would you like me to strace exactly? The execution of the Python code inside the docker?
yes

@2bestnick
Copy link

@Dubrzr Hello, sir. I am having the same problem, did you fix it? I am lost in google for two days to find a solution , but .....

@Dubrzr
Copy link
Author

Dubrzr commented Aug 9, 2019

@itsnickyang I reproduced the error outside of docker containers, the process freezes and is not killable. I created an issue in PyTorch repo pytorch/pytorch#24081 But don't know whether this is a PyTorch or Nvidia problem.

@RenaudWasTaken
Copy link
Contributor

So closing as not a container issue :D !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants