Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError on using multi gpu even on using ddp #2515

Closed
hadarishav opened this issue Jul 5, 2020 · 8 comments
Closed

AttributeError on using multi gpu even on using ddp #2515

hadarishav opened this issue Jul 5, 2020 · 8 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@hadarishav
Copy link

hadarishav commented Jul 5, 2020

🐛 Bug

I am getting attribute not found error on using multi gpu. The code works fine on using a single gpu. I am also using ddp as suggested. Here's a traceback.

Traceback (most recent call last):
  File "/home/sohigre/STL/stl_bert_trial_lightning.py", line 245, in <module>
    trainer.fit(model)
  File "/home/sohigre/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 952, in fit
    self.ddp_train(task, model)
  File "/home/sohigre/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 500, in ddp_train
    self.optimizers, self.lr_schedulers, self.optimizer_frequencies = self.init_optimizers(model)
  File "/home/sohigre/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/optimizers.py", line 18, in init_optimizers
    optim_conf = model.configure_optimizers()
  File "/home/sohigre/STL/stl_bert_trial_lightning.py", line 178, in configure_optimizers
    total_steps = len(self.train_dataloader()) * self.max_epochs
  File "/home/sohigre/STL/stl_bert_trial_lightning.py", line 109, in train_dataloader
    return DataLoader(self.ds_train, batch_size=self.batch_size,num_workers=4)
  File "/home/sohigre/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'Abuse_lightning' object has no attribute 'ds_train'

Code sample

  def prepare_data(self):
    # download only (not called on every GPU, just the root GPU per node)
    df = pd.read_csv(self.filename)
    self.df_train, self.df_test = train_test_split(df, test_size=0.2, random_state=self.RANDOM_SEED)
    self.df_val, self.df_test = train_test_split(self.df_test, test_size=0.5, random_state=self.RANDOM_SEED)
    self.tokenizer = BertTokenizer.from_pretrained(self.PRE_TRAINED_MODEL_NAME)
    self.ds_train = AbuseDataset(reviews=self.df_train.comment.to_numpy(), targets=self.df_train.Score.to_numpy(),
                          tokenizer=self.tokenizer,max_len=self.max_len)
    self.ds_val = AbuseDataset(reviews=self.df_val.comment.to_numpy(), targets=self.df_val.Score.to_numpy(),
                          tokenizer=self.tokenizer,max_len=self.max_len)
    self.ds_test = AbuseDataset(reviews=self.df_test.comment.to_numpy(), targets=self.df_test.Score.to_numpy(),
                          tokenizer=self.tokenizer,max_len=self.max_len)

  @pl.data_loader
  def train_dataloader(self):
    
    return DataLoader(self.ds_train, batch_size=self.batch_size,num_workers=4)

Environment

  • CUDA:
    • GPU:
      • GeForce GTX 1080 Ti
      • GeForce GTX 1080 Ti
      • GeForce GTX 1080 Ti
      • GeForce GTX 1080 Ti
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.1
    • pyTorch_debug: False
    • pyTorch_version: 1.5.1
    • pytorch-lightning: 0.8.4
    • tensorboard: 2.2.2
    • tqdm: 4.42.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor:
    • python: 3.7.6
    • version: Proposal for help #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07)
@hadarishav hadarishav added bug Something isn't working help wanted Open to be worked on labels Jul 5, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2020

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Contributor

@rohitgr7
Copy link
Contributor

rohitgr7 commented Jul 6, 2020

@awaelchli shouldn't dataset be initialized in prepare_data only. If I have some process in dataset init that takes some time then it will be done again seperately in all the devices which is I guess not required because the sampler handles moving batches to different devices AFAIK and dataset should be initialized only once and should be done in prepare_data.

@awaelchli
Copy link
Contributor

awaelchli commented Jul 6, 2020

The poster is getting the AttributeError because they assign attributes to self, but since prepare_data is only called once per node (or optionally once over all nodes), then some of the subprocesses have a model without these attributes, leading to the error. Therefore I suggest to assign these attributes in setup instead.

Whether or not this is right for the use case of op is of course a different question.
I interpret the docs as follows:
Split your preprocessing pipeline into

  • prepare_data: to download and preprocess a dataset (once per node, but not per gpu)
  • setup: read necessary attributes from disk, get file handles etc. and save it as attribute to self (per process/GPU)

If these cases do not apply to you, you can always preprocess the data outside the LightningModule and pass your dataloaders as arguments to Trainer.fit().

@hadarishav
Copy link
Author

hadarishav commented Jul 6, 2020

Thanks @awaelchli . This seems to have solved the problem. Getting a new error now "RuntimeError: Tensors must be CUDA and dense". Not sure if they are related. Let me know if you have any suggestions!

@rohitgr7
Copy link
Contributor

rohitgr7 commented Jul 7, 2020

@awaelchli

since prepare_data is only called once per node

then some of the subprocesses have a model without these attributes, leading to the error.

If an attribute is attached to the model in prepare_data within each node and then we just copy this model on all the devices(DDP), why wouldn't this attribute be copied too? I was trying this thing with tpu/8 cores(works similar to DDP) and it doesn't throw any Attribute error.

@awaelchli
Copy link
Contributor

I think prepare data is simply only executed on rank 0. The models don't get copied. In ddp, the script gets launched multiple times.
This is how I understand it.

@awaelchli
Copy link
Contributor

For TPU, I don't know how it is. Let's ask @williamFalcon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants