Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Targeting 2024 Q2] Dataloader crashes after enabling persistent_workers=True #48964

Open
Wong4j opened this issue Dec 9, 2022 · 11 comments
Open
Assignees
Labels
NVIDIA PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/following-up 跟进中 type/bug-report 报bug

Comments

@Wong4j
Copy link
Collaborator

Wong4j commented Dec 9, 2022

bug描述 Describe the Bug

As a benchmark, I only need to train a few steps per epoch. So, I add a break in the loop. For example:

train_dataloader = paddle.io.DataLoader(dataset, batch_size=16, num_workers=4, persistent_workers=True)
bench_epochs = 3
bench_steps = 10
for epoch in range(bench_epochs):
      for i, batch in enumerate(train_dataloader):
          if i > bench_steps:
              break
          do_training_process()

It works fine if I set persistent_workers=False. But after setting persistent_workers=True, I got this error:

$ python test_dataloader.py 
W1209 00:24:39.929682 3970167 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.8, Runtime API Version: 11.7
W1209 00:24:39.943722 3970167 gpu_resources.cc:91] device: 0, cuDNN Version: 8.7.
Epoch 0 batch 0: loss = 2.582632303237915
Epoch 0 batch 1: loss = 2.553558588027954
Epoch 0 batch 2: loss = 2.5804834365844727
Epoch 0 batch 3: loss = 2.531757354736328
Epoch 0 batch 4: loss = 2.3217196464538574
Epoch 0 batch 5: loss = 2.3962247371673584
Epoch 0 batch 6: loss = 2.3609089851379395
Epoch 0 batch 7: loss = 2.398348808288574
Epoch 0 batch 8: loss = 2.594115734100342
Epoch 0 batch 9: loss = 2.648672342300415
Epoch 0 batch 10: loss = 2.4073853492736816
Traceback (most recent call last):
  File "test_dataloader.py", line 51, in <module>
    for i, (image, label) in enumerate(loader()):
  File "/usr/local/lib/python3.8/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 746, in __next__
    data = _restore_batch(data, self._structure_infos.pop(0))
IndexError: pop from empty list

Here is the complete code to reproduce:

# cat test_dataloader.py 
import numpy as np

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import Dataset, BatchSampler, DataLoader

BATCH_NUM = 20
BATCH_SIZE = 16
EPOCH_NUM = 100
STEPS_PER_EPOCH = 10

IMAGE_SIZE = 784
CLASS_NUM = 10

USE_GPU = False # whether use GPU to run model

# define a random dataset
class RandomDataset(Dataset):
    def __init__(self, num_samples):
        self.num_samples = num_samples

    def __getitem__(self, idx):
        image = np.random.random([IMAGE_SIZE]).astype('float32')
        label = np.random.randint(0, CLASS_NUM - 1, (1, )).astype('int64')
        return image, label

    def __len__(self):
        return self.num_samples

dataset = RandomDataset(BATCH_NUM * BATCH_SIZE)

class SimpleNet(nn.Layer):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(IMAGE_SIZE, CLASS_NUM)

    def forward(self, image, label=None):
        return self.fc(image)

simple_net = SimpleNet()
opt = paddle.optimizer.SGD(learning_rate=1e-3,
                          parameters=simple_net.parameters())

loader = DataLoader(dataset,
                    batch_size=16,
                    num_workers=4,
                    persistent_workers=True)

for e in range(EPOCH_NUM):
    for i, (image, label) in enumerate(loader()):
        if i > STEPS_PER_EPOCH:
            break
        out = simple_net(image)
        loss = F.cross_entropy(out, label)
        avg_loss = paddle.mean(loss)
        avg_loss.backward()
        opt.minimize(avg_loss)
        simple_net.clear_gradients()
        print("Epoch {} batch {}: loss = {}".format(e, i, np.mean(loss.numpy())))

其他补充信息 Additional Supplementary Information

No response

@paddle-bot
Copy link

paddle-bot bot commented Dec 9, 2022

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@Wong4j
Copy link
Collaborator Author

Wong4j commented Dec 9, 2022

There's a related issue that was opened last year and remains unresolved. (#32927)
This bug can be reproduced with my code by setting EPOCH_NUM = 100 and persistent_workers=False.

@heavengate
Copy link
Contributor

persistent_workers=True还不是很稳定,我们还在优化这块的逻辑,根据你当前的背景,如果只需要训练一定的steps数,可以尝试先把dataset的__len__设置为steps * batch_size来做训练的中止,当前DataLoader训练中途做break有可能发生资源没有释放

@heavengate
Copy link
Contributor

已排期,Q2内修复~

@Wong4j
Copy link
Collaborator Author

Wong4j commented Jul 10, 2023

已排期,Q2内修复~

请问是否有更新?

@onecatcn
Copy link
Contributor

onecatcn commented Aug 8, 2023

@heavengate said this task will be targeted in 23 Q3

@Wong4j Wong4j changed the title Dataloader crashes after enabling persistent_workers=True [Targeting Q3] Dataloader crashes after enabling persistent_workers=True Sep 7, 2023
@tiandou-tangdou
Copy link

@heavengate said this task will be targeted in 23 Q3

@onecatcn done?

@paddle-bot paddle-bot bot added the PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc label Sep 21, 2023
@onecatcn
Copy link
Contributor

@heavengate said this task will be targeted in 23 Q3

@onecatcn done?

not yet

@onecatcn
Copy link
Contributor

@xysheng-baidu sheng will investigate the issue

@Wong4j Wong4j changed the title [Targeting Q3] Dataloader crashes after enabling persistent_workers=True [Targeting Q4] Dataloader crashes after enabling persistent_workers=True Nov 23, 2023
@HydrogenSulfate
Copy link
Contributor

Same here

@xysheng-baidu
Copy link
Contributor

Same here

The problem has not been solved yet, we will deal with it as soon as possible.

@Wong4j Wong4j changed the title [Targeting Q4] Dataloader crashes after enabling persistent_workers=True [Targeting 2024 Q1] Dataloader crashes after enabling persistent_workers=True Feb 29, 2024
@Wong4j Wong4j changed the title [Targeting 2024 Q1] Dataloader crashes after enabling persistent_workers=True [Targeting 2024 Q2] Dataloader crashes after enabling persistent_workers=True Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NVIDIA PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/following-up 跟进中 type/bug-report 报bug
Projects
None yet
Development

No branches or pull requests

7 participants