Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

代码没问题,只是稍微缓解不平衡,但还是不平衡。 #13

Open
leileilin opened this issue Jul 7, 2021 · 5 comments
Open

Comments

@leileilin
Copy link

bsz: 158
num_dev: 6
gpu0_bsz: 1
bsz_unit: 31
chunk_sizes: [1, 32, 32, 31, 31, 31]
len(inputs): 6
self.device_ids[:len(inputs)] [0, 1, 2, 3, 4, 5]
replicas: 6

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 13347 C python 10505MiB |
| 1 13347 C python 4991MiB |
| 4 13347 C python 4991MiB |
| 5 13347 C python 4925MiB |
| 6 13347 C python 4925MiB |
| 7 13347 C python 4925MiB |
+-----------------------------------------------------------------------------+

@Link-Li
Copy link
Owner

Link-Li commented Jul 7, 2021

那你可以看一下是不是你模型本身比较大的问题,还有这个和pytorch的版本也有关系,太新的pytorch还没有经过调试

@leileilin
Copy link
Author

leileilin commented Jul 7, 2021 via email

@Link-Li
Copy link
Owner

Link-Li commented Jul 8, 2021

30M应该不会出现这样的情况吧,这个模型和bert-base比都要小很多。我当时测试的应该是1.3附近的一个版本,具体是哪个我已经记不清楚了。
如果不是pytorch版本的问题,我感觉这个可能是模型需要在0号GPU计算梯度,你的batch size本身特别大,这个可能需要的显存比较多。
如果想要排除一下问题,你可以使用官方提供的DataParallel看看显存占用是啥情况。

@leileilin
Copy link
Author

leileilin commented Jul 8, 2021

30M应该不会出现这样的情况吧,这个模型和bert-base比都要小很多。我当时测试的应该是1.3附近的一个版本,具体是哪个我已经记不清楚了。
如果不是pytorch版本的问题,我感觉这个可能是模型需要在0号GPU计算梯度,你的batch size本身特别大,这个可能需要的显存比较多。
如果想要排除一下问题,你可以使用官方提供的DataParallel看看显存占用是啥情况。

我的模型是hiddensize为256的transformer-base,pytorch版本为1.6.0,batchsize为120。
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24562 C python 9737MiB |
| 1 24562 C python 3549MiB |
| 4 24562 C python 3549MiB |
| 5 24562 C python 3549MiB |
| 6 24562 C python 3549MiB |
| 7 24562 C python 3549MiB |
上面是采用DataParallel的显存占用情况。

之前的batchsize为150左右。

@birdwcp
Copy link

birdwcp commented Jul 24, 2023

The GPU memory-usage is balanced and can run a larger batchsize, but the accuracy of the trained model has decreased, and it is unclear where the problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants