机器有多个gpu，如何指定特定gpu参与模型训练和预测。 #6725

yinfeigl · 2017-12-19T03:13:14Z

类似caffe可以通过指定-gpu选项（如下），实现指定gpu运行，paddle如何完成设置呢？
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt # 使用默认的gpu0
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt --gpu 2
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt --gpu 0,1,2
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt --gpu all

pkuyym · 2017-12-19T04:04:55Z

你可以设置CUDA_VISIBLE_DEVICES
例如：
CUDA_VISIBLE_DEVICES=0,1,2 python train.py

peterzhang2029 · 2017-12-30T13:27:08Z

Closing due to low activity. Feel free to reopen it.

rulai-huiyingl · 2018-01-11T23:43:32Z

@pkuyym 使用nvidia-docker的时候貌似这样不行：

$ export CUDA_VISIBLE_DEVICES=0                                             
$ nvidia-docker run -it -v ~/test:/work paddlepaddle/paddle:latest-gpu python /work/fit_a_line.py

GPU的运行状况：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:05:00.0 Off |                  N/A |
| 39%   81C    P2    96W / 250W |   7263MiB / 12205MiB |     39%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 00000000:06:00.0 Off |                  N/A |
| 22%   59C    P8    17W / 250W |    206MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:09:00.0 Off |                  N/A |
| 22%   54C    P8    17W / 250W |    206MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 22%   49C    P8    16W / 250W |    206MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     19691    C   python                                        7252MiB |
|    1     19691    C   python                                         195MiB |
|    2     19691    C   python                                         195MiB |
|    3     19691    C   python                                         195MiB |
+-----------------------------------------------------------------------------+

其他的程序要用GPU也会报out of memory的错。
用nvidia-docker需要怎样指定GPU呢？
谢谢！

rulai-huiyingl · 2018-01-12T00:01:30Z

找到答案了。根据 https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation-(version-1.0)
运行的时候这样指定GPU：

$ export CUDA_VISIBLE_DEVICES=2
$ export NV_GPU=2
$ nvidia-docker run -ti -v ~/test:/work paddlepaddle/paddle:latest-gpu python /work/fit_a_line.py

这样其他的程序能使用其他的GPU。
如果有更简单的方法，希望能加到文档里面（我在目前的文档里没找到……），推荐使用nvidia-docker的话有这个方便很多的！谢谢了！
@luotao1

linrio · 2018-05-14T03:02:59Z

@luotao1 请问这个问题 nvidia-docker 该如何解决呢？

luotao1 · 2018-05-14T04:27:46Z

@linrio 您好，关于您在mlcommons/training#40 里提到的两个问题，能否每个问题发一个issue给我们呢？我们会在新issue里给予解答。

linrio · 2018-05-15T09:19:05Z

okey!

linrio · 2018-06-26T15:50:50Z

@luotao1
代码：

    # Setup place and executor for runtime
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    exe = fluid.Executor(place)
    feeder = fluid.DataFeeder(feed_list=[data, label], place=place)

我有4块GPU，这只使用到了GPU-0，如何设定 fluid.CUDAPlace() 使得可以使用4块GPU？或者2块GPU？

luotao1 · 2018-06-27T04:49:05Z

@linrio 您可以使用ParallelExecutor：http://paddlepaddle.org/docs/develop/api/fluid/en/fluid.html#permalink-30-parallelexecutor

只需要设置CUDA_VISIBLE_DEVICES就可以了, ParallelExecutor会将数据拷贝GPU端.
如果batch size是16，有四张卡0，1，2，3，ParallelExecutor的run方法会将数据切成四分，分别发送到四张卡上，每张卡的batch size是4

linrio · 2018-06-27T09:45:25Z

@luotao1 我按照您说的方法修改代码:

    exe = fluid.ParallelExecutor(use_cuda=True)
    feeder = fluid.DataFeeder(feed_list=[data, label], place)

但是这个fluid.DataFeeder()的 place 参数应如何修改？

luotao1 · 2018-06-27T13:50:42Z

place和使用executor时一样，即 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()可。

linrio · 2018-06-28T07:06:29Z

@luotao1 我按照您说的修改代码:

place = fluid.CUDAPlace(3) if use_cuda else fluid.CPUPlace()
exe = fluid.ParallelExecutor(use_cuda=True)
feeder = fluid.DataFeeder(feed_list=[data, label], place=place)

并把

                cost_val, acc_val = exe.run(main_program,
                                            feed=feeder.feed(data),
                                            fetch_list=[cost, acc_out])

改成：

cost_val, acc_val = exe.run(fetch_list=[cost, acc_out],feed_dict=feeder.feed(data))

但是报了错误：

Traceback (most recent call last):
  File "train.py", line 232, in <module>
    save_dirname="understand_sentiment_conv.inference.model")
  File "train.py", line 204, in main
    save_dirname=save_dirname)
  File "train.py", line 188, in train
    train_loop(fluid.default_main_program())
  File "train.py", line 159, in train_loop
    cost_val, acc_val = exe.run(fetch_list=[cost, acc_out],feed_dict=feeder.feed(data))
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 145, in run
    self.executor.run(fetch_list, fetch_var_name, feed_tensor_dict)
TypeError: run(): incompatible function arguments. The following argument types are supported:
    1. (self: paddle.fluid.core.ParallelExecutor, arg0: List[unicode], arg1: unicode, arg2: Dict[unicode, paddle.fluid.core.LoDTensor]) -> None

Invoked with: <paddle.fluid.core.ParallelExecutor object at 0x7f317dddd7b0>

其中，

                print(feeder.feed(data))
                print([cost, acc_out])

分别是：

{'words': <paddle.fluid.core.LoDTensor object at 0x7f19bbf09d50>, 'label': <paddle.fluid.core.LoDTensor object at 0x7f19bbf09d80>}
[name: "mean_0.tmp_0"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: 1
    }
  }
}
persistable: false
, name: "accuracy_0.tmp_2"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: 1
    }
    lod_level: 0
  }
}
persistable: false
]

我查看了/paddle/fluid/executor.py 的run() 方法，run()的参数与我传入的无异：


    def run(self,
            program=None,
            feed=None,
            fetch_list=None,
            feed_var_name='feed',
            fetch_var_name='fetch',
            scope=None,
            return_numpy=True,
use_program_cache=False):

请问我这是什么地方传参数有错误？
另外如果依然使用 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()，还是只能用GPU-0 这1块GPU呀？

luotao1 · 2018-06-28T07:16:37Z

您可以参考 https://github.com/chengduoZH/benchmark/blob/add_resnet_50_v2/fluid/ResNet_50/train_resnet.py#L192 怎么使用parallel_executor

dagelailege · 2018-08-20T06:58:53Z

请问run这个问题解决了吗，遇见了同样的问题

peterzhang2029 added the User 用于标记用户问题 label Dec 19, 2017

peterzhang2029 closed this as completed Dec 30, 2017

hedaoyuan mentioned this issue Jan 8, 2018

怎么使用GPU训练PaddlePaddle模型 #1838

Closed

luotao1 added the documentation label Jan 12, 2018

linrio mentioned this issue May 14, 2018

paddle.fluid.core.EnforceNotMet: enforce allocating <= available failed. how to specify GPU? mlcommons/training#40

Closed

linrio mentioned this issue May 15, 2018

docker 指定使用哪个 GPU的问题 #10661

Closed

4 tasks

changeyoung98 mentioned this issue Jul 29, 2024

linux运行paddleocr出现多个GPU同时运行 #66720

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

机器有多个gpu，如何指定特定gpu参与模型训练和预测。 #6725

机器有多个gpu，如何指定特定gpu参与模型训练和预测。 #6725

yinfeigl commented Dec 19, 2017 •

edited

Loading

pkuyym commented Dec 19, 2017

peterzhang2029 commented Dec 30, 2017

rulai-huiyingl commented Jan 11, 2018

rulai-huiyingl commented Jan 12, 2018 •

edited

Loading

linrio commented May 14, 2018

luotao1 commented May 14, 2018

linrio commented May 15, 2018

linrio commented Jun 26, 2018

luotao1 commented Jun 27, 2018 •

edited

Loading

linrio commented Jun 27, 2018

luotao1 commented Jun 27, 2018

linrio commented Jun 28, 2018 •

edited

Loading

luotao1 commented Jun 28, 2018

dagelailege commented Aug 20, 2018

机器有多个gpu，如何指定特定gpu参与模型训练和预测。 #6725

机器有多个gpu，如何指定特定gpu参与模型训练和预测。 #6725

Comments

yinfeigl commented Dec 19, 2017 • edited Loading

pkuyym commented Dec 19, 2017

peterzhang2029 commented Dec 30, 2017

rulai-huiyingl commented Jan 11, 2018

rulai-huiyingl commented Jan 12, 2018 • edited Loading

linrio commented May 14, 2018

luotao1 commented May 14, 2018

linrio commented May 15, 2018

linrio commented Jun 26, 2018

luotao1 commented Jun 27, 2018 • edited Loading

linrio commented Jun 27, 2018

luotao1 commented Jun 27, 2018

linrio commented Jun 28, 2018 • edited Loading

luotao1 commented Jun 28, 2018

dagelailege commented Aug 20, 2018

yinfeigl commented Dec 19, 2017 •

edited

Loading

rulai-huiyingl commented Jan 12, 2018 •

edited

Loading

luotao1 commented Jun 27, 2018 •

edited

Loading

linrio commented Jun 28, 2018 •

edited

Loading