Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

机器有多个gpu,如何指定特定gpu参与模型训练和预测。 #6725

Closed
yinfeigl opened this issue Dec 19, 2017 · 14 comments
Closed
Labels
User 用于标记用户问题

Comments

@yinfeigl
Copy link

yinfeigl commented Dec 19, 2017

类似caffe可以通过指定-gpu选项(如下),实现指定gpu运行,paddle如何完成设置呢?
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt # 使用默认的gpu0
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt --gpu 2
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt --gpu 0,1,2
./build/tools/caffe train --solver=examples/testXXX/solver.prototxt --gpu all

@pkuyym
Copy link
Contributor

pkuyym commented Dec 19, 2017

你可以设置CUDA_VISIBLE_DEVICES
例如:
CUDA_VISIBLE_DEVICES=0,1,2 python train.py

@peterzhang2029 peterzhang2029 added the User 用于标记用户问题 label Dec 19, 2017
@peterzhang2029
Copy link
Contributor

Closing due to low activity. Feel free to reopen it.

@rulai-huiyingl
Copy link

@pkuyym 使用nvidia-docker的时候貌似这样不行:

$ export CUDA_VISIBLE_DEVICES=0                                             
$ nvidia-docker run -it -v ~/test:/work paddlepaddle/paddle:latest-gpu python /work/fit_a_line.py

GPU的运行状况:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.66                 Driver Version: 384.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:05:00.0 Off |                  N/A |
| 39%   81C    P2    96W / 250W |   7263MiB / 12205MiB |     39%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 00000000:06:00.0 Off |                  N/A |
| 22%   59C    P8    17W / 250W |    206MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:09:00.0 Off |                  N/A |
| 22%   54C    P8    17W / 250W |    206MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 22%   49C    P8    16W / 250W |    206MiB / 12207MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     19691    C   python                                        7252MiB |
|    1     19691    C   python                                         195MiB |
|    2     19691    C   python                                         195MiB |
|    3     19691    C   python                                         195MiB |
+-----------------------------------------------------------------------------+

其他的程序要用GPU也会报out of memory的错。
用nvidia-docker需要怎样指定GPU呢?
谢谢!

@rulai-huiyingl
Copy link

rulai-huiyingl commented Jan 12, 2018

找到答案了。根据 https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation-(version-1.0)
运行的时候这样指定GPU:

$ export CUDA_VISIBLE_DEVICES=2
$ export NV_GPU=2
$ nvidia-docker run -ti -v ~/test:/work paddlepaddle/paddle:latest-gpu python /work/fit_a_line.py

这样其他的程序能使用其他的GPU。
如果有更简单的方法,希望能加到文档里面(我在目前的文档里没找到……),推荐使用nvidia-docker的话有这个方便很多的!谢谢了!
@luotao1

@linrio
Copy link

linrio commented May 14, 2018

@luotao1 请问这个问题 nvidia-docker 该如何解决呢?

@luotao1
Copy link
Contributor

luotao1 commented May 14, 2018

@linrio 您好,关于您在mlcommons/training#40 里提到的两个问题,能否每个问题发一个issue给我们呢?我们会在新issue里给予解答。

@linrio
Copy link

linrio commented May 15, 2018

okey!

@linrio
Copy link

linrio commented Jun 26, 2018

@luotao1
代码:

    # Setup place and executor for runtime
    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
    exe = fluid.Executor(place)
    feeder = fluid.DataFeeder(feed_list=[data, label], place=place)

我有4块GPU,这只使用到了GPU-0,如何设定 fluid.CUDAPlace() 使得可以使用4块GPU?或者2块GPU?

@luotao1
Copy link
Contributor

luotao1 commented Jun 27, 2018

@linrio 您可以使用ParallelExecutor:http://paddlepaddle.org/docs/develop/api/fluid/en/fluid.html#permalink-30-parallelexecutor

只需要设置CUDA_VISIBLE_DEVICES就可以了, ParallelExecutor会将数据拷贝GPU端.
如果batch size是16,有四张卡0,1,2,3,ParallelExecutor的run方法会将数据切成四分,分别发送到四张卡上,每张卡的batch size是4

@linrio
Copy link

linrio commented Jun 27, 2018

@luotao1 我按照您说的方法修改代码:

    exe = fluid.ParallelExecutor(use_cuda=True)
    feeder = fluid.DataFeeder(feed_list=[data, label], place)

但是这个fluid.DataFeeder()的 place 参数应如何修改?

@luotao1
Copy link
Contributor

luotao1 commented Jun 27, 2018

place和使用executor时一样,即 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()可。

@linrio
Copy link

linrio commented Jun 28, 2018

@luotao1 我按照您说的修改代码:

place = fluid.CUDAPlace(3) if use_cuda else fluid.CPUPlace()
exe = fluid.ParallelExecutor(use_cuda=True)
feeder = fluid.DataFeeder(feed_list=[data, label], place=place)

并把

                cost_val, acc_val = exe.run(main_program,
                                            feed=feeder.feed(data),
                                            fetch_list=[cost, acc_out])

改成:

cost_val, acc_val = exe.run(fetch_list=[cost, acc_out],feed_dict=feeder.feed(data))

但是报 了错误:

Traceback (most recent call last):
  File "train.py", line 232, in <module>
    save_dirname="understand_sentiment_conv.inference.model")
  File "train.py", line 204, in main
    save_dirname=save_dirname)
  File "train.py", line 188, in train
    train_loop(fluid.default_main_program())
  File "train.py", line 159, in train_loop
    cost_val, acc_val = exe.run(fetch_list=[cost, acc_out],feed_dict=feeder.feed(data))
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/parallel_executor.py", line 145, in run
    self.executor.run(fetch_list, fetch_var_name, feed_tensor_dict)
TypeError: run(): incompatible function arguments. The following argument types are supported:
    1. (self: paddle.fluid.core.ParallelExecutor, arg0: List[unicode], arg1: unicode, arg2: Dict[unicode, paddle.fluid.core.LoDTensor]) -> None

Invoked with: <paddle.fluid.core.ParallelExecutor object at 0x7f317dddd7b0>

其中,

                print(feeder.feed(data))
                print([cost, acc_out])

分别是:

{'words': <paddle.fluid.core.LoDTensor object at 0x7f19bbf09d50>, 'label': <paddle.fluid.core.LoDTensor object at 0x7f19bbf09d80>}
[name: "mean_0.tmp_0"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: 1
    }
  }
}
persistable: false
, name: "accuracy_0.tmp_2"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: 1
    }
    lod_level: 0
  }
}
persistable: false
]

我查看了/paddle/fluid/executor.py 的run() 方法,run()的参数与我传入的无异:


    def run(self,
            program=None,
            feed=None,
            fetch_list=None,
            feed_var_name='feed',
            fetch_var_name='fetch',
            scope=None,
            return_numpy=True,
use_program_cache=False):

请问我这是什么地方传参数有错误?
另外如果依然使用 place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace(),还是只能用GPU-0 这1块GPU呀?

@luotao1
Copy link
Contributor

luotao1 commented Jun 28, 2018

@dagelailege
Copy link

请问run这个问题解决了吗,遇见了同样的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

7 participants