New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train a model using multiple machines? #59

Closed
yefeng-zheng opened this Issue Jan 29, 2016 · 16 comments

Comments

Projects
None yet
10 participants
@yefeng-zheng

yefeng-zheng commented Jan 29, 2016

The main selling point of CNTK (compared to other deep learning packages) is that it supports training a large model on a compute cluster. However, I didn't find any information online or from the book on how to set up training across computers. Anybody can help?

@such87

This comment has been minimized.

Show comment
Hide comment
@such87

such87 Jan 29, 2016

Hi,
Can you try like this ?

mpiexec -np 2 cntk configFile=../Config/01_OneHidden.config parallelTrain=true deviceId=0

such87 commented Jan 29, 2016

Hi,
Can you try like this ?

mpiexec -np 2 cntk configFile=../Config/01_OneHidden.config parallelTrain=true deviceId=0

@amitaga

This comment has been minimized.

Show comment
Hide comment
@amitaga

amitaga Jan 29, 2016

Contributor

Parallel training in CNTK needs to be launched using MPI. Refer to the example "Examples/Other/Simple2d/Config/Multigpu.config" that illustrates the CNTK config options needed for parallel training.

For e.g. parallel training using 2 workers on the same machine:

cd /Examples/Other/Simple2d/Data
mpiexec -np 2 configFile=../Config/Multigpu.config

To run across multiple machines, a MPI hosts file needs to be passed to the mpiexec command to specify the hosts where the CNTK parallel training workers will be launched. Please refer the documentation of the MPI implementation you are using, for details regarding launching a MPI job spanning multiple machines.

Contributor

amitaga commented Jan 29, 2016

Parallel training in CNTK needs to be launched using MPI. Refer to the example "Examples/Other/Simple2d/Config/Multigpu.config" that illustrates the CNTK config options needed for parallel training.

For e.g. parallel training using 2 workers on the same machine:

cd /Examples/Other/Simple2d/Data
mpiexec -np 2 configFile=../Config/Multigpu.config

To run across multiple machines, a MPI hosts file needs to be passed to the mpiexec command to specify the hosts where the CNTK parallel training workers will be launched. Please refer the documentation of the MPI implementation you are using, for details regarding launching a MPI job spanning multiple machines.

@yqwangustc

This comment has been minimized.

Show comment
Hide comment
@yqwangustc

yqwangustc Jan 29, 2016

Contributor

Just want to add some additional comments on top of Amit's answer. To use multiple GPUs in training, it is better to set deviceId=auto, otherwise e.g., if we set deviceId=0, two individual mpi workers will compete for the 0-th GPU if they are launched in the same machine.

We may need to reset deviceId to auto, once we detected paralleTrain=true.

Contributor

yqwangustc commented Jan 29, 2016

Just want to add some additional comments on top of Amit's answer. To use multiple GPUs in training, it is better to set deviceId=auto, otherwise e.g., if we set deviceId=0, two individual mpi workers will compete for the 0-th GPU if they are launched in the same machine.

We may need to reset deviceId to auto, once we detected paralleTrain=true.

@yefeng-zheng

This comment has been minimized.

Show comment
Hide comment
@yefeng-zheng

yefeng-zheng Jan 30, 2016

Thanks for all the answers. I have tested it on my computer. I have only one GPU. I found out that if I can with "mpiexec -np 2," one process takes the GPU and the other process runs on CPU (using all available cores on that CPU). This is very smart. Next week, I will test on our compute cloud. Hope everything will go smoothly.

yefeng-zheng commented Jan 30, 2016

Thanks for all the answers. I have tested it on my computer. I have only one GPU. I found out that if I can with "mpiexec -np 2," one process takes the GPU and the other process runs on CPU (using all available cores on that CPU). This is very smart. Next week, I will test on our compute cloud. Hope everything will go smoothly.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Jan 31, 2016

Can a model be trained on multiple CPU-only machines? Or is the case that for multi machine examples, GPUs are required on all the machines?

rahulbhalerao001 commented Jan 31, 2016

Can a model be trained on multiple CPU-only machines? Or is the case that for multi machine examples, GPUs are required on all the machines?

@frankseide

This comment has been minimized.

Show comment
Hide comment
@frankseide

frankseide Jan 31, 2016

Member

CPU and GPU are equivalent, with very few image-related exceptions where we rely on cuDNN and lack CPU implementations.

CPU code already leverages multiple CPUs, so you may need to fiddle a little how many CPU threads vs MPI nodes you want to use. E.g. start with one MPI process per server, and then compare with using 2 while limiting cpuThreads to half the number of cores.

Let us know if you run into problems (we normally do not run parallelized across CPU-only machines).

Sent from Outlookhttp://aka.ms/Ox5hz3

On Sat, Jan 30, 2016 at 6:50 PM -0800, "Rahul Bhalerao" <notifications@github.commailto:notifications@github.com> wrote:

Can a model be trained on multiple CPU-only machines? Or is the case that for multi machine examples, GPUs are required on all the machines?

Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-177365142.

Member

frankseide commented Jan 31, 2016

CPU and GPU are equivalent, with very few image-related exceptions where we rely on cuDNN and lack CPU implementations.

CPU code already leverages multiple CPUs, so you may need to fiddle a little how many CPU threads vs MPI nodes you want to use. E.g. start with one MPI process per server, and then compare with using 2 while limiting cpuThreads to half the number of cores.

Let us know if you run into problems (we normally do not run parallelized across CPU-only machines).

Sent from Outlookhttp://aka.ms/Ox5hz3

On Sat, Jan 30, 2016 at 6:50 PM -0800, "Rahul Bhalerao" <notifications@github.commailto:notifications@github.com> wrote:

Can a model be trained on multiple CPU-only machines? Or is the case that for multi machine examples, GPUs are required on all the machines?

Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-177365142.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Jan 31, 2016

Thank you for the prompt response. Could you please let me know if any of the provided examples can be run this way on multiple CPU only machine setting. I am new to MPI, so starting pointers will be very helpful.

rahulbhalerao001 commented Jan 31, 2016

Thank you for the prompt response. Could you please let me know if any of the provided examples can be run this way on multiple CPU only machine setting. I am new to MPI, so starting pointers will be very helpful.

@frankseide

This comment has been minimized.

Show comment
Hide comment
@frankseide

frankseide Feb 1, 2016

Member

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it write-enabled by users. It is on our TODO list to find a more universal solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto, will create lock file in /var/lock/


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-177872518.

Member

frankseide commented Feb 1, 2016

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it write-enabled by users. It is on our TODO list to find a more universal solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto, will create lock file in /var/lock/


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-177872518.

@Sandy4321

This comment has been minimized.

Show comment
Hide comment
@Sandy4321

Sandy4321 Feb 1, 2016

pls help understand how may I use it for several CPU use? for example I
have PC with 4 CPU, can I train on all 4 CPUs?

On Mon, Feb 1, 2016 at 12:23 PM, Frank Seide notifications@github.com
wrote:

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it
write-enabled by users. It is on our TODO list to find a more universal
solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto,
will create lock file in /var/lock/


Reply to this email directly or view it on GitHub<
https://github.com/Microsoft/CNTK/issues/59#issuecomment-177872518>.


Reply to this email directly or view it on GitHub
#59 (comment).

Sandy4321 commented Feb 1, 2016

pls help understand how may I use it for several CPU use? for example I
have PC with 4 CPU, can I train on all 4 CPUs?

On Mon, Feb 1, 2016 at 12:23 PM, Frank Seide notifications@github.com
wrote:

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it
write-enabled by users. It is on our TODO list to find a more universal
solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto,
will create lock file in /var/lock/


Reply to this email directly or view it on GitHub<
https://github.com/Microsoft/CNTK/issues/59#issuecomment-177872518>.


Reply to this email directly or view it on GitHub
#59 (comment).

@dongyu888

This comment has been minimized.

Show comment
Hide comment
@dongyu888

dongyu888 Feb 1, 2016

Contributor

the BLAS libraries will automatically used all CPU cores you have on your computer. If you run it on a single box you can run cntk directly to exploit them, without using MPI.

Contributor

dongyu888 commented Feb 1, 2016

the BLAS libraries will automatically used all CPU cores you have on your computer. If you run it on a single box you can run cntk directly to exploit them, without using MPI.

@frankseide

This comment has been minimized.

Show comment
Hide comment
@frankseide

frankseide Feb 1, 2016

Member

Yes you can. First of all, by default, it will already use all cores on your machine through OpenMP. If you do nothing, you should see a CPU utilization >> 1 core. If not, please let us know and try setting a global parameter numCPUThreads= number of cores in your system.

However, this may or may not be optimal, depending on your specific HW configuration, model dimensions, and the BLAS library (which would be ACML unless you explicitly switched to MKL). The two options you have are:

· single process, using OpenMP to parallelize matrix operations using multiple threads. You can set a parameter numCPUThreads to select how many CPU cores OpenMP may use. The default is all cores (although in some cases we artificially cap this for some operations where we found it is actually slower).

· multi-process data parallelism (1-bit or model averaging). If you choose this on a single machine, you probably need to set numCPUThreads to limit the #cores that each process can use. E.g. if you have 12 cores and use 3-way data parallelism, you probably need to set numCPUThreads=4.

I cannot predict which will work better. We have seen that some BLAS libraries perform worse once you span a NUMA “socket.” E.g. if you have 3 CPU chips with 4 cores each, it may or may not be better to run 3-way data parallelism with 4-core OpenMP parallelism, compared to 12-core OpenMP parallelism. I would just try different combinations.

From: Sandy4321 [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 9:27
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

pls help understand how may I use it for several CPU use? for example I
have PC with 4 CPU, can I train on all 4 CPUs?

On Mon, Feb 1, 2016 at 12:23 PM, Frank Seide <notifications@github.commailto:notifications@github.com>
wrote:

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it
write-enabled by users. It is on our TODO list to find a more universal
solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK <CNTK@noreply.github.commailto:CNTK@noreply.github.com>
Cc: Frank Seide <fseide@microsoft.commailto:fseide@microsoft.com>
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto,
will create lock file in /var/lock/


Reply to this email directly or view it on GitHub<
https://github.com/Microsoft/CNTK/issues/59#issuecomment-177872518>.


Reply to this email directly or view it on GitHub
#59 (comment).


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-178081549.

Member

frankseide commented Feb 1, 2016

Yes you can. First of all, by default, it will already use all cores on your machine through OpenMP. If you do nothing, you should see a CPU utilization >> 1 core. If not, please let us know and try setting a global parameter numCPUThreads= number of cores in your system.

However, this may or may not be optimal, depending on your specific HW configuration, model dimensions, and the BLAS library (which would be ACML unless you explicitly switched to MKL). The two options you have are:

· single process, using OpenMP to parallelize matrix operations using multiple threads. You can set a parameter numCPUThreads to select how many CPU cores OpenMP may use. The default is all cores (although in some cases we artificially cap this for some operations where we found it is actually slower).

· multi-process data parallelism (1-bit or model averaging). If you choose this on a single machine, you probably need to set numCPUThreads to limit the #cores that each process can use. E.g. if you have 12 cores and use 3-way data parallelism, you probably need to set numCPUThreads=4.

I cannot predict which will work better. We have seen that some BLAS libraries perform worse once you span a NUMA “socket.” E.g. if you have 3 CPU chips with 4 cores each, it may or may not be better to run 3-way data parallelism with 4-core OpenMP parallelism, compared to 12-core OpenMP parallelism. I would just try different combinations.

From: Sandy4321 [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 9:27
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

pls help understand how may I use it for several CPU use? for example I
have PC with 4 CPU, can I train on all 4 CPUs?

On Mon, Feb 1, 2016 at 12:23 PM, Frank Seide <notifications@github.commailto:notifications@github.com>
wrote:

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it
write-enabled by users. It is on our TODO list to find a more universal
solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK <CNTK@noreply.github.commailto:CNTK@noreply.github.com>
Cc: Frank Seide <fseide@microsoft.commailto:fseide@microsoft.com>
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto,
will create lock file in /var/lock/


Reply to this email directly or view it on GitHub<
https://github.com/Microsoft/CNTK/issues/59#issuecomment-177872518>.


Reply to this email directly or view it on GitHub
#59 (comment).


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-178081549.

@yefeng-zheng

This comment has been minimized.

Show comment
Hide comment
@yefeng-zheng

yefeng-zheng Feb 1, 2016

I tested on a computer (Windows Server 2012 R2) with 4 Titan X GPUs. Unfortunately, I didn't see any speed up on training time. When I ran with mpiexec -np 4, I could conformed that all 4 GPUs were used with 20-30% of usage. If I ran with a single GPU, I also confirmed that only one GPU was used, but the usage was higher (40-50%). However, in the end, the training with 4 GPUs was actually slower than a single GPU.

I evaluated on simple2d and MNIST and they may not be good examples. Do you have any example to show the benefit of training with multiple GPUs? Thank you very much!

yefeng-zheng commented Feb 1, 2016

I tested on a computer (Windows Server 2012 R2) with 4 Titan X GPUs. Unfortunately, I didn't see any speed up on training time. When I ran with mpiexec -np 4, I could conformed that all 4 GPUs were used with 20-30% of usage. If I ran with a single GPU, I also confirmed that only one GPU was used, but the usage was higher (40-50%). However, in the end, the training with 4 GPUs was actually slower than a single GPU.

I evaluated on simple2d and MNIST and they may not be good examples. Do you have any example to show the benefit of training with multiple GPUs? Thank you very much!

@frankseide

This comment has been minimized.

Show comment
Hide comment
@frankseide

frankseide Feb 1, 2016

Member

The MB size is too small. We are working on updating the documentation and the sample.

Thanks!

From: yefeng-zheng [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 14:37
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

I tested on a computer (Windows Server 2012 R2) with 4 Titan X GPUs. Unfortunately, I didn't see any speed up on training time. When I ran with mpiexec -np 4, I could conformed that all 4 GPUs were used with 20-30% of usage. If I ran with a single GPU, I also confirmed that only one GPU was used, but the usage was higher (40-50%). However, in the end, the training with 4 GPUs was actually slower than a single GPU.

I evaluated on simple2d and MNIST and they may not be good examples. Do you have any example to show the benefit of training with multiple GPUs? Thank you very much!


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-178233969.

Member

frankseide commented Feb 1, 2016

The MB size is too small. We are working on updating the documentation and the sample.

Thanks!

From: yefeng-zheng [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 14:37
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

I tested on a computer (Windows Server 2012 R2) with 4 Titan X GPUs. Unfortunately, I didn't see any speed up on training time. When I ran with mpiexec -np 4, I could conformed that all 4 GPUs were used with 20-30% of usage. If I ran with a single GPU, I also confirmed that only one GPU was used, but the usage was higher (40-50%). However, in the end, the training with 4 GPUs was actually slower than a single GPU.

I evaluated on simple2d and MNIST and they may not be good examples. Do you have any example to show the benefit of training with multiple GPUs? Thank you very much!


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-178233969.

@vikingMei

This comment has been minimized.

Show comment
Hide comment
@vikingMei

vikingMei Feb 1, 2016

yeah, in my computer, there is no write permission on /var/lock, which
is soft link to /run/lock, I have changed the lock directory used
by CrossProcessMutex to current directory, until now, everything seems OK

On 02/02/2016 01:23 AM, Frank Seide wrote:

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it
write-enabled by users. It is on our TODO list to find a more
universal solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto,
will create lock file in /var/lock/


Reply to this email directly or view it on
GitHubhttps://github.com//issues/59#issuecomment-177872518.


Reply to this email directly or view it on GitHub
#59 (comment).

Life is too short to do what I have to do, it's just bearly long enough
to do what I wanna to do

vikingMei commented Feb 1, 2016

yeah, in my computer, there is no write permission on /var/lock, which
is soft link to /run/lock, I have changed the lock directory used
by CrossProcessMutex to current directory, until now, everything seems OK

On 02/02/2016 01:23 AM, Frank Seide wrote:

Yes, it will. Is that causing a problem for you?

We are seeing that some environments do not have this, or not have it
write-enabled by users. It is on our TODO list to find a more
universal solution.

From: weixing.mei [mailto:notifications@github.com]
Sent: Monday, February 1, 2016 1:24
To: Microsoft/CNTK CNTK@noreply.github.com
Cc: Frank Seide fseide@microsoft.com
Subject: Re: [CNTK] How to train a model using multiple machines? (#59)

hi, all, do you have problem while set deviceId=auto??

I'm running cntk in linux, according to code, while set deviceId=auto,
will create lock file in /var/lock/


Reply to this email directly or view it on
GitHubhttps://github.com//issues/59#issuecomment-177872518.


Reply to this email directly or view it on GitHub
#59 (comment).

Life is too short to do what I have to do, it's just bearly long enough
to do what I wanna to do

@frankseide

This comment has been minimized.

Show comment
Hide comment
@frankseide

frankseide Feb 1, 2016

Member

Added Issue #62 on /var/lock and #73 on better documentation/samples for multi-GPU training. I will close this one instead.

Member

frankseide commented Feb 1, 2016

Added Issue #62 on /var/lock and #73 on better documentation/samples for multi-GPU training. I will close this one instead.

@amyrebullar

This comment has been minimized.

Show comment
Hide comment
@amyrebullar

amyrebullar Feb 2, 2016

amyrebullar@yahoo.com
On Feb 2, 2016 7:43 AM, "Frank Seide" notifications@github.com wrote:

Added Issue #62 #62 on
/var/lock and #73 #73 on better
documentation/samples for multi-GPU training. I will close this one instead.


Reply to this email directly or view it on GitHub
#59 (comment).

amyrebullar commented Feb 2, 2016

amyrebullar@yahoo.com
On Feb 2, 2016 7:43 AM, "Frank Seide" notifications@github.com wrote:

Added Issue #62 #62 on
/var/lock and #73 #73 on better
documentation/samples for multi-GPU training. I will close this one instead.


Reply to this email directly or view it on GitHub
#59 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment