Using Multiple GPUs

Frédéric Bastien edited this page Mar 8, 2016 · 9 revisions

You probably want to use platoon for data parallelism or the new gpu back-end for model parallelism.

This page describes how to use Python's multiprocessing module to drive multiple GPUs, by spawning one child process per GPU from a common parent. The example provided uses two GPUs to fit logistic regression models, and demonstrates how to pack up and communicate both common and private arguments from the parent to the children.

The key is to initialize theano specifying the CPU as the target device, and to later override the assigned device to a specified GPU. According to a discussion on theano-users, this re-binding of devices can occur only once per process. So this example would (probably) not work with threads, nor could it be used to do GPU switching.

Here is a generic .theanorc file, similar to the file I use:

  floatX = float32
  device = cpu
  openmp = True
  base_compiledir = /path/to/base/dir

  fastmath = True

  ldflags = -L/path/to/blas/libs -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lm 

  root = /path/to/cuda/

At a high level, the procedure looks like:

  1. Set up your arguments.
  2. Launch each of your sub-processes.
  3. In the child process, import theano.sandbox.cuda and bind theano in the child process to a specific GPU.

In the example below, f is the function containing the work which will be carried out by the child process, shared_args provides shared arguments from the parent, and private_args holds the name of the gpu device to use for this child.

def f(shared_args,private_args): 
    # At this point, no theano import statements have been processed, and so the device is unbound
    # Import sandbox.cuda to bind the specified GPU to this subprocess
    # then import the remaining theano and model modules.
    import theano.sandbox.cuda
    import theano
    import theano.tensor as T
    from theano.tensor.shared_randomstreams import RandomStreams

That's it! After calling theano.sandbox.cuda.use(private_args['gpu']), proceed as you normally would in any theano script.

The example below does not take into account use cases that include communication between sub-processes, nor does it perform any post-processing on the output of each sub-process. If you need to perform inter-process communication, the Manager (declared in the parent, see below) can provide a safe and easy way to do this.

""" Test script that uses two GPUs, one per sub-process,
via the Python multiprocessing module.  Each GPU fits a logistic regression model. """

# These imports will not trigger any theano GPU binding
from multiprocessing import Process, Manager
import numpy as np
import os

def f(shared_args,private_args): 
    """ Build and fit a logistic regression model.  Adapted from
    # Import sandbox.cuda to bind the specified GPU to this subprocess
    # then import the remaining theano and model modules.
    import theano.sandbox.cuda
    import theano
    import theano.tensor as T
    from theano.tensor.shared_randomstreams import RandomStreams
    rng = np.random    
    # Pull the size of the matrices from 
    shared_args_dict = shared_args[0]
    N = shared_args_dict['N']
    feats = shared_args_dict['n_features']
    D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
    training_steps = shared_args_dict['n_steps']
    # Declare Theano symbolic variables
    x = T.matrix("x")
    y = T.vector("y")
    w = theano.shared(rng.randn(feats), name="w")
    b = theano.shared(0., name="b")
    print "Initial model:"
    print w.get_value(), b.get_value()
    # Construct Theano expression graph
    p_1 = 1 / (1 + T.exp(, w) - b))   # Probability that target = 1
    prediction = p_1 > 0.5                    # The prediction thresholded
    xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
    cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
    gw,gb = T.grad(cost, [w, b])              # Compute the gradient of the cost
                                              # (we shall return to this in a
                                              # following section of this tutorial)
    # Compile.  allow_input_downcast reassures the compiler that we are ok using
    # 64 bit floating point numbers on the cpu, gut only 32 bit floats on the gpu.
    train = theano.function(
              outputs=[prediction, xent],
              updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)), allow_input_downcast=True)
    predict = theano.function(inputs=[x], outputs=prediction, allow_input_downcast=True)
    # Train
    for i in range(training_steps):
        pred, err = train(D[0], D[1])
    print "Final model:"
    print w.get_value(), b.get_value()
    print "target values for D:", D[1]
    print "prediction on D:", predict(D[0])           

if __name__ == '__main__':
    # Construct a dict to hold arguments that can be shared by both processes
    # The Manager class is a convenient to implement this
    # See:
    # Important: managers store information in mutable *proxy* data structures
    # but any mutation of those proxy vars must be explicitly written back to the manager.
    manager = Manager()

    args = manager.list()
    shared_args = args[0]
    shared_args['N'] = 400
    shared_args['n_features'] = 784
    shared_args['n_steps'] = 10000
    args[0] = shared_args       
    # Construct the specific args for each of the two processes
    p_args = {}
    q_args = {}
    p_args['gpu'] = 'gpu0'
    q_args['gpu'] = 'gpu1'

    # Run both sub-processes
    p = Process(target=f, args=(args,p_args,))
    q = Process(target=f, args=(args,q_args,))

You can also use Queue or a duplex Pipe (see here for differences) to transfer data between process. One user create batch of data on the CPU and processed on the GPU. A sketch of this is:

import multiprocessing, time
def batchBuilderThread(q):
    while True: ##Create batches of data to be processed,
        time.sleep(10) ##Assume this step is slow.
        q.put(batch) ##Put batches into a queue

batchQueue=multiprocessing.Queue(20) ##Avoids running out of RAM
         for i in xrange(multiprocessing.cpu_count())]
for thread in threads:
    thread.start()  ## Create batches in parallel

while True:
    print batch[0]+batch[1]  ##Process batches one at a time

Note that on Unix, multiprocessing.Process in Python 2 and <3.4 will be a forked process (just fork(), without a following exec()). There are several libraries which can have problems with that and you can assume that in general any library is much less tested in a forked process environment, thus it will be more buggy in general. E.g. Numpy can have this problem (here, here, here), and maybe even CUDA (thus Theano) itself.

The solution is to start a fresh Python interpreter in the sub process (fork() + exec()). Python 3.4 supports that via the spawn option (documentation), as in:


For earlier Python versions, you can use this code (TaskSystem module). It provides the class ExecingProcess which is modeled after multiprocessing.Process but does a fork+exec. And there is AsyncTask which adds up a duplex pipe to this which works with both ExecingProcess and the standard multiprocessing.Process. For details and an example, see the linked code. Example for using one specific GPU:

AsyncTask(func=process, name="GPU0 proc", env_update={"THEANO_FLAGS": "device=gpu0"}, mustExec=True)

On Windows, the multiprocessing module will anyway always do this, because there is no fork on Windows.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.