Skip to content
Deep neural network framework for multiple GPUs
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
source Sync for 8bit data parallelism. Jun 20, 2015
tests Added model parallelism for dotMPI; allgather for floats. Jun 17, 2015
.gitignore Added .gitignore Mar 1, 2014


Deep neural network framework for GPU clusters:

  • supports NVIDIA GPUDirect RDMA

  • easy distributed computation:

    Matrix C = dot(A,B); //uses one GPU
    Matrix C = dotMPI(A,B); //uses all available GPUs on the board or in the network

  • no delay between batches due to asynchronous memory copies to the GPU:
    gpu.init_batch_allocator(X, y, 128);
    for(int i = 0; i < gpu.m_total_batches; i++)
    gpu.allocate_next_batch_async(); //loads the next batch while you do computations
    result =,w1); //do your computations here
    gpu.replace_current_batch_with_next(); //get the next batch which is already loaded

- distributed weights which are larger than a single GPU memory: ClusterNet gpus = ClusterNet(argc,argv,12346); Matrix *batch = gpus.rand(128,100000);//34 MB Matrix *out1 = empty(128,40000);//19 MB Matrix *out2 = empty(128,20000);//9 MB Matrix *W1 = gpus.distributed_uniformSqrtWeight(100000,40000);//15258 MB Matrix *W2 = gpus.distributed_uniformSqrtWeight(40000,20000);//3051 MB gpus.tick("Time taken"); gpus.dotMPI(batch,W1,out1); gpus.dotMPI(out1,W2,out2); gpus.tock("Time taken"); >>>Time taken: 117.704285 ms.
You can’t perform that action at this time.