Skip to content

Latest commit

 

History

History
35 lines (30 loc) · 1.24 KB

README.md

File metadata and controls

35 lines (30 loc) · 1.24 KB

clusterNet

Deep neural network framework for GPU clusters:

  • supports NVIDIA GPUDirect RDMA

  • easy distributed computation:

    Matrix C = dot(A,B); //uses one GPU
    Matrix C = dotMPI(A,B); //uses all available GPUs on the board or in the network

  • no delay between batches due to asynchronous memory copies to the GPU:
    gpu.init_batch_allocator(X, y, 128);
    for(int i = 0; i < gpu.m_total_batches; i++)
    {
    gpu.allocate_next_batch_async(); //loads the next batch while you do computations
    result = gpu.dot(gpu.m_current_batch_X,w1); //do your computations here
    gpu.replace_current_batch_with_next(); //get the next batch which is already loaded
    }

- distributed weights which are larger than a single GPU memory: ClusterNet gpus = ClusterNet(argc,argv,12346); Matrix *batch = gpus.rand(128,100000);//34 MB Matrix *out1 = empty(128,40000);//19 MB Matrix *out2 = empty(128,20000);//9 MB Matrix *W1 = gpus.distributed_uniformSqrtWeight(100000,40000);//15258 MB Matrix *W2 = gpus.distributed_uniformSqrtWeight(40000,20000);//3051 MB gpus.tick("Time taken"); gpus.dotMPI(batch,W1,out1); gpus.dotMPI(out1,W2,out2); gpus.tock("Time taken"); >>>Time taken: 117.704285 ms.