Kernels for computation and communication overlap

What is this?

This package contains a set of applications that implement the a set of kernels in along four different implementation strategies for overlapping GPU computation and CPU-GPU communication. This package also contains the benchmark application that is used to determine the system parameters needed for estimating PCIe transfer times.

These applications have been used in the evaluation of the performance models in:
"Performance models for CPU-GPU data transfers"
B. van Werkhoven, J. Maassen, F.J. Seinstra, and H.E. Bal
In proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2014)

This document contains a short description of the implementations, but for more information please read the above paper.

What are the different implementations?

Each kernel is implemented according to four different implementation strategies. These implementations differ in the way they overlap CPU-GPU communication and GPU computation.

Explicit is the default implementation that does not achieve any overlap between PCIe transfers and kernel execution.

Implicit uses device-mapped host memory to allow fine-grained overlap between PCIe transfers and kernel execution.

Streams uses explicit memory copy statements in combination with CUDA streams to overlap communication in one stream with computation and/or transfers in other streams.

Hybrid is the implementation that uses CUDA streams and memory copies to overlap transfers from host to device with GPU computation and uses device-mapped host memory for transferring the output back to main to main memory.

Which kernels are included?

State

Kernel taken from the Parallel Ocean Program that computes water densities in the ocean based on temperature and salinity and possibly also outputs the derivates of the density with respect to temperature or salinity. Our implementation is adapted from the Parallel Ocean Program http://climate.lanl.gov/Models/POP/

Buoydiff

Another kernel from the Parallel Ocean Program that computes the buoyancy differences between different vertical levels in the ocean. Our implementation is adapted from the Parallel Ocean Program http://climate.lanl.gov/Models/POP/

2D Convolution

Kernel from the image analysis domain that for each pixel in an image computes a weighted average of the neighborhood of that pixel and the weights stored in a convolution filter. We use an optimized kernel implementations as described in:

Optimizing convolution operations on GPUs using adaptive tiling
B. van Werkhoven, J. Maassen, F.J. Seinstra, H.E Bal
Future Generation Computer Systems, Volume 30, 2014

Matrix multiplication

A well-known kernel that computes the product of two matrices. We use an implementation that is optimized and tuned towards each GPU in our testbed according to the directions given in:

Better performance at lower occupancy
V. Volkov
GPU Technology Conference. GTC 2010. Nvidia 2010.

Sparse matrix vector multiplication

A well-known kernel with an irregular data access pattern that computes the multiplication of a sparse matrix in CSR representation and an input vector.

Dependencies

You need a Fortran 90, C, and C++ compiler and the Nvidia CUDA compiler. A Makefile is provided, but the path to the cuda runtime should be edited in the Makefile before the Makefile can be used.

The Latest Version

The latest version of this software can be found at here

Citation

If you use this software or a modified version of it, please cite the most relevant among the following papers:

"Performance models for CPU-GPU data transfers"
B. van Werkhoven, J. Maassen, F.J. Seinstra, and H.E. Bal
In proceedings 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2014)

@inproceedings{van2014performance,
  title={Performance models for CPU-GPU data transfers},
  author={Van Werkhoven, Ben and Maassen, Jason and Seinstra, Frank J and Bal, Henri E},
  booktitle={Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on},
  pages={11--20},
  year={2014},
  organization={IEEE}
}

"A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1)"
B. van Werkhoven, J. Maassen, M. Kliphuis, H.A. Dijkstra, S.E. Brunnabend, M. van Meersbergen, F.J. Seinstra, and H.E. Bal
Geoscientific Model Development, Volume 7, Issue 1, Pages 267-281, Febuary 2014.

@article{vanwerkhoven2014distributed,
  title={A distributed computing approach to improve the performance of the Parallel Ocean Program (v2. 1)},
  author={van Werkhoven, Ben and Maassen, Jason and Kliphuis, M and Dijkstra,
          HA and Brunnabend, SE and Van Meersbergen, M and Seinstra, FJ and Bal, HE},
  journal={Geoscientific Model Development},
  volume={7},
  number={1},
  pages={267--281},
  year={2014},
  publisher={Copernicus GmbH}
}

"Optimizing convolution operations on GPUs using adaptive tiling"
B. van Werkhoven, J. Maassen, F.J. Seinstra, H.E Bal
Future Generation Computer Systems, Volume 30, 2014

@article{vanwerkhoven2014optimizing,
  title={Optimizing convolution operations on GPUs using adaptive tiling},
  author={Van Werkhoven, Ben and Maassen, Jason and Bal, Henri E and Seinstra, Frank J},
  journal={Future Generation Computer Systems},
  volume={30},
  pages={14--26},
  year={2014},
  publisher={Elsevier}
}

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
benchmark-pci.cu		benchmark-pci.cu
buoydiff.F90		buoydiff.F90
buoydiff.cu		buoydiff.cu
conv.cu		conv.cu
domain.h		domain.h
matmul.cu		matmul.cu
spmv.cu		spmv.cu
state.F90		state.F90
state.cu		state.cu
state_init.cu		state_init.cu
state_mod.F90		state_mod.F90
timer.cc		timer.cc
timer.fh		timer.fh
timer.h		timer.h

License

NLeSC/com-com-kernels

Folders and files

Latest commit

History

Repository files navigation

Kernels for computation and communication overlap

What is this?

What are the different implementations?

Which kernels are included?

State

Buoydiff

2D Convolution

Matrix multiplication

Sparse matrix vector multiplication

Dependencies

The Latest Version

Citation

License

About

Resources

License

Stars

Watchers

Forks

Languages