APE on CUDA

This project is an APE implementation on NVIDIA GPU using the cuBLAS backend.

APE is a method of emulating high-bitwidth computation with low-bitwidth data types. For example, APE can use $3$ or $6$ Tensor Core low-bitwidth computation to emulate an FP32 computation with up to $5.3\times$ theoretical speedup. This project provides the following:

GEMM implementations using Tensor Cores with FP32-precision and various representation ranges.
Auto-adapted algorithm selection that guarantees end-to-end correctness.
INT16 GEMM implementation using Tensor Cores.

For more details, please see our paper.

Usage

Build

mkdir build && cd build
cmake ..
make -j

API

APE provides a blas-like API, and users only need to include ape.h to use APE to accelerate FP32 applications directly.

void apeGemmFP32(ApeTrans transa, ApeTrans transb, int m, int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, const float *beta, float *C, int ldc, const ApeAlgo algo = APE_ALGO_AUTO);

FP32 GEMM supports $5$ algorithms:

APE_ALGO_AUTO: Select the fastest algorithm without overflow.
APE_ALGO_AUTO_STRICT: Select the fastest algorithm without overflow and underflow.
APE_ALGO_FP32F: Use FP16 emulated FP32. (1-bit precision loss, narrow representation range, overflow may occur.)
APE_ALGO_FP32B: Use BF16 emulated FP32. (no precision loss, large representation range, overflow does not occur.)
APE_ALGO_FP32T: Use TF32 emulated FP32. (1-bit precision loss, large representation range, overflow does not occur.)

void apeGemmINT16(ApeTrans transa, ApeTrans transb, int m, int n, int k, const int16_t *alpha, const int16_t *A, int lda, const int16_t *B, int ldb, const int32_t *beta, int32_t *C, int ldc, ApeAlgo algo = APE_ALGO_AUTO);

INT16 GEMM supports $2$ algorithms:

APE_ALGO_AUTO: Select the algorithm without overflow.
APE_ALGO_INT16: Use INT8 emulate INT16. (The upper bound is $32639$. Native INT16's is $32767$. Overflow may occur.)

Authors

Citation

Ma, Zixuan, et al. "Efficiently emulating high-bitwidth computation with low-bitwidth hardware." Proceedings of the 36th ACM International Conference on Supercomputing. 2022.

If you find this work useful in your research, please cite it using the following BibTeX:

@inproceedings{ma2022efficiently,
author = {Ma, Zixuan and Wang, Haojie and Feng, Guanyu and Zhang, Chen and Xie, Lei and He, Jiaao and Chen, Shengqi and Zhai, Jidong},
title = {Efficiently Emulating High-Bitwidth Computation with Low-Bitwidth Hardware},
year = {2022},
isbn = {9781450392815},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3524059.3532377},
doi = {10.1145/3524059.3532377},
booktitle = {Proceedings of the 36th ACM International Conference on Supercomputing},
articleno = {5},
numpages = {12},
keywords = {emulation, tensor core, domain specific accelerator},
location = {Virtual Event},
series = {ICS '22}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
include		include
src		src
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

APE on CUDA

Usage

Build

API

Authors

Citation

About

Releases

Packages

Contributors 2

Languages

License

JohndeVostok/APE

Folders and files

Latest commit

History

Repository files navigation

APE on CUDA

Usage

Build

API

Authors

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages