Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where can I see examples of WMMA GEMM usage for INT1 (bit 1)? #34

Closed
AlexeyAB opened this issue Nov 9, 2018 · 4 comments
Closed

Where can I see examples of WMMA GEMM usage for INT1 (bit 1)? #34

AlexeyAB opened this issue Nov 9, 2018 · 4 comments

Comments

@AlexeyAB
Copy link

AlexeyAB commented Nov 9, 2018

  • Does the CUTLASS 1.2 library really support INT1 (1 bit) GEMM by using Tensor Cores, so can we use it for XNOR neural networks?

  • Does it perform XNOR !(a^b) operations instead of Multiply?

  • Does it perform C[j][i] = popcnt( A_i_row[x] XNOR B_j_col[x] ) ?

  • Should we pack each 32 bits into uint32_t (A along row, B along column) in such a maner as in cuDNN, where we should use CUDNN_DATA_INT8x32 and CUDNN_TENSOR_NCHW_VECT_C to use INT8 on Tensor Cores with CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_​PRECOMP_GEMM? https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor-ops-speedup-tips

  • Where can I read more about this and where can I see examples of Warp-Level Matrix Operations (WMMA) GEMM usage for INT1 (1 bit)?

I can see only tests for INT8 and INT4: https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu


As written here we can achieve 2088 TOPS for INT1 (1 bit) on GeForce RTX 2080 Ti (TU102): http://on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140-optimizing-cuda-applications-for-the-volta-turing-gpu-architecture.pdf

https://github.com/NVIDIA/cutlass#whats-new-in-cutlass-11

WMMA GEMM targeting TensorCores - INT8, INT4, 1-bit https://github.com/NVIDIA/cutlass/blob/master/tools/test/unit/gemm/wmma_integer_gemm.cu

From the last newsletter:

CUTLASS 1.2, the latest version of the CUDA template library for linear algebra subroutines, includes the following key updates:

  • Support for Turing Tensor Cores that significantly speedup matrix computations for deep learning inference
  • Tensor Core optimized WMMA GEMMs for the new INT8, INT4, and INT1 precision modes introduced in Turing
  • Support for batched strided GEMMs, parallelized GEMM-K reductions, enhanced utilities, and samples
@d-k-b
Copy link
Collaborator

d-k-b commented Nov 9, 2018

You can see an example in the perf tests at https://github.com/NVIDIA/cutlass/blob/master/tools/test/perf/gemm/wmma_binary_gemm.cu.

@d-k-b
Copy link
Collaborator

d-k-b commented Nov 9, 2018

The implementation is modeled here:

int inner_product<Vector<bin1_t, 32>, Vector<bin1_t, 32>, int>(
Vector<bin1_t, 32> a,
Vector<bin1_t, 32> b,
int c) {
int accum = 0;
for (int bit = 0; bit < 32; bit++) {
accum += a[bit] ^ b[bit];
}
return accum + c;
}
.

@AlexeyAB
Copy link
Author

If anyone is interested, I implemented neural network for object detection - XNOR-Yolo model (bit-1 precision) on Darknet framework with Tensor Cores: AlexeyAB/darknet#2365 (comment)

Model RTX 2070 CUDNN_HALF=0, ms RTX 2070 CUDNN_HALF=1, ms Speedup X times
yolov3-spp.cfg 608x608 Float-32/16 bit precision 40.9 27.2 (Tensor Cores for floats) 1.5x
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision 13.5 13.2 1.0x
Speedup X times 3.0x 2.0x -

XNOR-net training process:
chart_yolov3-spp_xnor_obj

@AlexeyAB
Copy link
Author

@d-k-b Hi,

Are there any approximate dates when the Device-Wide bin1_t-GEMM function that uses Tensor Cores will appear in the cutlass?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants