Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT8 version of GEMM? #202

Open
spedagadi opened this issue Oct 17, 2017 · 9 comments
Open

INT8 version of GEMM? #202

spedagadi opened this issue Oct 17, 2017 · 9 comments

Comments

@spedagadi
Copy link

spedagadi commented Oct 17, 2017

Hi

I am looking for a INT8 version of GEMM in OpenCL. If I am correct, CLBlast does not yet support it. Pls correct me if I am wrong and comment on the usage (perhaps a sample app etc.,).

Supposing INT8 variant is not yet present in CLBlast, have you come across any other works that you may recommend. I did run into this repo https://github.com/strin/gemm-android & then ARM's compute library https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/CL/cl_kernels/gemm.cl

My goal is to extend my project https://github.com/sat8/YoloOCLInference to support INT8 models during inference. I have gathered few initial details on how to go about quantization from tensorflow https://www.tensorflow.org/performance/quantization and would like to implement it in my project but is in need of a INT8 version of GEMM. Tensorflow refers to https://github.com/google/gemmlowp which is a CPU & NEON optimized gemm, a CPU only library.

Any thoughts or comments would be appreciated.

@CNugteren
Copy link
Owner

I haven't done the research on INT8 yet, so I don't know of any other GEMM implementations with INT8.

Nevertheless, I think INT8 is an interesting topic for CLBlast. Having tackled FP16 already, I'd be willing to spend time on implementing such a feature, but I don't think it's easy, both on the host and device side many things will have to change going from floating-point to fixed-point. Also, what kind of hardware would you run this on? Hardware with native INT8 support? Does ARM Mali support this (given that it's in ARM's compute library)? Or do they pack 4 values together in a 32-bit integer? I'll have to read up on this topic a bit more in other to give you a proper answer.

@spedagadi
Copy link
Author

spedagadi commented Oct 18, 2017

thnx for the response.

Or do they pack 4 values together in a 32-bit integer?
I think this may be true. You may want to check this out https://github.com/google/gemmlowp/blob/master/doc/quantization.md & reference code https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

In tensorflow documentation, they highlight the range for mapping float to unsigned char based on experimentation
image

If my understanding is correct, INT8 is not a special datatype, rather it's just an unsigned char value.
Of course with multiplication & other math ops, bit depth of more than 8 may be required as the output. Say, GEMM in INT8 may produce a 32 bit unsigned int.

Also, what kind of hardware would you run this on?
I am thinking of using low precision GEMM on Asus tinkerboard that has Mali™-T764 GPU, AMD RX580 & GTX 1080Ti. At this point, I am not sure of the speedup factor that INT8 based inference could produce over pure floating point math but would like to validate it to know it better.

Hardware with native INT8 support?
NVIDIA cards does seem to have instructions such as dp4a which could generate some speedup but I am unsure about where such instructions are exposed in OpenCL on any hardware. For now, I am aiming to compare FP32 vs INT8 deep learning inference supposing GEMM is in INT8 and I optimize my inference kernels using byte data. I would think doing so would certainly generate speedup as it is widely claimed by almost all hardware vendors. Any hardware native INT8 optimizations could come later in my project.

@naibaf7
Copy link

naibaf7 commented May 30, 2018

@SAT8
https://github.com/naibaf7/caffe
Has experimental int8 kernels for both CUDA and OpenCL if you're still interested to play with this.

Edit: I have to mention you'll not have the greatest time performance-wise.
Turns out int8-FMAD is probably going to end up being int32-FMAD on AMD cards and the additional computations for quantization do cost as well. Especially shared memory and register costs. I haven't seen a DP4A equivalent on either AMD or Mali.

@J0hnn4J1ang
Copy link

Do you have any update for this issue now? Or road map?, Thanks.

@CNugteren
Copy link
Owner

No, not really. Not sure if I will ever work on this, other things have priority. But contributors are free to work on this of course.

What hardware would you run it on? What use-case do you have?

@J0hnn4J1ang
Copy link

J0hnn4J1ang commented Jul 8, 2018

Hi, I worked on one kind of miner algo, it needs batchs of size 256 by 256 int8 to int16 matrix multiplication. For nvidia cuda, already done, but amd opencl, seems not have a solution yet
As you do not have the plan, I think I will work it out myself.

@CNugteren
Copy link
Owner

Well, you could try naibaf7's implementation as mentioned above. But as he says, there is not much support for INT8 multiplications in hardware, so you'll probably won't gain much (or will actually lose) compared to FP32.

@J0hnn4J1ang
Copy link

@CNugteren Thanks for you info. Very appreciate it.

@engineer1109
Copy link
Contributor

INT8 GEMM is usually as s8s8s32.
Like int c = (int8_t)a * (int8_t)b;
The result use int, the input use int8_t

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants