Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement INT4/INT8 quantization and optimal way to use AVX instructions? #11

Open
alph4b3th opened this issue Apr 16, 2023 · 10 comments

Comments

@alph4b3th
Copy link

in reply from: @gotzmann

Implementing INT4/INT8 quantization and using AVX instructions can be challenging, mainly due to the limitations of INT8 multiplication instructions. However, here are some ideas to help you get started:

Quantization:

  • For INT8, you can normalize the data in float32 to the range [-128, 127] and round it to integers. Remember to store the scale factors to convert back to float32 during dequantization.
  • For INT4, you can normalize the data in float32 to the range [-8, 7] and round it to integers. Similar to the INT8 case, store the scale factors for dequantization.

AVX Instructions:

  • AVX-512 instructions can be used to accelerate operations on INT8 and INT4 data arrays. You can use instructions like _mm512_maddubs_epi16() and _mm512_add_epi16() to perform multiplication and addition, respectively.

To deal with the lack of specific INT8 multiplication instructions, you can try converting the INT8 data to INT16 before performing the multiplication. Here is a basic example of how you can do this:

#include <immintrin.h>

__m512i int8_mul(__m512i a, __m512i b) {
  // Converta os vetores INT8 para INT16
  __m512i a_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(a));
  __m512i a_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(a, 1));
  __m512i b_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(b));
  __m512i b_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(b, 1));

  // Multiplique os vetores INT16
  __m512i product_lo = _mm512_mullo_epi16(a_lo, b_lo);
  __m512i product_hi = _mm512_mullo_epi16(a_hi, b_hi);

  // Combine os resultados em um vetor INT8
  __m512i result = _mm512_packs_epi16(product_lo, product_hi);

  return result;
}

Please note that this example is simplified and may not be the most efficient. It also doesn't handle saturation, so you'll need to tweak the code as needed to handle overflow cases.

These ideas should help you get started implementing INT4/INT8 quantization and using AVX instructions. Keep in mind that performance optimization is an iterative process and you may need to adjust and experiment with various approaches to find the most efficient solution for your specific case. If you need more information don't hesitate to call me, I don't really understand c++ as much as Go, but I'm at your disposal.

@alph4b3th
Copy link
Author

in reply from: @gotzmann

Implementing INT4/INT8 quantization and using AVX instructions can be challenging, mainly due to the limitations of INT8 multiplication instructions. However, here are some ideas to help you get started:

Quantization:

  • For INT8, you can normalize the data in float32 to the range [-128, 127] and round it to integers. Remember to store the scale factors to convert back to float32 during dequantization.
  • For INT4, you can normalize the data in float32 to the range [-8, 7] and round it to integers. Similar to the INT8 case, store the scale factors for dequantization.

AVX Instructions:

  • AVX-512 instructions can be used to accelerate operations on INT8 and INT4 data arrays. You can use instructions like _mm512_maddubs_epi16() and _mm512_add_epi16() to perform multiplication and addition, respectively.

To deal with the lack of specific INT8 multiplication instructions, you can try converting the INT8 data to INT16 before performing the multiplication. Here is a basic example of how you can do this:

#include <immintrin.h>

__m512i int8_mul(__m512i a, __m512i b) {
  // Converta os vetores INT8 para INT16
  __m512i a_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(a));
  __m512i a_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(a, 1));
  __m512i b_lo = _mm512_cvtepi8_epi16(_mm512_castsi512_si256(b));
  __m512i b_hi = _mm512_cvtepi8_epi16(_mm512_extracti64x4_epi64(b, 1));

  // Multiplique os vetores INT16
  __m512i product_lo = _mm512_mullo_epi16(a_lo, b_lo);
  __m512i product_hi = _mm512_mullo_epi16(a_hi, b_hi);

  // Combine os resultados em um vetor INT8
  __m512i result = _mm512_packs_epi16(product_lo, product_hi);

  return result;
}

Please note that this example is simplified and may not be the most efficient. It also doesn't handle saturation, so you'll need to tweak the code as needed to handle overflow cases.

These ideas should help you get started implementing INT4/INT8 quantization and using AVX instructions. Keep in mind that performance optimization is an iterative process and you may need to adjust and experiment with various approaches to find the most efficient solution for your specific case. If you need more information don't hesitate to call me, I don't really understand c++ as much as Go, but I'm at your disposal.

In Go, you can use the "github.com/minio/simdjson-go" library to work with SIMD and AVX. However, Go does not natively support SIMD intrinsics like C or C++. Therefore, you would need to use inline assembly functions to work directly with AVX instructions. Here is a basic example of how you can do INT8 multiplication using inline assembly and AVX-512 instructions in Go:

main.go

package main

import (
	"fmt"
	"unsafe"
)

//go:noescape
//go:generate go run asm.go -out vec_mul_amd64.s
func vecMulINT8(a, b, result *int8)

func main() {
	a := make([]int8, 64)
	b := make([]int8, 64)
	result := make([]int8, 64)

	// Preencha os vetores de exemplo
	for i := 0; i < 64; i++ {
		a[i] = int8(i)
		b[i] = int8(i + 1)
	}

	vecMulINT8(&a[0], &b[0], &result[0])

	fmt.Println("Resultado da multiplicação INT8:")
	for i := 0; i < 64; i++ {
		fmt.Printf("%d * %d = %d\n", a[i], b[i], result[i])
	}
}

Asm.go (assembly with go)

// +build ignore

package main

import (
	"fmt"
	"log"
	"os"
	"text/template"
)

const tmpl = `
#include "textflag.h"

TEXT ·vecMulINT8(SB), NOSPLIT, $0-24
	MOVQ a+0(FP), DI
	MOVQ b+8(FP), SI
	MOVQ result+16(FP), DX

	VMOVDQU (DI), ZMM0
	VPMOVSBW ZMM0, YMM1, YMM2
	VMOVDQU (SI), ZMM0
	VPMOVSBW ZMM0, YMM3, YMM4

	VPMULLW YMM3, YMM1, YMM5
	VPMULLW YMM4, YMM2, YMM6

	VPMOVSWB YMM5, YMM6, ZMM0
	VMOVDQU ZMM0, (DX)

	RET
`

func main() {
	t := template.Must(template.New("").Parse(tmpl))

	f, err := os.Create("vec_mul_amd64.s")
	if err != nil {
		log.Fatalf("Failed to create vec_mul_amd64.s: %v", err)
	}
	defer f.Close()

	err = t.Execute(f, nil)
	if err != nil {
		log.Fatalf("Failed to execute template: %v", err)
	}
}

This example defines a function vecMulINT8 that accepts three pointers to the arrays of int8. Assembly code performs INT8 multiplication using AVX-512 instructions. The main function creates example arrays and calls vecMulINT8 to perform the multiplication.

Be aware that this example is simplified and may not be the most efficient. Also, it doesn't handle saturation, so you'll need to adjust your code as needed to handle overflow cases.

@umarrudi
Copy link

I am aware that my question has nothing to do with the topic of this issue, but I just want to ask: is this https://github.com/gotzmann/llama.go/blob/main/pkg/ml/ml.go the exact port of this (tensor program that run exactly like ggml in Go) https://github.com/ggerganov/ggml/blob/master/src/ggml.c?

I am just getting started in ML, I have little experience in C/C++ and Go, but I want to leverage the Go part. So I want to know if I could run other model (e.g. mnist) which Georgi has already provided, with your ml.go?

Thank you.

@gotzmann
Copy link
Owner

I've started grokking with NEON and AVX2:

https://github.com/gotzmann/llama.go/tree/avx-neon

After looking into the topic, it seems the most easiest way to start with is to use MinIO tooling advanced by gorse:

https://gorse.io/posts/avx512-in-golang.html

After some long hours having segfaults both on my Mac and PC I finally managed to fix gotchas and build a version which much easier on CPU load. Not yet big speed improvement and I suppose the RAM becomes actual bottleneck when matrix math moved from main CPU cores to SIMD vector units.

@gotzmann
Copy link
Owner

AVX-512 instructions can be used to accelerate operations on INT8 and INT4 data arrays.

Unfortunately, AVX-512 support is fragmentary within Intel processors. It was removed recently even from CPUs that were capable of it:

https://www.igorslab.de/en/intel-deactivated-avx-512-on-alder-lake-but-fully-questionable-interpretation-of-efficiency-news-editorial/

So my idea is support only AVX2 which is standard de-facto for generations of Intel / AMD processors and later eventually introduce AVX-512 if it will make sense. From what I see I'm 99% sure after AVX2 RAM speed will become the main bottleneck there, not the CPU performance itself.

@gotzmann
Copy link
Owner

I don't really understand c++ as much as Go, but I'm at your disposal.

Yeah, thanks! The most annoying things here:

  • Go has no clever vector intrinsics like C++ has :(
  • Go uses Plan9 assembler which is no one outside of it does :)
  • Go programs usually do not implement low-level hardware optimisations, so there not so many sources to learn from

So one needs either go deep rabbit hole learning both how AVX/NEON works and Plan9 exotics or count on the C/C++ code bases and convert needed parts from there.

@gotzmann
Copy link
Owner

is this https://github.com/gotzmann/llama.go/blob/main/pkg/ml/ml.go the exact port of this (tensor program that run exactly like ggml in Go) https://github.com/ggerganov/ggml/blob/master/src/ggml.c?

@umarrudy - exactly :)

if I could run other model (e.g. mnist) which Georgi has already provided, with your ml.go?

Basically yes, but there still the chance some matrix operations not yet implemented within llama.go

I've looked into the code and it seems we have not converted ggml_graph_dump_dot yet. But that should be easy, if you need this, I might help implementing it within days.

@alph4b3th
Copy link
Author

To use AVX2 instructions in Go, you can use assembly language and the go:generate directive. Here is an example of how to perform INT8 vector multiplication using AVX2 instructions in Go:

Create a file called vecmul_avx2_amd64.s for the assembly code:

// +build amd64,!noasm

#include "textflag.h"

TEXT ·vecMulInt8AVX2(SB), NOSPLIT, $0
    MOVQ a+0(FP), AX
    MOVQ b+8(FP), BX
    MOVQ result+16(FP), CX
    MOVQ length+24(FP), DX

    XORQ R8, R8
loop:
    // Carregar os vetores a e b em registradores SIMD
    VMOVDQU (AX)(R8*1), Y0
    VMOVDQU (BX)(R8*1), Y1

    // Converter de INT8 para INT16
    VPUNPCKLBW Y0, Y2
    VPUNPCKHBW Y0, Y0
    VPMOVZXBW Y2, Y2
    VPMOVZXBW Y0, Y0

    VPUNPCKLBW Y1, Y3
    VPUNPCKHBW Y1, Y1
    VPMOVZXBW Y3, Y3
    VPMOVZXBW Y1, Y1

    // Realizar a multiplicação
    VPMULLW Y2, Y3, Y2
    VPMULLW Y0, Y1, Y0

    // Compactar os resultados de volta em INT8
    VPACKUSWB Y2, Y0, Y0
    VMOVDQU Y0, (CX)(R8*1)

    ADDQ $32, R8
    SUBQ $32, DX
    JGT loop

    RET

Next, create a Go file called vecmul.go to use the assembly function:

// +build amd64,!noasm

package main

import (
	"fmt"
)

//go:noescape
func vecMulInt8AVX2(a, b, result []byte, length int)

func main() {
	length := 32
	a := make([]byte, length)
	b := make([]byte, length)
	result := make([]byte, length)

	// Preencher os vetores a e b com valores de exemplo
	for i := 0; i < length; i++ {
		a[i] = byte(i)
		b[i] = byte(i + 1)
	}

	vecMulInt8AVX2(a, b, result, length)

	for i := 0; i < length; i++ {
		fmt.Printf("a[%d] * b[%d] = %d\n", i, i, result[i])
	}
}

In this example, the vecmul_avx2_amd64.s file contains the assembly code that implements the vecMulInt8AVX2 function using AVX2 instructions. The vecmul.go file uses this assembly function and performs INT8 vector multiplication. To compile and run this code, you must have a processor that supports AVX2 instructions.

Please note that support for SIMD intrinsics in Go is not as extensive as in other languages ​​like C or Rust

@umarrudi
Copy link

@gotzmann

I've looked into the code and it seems we have not converted ggml_graph_dump_dot yet. But that should be easy, if you need this, I might help implementing it within days.

I think ggml_graph_dump_dot is not as important as the core tensor operation, but for me now all learning related stuff is useful and helpful.

Thank you.

@gotzmann
Copy link
Owner

@BrunoIsaac27 To use AVX2 instructions in Go, you can use assembly language and the go:generate directive.

Having lost some days between debugging sessions on my Mac and PC I've finally managed to release AVX2 and NEON optimisations with v1.2 release :) It helped really offload CPU and boost performance for ~2x-4x times depending on how fast your memory.

I'm going to dig into AVX2 more to support memory aligned tensors and get even more better performance with slightly changed code here and there.

@kelindar
Copy link

kelindar commented May 2, 2023

I'm a bit late to the party, but I thought I might share.

I've been dabbling with C to Go Assembly for a while, but generally tooling is very poor. Had some free time this weekend and came up with this small utility to generate Go assembly, it's based off goarse and minio stuff but I had to rewrite most of it.

The main idea is still the same tho, using clang and llvm-objdump, stich things together. Today tried to use it for my bitmap package, and it works fine on my Apple Silicon, 2x improvement with NEON instructions as expected, as well as 8x improvement on my Intel. Haven't tested on Graviton or other ARM Linux based stuff, but it should also work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants