# FP4 Quantization

High-Level Overview of FP4 Quantization for LLMs
FP4 (4-bit floating-point) quantization is a technique used to significantly reduce the memory footprint and computational cost of Large Language Models (LLMs). The core idea is to represent the high-precision floating-point numbers (like FP16 or FP32) that make up a model's weights and activations with a much smaller 4-bit floating-point format. This allows you to store a massive model in a fraction of the memory and perform computations faster.

Here's a high-level breakdown of how it works:

- The Challenge: LLMs are huge. A model like LLaMA-7B has 7 billion parameters, each typically stored as a 16-bit floating-point number. This requires roughly 14 GB of VRAM. This is a lot, and it limits who can run these models. The goal of quantization is to reduce this number.

- The Idea: Instead of using 16 bits to represent each number, we'll use only 4 bits. A 4-bit floating-point number has a much smaller range of values it can represent. This is a trade-off: we save memory and compute, but we lose some precision. The challenge is to do this in a way that the model's performance doesn't degrade too much.

The Core Process: Quantization and De-Quantization:

- Quantization: When a model is loaded, its high-precision weights are converted to the 4-bit format. This involves a scaling factor and a data type conversion. The key is to find the right scaling factor that minimizes the loss of information.

- De-Quantization (on-the-fly): During a forward pass (inference), the 4-bit weights are loaded from memory. However, to perform the actual matrix multiplication (the core operation in a Transformer's linear layer), the GPU's hardware often requires higher precision (e.g., FP16). So, the 4-bit weights are de-quantized back to a higher precision on the fly. The matrix multiplication is then performed in this higher precision, and the result is stored.

- Handling Outliers: A major issue with quantizing LLMs is the presence of "outliers." These are a few values in the weight or activation tensors that are much larger than the rest. A naive quantization scheme would be dominated by these outliers, making the rest of the values lose all their precision. Solutions like bitsandbytes' FP4 and NF4 handle this by using a small, high-precision representation for these outliers while quantizing the majority of the values to 4-bit. This is a "mixed-precision" approach within the 4-bit quantization.

The key components of a bitsandbytes-like implementation are:

## 4-bit Floating-Point Data Type
bnb defined the NF4 - NormalFloat4. For our purpose let's use the standard FP4.
In the FP4 format the data are represented as:

|sign|exp|exp|mantissa| (E2M1)

(-1)^s *(1+m/2)^(2-1) * 2^(exp-bias)

sign: +1, -1
mantissa: 0, 1
exponent: 00, 01, 10, 11 (with bias=1)

bias is pretty important because allow us to have negative exponents and managing subnormal numbers, that in deep learning are pretty important.

So the total representable range is:

[-1x(1+0.5)x(2^2)... +1x(1+0.5)x(2^2)]= [-6 ... 6]



In [1]:
class FP4_E2M1:
  '''
  class that represent the E2M1 format
  '''
  def __init__(self):
    self.values = []
    for sign in [0,1]:
      for exp in range(2**2):
        for mantissa in range(2):
          if exp==0 and mantissa == 0:
            value = 0
          else:
            exp_val = exp-1
            mantissa_val = 1+mantissa*0.5
            value = (1 if sign==0 else -1) * mantissa_val * (2**(exp_val))

          if value not in self.values:
            self.values.append(value)
    self.values = sorted(self.values)

In [2]:
# In case of E2M1
fp4_range = FP4_E2M1()
fp4_range.values

[-6.0,
 -4.0,
 -3.0,
 -2.0,
 -1.5,
 -1.0,
 -0.75,
 0,
 0.75,
 1.0,
 1.5,
 2.0,
 3.0,
 4.0,
 6.0]

In [3]:
import torch
import torch.nn as nn

In [4]:
fp4_range = torch.tensor(FP4_E2M1().values)
fp4_range

tensor([-6.0000, -4.0000, -3.0000, -2.0000, -1.5000, -1.0000, -0.7500,  0.0000,
         0.7500,  1.0000,  1.5000,  2.0000,  3.0000,  4.0000,  6.0000])

## Educational FP4 Quantization

Very first version.
This implementation quickly and simply describes the algorithm from an educational point of view.

- The input tensor is taken and flattened to one dimension (flatten operation)

- The scale value is computed as absmax()

- The entire tensor is scaled

- For each value in the tensor (using broadcasting), the closest value is calculated within the bucket of the allowed 4-bit value range

- Finally, the quantized data is returned

In [5]:
class FP4_Quantizer():
  def __init__(self):
    self.fp4_values = torch.tensor(FP4_E2M1().values)
  def quantize(self, input_tensor):
    block = input_tensor.view(-1) # Flatten
    scale = block.abs().max() # Get the max value of the block for the scale
    if scale == 0:
      return torch.zeros_like(block), scale
    scaled_block =block/scale # Scale the tensor


    indices = torch.argmin(torch.abs(scaled_block.unsqueeze(1)-self.fp4_values),dim=1) # Find the nearest value from the range
    quantized_data = self.fp4_values[indices]
    return quantized_data, scale

  def dequantize(self,quantized_tensor,scale,original_shape):
    t = quantized_tensor*scale
    t = t.reshape(original_shape)
    return t

In [6]:
quantizer = FP4_Quantizer()

In [7]:
input_tensor = torch.randn((1,512))
input_tensor.shape, input_tensor

(torch.Size([1, 512]),
 tensor([[ 1.1179, -1.1281,  0.8373, -1.1588,  0.8053, -0.6481, -1.6126, -0.0629,
          -0.2260, -1.3027,  0.4230,  0.1722, -0.7198, -1.8198, -0.5111, -0.5895,
           0.4420, -0.3890,  1.2333, -0.1605,  1.6057, -0.1552,  1.6179, -0.3557,
           0.0984,  0.5940, -0.8442, -0.6008,  1.6831, -1.2821,  0.1665,  0.4709,
          -1.3341,  0.4199,  0.0672,  0.4475, -1.5926,  0.4505,  0.9490,  0.8768,
           0.4062,  1.8352,  1.0064,  1.4580, -1.5108, -0.2318, -2.1338,  1.5809,
           1.2413, -0.9667, -1.6481, -1.2794, -0.5184,  1.8801,  0.3109, -0.2874,
          -0.1665, -1.0976, -0.6915,  0.6209, -0.4276, -0.5437, -0.0457, -1.4656,
          -1.6254,  0.3077,  2.5372, -0.8167,  0.6427, -1.0736, -0.5634,  0.6819,
           0.6419, -0.5678,  0.9447,  0.6653,  0.4357, -0.9539,  1.2157, -0.4664,
          -0.0565,  0.4519, -0.2334, -0.4839, -0.4435, -0.4420,  0.6545,  0.8647,
          -1.8317,  0.0663,  0.4506,  2.0528, -1.1950, -0.2536, -0.0069,  1

In [8]:
quantized_tensor, scale = quantizer.quantize(input_tensor=input_tensor)
quantized_tensor, scale

(tensor([ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.7500,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.7500,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.7500,  0.0000,  0.7500,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.7500,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000, -0.7500,  0.0000,  0.0000,  0.0000,
          0.0000,  0.7500,  0.0000,  0.7500, -0.7500,  0.0000, -0.7500,  0.7500,
          0.0000,  0.0000, -0.7500,  0.0000,  0.0000,  0.7500,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.7500,
         -0.7500,  0.0000,  0.7500,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         -0.7500,  0.0000,  0.0000,  0.7500,  0.0000,  0.0000,  0.0000,  0.0000,
         -0.7500,  0.0000,  

In [9]:
dequantized_tensor = quantizer.dequantize(quantized_tensor,scale, input_tensor.shape)

In [10]:
dequantized_tensor-input_tensor

tensor([[-1.1179,  1.1281, -0.8373,  1.1588, -0.8053,  0.6481, -1.1146,  0.0629,
          0.2260,  1.3027, -0.4230, -0.1722,  0.7198, -0.9075,  0.5111,  0.5895,
         -0.4420,  0.3890, -1.2333,  0.1605,  1.1215,  0.1552,  1.1093,  0.3557,
         -0.0984, -0.5940,  0.8442,  0.6008,  1.0441,  1.2821, -0.1665, -0.4709,
          1.3341, -0.4199, -0.0672, -0.4475, -1.1347, -0.4505, -0.9490, -0.8768,
         -0.4062,  0.8920, -1.0064,  1.2693, -1.2165,  0.2318, -0.5935,  1.1463,
         -1.2413,  0.9667, -1.0792,  1.2794,  0.5184,  0.8471, -0.3109,  0.2874,
          0.1665,  1.0976,  0.6915, -0.6209,  0.4276,  0.5437,  0.0457, -1.2617,
         -1.1019, -0.3077,  0.1900,  0.8167, -0.6427,  1.0736,  0.5634, -0.6819,
         -0.6419,  0.5678, -0.9447, -0.6653, -0.4357,  0.9539, -1.2157,  0.4664,
          0.0565, -0.4519,  0.2334,  0.4839,  0.4435,  0.4420, -0.6545, -0.8647,
         -0.8956, -0.0663, -0.4506,  0.6745,  1.1950,  0.2536,  0.0069, -1.0724,
         -1.2039, -1.2744,  

In [11]:
torch.mean(dequantized_tensor-input_tensor)

tensor(0.0371)

As you can see we have quantized and dequantize the original tensor, and of course we have a loss on the convertion. The average error is not that big in this example

In some cases the differences are pretty huge, and this would be worst in case of bigger outliers

#### NB This is really a basic educational implementation that doesn't really optimize the space for the quantization. It just to show the algorithm

#Blockwise Quantization

This approach is simple and it's great, but real library use a more fine grain approach, calculating multiple scale factor base on blocks.

Let's change our code to do that

In [12]:
input_tensor = torch.randn((1,512))
input_tensor.shape, input_tensor

(torch.Size([1, 512]),
 tensor([[-9.5547e-02, -1.3130e+00, -1.7122e-01, -3.0108e-02, -1.1359e-01,
           2.7123e-01,  7.6356e-01, -3.8492e-01,  5.7544e-01, -1.2327e+00,
          -4.5623e-01,  7.2639e-01,  6.8328e-01, -6.8503e-01, -1.4650e+00,
           2.0754e-01,  6.3340e-01,  1.1460e+00, -2.4873e-01,  2.1266e+00,
          -8.3623e-01, -4.1115e-01, -9.1674e-02,  1.3308e+00, -1.1053e-01,
           2.7257e-01, -9.7877e-01,  8.8152e-01, -9.5816e-01, -3.7615e-01,
          -6.4455e-01, -1.1638e+00, -3.0966e-01, -4.9553e-01, -1.2040e+00,
           8.1451e-01,  2.3513e-01, -5.2868e-01, -1.1573e+00,  5.8821e-01,
          -1.4447e+00, -2.7392e-02,  1.0972e+00, -1.4447e+00, -3.3108e-01,
          -7.6031e-01, -1.1028e+00, -5.5641e-01,  1.2619e+00,  3.5575e-01,
          -9.6245e-03, -3.6474e-01,  2.9966e-01,  9.1289e-02,  2.3315e-01,
          -1.4713e+00,  1.3958e+00,  6.4976e-01, -9.4165e-01, -2.5871e-01,
          -3.4844e-01,  1.6903e+00,  7.8741e-01, -2.3817e+00, -4.4762e-01,
  

In [13]:
fp4_range

tensor([-6.0000, -4.0000, -3.0000, -2.0000, -1.5000, -1.0000, -0.7500,  0.0000,
         0.7500,  1.0000,  1.5000,  2.0000,  3.0000,  4.0000,  6.0000])

In [14]:
class FP4_Quantizer_Blockwise():
  def __init__(self,block_size=8):
    self.fp4_values = torch.tensor(FP4_E2M1().values)
    self.block_size = block_size

  def quantize(self,input_tensor):
    data_flat = input_tensor.view(-1) # Flatten
    num_blocks = (data_flat.numel()+ self.block_size -1) // self.block_size
    quantized_data = torch.zeros(num_blocks * (self.block_size//2), dtype=torch.uint8) # Every 8 bit we'll pack together 2 tensor of 4 bit
    scales = torch.zeros(num_blocks)

    for i in range(num_blocks):
      start = i*self.block_size
      end = min((i+1)*self.block_size,data_flat.numel())
      block = data_flat[start:end]
      scale = block.abs().max() # Get the max value of the block for the scale
      if scale == 0:
        scale = 1.0
      scales[i] = scale # Saving the scale factor for the block

      scaled_block = block/scale # Scale the tensor
      indices = torch.argmin(torch.abs(scaled_block.unsqueeze(1)-self.fp4_values),dim=1) # Find the nearest value
      # Combine two 4 bit indices in one uint8 value
      # This operation refactor the indices organizing it in group of two [[1,2],[3,4]...]
      # Then pack the values of the first column with the second column moving this one 4bit to the left (left bit shift operator)
      # For example if the index is 5 (0101) shifting it left will result in 0101 0000 (80)
      if indices.numel() % 2 != 0:
        # Pad with a dummy value to make the number of elements even
        indices = torch.cat((indices, torch.tensor([0], dtype=indices.dtype)))

      packed_indices = indices.view(-1,2)
      packed_values = packed_indices[:, 0] | (packed_indices[:, 1] << 4)
      quantized_data[i * (self.block_size // 2) : i * (self.block_size // 2) + packed_values.numel()] = packed_values
    return quantized_data, scales

  def dequantize(self,quantized_tensor,scales, original_shape):
    num_elements = torch.prod(torch.tensor(original_shape))
    dequantized_flat = torch.zeros(num_elements, dtype=torch.float32)

    num_blocks = scales.numel()
    current_index = 0
    for i in range(num_blocks):
      start = i * self.block_size
      end = min((i+1)*self.block_size, num_elements)
      current_block_size = end-start
      # How many 8-bit values to unpack for the current block
      packed_block_size = (current_block_size + 1) // 2
      packed_values = quantized_tensor[current_index:current_index+packed_block_size]
      # Unpack the values -> I need to do a bitwise operation the most signifanct bit will be the second index
      # The least significant bits will be the first index

      index_1 = packed_values & 0x0F
      index_2 = (packed_values >> 4) & 0x0F
      indices_unpacked =torch.stack([index_1,index_2], dim=1).view(-1)
      indices_unpacked = indices_unpacked[:current_block_size]
      fp4_block_value = self.fp4_values[indices_unpacked.long()]
      dequantized_flat[start:end] = fp4_block_value * scales[i]
      current_index += packed_block_size
    return dequantized_flat.view(original_shape)

In [15]:
quantizer = FP4_Quantizer_Blockwise(block_size=64)

In [16]:
quantized_data, scales = quantizer.quantize(input_tensor)

In [17]:
quantized_data, len(scales)

(tensor([103, 119, 119, 119, 103, 119, 119, 118, 135, 151, 119, 135, 119, 118,
         118, 103, 119, 118, 119, 118, 118, 104, 119, 118, 120, 119, 119, 103,
         120, 118, 135,  87, 135, 119, 120, 103, 119, 104, 120, 104, 118, 120,
         104, 118, 103, 103, 117, 135, 135, 120,  87, 135, 102, 102, 119, 134,
         104, 119, 103, 118, 103,  88, 118, 135, 119, 119, 118, 135, 119, 103,
         119, 135, 104, 120, 119, 119, 120, 102, 135, 119, 119, 150, 135, 119,
         119, 135, 135, 103, 118, 119, 118,  86, 119, 119, 119, 117, 120, 119,
         103, 119, 119, 135, 119, 134, 136, 103, 119, 119, 118, 134, 135, 118,
         120, 102, 118, 102, 135, 135, 119, 120, 101, 103, 118, 135, 120, 103,
         119, 134, 103, 103, 119, 119, 118, 120, 120, 134, 119, 119, 103, 119,
         119, 120, 103, 119, 119, 119, 119, 119, 118, 119, 119,  87, 135, 150,
         119, 118, 104, 119, 120, 118, 135, 119, 119, 104, 118, 103, 119, 118,
         119, 119, 119, 119, 120, 119, 119, 118, 102

We have N scale factor (with N =Tensor Input dimension / block_size)

In [18]:
dequantized_tensor = quantizer.dequantize(quantized_data,scales, input_tensor.shape)

In [22]:
(dequantized_tensor- input_tensor).abs().mean()

tensor(0.4305)

# Matmul operations

Let's see a problem using the FP4 data type

In [21]:
BLOCK_SIZE = 64

In [23]:
class matmul():
  def __init__(self,quantizer):
    self.quantizer = quantizer
  def __call__(self, input_tensor, weights, scales=None, weights_quantized=False, shape=None):
    if weights_quantized:
      if shape is None or scales is None:
        raise Exception("'shape' and 'scales' are required")
      weights = self.quantizer.dequantize(weights, scales, shape)
      weights = weights.to(torch.bfloat16)
    output = torch.matmul(input_tensor, weights.T)
    return output

In [24]:
quantizer = FP4_Quantizer_Blockwise(block_size=BLOCK_SIZE)

In [25]:
in_features, out_features = 1024, 512
weights = torch.randn(out_features, in_features).to(torch.bfloat16)
input_tensor = torch.randn(1, in_features).to(torch.bfloat16)

In [26]:
matmul_operation = matmul(quantizer = quantizer)

In [27]:
base_matmul_result = matmul_operation(input_tensor, weights, weights_quantized=False)

In [28]:
quantized_weight, scales = quantizer.quantize(weights)
dequantized_matmul_result = matmul_operation(input_tensor, quantized_weight, weights_quantized=True,scales=scales, shape=weights.shape)

In [29]:
dequantized_matmul_result - base_matmul_result

tensor([[ 1.4812e+01, -1.0625e+01, -2.1000e+01,  4.6000e+01,  9.2500e+00,
          8.9375e+00,  1.6125e+01,  1.0500e+01,  2.6250e+00, -2.7188e+00,
         -1.5625e+01,  2.3000e+01, -2.1719e+00, -8.7500e-01, -2.3000e+01,
          0.0000e+00, -4.6875e+00, -4.0750e+01,  2.5000e+00, -2.4000e+01,
          2.8750e+01, -2.1375e+01, -2.4000e+01,  5.1250e+01,  3.9500e+01,
         -1.5500e+01, -7.3750e+00, -7.9688e+00, -5.8438e+00,  5.2500e+00,
         -4.3750e-01, -2.7250e+01,  5.4375e+00, -4.9000e+01,  3.1250e+00,
         -1.7250e+01,  6.5000e+00, -1.2500e+00,  1.2125e+01, -7.7500e+00,
         -1.5875e+01, -9.5000e+00,  1.8000e+01,  8.7500e-01, -2.1750e+01,
          2.0000e+00, -6.0000e+00, -2.3875e+01,  1.8375e+01,  2.6375e+01,
          2.0250e+01,  4.5000e+00, -9.5000e+00, -8.9375e+00, -1.7625e+01,
          4.0750e+01,  8.6250e+00,  1.7000e+01, -4.5000e+00, -1.0812e+01,
         -2.8750e+00, -1.6250e+01,  5.5000e+00, -2.8875e+01,  1.2688e+01,
         -7.6875e+00,  2.7500e+01,  6.

In [30]:
(dequantized_matmul_result - base_matmul_result).abs().mean()

tensor(14.3750, dtype=torch.bfloat16)

The errors are huge, and this will lead to gigantic error by our models. That's the reason enterprice libraries like BitsandBytes doesn't use this data type but actually they define a special data type called NF4.

NF4 works pretty well with LLM due to their nature

# NF4 - Normal Float 4

The weights in large neural networks, including LLMs, tend to follow a zero-centered normal distribution. This means most weights are clustered around zero, with fewer weights at the extremes. NF4 takes advantage of this by creating a quantization scheme where the "bins" or discrete values are not equally spaced. Instead, there are more bins around zero to capture the fine-grained details of the majority of the weights, and fewer, wider bins for the less common outlier weights.


This non-uniform approach is more information-theoretically optimal for normally distributed data, as it minimizes the quantization error and preserves the crucial information in the weights that are essential for the model's performance.

In [31]:
class NF4_Quantizer_Blockwise():
  def __init__(self,block_size=8):
    self.nf4_values = torch.tensor([
        -1.0000, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0000,
         0.0796,  0.1609,  0.2461,  0.3379,  0.4407,  0.5626,  0.7229,  1.0000
    ], dtype=torch.float32) # Precomputed
    self.block_size = block_size

  def quantize(self,input_tensor):
    data_flat = input_tensor.view(-1) # Flatten
    num_blocks = (data_flat.numel()+ self.block_size -1) // self.block_size
    quantized_data = torch.zeros(num_blocks * (self.block_size//2), dtype=torch.uint8) # Every 8 bit we'll pack together 2 tensor of 4 bit
    scales = torch.zeros(num_blocks)

    for i in range(num_blocks):
      start = i*self.block_size
      end = min((i+1)*self.block_size,data_flat.numel())
      block = data_flat[start:end]
      scale = block.abs().max() # Get the max value of the block for the scale
      if scale == 0:
        scale = 1.0
      scales[i] = scale # Saving the scale factor for the block

      scaled_block = block/scale # Scale the tensor
      indices = torch.argmin(torch.abs(scaled_block.unsqueeze(1)-self.nf4_values),dim=1) # Find the nearest value
      # Combine two 4 bit indices in one uint8 value
      # This operation refactor the indices organizing it in group of two [[1,2],[3,4]...]
      # Then pack the values of the first column with the second column moving this one 4bit to the left (left bit shift operator)
      # For example if the index is 5 (0101) shifting it left will result in 0101 0000 (80)
      if indices.numel() % 2 != 0:
        # Pad with a dummy value to make the number of elements even
        indices = torch.cat((indices, torch.tensor([0], dtype=indices.dtype)))

      packed_indices = indices.view(-1,2)
      packed_values = packed_indices[:, 0] | (packed_indices[:, 1] << 4)
      quantized_data[i * (self.block_size // 2) : i * (self.block_size // 2) + packed_values.numel()] = packed_values
    return quantized_data, scales

  def dequantize(self,quantized_tensor,scales, original_shape):
    num_elements = torch.prod(torch.tensor(original_shape))
    dequantized_flat = torch.zeros(num_elements, dtype=torch.float32)

    num_blocks = scales.numel()
    current_index = 0
    for i in range(num_blocks):
      start = i * self.block_size
      end = min((i+1)*self.block_size, num_elements)
      current_block_size = end-start
      # How many 8-bit values to unpack for the current block
      packed_block_size = (current_block_size + 1) // 2
      packed_values = quantized_tensor[current_index:current_index+packed_block_size]
      # Unpack the values -> I need to do a bitwise operation the most signifanct bit will be the second index
      # The least significant bits will be the first index

      index_1 = packed_values & 0x0F
      index_2 = (packed_values >> 4) & 0x0F
      indices_unpacked =torch.stack([index_1,index_2], dim=1).view(-1)
      indices_unpacked = indices_unpacked[:current_block_size]
      nf4_block_value = self.nf4_values[indices_unpacked.long()]
      dequantized_flat[start:end] = nf4_block_value * scales[i]
      current_index += packed_block_size
    return dequantized_flat.view(original_shape)

In [32]:
quantizer = NF4_Quantizer_Blockwise(block_size=BLOCK_SIZE)

In [33]:
matmul_operation = matmul(quantizer=quantizer)

In [34]:
quantized_weight, scales = quantizer.quantize(weights)

In [35]:
base_matmul_result = matmul_operation(input_tensor, weights, weights_quantized=False)

In [36]:
quantized_weight, scales = quantizer.quantize(weights)
dequantized_matmul_result = matmul_operation(input_tensor, quantized_weight, weights_quantized=True,scales=scales, shape=weights.shape)

In [37]:
dequantized_matmul_result - base_matmul_result

tensor([[ 2.5625, -0.1875,  4.6250,  0.5000,  1.7500, -2.0312,  3.0000,  3.0000,
          2.7500, -5.9688,  0.8750,  5.2500, -1.2344, -0.2500,  0.5000,  1.0000,
         -2.5625, -3.2500,  2.6250, -0.5000, -4.1250, -0.4375, -0.5000, -0.5000,
         -1.2500, -3.2500,  0.3750, -2.3125, -2.0938, -2.8750,  3.2812,  3.8750,
         -0.8125, -0.1250,  0.6250,  3.2500, -1.0000, -1.2500, -0.7500, -2.7500,
         -0.5000,  7.8750,  2.0000, -1.5000,  3.1250, -4.0000, -3.8750,  1.8750,
          1.8750, -2.3594, -4.4375, -1.0000, -1.7500, -1.6250, -0.6250,  0.2500,
         -2.8750,  2.0000, -0.5000, -0.9375, -2.8750,  4.7500, -3.8125, -0.1250,
          0.9375,  2.0000,  3.5000,  0.1172, -2.4375, -2.7500,  4.1250,  0.3750,
         -0.6250,  2.3750, -0.6250,  2.1250,  3.5000, -2.6250, -6.0000, -2.6250,
          2.3750, -1.7500,  2.5000, -0.3125,  2.2500, -1.3125,  2.1250,  1.5000,
         -1.9375,  2.5000, -2.9219,  5.1250, -4.4062, -5.2500,  1.2500,  2.0000,
         -1.3750,  3.0000, -

In [38]:
(dequantized_matmul_result - base_matmul_result).abs().mean()

tensor(2.3594, dtype=torch.bfloat16)

The error now are more manageable.

This algorithm is the one used from bitsandbytes when training QLora, or when you quantize the model to manage the memory requirements.

Of course the main advantages is inside a custom kernel and CUDA Optimization that performe the dequantize and the matmul operations directly in one step.