# Int8 Quantization for Needle

#### Group 11: Tyler Ho (tylerho), Andrew Zhang (az4)
#### DL Systems Final Project, Fall 2025

## Abstract

This report documents the design and implementation of post-training int8 quantization in the Needle deep learning library. The goal of the work is to shrink memory footprint and improve inference efficiency while retaining the educational clarity of the project codebase. We outline the motivation for the project, summarize the background literature, describe the software architecture spanning Python and C++/CUDA components, and present the experiments we ran to validate correctness, accuracy, speed, and memory behavior.

The project follows our initial proposal of mapping float32 weights to uint8 or int8 representations using scale factors. We extend that idea with per-channel scaling, zero-points for asymmetric quantization, and integration hooks that let Needle modules toggle quantized execution at inference time. The accompanying code lives entirely in-tree, requires no external dependencies beyond the existing build, and exposes an API that mirrors the ergonomics of mainstream libraries. We also include benchmarking and visualization utilities so that the impact of quantization can be inspected end to end.


## 1. Problem Motivation and Objectives

Modern models are bottlenecked by memory bandwidth and cache locality. Even modest multi-layer perceptrons routinely move more data than they compute, and convolutional or transformer models amplify the issue. Int8 quantization offers a pragmatic trade-off: reduce every parameter from four bytes to one while keeping arithmetic within acceptable error bounds. Our motivating use case mirrors production systems that store int8 weights and dequantize on the fly during inference. We targeted three concrete objectives:

1. Enable post-training int8 weight storage with minimal code changes for model authors. Training stays in float32; quantization is opt-in at inference time.
2. Provide both CPU and CUDA backends so that the end-to-end flow works across common environments and matches the rest of Needle.
3. Demonstrate correctness and performance through unit tests, microbenchmarks, and an MNIST end-to-end example that compares float32 and int8 executions.

These objectives informed the scope of the implementation. We deliberately kept training untouched, avoided mixed-precision training complexity, and focused on a crisp API for quantizing existing models. The design also emphasizes debuggability: quantized tensors cache their dequantized views, and the system includes clear fallbacks when the optimized int8 kernel is unavailable.



## 2. Background and Related Work

Quantization has become a staple optimization for edge and data-center inference. Typical post-training schemes compute a scale `s` such that float values `x` map to integers `q = round(x / s)`, then recover approximate floats via `x ≈ s * q`. Symmetric quantization sets zero-points to zero and is common for weights; asymmetric variants introduce zero-points to better cover activation ranges. Per-channel quantization reduces error by allowing each output channel (or column) to pick its own scale, a technique used in frameworks like PyTorch, TensorRT, and ONNX Runtime.

LLM.int8() [Dettmers et al., 2022] (https://arxiv.org/abs/2208.07339) popularized mixed-precision transformer inference by showing that carefully calibrated 8-bit matrix multiplication can preserve accuracy for large language models. The kye idea includes converting 32-bit floating-point weights to 8-bit integers, typically by computing a scale factor that maps the weight range to 0-255. During inference, weights are dequantized on-the-fly by multiplying the uint8 values by the scale factor, allowing models to run with ~4x less memory while maintaining reasonable accuracy through careful calibration of the quantization parameters.


## 3. System Architecture

The Needle stack layers a minimal autograd engine on top of interchangeable backends. Computation graphs live in Python, tensors lazily realize data on either the CPU or CUDA backend, and operations dispatch to C++ or CUDA kernels via pybind11 bindings. Our quantization additions follow this split architecture:

### 3.1 Python API (`python/needle/quantization.py`)

The core abstraction is the `QuantizedTensor` dataclass, which stores:
- `data`: The quantized int8 data (numpy array).
- `scale`: Float32 scaling factor.
- `zero_point`: Int8 zero point (for asymmetric quantization).
- `cached_dequantized`: Cached float32 `Tensor` to avoid re-dequantization overhead during inference.

Key functions include:
- `quantize_int8(arr, axis=None, symmetric=True)`: Quantizes a float array/tensor to int8. Supports both symmetric (scale only) and asymmetric (scale + zero_point) schemes.
- `dequantize_int8(q_tensor, device=None)`: Converts a `QuantizedTensor` back to a float32 `Tensor` on the specified device.
- `quantized_matmul_int8(lhs, rhs, bias)`: Performs a quantized matrix multiplication.
    - **Activations (`lhs`)**: Quantized on-the-fly using symmetric quantization.
    - **Weights (`rhs`)**: Expected to be a pre-quantized `QuantizedTensor`.
    - **Computation**: Calls the backend's `matmul_int8` function.
    - **Output**: Returns a float32 `Tensor` (dequantized result).

### 3.2 NN Module Integration (`python/needle/nn/nn_basic.py`)

The `Linear` layer supports post-training quantization directly:
- `enable_quantization(axis=1, symmetric=True)`: Converts the layer's weights to int8 and enables quantized inference.
- `forward()`: Automatically handles quantized execution.
    - By default, it dequantizes weights and runs float32 matmul (simulated quantization).
    - If `NEEDLE_USE_INT8_MATMUL=1` is set, it attempts to use the optimized int8 kernel (`quantized_matmul_int8`).

### 3.3 Backend Implementation

The actual heavy lifting is done in the backend extensions, which expose a `matmul_int8` function.

#### 3.3.1 CPU Backend (`src/ndarray_backend_cpu.cc`)
The CPU backend implements an optimized int8 matrix multiplication:
- **Input**: Takes two int8 numpy arrays and their corresponding float scales.
- **Optimization**:
    - **Transposition**: The second matrix (B) is transposed to ensure contiguous memory access during the inner loop.
    - **Accumulation**: Intermediate results are accumulated in `int32` to prevent overflow before being scaled back to `float32`.

#### 3.3.2 CUDA Backend (`src/ndarray_backend_cuda.cu`)
The CUDA backend provides GPU acceleration for quantized operations:
- **Kernel**: `MatmulInt8Kernel` performs the matrix multiplication on the GPU.
- **Memory Management**: Handles allocation and data transfer (Host <-> Device) for inputs and outputs.
- **Computation**: Similar to the CPU, it accumulates products in `int32` and scales the final result to `float32`.



## 4. Implementation Details

### 4.1 Quantization API and Module Integration

The heart of the Python layer is the `QuantizedTensor` structure in `python/needle/quantization.py`. It stores three numpy arrays: `data` (int8 payload), `scale` (float32), and `zero_point` (int8). The constructor asserts dtype correctness and shape consistency between scale and zero-point. A cached dequantized `Tensor` avoids recomputing float views on every call, which matters during inference loops.

Quantization begins by converting arbitrary array-likes or Needle `Tensor` instances into float32 numpy arrays. The helper `_reduction_axes` computes which dimensions to reduce when applying per-axis quantization. Symmetric scaling finds the maximum absolute value along the chosen axes, divides by 127, and sets zero-points to zero. Asymmetric scaling instead tracks min and max, computes a span, and derives zero-points to center the representable range around the data. Both paths guard against degenerate ranges by replacing zeros with ones before dividing.

`quantize_int8` orchestrates these steps: it computes scales, rounds and clips to int8, and returns a `QuantizedTensor`. Dequantization reverses the process by casting to int32, subtracting zero-points, and multiplying by scales before wrapping the result in a Needle `Tensor`. Because the cache is device-aware, repeated calls during evaluation reuse prior results when possible.

`Linear.enable_quantization()` calls `quantize_int8` on weights, caches the dequantized float tensor, and marks the module to use quantized execution. The forward path distinguishes between training and eval. During training we leave weights in float32 to keep gradients intact. During eval, we attempt int8 matmul when the environment variable is set; otherwise we dequantize weights and run the standard float kernel. This mirrors the ergonomics of PyTorch’s `prepare`/`convert` flow but remains much lighter-weight.

### 4.2 Backend Int8 Matrix Multiplication

The CPU backend (`src/ndarray_backend_cpu.cc`) implements `MatmulInt8` using pybind11. Inputs are checked for 2D shape and matching inner dimensions. To improve cache locality, the code explicitly transposes matrix B so that the inner loop can walk contiguous columns. The kernel then iterates over rows of A and columns of B, accumulating products into an `int32_t` accumulator before scaling by `a_scale` and `b_scale` to produce float32 outputs. The method returns a numpy array, keeping integration simple for both the numpy-based backend and the higher-level Tensor wrapper.

The CUDA backend (`src/ndarray_backend_cuda.cu`) mirrors the interface with a tiled kernel. `MatmulInt8Kernel` loads tiles of A and B into shared memory, uses per-thread registers to compute an outer product over the tile, and writes results as float32 after applying scales. Grid dimensions cover the M×N output matrix with tiles sized by `MATMUL_L` and vector width `MATMUL_V`. Device memory is allocated for inputs and outputs, populated via `cudaMemcpy`, and freed after the kernel finishes. Accumulation again happens in int32 to prevent overflow from multiplying int8 operands.


## 5. Testing, Tooling, and Developer Workflow

We validated the quantization path through unit tests, notebooks, and command-line demos. The `tests/test_quantization.py` suite exercises quantize/dequantize round-trips, per-axis handling, and the quantized Linear module against float32 references. These tests run quickly on CPU and catch regressions in numerical correctness or caching behavior.

`project.ipynb` serves as the interactive hub for experiments. It walks through building the C++ and CUDA backends, toggling `NEEDLE_USE_INT8_MATMUL`, and running quantization checks that compare int8 outputs to float baselines within tolerances appropriate for 8-bit arithmetic. The notebook also benchmarks forward passes for single and stacked Linear layers, varying shapes to stress cache and memory bandwidth.

For end-to-end validation, `train_mnist_quant.py` trains a simple MLP on MNIST, reports float32 accuracy, quantizes the trained weights, and evaluates the int8 model. The script keeps training hyperparameters modest so that experiments finish quickly. It also exposes command-line flags for batch size, learning rate, device selection, and hidden dimension so that we can probe performance on both CPU and CUDA. Throughout, we leaned on clear console logging and simple data loaders (`needle.data`) to keep reproducibility high.



## 6. Experiments & Results

Our experiments focus on four aspects: numerical fidelity, execution speed, memory footprint, and usability. The unit tests and the correctness cells in `project.ipynb` confirmed that per-layer quantization error stays within the expected bounds for int8 math, with dequantized outputs closely matching float baselines. Because accumulation uses int32 and scales are retained in float32, we avoid catastrophic error amplification even on moderately deep linear stacks.

Performance microbenchmarks compare float32 and int8 weight paths for representative matrix sizes. On CPU we observe the expected reduction in data movement, and the int8 kernel’s explicit B transposition reduces cache misses for column access. On CUDA, the tiled kernel benefits from improved shared memory reuse. The precise numbers depend on hardware, but the pattern consistently favors int8 when computation is bandwidth-bound and matrices are sufficiently large to amortize kernel launch overhead.

Memory savings are illustrated by the plots below, generated by the memory benchmarking script invoked from `project.ipynb`. Storing weights in int8 shrinks parameter memory roughly fourfold relative to float32. The figures cover CPU and CUDA cases and include per-layer measurements across increasing hidden sizes.

![Memory comparison on CPU](memory_plot.png)

![Memory comparison on CUDA](memory_plot_cuda.png)

Usability-wise, the `model.quantize(axis=1)` call provides a one-liner to switch existing models into quantized inference mode. When the optimized kernel is unavailable, the fallback dequantized path guarantees correctness at the cost of speed, making the feature safe to enable in portable code. These qualities align with our objective of minimal surface area for users while still offering explicit control over performance-sensitive toggles.



## 7. Limitations, Lessons, and Future Work

While the current implementation meets the project goals, several limitations remain. Activations are quantized on the fly with a single symmetric scale per tensor, which can leave headroom on distributions with large channel variance. Extending per-channel activation scales or adopting percentile-based calibration would reduce error at the cost of extra bookkeeping. We also rely on explicit memory copies in the CUDA path rather than using persistent device buffers or fusion to reduce transfers; a production-ready system would integrate memory pooling and kernel fusion to cut overhead.

Another limitation is the absence of quantized operators beyond matrix multiplication. Convolutions, layer normalization, and activation functions remain in float32, so end-to-end models cannot run fully in int8. We intentionally scoped the work to matmul to keep the kernel surface manageable, but future extensions could generalize the approach. Support for quantization-aware training is also out of scope; incorporating fake quantization nodes during training would likely improve downstream accuracy for aggressive quantization schemes.

Despite these constraints, the project demonstrates that a small, transparent codebase can host meaningful systems work. By implementing both CPU and CUDA kernels, we surfaced cross-device considerations such as scale handling, shared memory tiling, and API stability.

In future iterations we would like to benchmark against more realistic transformer-style shapes, add vectorized packing strategies (such as int8×int8→int32 GEMM using tensor cores), and explore calibration heuristics inspired by LLM.int8() and related literature.

## 8. Conclusion

We added post-training int8 quantization to Needle with a concise API, backend kernels for CPU and CUDA, and tooling to validate and benchmark the feature. The system stores weights in int8, quantizes activations on the fly, accumulates in int32, and returns float32 outputs for downstream compatibility. Integration into the `Linear` module and the MNIST demo shows that quantization can be enabled with minimal code changes while preserving accuracy and reducing memory use.

The project draws on ideas from LLM.int8() and mainstream frameworks, and we explored the full quantization pipeline: scale computation, zero-point handling, kernel tiling, and performance trade-offs. The accompanying experiments and memory visualizations provide evidence that the approach works across devices and align with the original proposal to shrink model footprint without sacrificing usability.



## 9. Links to Code