Kernel-Align

Extreme Infrastructure for GRPO & Large-Scale Reinforcement Learning.

Kernel-Align is a high-performance, memory-efficient infrastructure for Reinforcement Learning (RL) post-training. It eliminates the memory and latency bottlenecks in Large Language Model (LLM) alignment, This project targets AI infrastructure engineers, algorithm researchers, and enterprise-level large model alignment scenarios, providing specialized kernels for algorithms like GRPO, PPO, and DPO.

Performance Benchmarks: Breaking the Memory Wall

Kernel-Align is designed to solve the $O(G \cdot L \cdot V)$ memory explosion in DeepSeek-style GRPO training. A typical scenario is as follows:

1. Logprob Computation (Training Stability)

By implementing Pre-allocated Chunking, Kernel-Align maintains constant additional VRAM overhead regardless of the group size ($G$).

Testbed: NVIDIA A100 80GB | Model: Llama-3-8B | Vocab: 128,256 | SeqLen: 512

Group Size ($G$)	TRL (Standard)	PyTorch Native	Kernel-Align (Ours)	Status
G = 64	OOM	15.66 GB	16.15 GB	Success
G = 128	OOM	31.31 GB	31.80 GB	Success
G = 256	FAILED (OOM)	62.63 GB	63.12 GB	Optimized

Note: Kernel-Align is the only solution that successfully scales G=256 on a single A100 by keeping extra VRAM usage to a constant ~0.5GB.

2. Sampling Latency (Rollout Speed)

Integrating FlashInfer fused kernels to accelerate the bottleneck of RL training: the sampling phase.

Batch Size ($G$)	Native PyTorch	Kernel-Align (Fused)	Speedup
64	219.4 ms	0.55 ms	399x
128	14.08 ms	0.67 ms	21x
256	25.49 ms	1.15 ms	22x

Key Features

Zero-Growth Memory Pool: Uses pre-allocated buffers and micro-chunking to prevent VRAM spikes during advantage calculation.
Fused Sampling Pipeline: Direct integration with FlashInfer and vLLM backends for sub-1ms sampling latency.
Universal Backend Abstraction: Unified API supporting both NVIDIA (CUDA/FlashInfer) and AMD (ROCm/AITER).
Post-Training Ready: Drop-in replacement for standard sampling and logprob operators in TRL or DeepSpeed-Chat.

Architecture

Kernel-Align sits between high-level alignment libraries and low-level GPU kernels, ensuring maximum throughput without sacrificing flexibility.

Quick Start

Installation

# Clone the repository
git clone https://github.com/Flink-ddd/Kernel-Align.git
cd Kernel-Align

# Install core dependencies (CUDA 12.4+ recommended)
pip install -e .

Contributions

Inspired by the kernel designs of vLLM and DeepSpeed. As an active contributor to the AI Infrastructure ecosystem, Kernel-Align aims to push the boundaries of RL efficiency.

Target: Building the most efficient RLHF toolchain for the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
benchmarks		benchmarks
csrc		csrc
docs		docs
examples		examples
rl_engine		rl_engine
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kernel-Align

Performance Benchmarks: Breaking the Memory Wall

1. Logprob Computation (Training Stability)

2. Sampling Latency (Rollout Speed)

Key Features

Architecture

Quick Start

Installation

Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kernel-Align

Performance Benchmarks: Breaking the Memory Wall

1. Logprob Computation (Training Stability)

2. Sampling Latency (Rollout Speed)

Key Features

Architecture

Quick Start

Installation

Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages