Skip to content

Flink-ddd/Kernel-Align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kernel-Align Logo

Kernel-Align

Extreme Infrastructure for GRPO & Large-Scale Reinforcement Learning.

License Hardware

Kernel-Align is a high-performance, memory-efficient infrastructure for Reinforcement Learning (RL) post-training. It eliminates the memory and latency bottlenecks in Large Language Model (LLM) alignment, This project targets AI infrastructure engineers, algorithm researchers, and enterprise-level large model alignment scenarios, providing specialized kernels for algorithms like GRPO, PPO, and DPO.


Performance Benchmarks: Breaking the Memory Wall

Kernel-Align is designed to solve the $O(G \cdot L \cdot V)$ memory explosion in DeepSeek-style GRPO training. A typical scenario is as follows:

1. Logprob Computation (Training Stability)

By implementing Pre-allocated Chunking, Kernel-Align maintains constant additional VRAM overhead regardless of the group size ($G$).

Testbed: NVIDIA A100 80GB | Model: Llama-3-8B | Vocab: 128,256 | SeqLen: 512

Group Size ($G$) TRL (Standard) PyTorch Native Kernel-Align (Ours) Status
G = 64 OOM 15.66 GB 16.15 GB Success
G = 128 OOM 31.31 GB 31.80 GB Success
G = 256 FAILED (OOM) 62.63 GB 63.12 GB Optimized

Note: Kernel-Align is the only solution that successfully scales G=256 on a single A100 by keeping extra VRAM usage to a constant ~0.5GB.

2. Sampling Latency (Rollout Speed)

Integrating FlashInfer fused kernels to accelerate the bottleneck of RL training: the sampling phase.

Batch Size ($G$) Native PyTorch Kernel-Align (Fused) Speedup
64 219.4 ms 0.55 ms 399x
128 14.08 ms 0.67 ms 21x
256 25.49 ms 1.15 ms 22x

Key Features

  • Zero-Growth Memory Pool: Uses pre-allocated buffers and micro-chunking to prevent VRAM spikes during advantage calculation.
  • Fused Sampling Pipeline: Direct integration with FlashInfer and vLLM backends for sub-1ms sampling latency.
  • Universal Backend Abstraction: Unified API supporting both NVIDIA (CUDA/FlashInfer) and AMD (ROCm/AITER).
  • Post-Training Ready: Drop-in replacement for standard sampling and logprob operators in TRL or DeepSpeed-Chat.

Architecture

Kernel-Align sits between high-level alignment libraries and low-level GPU kernels, ensuring maximum throughput without sacrificing flexibility.


Quick Start

Installation

# Clone the repository
git clone https://github.com/Flink-ddd/Kernel-Align.git
cd Kernel-Align

# Install core dependencies (CUDA 12.4+ recommended)
pip install -e .

Contributions

Inspired by the kernel designs of vLLM and DeepSpeed. As an active contributor to the AI Infrastructure ecosystem, Kernel-Align aims to push the boundaries of RL efficiency.

Target: Building the most efficient RLHF toolchain for the open-source community.

About

Modern RL Post-training Infrastructure: Optimized for NVIDIA/AMD GPUs with a focus on vLLM integration, Triton kernels, and transparent hardware-aware scaling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors