
- Shanghai
-
08:37
- 8h ahead
Starred repositories
Vector (and Scalar) Quantization, in Pytorch
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
Awesome LLM compression research papers and tools.
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
A fast inference library for running LLMs locally on modern consumer-class GPUs
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Mixed precision inference by Tensorrt-LLM
Machine Learning Engineering Open Book
Pytorch domain library for recommendation systems
NVIDIA curated collection of educational resources related to general purpose GPU programming.
A Datacenter Scale Distributed Inference Serving Framework
ademeure / DeeperGEMM
Forked from deepseek-ai/DeepGEMMDeeperGEMM: crazy optimized version
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Puzzles for learning Triton, play it with minimal environment configuration!
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
My learning notes/codes for ML SYS.
A highly optimized LLM inference acceleration engine for Llama and its variants.
MTEB: Massive Text Embedding Benchmark
how to optimize some algorithm in cuda.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
FlashInfer: Kernel Library for LLM Serving