Skip to content
View hhy3's full-sized avatar

Organizations

@milvus-io

Block or report hhy3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

cuML - RAPIDS Machine Learning Library

C++ 4,572 564 Updated Mar 26, 2025

Vector (and Scalar) Quantization, in Pytorch

Python 3,072 246 Updated Mar 24, 2025

[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference

Python 39 9 Updated Feb 3, 2025

Awesome LLM compression research papers and tools.

1,445 93 Updated Mar 28, 2025

[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Python 211 15 Updated Jan 11, 2025

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Python 283 29 Updated Jan 19, 2025

Materials for learning SGLang

357 24 Updated Mar 22, 2025

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 2,042 256 Updated Mar 6, 2025

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,891 241 Updated Mar 25, 2025

A fast inference library for running LLMs locally on modern consumer-class GPUs

Python 4,077 305 Updated Mar 15, 2025

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Python 36,270 6,162 Updated Mar 29, 2025

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Python 1,803 264 Updated Mar 24, 2025

Mixed precision inference by Tensorrt-LLM

C++ 79 20 Updated Oct 23, 2024

Machine Learning Engineering Open Book

Python 13,256 803 Updated Mar 9, 2025

Pytorch domain library for recommendation systems

Python 2,073 488 Updated Mar 28, 2025

NVIDIA curated collection of educational resources related to general purpose GPU programming.

Jupyter Notebook 369 67 Updated Mar 19, 2025

A Datacenter Scale Distributed Inference Serving Framework

Rust 3,353 229 Updated Mar 28, 2025

DeeperGEMM: crazy optimized version

Cuda 63 Updated Mar 16, 2025

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Python 13,483 2,758 Updated Mar 28, 2025

Puzzles for learning Triton, play it with minimal environment configuration!

Python 267 25 Updated Dec 3, 2024
Python 858 142 Updated Nov 29, 2023
Python 8,711 743 Updated Mar 27, 2025

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Python 1,048 69 Updated Mar 28, 2025

My learning notes/codes for ML SYS.

Python 1,603 91 Updated Mar 28, 2025

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 881 103 Updated Mar 24, 2025

MTEB: Massive Text Embedding Benchmark

Jupyter Notebook 2,357 355 Updated Mar 28, 2025

how to optimize some algorithm in cuda.

Cuda 2,052 183 Updated Mar 26, 2025

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

C++ 7,648 834 Updated Mar 27, 2025

Material for gpu-mode lectures

Jupyter Notebook 4,138 418 Updated Feb 9, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 2,511 261 Updated Mar 27, 2025
Next
Showing results