Skip to content
View yzhaiustc's full-sized avatar

Block or report yzhaiustc

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A Datacenter Scale Distributed Inference Serving Framework

Rust 3,384 233 Updated Mar 28, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,112 536 Updated Mar 28, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 7,332 685 Updated Mar 28, 2025

Fully open reproduction of DeepSeek-R1

Python 23,495 2,141 Updated Mar 30, 2025

Puzzles for learning Triton, play it with minimal environment configuration!

Python 267 25 Updated Dec 3, 2024

Development repository for the Triton language and compiler

MLIR 15,022 1,890 Updated Mar 30, 2025

A simple pip-installable Python tool to generate your own HTML citation world map from your Google Scholar ID.

Python 516 43 Updated Feb 25, 2025

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 620 42 Updated Mar 6, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 815 53 Updated Mar 19, 2025

You like pytorch? You like micrograd? You love tinygrad! ❤️

Python 28,475 3,266 Updated Mar 30, 2025

Grok open release

Python 50,250 8,360 Updated Aug 30, 2024

A PyTorch Native LLM Training Framework

Python 763 42 Updated Dec 27, 2024

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 783 63 Updated Sep 4, 2024

A compiler for homomorphic encryption

C++ 413 72 Updated Mar 30, 2025

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 10,041 1,272 Updated Mar 29, 2025

A retargetable MLIR-based machine learning compiler and runtime toolkit.

C++ 3,075 679 Updated Mar 30, 2025

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Cuda 203 16 Updated Sep 24, 2023

Standalone Flash Attention v2 kernel without libtorch dependency

C++ 108 15 Updated Sep 10, 2024

Fast and memory-efficient exact attention

Python 16,615 1,574 Updated Mar 29, 2025
C++ 61 20 Updated Dec 18, 2024

100 Days of RTL

SystemVerilog 357 105 Updated Aug 15, 2024

CUDA on non-NVIDIA GPUs

Rust 11,053 708 Updated Mar 17, 2025

Making large AI models cheaper, faster and more accessible

Python 40,695 4,488 Updated Mar 28, 2025

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

C++ 28 7 Updated Jun 26, 2024

SLATE is a distributed, GPU-accelerated, dense linear algebra library targetting current and upcoming high-performance computing (HPC) systems. It is developed as part of the U.S. Department of Ene…

C++ 110 23 Updated Jan 11, 2025
PostScript 3 Updated Apr 5, 2023

Source code for Twitter's Recommendation Algorithm

Scala 63,058 12,167 Updated Jul 10, 2024

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows

C++ 657 225 Updated Mar 28, 2025
Next
Showing results