- Beijing
Starred repositories
verl: Volcano Engine Reinforcement Learning for LLMs
Fully open reproduction of DeepSeek-R1
An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
DeepEP: an efficient expert-parallel communication library
Accessible large language models via k-bit quantization for PyTorch.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
Zero Bubble Pipeline Parallelism
Use PEFT or Full-parameter to finetune 500+ LLMs (Qwen2.5, InternLM3, GLM4, Llama3.3, Mistral, Yi1.5, Baichuan2, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2-Audio, Llama3.2-Vision, Llava, I…
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
Prometheus exporter that mines /proc to report on selected processes
The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
北京电信IPTV播放列表 Beijing Telecom IPTV playlist bj-telecom-iptv.m3u
Official implementation of "Towards Efficient Visual Adaption via Structural Re-parameterization".
A tool for bandwidth measurements on NVIDIA GPUs.
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
Optimized primitives for collective multi-GPU communication
A GPU performance profiling tool for PyTorch models
Example models using DeepSpeed
Chinese-LLaMA 1&2、Chinese-Falcon 基础模型;ChatFlow中文对话模型;中文OpenLLaMA模型;NLP预训练/指令微调数据集
Code and documentation to train Stanford's Alpaca models, and generate the data.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Ongoing research training transformer language models at scale, including: BERT & GPT-2