A lightweight Inference Engine implementation built from scratch.
- 🚀 Fast offline inference - Fast Moe Engine for Mixtral.
- 📖 Readable codebase - Clean architectures.
- ⚡ Optimization Suite - Prefix caching, Tensor Parallelism, customized kernel for MoE layer, etc.
Strictly follow the given installation instructions. If you cannot connect to the network in the container, please install the required packages on the login node, and then execute pip install -e .on the compute node.
conda create -n electrock python=3.10
conda activate electrock
cd electrock-infer-xdb
pip install pip==24.0
pip install https://download.sourcefind.cn:65024/directlink/4/pytorch/DAS1.0/torch-2.1.0+das1.0+git00661e0.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install https://download.sourcefind.cn:65024/directlink/4/triton/DAS1.0/triton-2.1.0+das1.0+git3841f975.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install -e .See example.py for usage.
import os
from electrock_infer import LLM, SamplingParams
from transformers import AutoTokenizer
from electrock_infer.flash_infer.engine_core import EngineCore
path = os.path.expanduser("/work/share/data/XDZS2025/Mixtral-8x7B-v0.1")
llm = LLM(path, enforce_eager=True, tensor_parallel_size=2, gpu_memory_utilization=0.9)
# llm = EngineCore(path, enforce_eager=True, tensor_parallel_size=2, gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.9, max_tokens=256)
prompts = [
"Hello my name is"
]
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("\n")
print(f"Prompt: {prompt!r}")
print(f"Completion: {output['text']!r}")By default, PagedAttention is enabled(This generally provides greater throughput). If you want faster multi-batch latency, you can enable continuous kvcache by from electrock_infer.flash_infer.engine_core import EngineCore
See eval3.py for benchmark.
from electrock_infer.engine.llm_engine import LLMEngine
from electrock_infer.sampling_params import SamplingParams
prompts = sentences[:EVAL_SENTENCE_COUNT]
tokenizer = AutoTokenizer.from_pretrained(model_path)
prompt_token_ids = [
tokenizer.encode(prompt)
for prompt in prompts
]
sampling_params = [SamplingParams(temperature=1, ignore_eos=False, max_tokens=512, max_total_tokens = 512) for _ in range(EVAL_SENTENCE_COUNT)]
engine = LLMEngine(model_path, enforce_eager=True, max_model_len=512, tensor_parallel_size=2)
# warmup
print("Warmup begin")
engine.generate(["Benchmark: "], SamplingParams(), use_tqdm=USE_tqdm)
print("Warmup done")
print("Begining evaluate....")
t = time.time()
engine.generate(prompt_token_ids, sampling_params, use_tqdm=USE_tqdm)
t = (time.time() - t)
engine.exit()
latency_per_seq = t / EVAL_SENTENCE_COUNT
return latency_per_seqTest Configuration:
- Hardware: K100-AI (64GB) * 2
- Model: Mixtral-8x7B-Instruct-v0.1
- Total Requests: 100 sequences
- Input Length: Randomly sampled between 100–1024 tokens
- Output Length: Randomly sampled between 100-256 tokens
Performance Results:
| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) | Request(req/s) |
|---|---|---|---|---|
| baseline | none | none | none | noen |
| Elect-Rock-Infer | none | none | none | none |
| ERI(flash_attnV2.6.1) | 133,966 | 111.79 | 1198.42tok/s | 2.29 |