Elect-Rock-Infer

A lightweight Inference Engine implementation built from scratch.

Key Features

🚀 Fast offline inference - Fast Moe Engine for Mixtral.
📖 Readable codebase - Clean architectures.
⚡ Optimization Suite - Prefix caching, Tensor Parallelism, customized kernel for MoE layer, etc.

Install from source

Strictly follow the given installation instructions. If you cannot connect to the network in the container, please install the required packages on the login node, and then execute pip install -e .on the compute node.

conda create -n electrock python=3.10
conda activate electrock
cd electrock-infer-xdb
pip install pip==24.0
pip install https://download.sourcefind.cn:65024/directlink/4/pytorch/DAS1.0/torch-2.1.0+das1.0+git00661e0.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl  -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install https://download.sourcefind.cn:65024/directlink/4/triton/DAS1.0/triton-2.1.0+das1.0+git3841f975.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install -e .

⚠️ If you encounter an error, make sure you compile and install it on the compute node.

Quick Start

See example.py for usage.

import os
from electrock_infer import LLM, SamplingParams
from transformers import AutoTokenizer
from electrock_infer.flash_infer.engine_core import EngineCore

path = os.path.expanduser("/work/share/data/XDZS2025/Mixtral-8x7B-v0.1")
llm = LLM(path, enforce_eager=True, tensor_parallel_size=2, gpu_memory_utilization=0.9)
# llm = EngineCore(path, enforce_eager=True, tensor_parallel_size=2, gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=0.9, max_tokens=256)
prompts = [
    "Hello my name is"
]
outputs = llm.generate(prompts, sampling_params)

for prompt, output in zip(prompts, outputs):
    print("\n")
    print(f"Prompt: {prompt!r}")
    print(f"Completion: {output['text']!r}")

By default, PagedAttention is enabled(This generally provides greater throughput). If you want faster multi-batch latency, you can enable continuous kvcache by from electrock_infer.flash_infer.engine_core import EngineCore

Benchmark

See eval3.py for benchmark.

from electrock_infer.engine.llm_engine import LLMEngine
from electrock_infer.sampling_params import SamplingParams
prompts = sentences[:EVAL_SENTENCE_COUNT]
tokenizer = AutoTokenizer.from_pretrained(model_path)
prompt_token_ids = [
    tokenizer.encode(prompt)
    for prompt in prompts
]
sampling_params = [SamplingParams(temperature=1, ignore_eos=False, max_tokens=512, max_total_tokens = 512) for _ in range(EVAL_SENTENCE_COUNT)]
engine = LLMEngine(model_path, enforce_eager=True, max_model_len=512, tensor_parallel_size=2)
# warmup
print("Warmup begin")
engine.generate(["Benchmark: "], SamplingParams(), use_tqdm=USE_tqdm)
print("Warmup done")
print("Begining evaluate....")
t = time.time()
engine.generate(prompt_token_ids, sampling_params, use_tqdm=USE_tqdm)
t = (time.time() - t)
engine.exit()

latency_per_seq = t / EVAL_SENTENCE_COUNT
return latency_per_seq

Test Configuration:

Hardware: K100-AI (64GB) * 2
Model: Mixtral-8x7B-Instruct-v0.1
Total Requests: 100 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100-256 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)	Request(req/s)
baseline	none	none	none	noen
Elect-Rock-Infer	none	none	none	none
ERI(flash_attnV2.6.1)	133,966	111.79	1198.42tok/s	2.29

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
electrock_infer		electrock_infer
src		src
test		test
.gitignore		.gitignore
README.md		README.md
bench.py		bench.py
eval3.py		eval3.py
example.py		example.py
installation.py		installation.py
pytorch_bind.cpp		pytorch_bind.cpp
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elect-Rock-Infer

Key Features

Install from source

Quick Start

Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Elect-Rock-Infer

Key Features

Install from source

Quick Start

Benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages