Systematic LLM Inference Performance Evaluation
Build Performance Matrices • Compare Hardware Configs • Optimize Deployments
InferMatrix is a unified framework for systematic evaluation of Large Language Model (LLM) inference performance across multiple backends and hardware configurations. It helps you build comprehensive performance matrices to identify optimal deployment strategies.
- 🔄 Multi-Backend Support: Seamlessly test Ollama, vLLM, and LMStudio
- 🌐 Flexible Deployment: Local Windows, WSL, and remote servers via SSH
- 📊 Matrix-Based Evaluation: Cross-dimensional performance comparison
- ⚡ Comprehensive Metrics: TTFT, TPOT, Throughput, and Token Count
- 🎛️ JSON-Driven Configuration: Simple setup, powerful capabilities
- 📈 Auto Visualization: Beautiful performance reports out of the box
InferMatrix generates systematic performance comparisons like this:
| Backend | RTX 4090 | RTX 3090 | A100 |
|---|---|---|---|
| Ollama | 137 tok/s | 89 tok/s | 156 tok/s |
| vLLM | 156 tok/s | 102 tok/s | 198 tok/s |
| LMStudio | 125 tok/s | 84 tok/s | 145 tok/s |
Evaluate new models, compare frameworks, and optimize hardware choices systematically.
# Clone the repository
git clone https://github.com/infermatrix/infermatrix.git
cd infermatrix
# Install dependencies
pip install -r requirements.txt# Run a test with your configuration
python run_tests.py --config configs/example.json
# List all available tests
python run_tests.py --list
# Generate performance report
python run_tests.py --generate-reportCreate a simple configuration file my_test.json:
{
"ssh": {
"local_mode": true,
"hostname": "127.0.0.1"
},
"prompts": ["What is the capital of France?"],
"max_tokens": 512,
"result_dir": "results",
"tests": [
{
"name": "ollama-test",
"model": "llama2:7b",
"backend": "ollama",
"port": 11434
}
]
}Run the test:
python run_tests.py --config my_test.jsonView results in results/ directory with JSON files and visualizations.
Test models running directly on your Windows machine (e.g., Ollama).
- Windows with Ollama installed and running
- Model downloaded via
ollama pull <model_name>
Use the template configs/local_ollama_example.json:
{
"ssh": {
"local_mode": true,
"hostname": "127.0.0.1"
},
"prompts": [
"Explain neural networks in simple terms.",
"Write a Python function to calculate factorial."
],
"max_tokens": 512,
"result_dir": "results",
"tests": [
{
"name": "ollama-local-test",
"model": "deepseek-r1:1.5b",
"backend": "ollama",
"local_mode": true,
"port": 11434
}
]
}python run_tests.py --config configs/local_ollama_example.jsonTest models deployed in Windows Subsystem for Linux (e.g., vLLM).
- WSL (Ubuntu recommended) installed and configured
- vLLM and dependencies (CUDA) installed in WSL
- Model files accessible in WSL (e.g.,
/mnt/e/models/...) - Network connectivity between Windows and WSL
Step 1: Get WSL IP address
# In WSL terminal
hostname -I
# Example output: 172.28.144.1Step 2: Create configuration wsl_test.json:
{
"ssh": {
"local_mode": false,
"hostname": "172.28.144.1",
"username": "your_wsl_username",
"password": "your_password",
"port": 22
},
"prompts": ["What is machine learning?"],
"max_tokens": 512,
"result_dir": "results",
"tests": [
{
"name": "wsl-vllm-test",
"backend": "vllm",
"port": 8000,
"backend_config": {
"model_path": "/mnt/e/models/deepseek-r1-distill-qwen-1.5b",
"wsl-venv": true,
"args": {
"host": "0.0.0.0",
"port": 8000,
"tensor-parallel-size": 1,
"gpu-memory-utilization": 0.9,
"max-model-len": 4096
}
}
}
]
}Note: If wsl-venv: true, ensure your WSL environment activates the Python virtual environment in non-interactive SSH sessions (configure in ~/.bashrc).
python run_tests.py --config wsl_test.jsonHow it works: InferMatrix connects to WSL via SSH, launches vLLM service, runs performance tests, and cleanly shuts down the service.
Test models on remote Linux servers (e.g., vLLM, Text Generation Inference).
- SSH access to remote server (username/password or SSH key)
- Model service deployed on the server (vLLM, TGI, etc.)
Create remote_test.json:
{
"ssh": {
"local_mode": false,
"hostname": "192.168.1.100",
"username": "your_username",
"password": "your_password",
"key_path": null,
"port": 22
},
"prompts": [
"Explain the transformer architecture.",
"What are the benefits of quantization?"
],
"max_tokens": 1024,
"result_dir": "results",
"tests": [
{
"name": "remote-vllm-test",
"backend": "vllm",
"port": 8000,
"backend_config": {
"model_path": "/data/models/llama-2-7b",
"wsl-venv": false,
"args": {
"host": "0.0.0.0",
"port": 8000,
"tensor-parallel-size": 2,
"gpu-memory-utilization": 0.85,
"max-model-len": 8192
}
}
}
]
}Before running full tests, verify server connectivity:
python demo_run_server_test.py --config remote_test.jsonpython run_tests.py --config remote_test.json| Field | Type | Description |
|---|---|---|
ssh.local_mode |
bool |
true: Test local servicesfalse: Test via SSH |
ssh.hostname |
string |
Server IP or hostname (Use 127.0.0.1 for local) |
ssh.username |
string |
SSH username |
ssh.password |
string |
SSH password (leave empty if using key) |
ssh.key_path |
string |
Path to SSH private key |
ssh.port |
int |
SSH port (default: 22) |
prompts |
list[string] |
List of test prompts |
max_tokens |
int |
Maximum tokens to generate |
result_dir |
string |
Output directory for results |
tests |
list[object] |
List of test configurations |
tests[].name |
string |
Test identifier |
tests[].model |
string |
Model name (Ollama only) |
tests[].backend |
string |
Backend type: ollama, vllm, lmstudio |
tests[].port |
int |
Service port |
tests[].backend_config |
object |
Backend-specific configuration |
backend_config.model_path |
string |
Model file path on server |
backend_config.wsl-venv |
bool |
Activate Python venv in WSL |
backend_config.args |
object |
Backend startup arguments |
All example configurations are available in the configs/ directory:
local_ollama_example.json- Local Windows Ollama testingwsl_vllm_example.json- WSL vLLM testing via SSHremote_server_example.json- Remote server testing
InferMatrix measures the following key performance indicators:
| Metric | Description | Unit |
|---|---|---|
| TTFT | Time To First Token Latency from request to first token |
seconds |
| TPOT | Time Per Output Token Average time per generated token |
milliseconds |
| Throughput | Total tokens per second Overall generation speed |
tokens/s |
| Token Count | Total tokens in response | tokens |
| Prefill Speed | Input processing speed | tokens/s |
Request Timeline:
├─ TTFT ─────┤ (Prefill Phase)
└─ TPOT ─┬─ TPOT ─┬─ TPOT ─┤ (Decode Phase)
Token 1 Token 2 Token N
Total Latency = TTFT + (TPOT × N tokens)
Throughput = N tokens / Total Latency
Compare multiple models across different backends:
{
"prompts": ["Calculate 1+1"],
"tests": [
{
"name": "ollama-deepseek",
"model": "deepseek-r1:1.5b",
"backend": "ollama",
"port": 11434
},
{
"name": "vllm-deepseek",
"backend": "vllm",
"backend_config": {
"model_path": "/models/deepseek-r1-1.5b"
}
},
{
"name": "lmstudio-deepseek",
"backend": "lmstudio",
"port": 8000
}
]
}Results automatically generate a comparison matrix:
Framework Comparison - DeepSeek R1 1.5B
┌─────────────┬────────────┬───────────┬──────────────┐
│ Backend │ TTFT (s) │ TPOT (ms) │ Throughput │
├─────────────┼────────────┼───────────┼──────────────┤
│ Ollama │ 4.87 │ 7.29 │ 137.09 tok/s │
│ vLLM │ 0.25 │ 15.80 │ 63.29 tok/s │
│ LMStudio │ 2.33 │ 9.81 │ 101.89 tok/s │
└─────────────┴────────────┴───────────┴──────────────┘
Test the same model on different GPUs:
# Test on RTX 4090
python run_tests.py --config configs/rtx4090_config.json
# Test on RTX 3090
python run_tests.py --config configs/rtx3090_config.json
# Compare results
python generate_comparison_report.py \
--results results/rtx4090_results.json results/rtx3090_results.jsonSymptoms: Connection refused or Permission denied
Solutions:
- ✅ Verify
hostname,username,password/key_path, andport - ✅ Ensure SSH service is running:
sudo systemctl status ssh - ✅ Test connectivity:
ssh username@hostname - ✅ For WSL: Check Windows Defender Firewall settings
Symptoms: Model <name> not found or 404 Not Found
Solutions:
- Ollama: Verify model name matches
ollama listexactly - vLLM: Check
model_pathpoints to directory withconfig.jsonand weight files - WSL: Ensure Windows drive mount path is correct (e.g.,
/mnt/e/...)
Symptoms: Address already in use or Port <N> is occupied
Solutions:
- ✅ Check if port is in use:
lsof -i :8000(Linux) ornetstat -ano | findstr :8000(Windows) - ✅ Kill existing process or choose a different port
- ✅ Update
portfield in configuration
Symptoms: Permission denied when accessing files or starting services
Solutions:
- ✅ Ensure user has read/write permissions for config and result directories
- ✅ Verify SSH user has permissions to access model files
- ✅ Check execution permissions:
chmod +x run_tests.py
Symptoms: vLLM fails to initialize or detect GPU
Solutions:
- ✅ Verify CUDA installation:
nvidia-smiin WSL - ✅ Check vLLM can detect GPU:
python -c "import torch; print(torch.cuda.is_available())" - ✅ If
wsl-venv: true, verify activation command in script (default:source ~/venv/bin/activate) - ✅ Ensure GPU drivers are properly passed through to WSL
Symptoms: Request timeout or Connection timeout
Solutions:
- ✅ Increase timeout values in code if testing very large models
- ✅ Check network stability
- ✅ Verify server resources are sufficient (CPU, GPU, RAM)
InferMatrix uses a modular architecture for flexibility and extensibility:
infermatrix/
├── run_tests.py # Main entry point
├── test_orchestrator.py # Test orchestration and workflow
├── llm_tester.py # Core testing logic
├── ssh_manager.py # SSH connection management
├── backend_deployer.py # Backend service deployment
├── configs/ # Configuration examples
│ ├── local_ollama_example.json
│ ├── wsl_vllm_example.json
│ └── remote_server_example.json
├── results/ # Test results output
│ ├── *.json # Raw performance data
│ └── *.png # Visualization charts
└── utils/ # Utility functions
├── metrics.py # Performance metrics calculation
└── visualizer.py # Chart generation
We welcome contributions! Please see our Contributing Guide for details.
- 🐛 Report bugs via GitHub Issues
- 💡 Suggest features or improvements
- 📝 Improve documentation
- 🔧 Submit pull requests
- Support for additional backends (TensorRT-LLM, Text-Generation-Inference)
- Enhanced visualization options
- Batch testing capabilities
- Multi-GPU performance analysis
- CI/CD pipeline
This project is licensed under the MIT License - see the LICENSE file for details.
If you use InferMatrix in your research, please cite:
@software{infermatrix2025,
title={InferMatrix: Systematic LLM Inference Performance Evaluation},
author={Anonymous},
year={2025},
url={https://github.com/infermatrix/infermatrix}
}- vLLM - High-throughput LLM serving
- Ollama - Run LLMs locally
- LMStudio - Desktop LLM application
- llm-perf - LLM performance benchmarking
If you find InferMatrix useful, please consider giving it a star ⭐
Built with ❤️ for the LLM community