Skip to content

Systematic LLM inference performance evaluation | Build performance matrices across models, backends, and hardware

License

Notifications You must be signed in to change notification settings

InferMatrix-Labs/InferMatrix

Repository files navigation

InferMatrix

InferMatrix Logo

Systematic LLM Inference Performance Evaluation

License: MIT Python 3.8+ Code style: black

Build Performance Matrices • Compare Hardware Configs • Optimize Deployments

Quick StartDocumentationExamplesContributing


🎯 What is InferMatrix?

InferMatrix is a unified framework for systematic evaluation of Large Language Model (LLM) inference performance across multiple backends and hardware configurations. It helps you build comprehensive performance matrices to identify optimal deployment strategies.

Why InferMatrix?

  • 🔄 Multi-Backend Support: Seamlessly test Ollama, vLLM, and LMStudio
  • 🌐 Flexible Deployment: Local Windows, WSL, and remote servers via SSH
  • 📊 Matrix-Based Evaluation: Cross-dimensional performance comparison
  • Comprehensive Metrics: TTFT, TPOT, Throughput, and Token Count
  • 🎛️ JSON-Driven Configuration: Simple setup, powerful capabilities
  • 📈 Auto Visualization: Beautiful performance reports out of the box

Performance Matrix Example

InferMatrix generates systematic performance comparisons like this:

Backend RTX 4090 RTX 3090 A100
Ollama 137 tok/s 89 tok/s 156 tok/s
vLLM 156 tok/s 102 tok/s 198 tok/s
LMStudio 125 tok/s 84 tok/s 145 tok/s

Evaluate new models, compare frameworks, and optimize hardware choices systematically.


🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/infermatrix/infermatrix.git
cd infermatrix

# Install dependencies
pip install -r requirements.txt

Basic Usage

# Run a test with your configuration
python run_tests.py --config configs/example.json

# List all available tests
python run_tests.py --list

# Generate performance report
python run_tests.py --generate-report

Your First Test

Create a simple configuration file my_test.json:

{
  "ssh": {
    "local_mode": true,
    "hostname": "127.0.0.1"
  },
  "prompts": ["What is the capital of France?"],
  "max_tokens": 512,
  "result_dir": "results",
  "tests": [
    {
      "name": "ollama-test",
      "model": "llama2:7b",
      "backend": "ollama",
      "port": 11434
    }
  ]
}

Run the test:

python run_tests.py --config my_test.json

View results in results/ directory with JSON files and visualizations.


📖 Documentation

Table of Contents


🎬 Testing Scenarios

1. Local Windows Testing

Test models running directly on your Windows machine (e.g., Ollama).

Prerequisites

  • Windows with Ollama installed and running
  • Model downloaded via ollama pull <model_name>

Configuration

Use the template configs/local_ollama_example.json:

{
  "ssh": {
    "local_mode": true,
    "hostname": "127.0.0.1"
  },
  "prompts": [
    "Explain neural networks in simple terms.",
    "Write a Python function to calculate factorial."
  ],
  "max_tokens": 512,
  "result_dir": "results",
  "tests": [
    {
      "name": "ollama-local-test",
      "model": "deepseek-r1:1.5b",
      "backend": "ollama",
      "local_mode": true,
      "port": 11434
    }
  ]
}

Run Test

python run_tests.py --config configs/local_ollama_example.json

2. WSL Testing via SSH

Test models deployed in Windows Subsystem for Linux (e.g., vLLM).

Prerequisites

  • WSL (Ubuntu recommended) installed and configured
  • vLLM and dependencies (CUDA) installed in WSL
  • Model files accessible in WSL (e.g., /mnt/e/models/...)
  • Network connectivity between Windows and WSL

Configuration

Step 1: Get WSL IP address

# In WSL terminal
hostname -I
# Example output: 172.28.144.1

Step 2: Create configuration wsl_test.json:

{
  "ssh": {
    "local_mode": false,
    "hostname": "172.28.144.1",
    "username": "your_wsl_username",
    "password": "your_password",
    "port": 22
  },
  "prompts": ["What is machine learning?"],
  "max_tokens": 512,
  "result_dir": "results",
  "tests": [
    {
      "name": "wsl-vllm-test",
      "backend": "vllm",
      "port": 8000,
      "backend_config": {
        "model_path": "/mnt/e/models/deepseek-r1-distill-qwen-1.5b",
        "wsl-venv": true,
        "args": {
          "host": "0.0.0.0",
          "port": 8000,
          "tensor-parallel-size": 1,
          "gpu-memory-utilization": 0.9,
          "max-model-len": 4096
        }
      }
    }
  ]
}

Note: If wsl-venv: true, ensure your WSL environment activates the Python virtual environment in non-interactive SSH sessions (configure in ~/.bashrc).

Run Test

python run_tests.py --config wsl_test.json

How it works: InferMatrix connects to WSL via SSH, launches vLLM service, runs performance tests, and cleanly shuts down the service.


3. Remote Server Testing

Test models on remote Linux servers (e.g., vLLM, Text Generation Inference).

Prerequisites

  • SSH access to remote server (username/password or SSH key)
  • Model service deployed on the server (vLLM, TGI, etc.)

Configuration

Create remote_test.json:

{
  "ssh": {
    "local_mode": false,
    "hostname": "192.168.1.100",
    "username": "your_username",
    "password": "your_password",
    "key_path": null,
    "port": 22
  },
  "prompts": [
    "Explain the transformer architecture.",
    "What are the benefits of quantization?"
  ],
  "max_tokens": 1024,
  "result_dir": "results",
  "tests": [
    {
      "name": "remote-vllm-test",
      "backend": "vllm",
      "port": 8000,
      "backend_config": {
        "model_path": "/data/models/llama-2-7b",
        "wsl-venv": false,
        "args": {
          "host": "0.0.0.0",
          "port": 8000,
          "tensor-parallel-size": 2,
          "gpu-memory-utilization": 0.85,
          "max-model-len": 8192
        }
      }
    }
  ]
}

Test Connectivity (Recommended)

Before running full tests, verify server connectivity:

python demo_run_server_test.py --config remote_test.json

Run Performance Test

python run_tests.py --config remote_test.json

⚙️ Configuration Reference

Configuration Schema

Field Type Description
ssh.local_mode bool true: Test local services
false: Test via SSH
ssh.hostname string Server IP or hostname
(Use 127.0.0.1 for local)
ssh.username string SSH username
ssh.password string SSH password (leave empty if using key)
ssh.key_path string Path to SSH private key
ssh.port int SSH port (default: 22)
prompts list[string] List of test prompts
max_tokens int Maximum tokens to generate
result_dir string Output directory for results
tests list[object] List of test configurations
tests[].name string Test identifier
tests[].model string Model name (Ollama only)
tests[].backend string Backend type: ollama, vllm, lmstudio
tests[].port int Service port
tests[].backend_config object Backend-specific configuration
backend_config.model_path string Model file path on server
backend_config.wsl-venv bool Activate Python venv in WSL
backend_config.args object Backend startup arguments

Example Configurations

All example configurations are available in the configs/ directory:

  • local_ollama_example.json - Local Windows Ollama testing
  • wsl_vllm_example.json - WSL vLLM testing via SSH
  • remote_server_example.json - Remote server testing

📊 Performance Metrics

InferMatrix measures the following key performance indicators:

Metric Description Unit
TTFT Time To First Token
Latency from request to first token
seconds
TPOT Time Per Output Token
Average time per generated token
milliseconds
Throughput Total tokens per second
Overall generation speed
tokens/s
Token Count Total tokens in response tokens
Prefill Speed Input processing speed tokens/s

Understanding the Metrics

Request Timeline:
├─ TTFT ─────┤ (Prefill Phase)
             └─ TPOT ─┬─ TPOT ─┬─ TPOT ─┤ (Decode Phase)
                      Token 1  Token 2  Token N

Total Latency = TTFT + (TPOT × N tokens)
Throughput = N tokens / Total Latency

🎨 Examples

Building a Performance Matrix

Compare multiple models across different backends:

{
  "prompts": ["Calculate 1+1"],
  "tests": [
    {
      "name": "ollama-deepseek",
      "model": "deepseek-r1:1.5b",
      "backend": "ollama",
      "port": 11434
    },
    {
      "name": "vllm-deepseek",
      "backend": "vllm",
      "backend_config": {
        "model_path": "/models/deepseek-r1-1.5b"
      }
    },
    {
      "name": "lmstudio-deepseek",
      "backend": "lmstudio",
      "port": 8000
    }
  ]
}

Results automatically generate a comparison matrix:

Framework Comparison - DeepSeek R1 1.5B
┌─────────────┬────────────┬───────────┬──────────────┐
│ Backend     │ TTFT (s)   │ TPOT (ms) │ Throughput   │
├─────────────┼────────────┼───────────┼──────────────┤
│ Ollama      │ 4.87       │ 7.29      │ 137.09 tok/s │
│ vLLM        │ 0.25       │ 15.80     │ 63.29 tok/s  │
│ LMStudio    │ 2.33       │ 9.81      │ 101.89 tok/s │
└─────────────┴────────────┴───────────┴──────────────┘

Hardware Configuration Testing

Test the same model on different GPUs:

# Test on RTX 4090
python run_tests.py --config configs/rtx4090_config.json

# Test on RTX 3090
python run_tests.py --config configs/rtx3090_config.json

# Compare results
python generate_comparison_report.py \
  --results results/rtx4090_results.json results/rtx3090_results.json

🔧 Troubleshooting

Common Issues and Solutions

SSH Connection Failed

Symptoms: Connection refused or Permission denied

Solutions:

  • ✅ Verify hostname, username, password/key_path, and port
  • ✅ Ensure SSH service is running: sudo systemctl status ssh
  • ✅ Test connectivity: ssh username@hostname
  • ✅ For WSL: Check Windows Defender Firewall settings

Model Not Found

Symptoms: Model <name> not found or 404 Not Found

Solutions:

  • Ollama: Verify model name matches ollama list exactly
  • vLLM: Check model_path points to directory with config.json and weight files
  • WSL: Ensure Windows drive mount path is correct (e.g., /mnt/e/...)

Port Already in Use

Symptoms: Address already in use or Port <N> is occupied

Solutions:

  • ✅ Check if port is in use: lsof -i :8000 (Linux) or netstat -ano | findstr :8000 (Windows)
  • ✅ Kill existing process or choose a different port
  • ✅ Update port field in configuration

Permission Denied

Symptoms: Permission denied when accessing files or starting services

Solutions:

  • ✅ Ensure user has read/write permissions for config and result directories
  • ✅ Verify SSH user has permissions to access model files
  • ✅ Check execution permissions: chmod +x run_tests.py

vLLM Startup Failed in WSL

Symptoms: vLLM fails to initialize or detect GPU

Solutions:

  • ✅ Verify CUDA installation: nvidia-smi in WSL
  • ✅ Check vLLM can detect GPU: python -c "import torch; print(torch.cuda.is_available())"
  • ✅ If wsl-venv: true, verify activation command in script (default: source ~/venv/bin/activate)
  • ✅ Ensure GPU drivers are properly passed through to WSL

Timeout Errors

Symptoms: Request timeout or Connection timeout

Solutions:

  • ✅ Increase timeout values in code if testing very large models
  • ✅ Check network stability
  • ✅ Verify server resources are sufficient (CPU, GPU, RAM)

🏗️ Architecture

InferMatrix uses a modular architecture for flexibility and extensibility:

infermatrix/
├── run_tests.py              # Main entry point
├── test_orchestrator.py      # Test orchestration and workflow
├── llm_tester.py             # Core testing logic
├── ssh_manager.py            # SSH connection management
├── backend_deployer.py       # Backend service deployment
├── configs/                  # Configuration examples
│   ├── local_ollama_example.json
│   ├── wsl_vllm_example.json
│   └── remote_server_example.json
├── results/                  # Test results output
│   ├── *.json               # Raw performance data
│   └── *.png                # Visualization charts
└── utils/                    # Utility functions
    ├── metrics.py           # Performance metrics calculation
    └── visualizer.py        # Chart generation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Ways to Contribute

  • 🐛 Report bugs via GitHub Issues
  • 💡 Suggest features or improvements
  • 📝 Improve documentation
  • 🔧 Submit pull requests

Priority Areas

  • Support for additional backends (TensorRT-LLM, Text-Generation-Inference)
  • Enhanced visualization options
  • Batch testing capabilities
  • Multi-GPU performance analysis
  • CI/CD pipeline

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


📚 Citation

If you use InferMatrix in your research, please cite:

@software{infermatrix2025,
  title={InferMatrix: Systematic LLM Inference Performance Evaluation},
  author={Anonymous},
  year={2025},
  url={https://github.com/infermatrix/infermatrix}
}

🔗 Related Projects

  • vLLM - High-throughput LLM serving
  • Ollama - Run LLMs locally
  • LMStudio - Desktop LLM application
  • llm-perf - LLM performance benchmarking

🌟 Star History

If you find InferMatrix useful, please consider giving it a star ⭐


Built with ❤️ for the LLM community

DocumentationIssuesDiscussions

About

Systematic LLM inference performance evaluation | Build performance matrices across models, backends, and hardware

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published