litmus-lab

A blazing-fast, zero-dependency local CLI profiler built for benchmarking Large Language Models across different precision formats (Native FP16 vs INT8 vs INT4) directly on your GPU.

litmus-lab measures:

GPU VRAM consumption
Tokens per second (TPS)
Time to first token (TTFT)
Linguistic degradation (Perplexity)

After profiling, an offline mathematical heuristic engine automatically recommends the best deployment precision for your hardware.

No cloud APIs. No subscriptions. No hallucinated advice.

Features

Multi-Precision Benchmarking

Profile and compare:

Native FP16
INT8 Quantization
INT4 Quantization (NF4/GPTQ-style)

on the exact same prompt and architecture.

Offline Recommendation Engine

litmus-lab contains a local rule-based mathematical evaluation engine.

It automatically determines:

whether quantization is worth it
whether the VRAM savings justify the quality loss
if lower precision actually hurts latency or throughput
if perplexity degradation becomes unsafe

Example:

Small models may not benefit from quantization
Some architectures become unstable in 4-bit
Certain GPUs gain VRAM savings but lose TPS

The engine evaluates all of this locally and outputs a deployment verdict.

VRAM Isolation & Cleanup

Each profiling worker is completely isolated.

Between benchmark passes, litmus-lab aggressively performs:

CUDA cache cleanup
Python garbage collection
IPC memory clearing
model unloads
allocator flushes

This prevents hidden memory leaks and false VRAM readings during sequential quantization tests.

🛡 Context-Length Protection

Some older transformer architectures crash if generation exceeds positional embedding limits.

litmus-lab automatically:

reads max_position_embeddings
scales test sequence lengths safely
avoids index out-of-bound runtime failures

Beautiful Terminal Dashboard

All benchmark data is rendered using rich terminal tables for clean visualization directly inside the CLI.

Example:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Quantization ┃ VRAM (MB) ┃ Tokens/sec(TPS) ┃ Time to first token(TTFT) ┃ Perplexity ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Native       │ 7540.12   │ 54.2180         │ 0.0120 sec                │ 12.42      │
│ INT8         │ 4210.45   │ 18.9412         │ 0.0540 sec                │ 12.45      │
│ INT4         │ 2840.88   │ 22.1054         │ 0.0610 sec                │ 12.68      │
└──────────────┴───────────┴─────────────────┴───────────────────────────┴━━━━━━━━━━━━┛

Installation

Install globally from PyPI:

pip install litmus-lab

Quick Start

Run a full benchmark pass:

litmus-lab --model microsoft/Phi-3-mini-4k-instruct --prompt "Write a short poem on a wall"

CLI Options

Flag	Description
`--model`	Hugging Face model repository path
`--prompt`	Prompt text sent to the inference runner
`--token`	Optional Hugging Face token for gated models

Supported Models

Most Hugging Face causal language models are supported, including:

Phi
Qwen
Gemma
Mistral
Llama
OPT
Falcon
TinyLlama
DeepSeek

Examples:

litmus-lab --model Qwen/Qwen2.5-7B-Instruct --prompt "Explain quantum gravity"

litmus-lab --model google/gemma-2-2b-it --prompt "Write a Linux shell script"

litmus-lab --model meta-llama/Llama-3.1-8B-Instruct --token YOUR_TOKEN --prompt "Explain TCP congestion control"

Example System Evaluation Report

==========================================================================================
SYSTEM EVALUATION REPORT FOR microsoft/Phi-3-mini-4k-instruct:

• Recommendation: Deploy INT4 (NF4 format).

• Reason:
  Reclaims a significant 4699.24 MB of GPU VRAM compared to Native FP16 execution.

  The perplexity delta remains tightly controlled (+0.26 PPL), making INT4 the
  most hardware-efficient deployment format for this architecture size.

==========================================================================================

How Recommendations Are Calculated

The heuristic engine evaluates:

VRAM reclaimed
TPS throughput changes
TTFT latency penalties
Perplexity degradation
architecture stability
quantization efficiency ratios

The recommendation engine is completely offline and deterministic.

No LLM APIs are used.

Why This Exists

Most quantization tooling tells you:

"INT4 uses less memory."

But memory reduction alone does not determine deployment quality.

Some quantized models:

become slower
lose coherence
spike TTFT
destabilize logits
produce negligible VRAM savings

litmus-lab exists to mathematically determine whether quantization is actually worth deploying on YOUR hardware.

Example Workflow

# Benchmark a 7B instruct model
litmus-lab \
  --model Qwen/Qwen2.5-7B-Instruct \
  --prompt "Explain transformers in simple terms"

# Benchmark a gated Llama model
litmus-lab \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --token hf_xxxxxxxxx \
  --prompt "Write a memory allocator in C"

Performance Metrics Explained

VRAM (MB)

Peak GPU memory allocated during inference.

Tokens/sec (TPS)

Measures generation throughput speed.

Higher is better.

Time To First Token (TTFT)

Measures inference latency before the first generated token appears.

Lower is better.

Perplexity (PPL)

Measures language degradation and prediction uncertainty.

Lower is better.

Small increases are acceptable.

Large jumps indicate quantization damage.

Architecture

litmus-lab internally uses:

PyTorch
Transformers
bitsandbytes
CUDA memory instrumentation
isolated worker runners
rich terminal rendering

while exposing a single lightweight CLI interface.

Roadmap

Planned future features:

ONNX Runtime benchmarking
GGUF profiling
AWQ/GPTQ support
AMD ROCm backend
CPU-only profiling
tensor parallel profiling
JSON/CSV export mode
benchmark history tracking
automated regression detection

License

MIT License

Free to use, modify, distribute, and integrate into commercial tooling.

Contributing

Pull requests, issue reports, architecture improvements, and benchmark contributions are welcome.

Disclaimer

Benchmark results vary depending on:

GPU architecture
CUDA version
driver versions
kernel scheduler state
model architecture
tokenizer implementation
quantization backend

Always validate production deployments independently.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
int_4.py		int_4.py
int_8.py		int_8.py
main.py		main.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

litmus-lab

Features

Multi-Precision Benchmarking

Offline Recommendation Engine

VRAM Isolation & Cleanup

🛡 Context-Length Protection

Beautiful Terminal Dashboard

Installation

Quick Start

CLI Options

Supported Models

Example System Evaluation Report

How Recommendations Are Calculated

Why This Exists

Example Workflow

Performance Metrics Explained

VRAM (MB)

Tokens/sec (TPS)

Time To First Token (TTFT)

Perplexity (PPL)

Architecture

Roadmap

License

Contributing

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

litmus-lab

Features

Multi-Precision Benchmarking

Offline Recommendation Engine

VRAM Isolation & Cleanup

🛡 Context-Length Protection

Beautiful Terminal Dashboard

Installation

Quick Start

CLI Options

Supported Models

Example System Evaluation Report

How Recommendations Are Calculated

Why This Exists

Example Workflow

Performance Metrics Explained

VRAM (MB)

Tokens/sec (TPS)

Time To First Token (TTFT)

Perplexity (PPL)

Architecture

Roadmap

License

Contributing

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages