This project uses Docker and llama.cpp to simulate a resource-constrained (low-CPU, low-RAM) mobile device environment on your laptop. It allows you to benchmark the performance of various GGUF-quantized Language Models for CPU-only inference.
run_test.sh/run_test.bat: Your main configuration script. You edit this file to set the simulated CPU/RAM limits.Dockerfile: Defines the container, installing Python andllama.cpp.benchmark.py: The Python script that runs the actual tests (RAM usage, TTFT, TPS).README.md: This file.
This workflow is much simpler. You only need to run one command.
- Install Docker: You must have Docker Desktop (or Docker Engine on Linux) installed and running.
- Create Model Folder: Create a folder named
llm-modelsin the same directory as these other files. - Download GGUF Models: Go to Hugging Face and download quantized GGUF models (e.g.,
phi-3-mini-4k-instruct-q4_K_M.gguf). Place them inside yourllm-modelsfolder.
You only need to edit two files to set up a new test:
-
Configure Hardware (in
run_test.shorrun_test.bat) Open yourrun_testscript and edit the variables at the top:# Set the simulated hardware limits SIM_CPU_CORES="2.0" # Set to "4.0" for a mid-range phone SIM_MEMORY="3g" # Set to "6g" for a mid-range phone
-
Configure Model (in
benchmark.py) Openbenchmark.pyand edit the variables at the top:# Set the model file you want to test MODEL_PATH = "/models/phi-3-mini-4k-instruct-q4_K_M.gguf" # ... # Set n_threads to match your SIM_CPU_CORES (e.g., "2.0" -> 2) LLM_PARAMS = { "n_threads": 2, "n_gpu_layers": 0 }
Now, just run the single script from your terminal.
On Linux/macOS: First, make the script executable (only need to do this once):
chmod +x run_test.sh