An evaluation of 43 open source language models across four distinct tasks: creative writing, logical reasoning, counterfactual causality, and programming. This benchmark aims to provide practical insights into current open source LLM capabilities and performance characteristics.
This benchmark evaluates where we currently stand with open source language models, examining not just raw generative speed (tokens/second) but actual task completion effectiveness. A key insight from this work is that the fastest generative model is not necessarily the fastest at getting correct answers—models that over-reason or are overly verbose can be slower despite higher token generation rates.
- Hardware: Apple M4 Max with 128GB unified memory
- Software: Ollama 0.9.0 (
ollama serve
) - Quantization: All models tested in q4_K_M quantization (4-bit) unless specified
- Context Length: Extended context window (
OLLAMA_CONTEXT_LENGTH=100000
) to handle complex reasoning tasks - Special Case: Qwen3-235B-A22B tested with MLX 3-bit quantization due to memory constraints
- Cogito: 3B, 8B, 14B, 32B, 70B (5 models)
- Gemma3: 1B, 4B, 12B, 27B (4 models)
- Granite3.3: 2B, 8B (2 models)
- Qwen3: 0.6B, 1.7B, 8B, 32B, 30B-A3B, 235B-A22B (6 models)
- Llama3/4: 3.2:1B, 3.2:3B, 3.1:8B, 3.3:70B, 4:17B-Scout (5 models)
- Coding Models: Qwen2.5-Coder (0.5B-32B), CodeGemma, Codestral, Devstral, DeepCoder (8 models)
- Reasoning Models: DeepSeek-R1 (1.5B, 8B, 32B variants), Phi4 (Mini, Reasoning variants) (9 models)
- Other: Mistral 7B (1 model)
1. Creative Writing (Part 1)
Task: Generate a 5-sentence short story
Evaluation: Coherence, creativity, adherence to length requirement, narrative structure
2. Logic Puzzle (Part 2)
Task: Solve a deceptive riddle requiring careful logical reasoning
Evaluation: Correct answer identification, reasoning quality, avoidance of common logical traps
3. Counterfactual Causality (Part 3)
Task: Analyze a scenario involving counterfactual reasoning about causation
Evaluation: Understanding of causal relationships, ability to reason about hypothetical scenarios
4. Python Programming (Part 4)
Task: Generate a complete 3D physics simulation (bouncing ball with gravity)
Evaluation: Code correctness, execution success, physics accuracy, code quality
Each test captures detailed performance metrics:
Metric | Description |
---|---|
Total Duration | Complete operation time (loading + processing + generation) |
Load Duration | Model initialization time |
Prompt Eval Count/Rate | Input processing tokens and speed |
Eval Count/Rate | Output generation tokens and speed |
- Task Completion: Did the model fulfill the specific requirements?
- Accuracy: Was the response factually/logically correct?
- Efficiency: How quickly did the model arrive at a correct solution?
- Quality: Overall response quality and coherence
- Larger models don't always perform better on specific tasks
- Parameter count correlates weakly with task-specific performance
- Specialized models often outperform general-purpose models in their domain
- Token Generation Speed ≠ Task Completion Speed
- Models with extensive reasoning chains can be slower despite high tok/s rates
- Verbose models may appear productive but take longer to reach conclusions
- Concise, accurate responses often indicate better practical performance
- Multi-modal models may have different performance characteristics due to diverse training data
- Reasoning-specialized models show improved performance on logical tasks but may over-analyze simple problems
- Code-specialized models excel at programming but may struggle with general reasoning
- Task-Specific Performance matters more than general benchmarks
- Multi-Agent Systems using different specialized models may be optimal
- Context Requirements significantly impact performance and should be considered
- Resource Constraints (memory, inference time) are practical limiting factors
- Quality doesn't always correlate with response length
- Different architectures excel at different task types
- Quantization impacts should be evaluated per use case
- Local deployment considerations (hardware, memory) affect model choice
├── README.md # This file
├── LICENSE # MIT License
├── benchmark.md # Summary and methodology
├── benchmark-part1.md # Creative writing results
├── benchmark-part2-light.md # Logic puzzle results
├── benchmark-part3.md # Counterfactual reasoning results
├── benchmark-part4-light.md # Programming task results
├── mlx_chat.py # MLX model testing utility
└── codes/ # Generated code samples
├── [model-name].py # Programming task outputs
└── performance_data.py # Performance analysis
For models that couldn't run with Ollama due to memory constraints (specifically Qwen3-235B-A22B), I provide a specialized MLX testing utility. This Python script enables testing of MLX-optimized models on Apple Silicon:
Features:
- Interactive chat interface for MLX models
- Configurable thinking mode and token budgets
- Support for large models (tested with 235B parameters)
- Real-time response streaming
- Conversation history management
Usage:
# Test with default Qwen3-30B-A3B model
python mlx_chat.py
# Test with larger model (requires ~128GB RAM)
python mlx_chat.py --model mlx-community/Qwen3-235B-A22B-3bit
# Disable thinking mode for direct responses
python mlx_chat.py --thinking-budget 0
Requirements:
- Apple Silicon Mac with sufficient memory
- MLX framework:
pip install mlx-lm
- For 235B model: 128GB unified memory recommended
Note: MLX models are optimized for Apple Silicon and run ~20% faster than equivalent Ollama models, but results should be interpreted within this context when comparing performance metrics.
- Context Window Size: Critical for complex reasoning tasks—many models perform significantly better with extended context
- Quantization Effects: 4-bit quantization used throughout for consistency, but performance may vary with different quantization levels
- Hardware Specificity: Results obtained on Apple Silicon; performance may differ on other architectures
- Model Versions: Specific model versions and quantizations tested—results may not generalize to other versions
This benchmark was made possible by the incredible work of the open source community:
I extend my gratitude to all the organizations and researchers who have made their language models freely available:
- Alibaba (Qwen series)
- Google (Gemma series)
- Meta (Llama series)
- Microsoft (Phi series)
- IBM (Granite series)
- Mistral AI (Mistral, Codestral, Devstral)
- DeepSeek (DeepSeek-R1 series)
- DeepCogito (Cogito series)
- And all other contributors to the open source LLM ecosystem
- Ollama - For providing an excellent local LLM serving platform that made testing 40+ models seamless
- Apple MLX - For the MLX framework enabling efficient inference on Apple Silicon
- MLX-LM - For the high-level MLX interface used in our custom testing utility
- Hugging Face - For hosting and distributing the quantized models
- Apple - For the M4 Max chip and unified memory architecture that enabled testing of large models locally
-
The open source AI community's collaborative spirit makes research like this possible. These benchmarks aim to contribute back to the community by providing practical performance insights.
-
Claude 4 Sonnet, for its valuable assistance in analyzing benchmark results, identifying data inconsistencies, and helping to structure comprehensive documentation
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this benchmark in your research or analysis, please reference this repository and note the specific testing conditions (hardware, software versions, quantization levels) as they significantly impact results.
This benchmark provides a snapshot of open source LLM capabilities as of the testing date. The rapidly evolving nature of this field means results should be interpreted within their temporal and technical context.