A Kotlin/Native port of llama.cpp - Inference of Meta's LLaMA model (and others) in pure Kotlin
Maintained by Sydney (The Solace Project)
This project is a Kotlin/Native port of the original llama.cpp created by Georgi Gerganov and the amazing open source community. We extend our deepest gratitude to all the original contributors who made the foundational work possible. Please see the AUTHORS file for the complete list of contributors to the original project.
Original llama.cpp contributors deserve special recognition for their groundbreaking work in making LLM inference efficient and accessible.
This is a Kotlin/Native implementation of llama.cpp, designed to bring the power of large language model inference to the Kotlin ecosystem with a focus on:
- Optimized CPU backend built on deterministic klang primitives
- Idiomatic Kotlin API while maintaining compatibility with original concepts
- Memory-efficient tensor operations adapted for Kotlin/Native's memory model
- Comprehensive quantization support (Q8_0, Q4_0, Q4_1, BitNet 1.58, and K-Quant types Q2_K, Q3_K, Q4_K, Q5_K, Q8_K; Q6_K in progress)
- Automatic differentiation for training and fine-tuning capabilities
- kcoro interop is now gated behind
expect/actualwrappers: non-Apple targets receive inert stubs while macOS/Arm64 keeps the native bindings../gradlew build -x macosArm64Testsucceeds across JVM/JS/native targets. Running./gradlew macosArm64Testcurrently fails: legacy llama.cpp-style tensor/model tests (BitNet 1.58, GGML integration, inference pipeline) expect archived fixtures and need to be reworked for the Kotlin port. The kcoro benchmark itself is opt-inβsetENABLE_KCORO_BENCH=1to exercise it on Apple Silicon. - The legacy
archive/andexamples/trees from the original llama.cpp have been removed. Kotlin sources now depend solely on the vendoredexternal/kcoro/external/kcoro_cppruntimes and the rewritten Kotlin/Native code. - macOS-specific kcoro benchmarks were moved to
src/macosArm64Test. Other target test suites execute normally (JS/JVM/Native metadata), while kcoro tests are skipped when interop is unavailable.
The project is actively under development with substantial progress across multiple phases:
- Memory Management: Advanced tensor allocation with
GGMLGraphAllocatorandGGMLDynTensorAllocator- Primary ByteArray buffer with dynamic allocation within reserved space
- Inplace tensor allocation and memory reuse logic for optimization
- Tensor usage tracking and automatic memory freeing
- Graph-level memory planning with comprehensive allocation strategies
- Tensor Data Access: Comprehensive accessor methods for all supported data types
- F32, F16, I32, I16 data accessors with stride information
- Efficient ByteArray-based data storage and retrieval
- Multi-dimensional tensor indexing with proper stride calculations
- Advanced Quantization Support: Comprehensive quantization ecosystem implemented
- Q8_0: F16 scale + 32xI8 weights (34 bytes per block)
- Q4_0: F16 scale + 32x4-bit packed weights (18 bytes per block)
- Q4_1: 2x F16 scale/min + 32x4-bit packed weights (20 bytes per block)
- BitNet 1.58: Ternary quantization with F16 scale + base-3 packing (10 bytes per block)
- Optimized dot product routines for all quantized operations
- Direct quantized-to-quantized operations (QΓQ) avoiding expensive dequantization
- Matrix Multiplication Optimizations: Complete optimization coverage for all quantization combinations
- Symmetric F32ΓQ_type optimizations (2-3x speedup)
- Direct Q_typeΓQ_type operations (3-5x speedup)
- Mixed quantized operations (Q8_0ΓQ4_0, etc.)
- Performance validation with comprehensive benchmarking suite
- Core Tensor Operations: Element-wise and matrix operations with multi-type support
- ADD, MUL, SUB, DIV, NEG, SQR, SQRT, MatMul for F32/F16 and quantized types
- Activation functions: RELU, GELU, SILU, RMSNorm
- Broadcasting and dimension validation with proper error handling
- GGUF File Format Support: Complete model loading pipeline
- Binary GGUF file parsing with metadata extraction
- Tensor loading and integration with existing tensor system
- Model validation through forward pass testing
- Support for all standard GGUF data types and structures
- Automatic Differentiation: Backward pass implementation for core operations
- ADD, SUB, MUL, NEG, DIV, SQR, SQRT operations
- RELU, GELU activation functions
- MUL_MAT (matrix multiplication)
- SUM, MEAN, REPEAT operations
- Compute Operations Architecture: Major refactor to destination-based operations
- All compute functions write directly into pre-allocated destination tensors
- Function signatures changed from
computeAdd(...): GGMLTensortocomputeAdd(..., dst: GGMLTensor) - Eliminated memory allocation within compute operations for improved efficiency
- Aligned with GGML architecture patterns for memory reuse and graph optimization
- Additional Quantization: K-Quant types (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K)
- CPU Backend Formalization: Structured CPU backend with multi-threading support
- Model Architecture Implementation: LLaMA model structures and inference pipeline
- Unit Tests: Complete coverage for core operations under
src/commonTest/kotlin- Element-wise operations (ADD, MUL, SUB, DIV, NEG, SQR, SQRT)
- Matrix operations with all quantization combinations
- Activation functions (RELU, GELU, SILU, RMSNorm)
- Error handling and edge cases (scalar tensors, dimension mismatches)
- Quantization Accuracy Tests: Rigorous validation with standardized metrics
- MSE, RMSE, MAD, SNR analysis for all quantization types
- Reference-aligned error thresholds based on upstream llama.cpp
- Synthetic, random, and edge case test vectors
- Cross-quantization performance validation
- Performance Benchmarking: Comprehensive performance validation suite
- Throughput (MB/s) and operations-per-second metrics
- Matrix multiplication optimization validation (2-5x speedups achieved)
- Memory allocation efficiency testing
- Scalability analysis across data sizes
- Integration Tests: End-to-end workflow validation
- Complex computation chains and mathematical expressions
- Mixed precision workflows (F32 β F16)
- Memory stress testing with large-scale operations
- GGUF model loading and forward pass validation
- Reference Validation Framework: Mathematical accuracy assurance
- Analytical reference validation against known results
- Numerical stability and boundary condition testing
- Cross-precision consistency validation
- Regression baseline prevention system
- Destination-based Operations Test Suite: Comprehensive validation of architecture refactor
- In-place computation interface validation (
GGMLComputeOpsDestinationTest.kt) - Dimension and type mismatch error handling
- Direct integration with graph allocator memory management
- In-place computation interface validation (
For detailed development information, see:
- KOTLIN_PORT_CHECKLIST.md - Detailed development roadmap with current progress
- KOTLIN_PORT_STATUS.md - Overall project status and completion overview
- COMPUTE_OPERATIONS_REFACTOR_SUMMARY.md - Major architectural refactor details
- MATMUL_OPTIMIZATION_SUMMARY.md - Comprehensive matrix multiplication optimizations
- GGUF_IMPLEMENTATION.md - GGUF file format support implementation
- GGML_TESTING_SUMMARY.md - Comprehensive testing infrastructure documentation
- GGML_COMPUTE_OPS_DESIGN.md - Technical design for computation operations
- TENSOR_OPERATIONS_DESIGN.md - Design patterns for tensor operations
- AGENTS.md - Project instructions for contributors and agents
This project uses Kotlin Multiplatform with Gradle as the build system.
- Kotlin/Native compiler and tools
- Gradle 8.13 or later
- macOS (current target platform - x64 and arm64)
./gradlew build./gradlew allTests # Run all tests for available targets
# Or run tests for specific target:
./gradlew linuxX64Test # Linux tests
./gradlew macosX64Test # macOS tests (when on macOS)./gradlew kcoroBench # macOS arm64 onlyNote: Legacy llama.cpp C/C++ and Python examples were removed in October 2025; only files required by the Kotlin/Native port remain. Expect kcoro and ggml headers to live under external/, spm-headers/, and src/nativeInterop/.
- Source Directory:
src/nativeMain/kotlin/ai/solace/llamakotlin - Package Structure:
ai.solace.llamakotlin.* - Core Modules:
core/GGMLTypes.kt- Core tensor data structurescore/GGMLAlloc.kt- Memory managementcore/GGMLOps.kt- High-level tensor operationscore/GGMLComputeOps.kt- Low-level computation kernelscore/GGMLGraph.kt- Computation graph executioncore/GGMLQuants.kt- Quantization implementations
- macOS (x64 and arm64) - Primary target for Kotlin/Native builds
- Linux (x64) - Supported for CPU backend development and testing
- Windows (mingw-x64) - Basic support for development
- Other Kotlin/Native targets - Planned for future releases
As the Kotlin port develops, we plan to support the same range of models as the original llama.cpp:
Large Language Models:
- LLaMA / LLaMA 2 / LLaMA 3 π¦
- Mistral 7B / Mixtral MoE
- GPT-2, Phi models, Gemma
- And many more from the original llama.cpp ecosystem
Multimodal Models (Future):
- LLaVA models
- Other vision-language models
Model support will be added progressively as the core Kotlin implementation matures.
We welcome contributions to the Kotlin port! Here's how you can help:
- Core tensor operations and optimization
- Quantization methods implementation
- CPU backend development and optimization
- Future accelerator hooks exposed via backend registry
- Testing and validation against original implementations
- Documentation and examples
- Kotlin Style: Use idiomatic Kotlin with descriptive names and comprehensive KDoc comments
- Modular Design: Separate tensor creation logic from compute kernels
- Memory Efficiency: Use ByteArray-based storage with accessor methods
- Type Safety: Leverage Kotlin's type system for compile-time safety
- Testing: Include comprehensive tests for new functionality
- Check the current progress in KOTLIN_PORT_CHECKLIST.md
- Review the AGENTS.md file for detailed development guidance
- Look at existing implementations in
src/nativeMain/kotlin/ai/solace/llamakotlin/ - Add tests in
src/commonTest/kotlin/
This project maintains conceptual compatibility with the original llama.cpp while providing a Kotlin/Native implementation.
- Same core concepts: Tensor operations, quantization methods, and model architectures
- Compatible file formats: Plans to support GGUF and other formats from the original
- Independent development: Optimized for Kotlin/Native's strengths and constraints
- Complementary goals: Different language ecosystem, same inference capabilities
Beyond the core llama.cpp port, this project explores cutting-edge hybrid architectures:
- Liquid Neural Networks Integration: Actor-based computation with adaptive time constants
- Quantum-Classical Hybrid Framework: Meta-Cognitive Temporal Architecture (MCTA) research
- Memory Cube Architecture: Advanced caching and inference optimization strategies
- Actor-Coroutine Integration: Leveraging Kotlin's concurrency model for neural computation
These are research directions for future exploration after completing the core llama.cpp port.
The following examples show the current capabilities of the Kotlin port:
// Basic tensor creation and operations
val tensorA = ggmlNewTensor2D(ctx, GGMLType.GGML_TYPE_F32, 4, 4)
val tensorB = ggmlNewTensor2D(ctx, GGMLType.GGML_TYPE_F32, 4, 4)
val result = ggmlAdd(ctx, tensorA, tensorB)// Use GGMLGraphAllocator for efficient memory planning
val allocator = GGMLGraphAllocator()
allocator.reserve(graph, bufferSize)
val tensor = allocator.allocateTensor(type, dimensions) // Automatically uses inplace when possible// New efficient destination-based compute operations
val src0 = allocator.allocateTensor(GGMLType.F32, longArrayOf(4, 4))
val src1 = allocator.allocateTensor(GGMLType.F32, longArrayOf(4, 4))
val dst = allocator.allocateTensor(GGMLType.F32, longArrayOf(4, 4))
// Operations write directly into pre-allocated destination
computeAdd(allocator, context, src0, src1, dst) // No return value - writes to dst// Q4_0 quantization example with optimized operations
val originalTensor = allocator.allocateTensor(GGMLType.F32, longArrayOf(32, 128))
val quantizedTensor = allocator.allocateTensor(GGMLType.Q4_0, longArrayOf(32, 128))
quantizeTensor(originalTensor, quantizedTensor) // Direct quantization
// Optimized matrix multiplication with quantized tensors (2-5x speedup)
val weights = allocator.allocateTensor(GGMLType.Q4_0, longArrayOf(128, 512))
val input = allocator.allocateTensor(GGMLType.F32, longArrayOf(32, 128))
val output = allocator.allocateTensor(GGMLType.F32, longArrayOf(32, 512))
computeMatMul(allocator, context, input, weights, output) // Uses optimized Q4_0ΓF32 kernel// Load model from GGUF file
val loader = ModelLoader()
val model = loader.loadFromFile("model.gguf")
// Access loaded tensors
val embeddings = model.getTensor("token_embeddings.weight")
val attention = model.getTensor("layers.0.attention.wq.weight")
// Validate model with forward pass
val success = model.validateForwardPass()
println("Model validation: ${if (success) "PASSED" else "FAILED"}")More comprehensive examples will be available as the implementation progresses.
Note: This is an active Kotlin port with substantial core infrastructure complete. GGUF model loading, comprehensive quantization support, advanced memory management, and performance optimizations are functional.
-
Clone the repository:
git clone https://github.com/SolaceHarmony/llama.kotlin.git cd llama.kotlin -
Build the project:
./gradlew build
-
Run tests:
./gradlew allTests
Note: Network access may be required to download Gradle and Kotlin dependencies during the first build.
The project follows a structured development approach across multiple phases:
- β Phase 1: Project Setup and Analysis - Complete
- β Phase 2: Core Library Translation (ggml) - Complete
- π Phase 3: CPU Backend Implementation - In Progress
- π Phase 4: Backend Extensibility & Runtime Experiments - Planned
- π Phase 5: LLaMA Model Implementation - Starting
- β Phase 6: Model Loading and File Format Support - Largely Complete (GGUF)
- π Phase 7: API and Applications
- β Phase 8: Testing and Validation - Comprehensive Infrastructure Complete
- π Phase 9: Documentation and Distribution
- π Phase 10: Performance Optimization - Major Optimizations Complete
For detailed progress tracking, see KOTLIN_PORT_CHECKLIST.md.
This project is licensed under the MIT License - same as the original llama.cpp.
- Original llama.cpp: https://github.com/ggerganov/llama.cpp
- Project Documentation: AGENTS.md, KOTLIN_PORT_CHECKLIST.md
- Design Documents: GGML_COMPUTE_OPS_DESIGN.md, TENSOR_OPERATIONS_DESIGN.md
- Issues and Discussions: GitHub Issues
This project is maintained by Sydney at The Solace Project. We are grateful to the original llama.cpp community for their foundational work that makes this Kotlin port possible.
