High-performance, private AI on the edge. Run world-class models like Google DeepMind's Gemma 4 locally on your own Android hardware. Powered by Google's LiteRT with OpenAI-compatible API access.
Demo Video: Local LLM Server Gemma 4 E2B on Samsung Galaxy S22+ with OpenChat Client (click image to play)
Transform your Android phone into a high-performance, private AI server. This project was born from a mission to give an over four-year-old Samsung Galaxy S22+ a "second life" by repurposing its powerful GPU for dedicated local inference.
Outsource your AI workload to a dedicated mobile server. Instead of competing for VRAM on your primary workstation or paying for cloud subscriptions, you can leverage the high-efficiency silicon in flagship smartphones.
- Re-use old flagships: Older high-end devices make incredible AI servers. Even devices with cracked screens or damaged displays are perfect for this role.
- Superior cost & power efficiency: Running models on a smartphone is significantly more power-efficient than using a desktop GPU. While an NVIDIA card might draw 300W-450W, a smartphone performs at a fraction of that power (typically <15W), making it much cheaper to run 24/7.
- Easy hardware access: High-end GPUs can be expensive and hard to source. In contrast, used flagship smartphones are affordable and widely available on the second-hand market.
- Absolute Privacy: Chat with state-of-the-art models like Google DeepMind's Gemma 4 entirely on-device. Your data never leaves your local network.
- AI Sovereignty: Own your intelligence. By running models locally on your own hardware, you achieve complete independence from cloud provider terms of service, arbitrary pricing changes, or unexpected outages.
Access the server via an OpenAI-compatible API from any device on your local WiFi network.
No cloud. No subscriptions. Your data never leaves your device.
- Smartphone with a modern GPU
- 4GB+ RAM (8GB recommended)
- 3-5GB free storage per model
- Android 16+ (API 36+)
- WiFi connection
Download APK from releases and install on your device.
Note the server URL (e.g., http://10.0.2.15:8080)
From any device on same WiFi:
# Replace 10.0.2.15 with your device's IP
curl http://10.0.2.15:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'Python (OpenAI SDK):
from openai import OpenAI
client = OpenAI(
base_url="http://10.0.2.15:8080/v1", # Your device IP
api_key="not-needed"
)
response = client.chat.completions.create(
model="gemma-4",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)| Model | Size | RAM Needed | Speed |
|---|---|---|---|
| Google DeepMind Gemma 4 E2B | 2.4G | 4GB+ | Fast (recommended) |
| Google DeepMind Gemma 4 E4B | 3.4G | 8GB+ | Slower, better quality |
For a quick overview of what's possible with Gemma 4, check out this introductory video.
Running LLMs on Android presents unique challenges compared to desktop/server environments due to limited power, thermal constraints, and memory bandwidth.
Total time executing api/examples.sh against Local LLM Server on:
| Device | Chipset | CPU | GPU | Total Time |
|---|---|---|---|---|
| Samsung Galaxy S26 Ultra | Qualcomm SM8850-1-AD Snapdragon 8 Elite Gen 5 (3 nm) | Octa-core (2x4.74 GHz Oryon V3 Phoenix L + 6x3.62 GHz Oryon V3 Phoenix M) | Adreno 840 (1.3GHz) | 1:21.40 |
| Samsung Galaxy Tab S10+ (SM-X820) | Mediatek Dimensity 9300+ (4 nm) | Octa-core (1x3.4 GHz Cortex-X4 & 3x2.8 GHz Cortex-X4 & 4x2.0 GHz Cortex-A720) | Immortalis-G720 MC12 | 2:17.49 |
| Samsung Galaxy S22+ (Europe) | Exynos 2200 (4 nm) | Octa-core (1x2.8 GHz Cortex-X2 & 3x2.50 GHz Cortex-A710 & 4x1.8 GHz Cortex-A510) | Xclipse 920 | 3:16.91 |
| Platform (Device) | Backend | Prefill | Decode | TTFT | Peak Memory |
|---|---|---|---|---|---|
| Android (S26 Ultra) | GPU | 3,808 tk/s | 52 tk/s | 0.3s | 676 MB |
| macOS (MacBook Pro M4) | GPU | 7,835 tk/s | 160 tk/s | 0.1s | 1,623 MB |
| Linux (RTX 4090) | GPU | 11,234 tk/s | 143 tk/s | 0.1s | 913 MB |
TTFT: Time to First Token. Benchmarks based on the 2.58 GB Gemma-4-E2B model (Source).
- Screen stays on (prevents GPU throttling)
- CPU runs at full speed always
- Battery drains 2-5x faster than normal
- Device gets warm/hot during use
Recommendation: Keep device plugged in during use, but disable Fast Charging. High CPU/GPU usage combined with rapid charging generates significant heat, which may lead to thermal throttling or overheating.
Endpoint: http://<device-ip>:8080/v1/chat/completions
Compatible with: OpenAI Python SDK, LangChain, any OpenAI-compatible client
Authentication: None (local network only)
Streaming: Supported (set "stream": true)
Sampling Config (Fixed):
- Temperature: 1.0
- Top-P: 0.95
- Top-K: 64
The model always generates complete responses. Parameters like max_tokens and temperature from OpenAI API are ignored for consistency and performance.
Explore the Detailed API Documentation & Examples for comprehensive endpoint specifications and advanced integration guides.
- Ensure server is running (check dashboard)
- Use correct IP from dashboard
- Connect from device on same WiFi
- Check model is downloaded
- Ensure device has 4GB+ free RAM
- Close other apps to free memory
- Budget devices are naturally slower
- Use smaller model (Gemma 4 E2B)
- Ensure battery optimization is disabled
This is expected. App runs at maximum performance. Keep device plugged in.
Local network only. Server binds exclusively to private IP addresses (192.168.x.x, 10.x.x.x). Cannot be accessed from internet.
No authentication. Protected by network isolation. Secure your WiFi.
No limits. No rate limiting, prompt length limits, or token caps. Your hardware determines limits naturally.
Built entirely in Kotlin, this project leverages Jetpack Compose for the UI, Ktor for the HTTP server, and LiteRT for the inference engine. Notably, AI assistance was used heavily in the development of this project, assisting in everything from architectural decisions to implementation.
See CONTRIBUTING.md for development setup.
See ARCHITECTURE.md for technical details.
This project is licensed under the Apache License 2.0.
- Download Badge: Generated using graphics from Logowik.
- Social Preview: Created with Nano Banana using source imagery from this YouTube video.
This project is an independent Open Source initiative and is not affiliated with, endorsed by, or associated with Google, Deepmind, Hugging Face or OpenAI. All trademarks and registered trademarks are the property of their respective owners.



