Offline first-aid intelligence for a commodity Android phone in a disaster zone.
A fine-tuned Gemma 4 E2B (5.1B parameters, Q4_0) runs entirely on-device via llama.cpp. No internet. No cloud. No telemetry. The build target is a sub-$200 Android phone — Moto G54 5G class (ARMv8.2 Cortex-A78+A55, 4–8 GB RAM), with the same APK expected to run on commodity hardware down to a Samsung A-series, Xiaomi Redmi, Tecno Spark, Infinix Hot, or Itel A-series. Built for refugees, conflict zones, and disaster contexts where infrastructure is gone and the only responder is a layperson with the phone in their pocket.
Track: Google Gemma 4 Good Hackathon — llama.cpp Special Tech Prize Submission deadline: 2026-05-18 Repo: github.com/ApocalypseTech00/apocalypse-aid Model: DestinyApocalypse/apocalypse-aid-gemma4-e2b
Google's reference platforms for Gemma 4 — Pixel 9 Pro, Snapdragon 8 Gen 3 flagships — are not the phones in a Sudanese refugee camp, a South American cartel-zone village, or a post-earthquake aid distribution line. The actual device floor in those contexts is a commodity Android: a Cortex-A55/A78 SoC, 4 GB of LPDDR4X, and a SIM with no signal. Running a 5.1B-parameter LLM on that hardware, fast enough to answer a first-aid question in under a minute, is not a matter of porting the official llama.cpp Android example — that example requires API 33 and assumes a flagship. It is a matter of three or four very specific engineering moves below the framework level.
The "~4 min → ~19 s" win is two complementary fixes, both worth understanding distinctly. A llama.cpp expert reading the code will see both; we credit both honestly.
(a) CMAKE_BUILD_TYPE=Release for the native JNI library (compile-time). Android Gradle Plugin's assembleDebug ships native code at -O0 — no SIMD, no auto-vectorization. For a SIMD-heavy LLM kernel (ggml + KleidiAI + llama.cpp) the default is catastrophic. Measured on Moto G54 (2026-05-16): the same model that llama-bench runs at 7.5 tok/s prefill at RelWithDebInfo ran at 0.22 tok/s out of a Debug-built JNI. A ~30× slowdown that quietly makes developers blame Gemma 4 for "just not fitting" on commodity Android. The Kotlin side stays Debug (breakpoints, no R8); only the inference path runs at full -O3:
// app/build.gradle.kts
externalNativeBuild {
cmake {
arguments += listOf("-DCMAKE_BUILD_TYPE=Release")
}
}(b) llama-quantize --pure Q4_0 (dispatch correctness). The first GGUF we shipped had token_embd stored as Q6_K alongside the rest as Q4_0 — a default that silently dropped out of KleidiAI's optimized matmul path and forced a scalar fallback for that tensor. Re-quantizing with --pure Q4_0 so every weight tensor matches the dispatch path produced a ~100× speedup on the affected matmuls. File size also dropped from 3.35 GB to 2.5 GB as a side effect.
Either fix in isolation is partial. Together they bring sustained inference from "minutes-per-token unusable" to "under a minute per first-aid query" on a commodity ARM phone.
The Q4_0 GGUF is memory-mapped. On cold start the first decoded token triggers a cascade of page faults pulling every weight page from UFS storage into RAM. On Moto G54, the first inference after model load took 351 s for an 82-token prime — almost entirely disk I/O.
Solution. Before llama.cpp opens the file, the JNI bridge does its own mmap(MAP_SHARED | MAP_POPULATE) over the GGUF and immediately munmaps. MAP_POPULATE forces every page resident; the pages stay warm in the Linux page cache so llama.cpp's own mmap reads from RAM, not flash. Cost: ~5–10 s of sequential read at app launch. Pays for itself on the very first chat query.
Source: app/src/main/cpp/llm_bridge.cpp:108–149.
Every chat query re-tokenises and re-decodes the same system prompt — ~38 tokens of "You are a first-aid action card. Reply with numbered steps only…" On a Cortex-A55 at ~7.5 tok/s prefill that's 3–5 seconds of wasted prompt-eval per query. Most llama.cpp Android wrappers eat this cost silently because they treat each generate() as stateless.
Solution. After model load, primeSystemPrompt() tokenises + decodes the system prompt once and snapshots the resulting KV state via llama_state_seq_get_data(ctx, buf, size, /*seq_id=*/0). On every subsequent generate(), if the incoming prompt begins with the cached prefix, the bridge restores the KV state with llama_state_seq_set_data(...) and decodes only the tail — the RAG chunks plus the user's question. System-prompt prefill is paid once at app start, never again.
The same architecture also enables: cooperative cancellation (atomic flag polled between tokens for a 60-second coroutine timeout), and a (used_kv_cache, n_tail_tokens) log line every query for cost auditing.
Source: app/src/main/cpp/llm_bridge.cpp:48–69, 254–301, 463–547.
A first-aid Q&A assistant grounded in peer-reviewed primary sources (WHO IMCI/mhGAP, MSF clinical guidelines, IFRC/Red Cross, AHA 2025, ERC 2025, TCCC 2024, BMJ Open, Lancet OA, PubMed Central OA, and military medicine field manuals). Every answer cites its source. The model fits in ~3.5 GB of working memory and runs without ever opening a network socket — INTERNET is not declared in the manifest.
Architecture in one paragraph. A user types a question. The query first hits a hand-curated safety router (DoseLookup) with 79 entries / ~470 patterns covering the life-critical first-aid surface (CPR ratios, choking algorithm, anaphylaxis dose, naloxone protocol, suicide-crisis routing, paediatric weight-based dosing). On a match, the router returns a pre-vetted answer and the LLM is never invoked — the model does not get to roll dice on life-critical doses. On a miss, the query goes to a hybrid retriever (sentence-transformers MiniLM-L6-v2 dense embeddings + BM25, fused with weighted Reciprocal Rank Fusion) over a 25,173-chunk corpus memory-mapped from APK assets. The top chunk is injected into a Gemma 4 chat prompt that ends with a "1. " assistant-turn prefill (forcing the model past its trained "I'm a first-aid reference, …" preamble bias). Output passes through a defence-in-depth filter chain (dose-leak guard, repetition-loop guard, surgical sentence-level scrub for external-referral hallucinations) before reaching the user. Total round-trip on Moto G54: under 20 s for typical first-aid queries.
Other engineering notes (not the headliners, but worth pointing at):
- Custom JNI for
minSdk 26(Android 8.0). The official llama.cpp Android example requires API 33; the V1 floor includes phones from 2018.app/src/main/cpp/llm_bridge.cppis a clean-room ~1,000-line bridge that exposes only what the app needs. - AAPT compression-free assets (
noCompress += listOf("gguf", "bin", "bm25-stats")). Without this, AAPT runs zlib-deflate over the corpus + index — install time on commodity flash inflates by 30–90 s, and worse, llama.cpp'smmap()returns garbage on a deflate-stream-backed asset. - Action-verb-only system prompt + "1. " assistant-turn prefill mechanically forces the model past instruction-tuned preamble bias ("I'm an AI…", "As a first-aid reference…") into the middle of a numbered step on the first sampled token.
- A78 / A55 thread placement was the wrong fix. Pinning to the big cluster halved effective core count; the Linux EAS scheduler at
n_threads=4was empirically faster. LPDDR4X bandwidth, not core count, is the ceiling on this class of hardware. x86_64stripped from the release AAB — every shipping target isarm64-v8a. Halves CMake build time and the AAB.- Defense-in-depth Unicode sanitiser — NFKC normalisation + zero-width / bidi-override stripping + IPA small-caps homoglyph fold-map applied symmetrically in both the UI sanitiser and the safety router. Closes the homoglyph / RLO / ZWSP class of prompt-injection attacks.
- Pinned
n_ubatch = n_batchin the llama.cpp context to avoid aggml_abortSIGABRT on the chunked-prompt decode path. Defaultn_ubatch = 512; on memory-constrained phones we shipn_batch = 768and a smaller phone might forcen_batch < 512, at which point the first chunked-batch path tripped an internal invariant. One-line pin, not documented in llama.cpp upstream. - File-size shrink that's actually a RAM-budget fix. The
--pure Q4_0requantize dropped the GGUF from 3.35 GB to 2.5 GB. On a 4 GB-RAM phone (Samsung A06, Tecno Spark 20C) where Android + OEM bloat consume ~1.5 GB and JVM/Compose another ~200 MB, that 850 MB delta is the difference between "swap thrashing + LMK reap" and "headroom for a second app to be open." Not a storage win — a working-set win. - Co-resident embedder model in one JNI process.
app/src/main/cpp/llm_bridge.cppholds Gemma 4 E2B (~2.5 GB mmap) and a sentence-transformers MiniLM-L6-v2 (~20 MB) behind separate mutexes and separatellama_contextinstances. The on-device RAG retriever calls the embedder; the chat path calls Gemma. Most Android LLM apps either pre-embed corpus offline (no live query embedding) or ship two JNI bridges; one shared process saves the JNI startup cost and the duplicatellama_backend_init. - Cooperative cancellation via
std::atomic<bool>polled per-token. Kotlin'swithTimeoutcan't pre-empt a thread-blocking JNI call, so a 60 s coroutine timeout would otherwise leak the generation thread forever. The flag is set bycancelGenerate()from any thread, the decode loop polls it between tokens withmemory_order_acquire, and bails within ~one token boundary (≤1 s) returning whatever was generated so far. Without this, a query that hits the 60 s wall would still hold the JNI mutex while the next user message waited. - Battery-tier
maxResponseTokensscaling inHardwareProfiler.kt— 96 / 80 / 64 tokens by battery percentage. Software trade against thermal throttle on commodity SoCs: a deep-discharge phone gets a shorter (still complete) answer at lower SoC voltage, instead of a half-finished long answer the user can't act on.
Honesty here is more useful than marketing.
- No clinician sign-off. The corpus and the 79 safety-routed canned responses were reviewed by internal multi-agent panels (CTO / LLM / Security / Clinical-agent / Pharmacology-agent), not by a licensed clinician. Deployment beyond the hackathon needs that review, especially on the paediatric weight-based dose-bearing chunks. The Pharmacology review pass exists because a public medical training dataset shipped to us with a pharmacologically impossible "10 g oxytocin IM" entry (the real dose is 10 IU); agent panels are good at catching that class of error, but a real paediatrician is the bar for production.
- English only in V1. The safety filter chain reads 10 languages defensively (so non-English model-output leaks are still caught), but generation is English-only.
- No voice input in V1. Voice is critical for the disaster audience (injured hands, in water, holding someone) — it's the V1.1 priority, not V1.
- No image input. Gemma 4 E2B is text-only. MedGemma 4B does image+text but ships at a different audience (trained CHWs) and a different hardware floor.
- Sustained inference will thermal-throttle on 4 GB phones. First-query latency on a Helio G85 / 4 GB chassis (Samsung A06, Tecno Spark 20C, Redmi 12) is comparable to Moto G54, but 2–3 queries back-to-back will hit thermal throttling and slow ~30–40% until the phone cools. Real-world disaster-zone use is more like "one question, walk away, come back" than continuous chat, so this is acceptable for V1 — but it's not a benchmark we'd publish without the disclaimer.
- Not a substitute for emergency services where they exist. This is a first-aid reference for the case when they don't.
- Dose-critical questions are intentionally narrow. Anything the curated router doesn't already have a pre-vetted answer for gets either a generic refusal or a model-generated answer with RAG grounding — we'd rather refuse than hallucinate a dose.
| Project | Type | Offline? | Hardware target | Differentiator vs. Apocalypse-Aid |
|---|---|---|---|---|
| Apocalypse-Aid | Android app | Yes | Sub-$200 commodity Android (Moto G54 / Tecno Spark / Samsung A) | This entry |
| Project N.O.M.A.D. | Linux basecamp server | Yes | Mini-PC at a basecamp | Complementary — Apocalypse-Aid is what you carry when you leave the basecamp |
| survive-ai | Android app | Yes | Unspecified | Older/smaller Gemma 2B, keyword search, no dose safety route, 2 GitHub stars |
| medical-gemma-3n (ericrisco) | Model only | n/a | n/a | Re-trained the model; we kept Gemma 4 general and added a curated RAG library + safety router |
| MedGemma Uganda triage | Android app | Yes | Mid-tier Android | Different audience (trained CHWs); image+text MedGemma 4B |
| Signpost AI (IRC/Mercy Corps) | Cloud chatbot | No — needs internet | n/a | Stops working when the internet does |
| PocketPal AI / SmolChat / Google AI Edge Gallery | Generic AI runners | Yes | Any | No medical content, no safety routes, no curated sources |
To our knowledge, Apocalypse-Aid is the only shipped offline-on-device LLM survival app combining (a) the newer/bigger Gemma 4 E2B, (b) semantic + BM25 hybrid search over a peer-reviewed medical library, (c) a hand-curated pre-LLM router for life-critical questions, and (d) a target floor of commodity sub-$200 Android. See docs/competitor-landscape.md for the full survey.
Safety routing — 149-case JVM suite: 137/149 pass (~92%). 0 of the remaining 12 failures routes to a dangerous answer; all are routing-test fails where the test expected a different DL entry or a model fall-through. Categories at 100% pass: dl-routing (the 20 core queries), pediatric, drug-name, garbage. Run with ./gradlew :app:testDebugUnitTest --tests "com.apocalypseaid.qa.QaSuiteTest".
Model + safety-layer eval (102-probe holdout, prior session): the v2 fine-tune + AxiomScrub configuration scored 81.1% refusal accuracy / 3.9% false-refusal / 96.1% citation / 81.4% adversarial / 0 user-visible axiom violations. The headline number is the model the project shipped on; the QA suite above is the routing layer that sits in front of it. Full methodology in MODEL_CARD.md.
git clone --recursive https://github.com/ApocalypseTech00/apocalypse-aid.git
cd apocalypse-aid
# Java 17 + Android NDK required (Android Studio installs both)
export JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home
./gradlew :app:assembleDebug # debug APK
./gradlew :app:testDebugUnitTest # JVM unit tests + QA routing suite
./gradlew :app:assembleRelease # signed release (needs local.properties — see docs/PROTOCOL.md §7)The Q4_0 GGUF (~2.5 GB) is not bundled in the APK — Android Gradle Plugin caps single-asset size at 2 GB. It is hosted on Hugging Face and imported on first launch via Android's Storage Access Framework picker, with SHA-256 verification on import.
Install on a phone: Download app-release.apk and the GGUF from the release page over Wi-Fi, tap to install, point the app at the downloaded .gguf. After first launch nothing leaves the phone.
| Path | Purpose |
|---|---|
app/src/main/cpp/llm_bridge.cpp |
JNI bridge to llama.cpp — minSdk 26, MAP_POPULATE pre-warm, KV-cache prefix priming, cooperative cancellation, embedder context |
app/src/main/java/com/apocalypseaid/ai/ |
AI inference, hardware profiler, RAG retriever, safety layer |
app/src/main/java/com/apocalypseaid/ai/safety/ |
DoseLookup router (79 entries / ~470 patterns), AxiomScrub, DoseFilter, RepetitionCheck |
app/src/main/java/com/apocalypseaid/ui/ |
Jetpack Compose UI (apocalypse-tech pink #F918D0) |
app/src/main/assets/ |
Bundled corpus + embedder + safety patterns (model is NOT bundled) |
app/src/test/ |
JVM unit tests + 149-case QA routing suite |
app/build.gradle.kts |
The CMAKE_BUILD_TYPE=Release fix lives here |
ai-training/ |
Training pipeline, evaluation harnesses, Python tooling |
MODEL_CARD.md |
Training methodology, eval results, install architecture |
docs/PROTOCOL.md |
Development protocol (mandatory reading for contributors) |
docs/competitor-landscape.md |
Plain-English survey of the offline-on-device LLM medical space |
GPLv3 — see LICENSE. Trained on permissive medical sources only (PubMed Central OA Commercial + government-published clinical guidelines). No commercial-licensed content is bundled in the model weights or in the runtime corpus.
This software is first-aid REFERENCE information, never medical advice. It is designed for scenarios where infrastructure is gone and a layperson is the only responder available.
- Google Gemma team for releasing Gemma 4 E2B with permissive terms that allow on-device fine-tuning and redistribution under GPLv3.
- ggml-org / llama.cpp for the inference engine, the KleidiAI ARM kernels, the GGUF format, and the
llama_state_seq_*APIs that make KV-cache prefix priming a one-line call. - WHO, MSF, IFRC/Red Cross, AHA, ERC, TCCC, BMJ Open, Lancet OA, PubMed Central for the open clinical sources the corpus is built from.
- Project N.O.M.A.D. for proving the basecamp half of the offline-infrastructure stack, and being the reason this app can stay focused on the phone half.