APOCALYPSE-AID

Offline first-aid intelligence for a commodity Android phone in a disaster zone.

A fine-tuned Gemma 4 E2B (5.1B parameters, Q4_0) runs entirely on-device via llama.cpp. No internet. No cloud. No telemetry. The build target is a sub-$200 Android phone — Moto G54 5G class (ARMv8.2 Cortex-A78+A55, 4–8 GB RAM), with the same APK expected to run on commodity hardware down to a Samsung A-series, Xiaomi Redmi, Tecno Spark, Infinix Hot, or Itel A-series. Built for refugees, conflict zones, and disaster contexts where infrastructure is gone and the only responder is a layperson with the phone in their pocket.

Track: Google Gemma 4 Good Hackathon — llama.cpp Special Tech Prize Submission deadline: 2026-05-18 Repo: github.com/ApocalypseTech00/apocalypse-aid Model: DestinyApocalypse/apocalypse-aid-gemma4-e2b

The problem

Google's reference platforms for Gemma 4 — Pixel 9 Pro, Snapdragon 8 Gen 3 flagships — are not the phones in a Sudanese refugee camp, a South American cartel-zone village, or a post-earthquake aid distribution line. The actual device floor in those contexts is a commodity Android: a Cortex-A55/A78 SoC, 4 GB of LPDDR4X, and a SIM with no signal. Running a 5.1B-parameter LLM on that hardware, fast enough to answer a first-aid question in under a minute, is not a matter of porting the official llama.cpp Android example — that example requires API 33 and assumes a flagship. It is a matter of three or four very specific engineering moves below the framework level.

Top engineering innovations

1. The two-line speedup: build-flag fix + pure-Q4_0 requantize

The "~4 min → ~19 s" win is two complementary fixes, both worth understanding distinctly. A llama.cpp expert reading the code will see both; we credit both honestly.

(a) CMAKE_BUILD_TYPE=Release for the native JNI library (compile-time). Android Gradle Plugin's assembleDebug ships native code at -O0 — no SIMD, no auto-vectorization. For a SIMD-heavy LLM kernel (ggml + KleidiAI + llama.cpp) the default is catastrophic. Measured on Moto G54 (2026-05-16): the same model that llama-bench runs at 7.5 tok/s prefill at RelWithDebInfo ran at 0.22 tok/s out of a Debug-built JNI. A ~30× slowdown that quietly makes developers blame Gemma 4 for "just not fitting" on commodity Android. The Kotlin side stays Debug (breakpoints, no R8); only the inference path runs at full -O3:

// app/build.gradle.kts
externalNativeBuild {
    cmake {
        arguments += listOf("-DCMAKE_BUILD_TYPE=Release")
    }
}

(b) llama-quantize --pure Q4_0 (dispatch correctness). The first GGUF we shipped had token_embd stored as Q6_K alongside the rest as Q4_0 — a default that silently dropped out of KleidiAI's optimized matmul path and forced a scalar fallback for that tensor. Re-quantizing with --pure Q4_0 so every weight tensor matches the dispatch path produced a ~100× speedup on the affected matmuls. File size also dropped from 3.35 GB to 2.5 GB as a side effect.

Either fix in isolation is partial. Together they bring sustained inference from "minutes-per-token unusable" to "under a minute per first-aid query" on a commodity ARM phone.

2. `MAP_POPULATE` pre-warm — eliminates 60–200 s of cold-start page-fault I/O

The Q4_0 GGUF is memory-mapped. On cold start the first decoded token triggers a cascade of page faults pulling every weight page from UFS storage into RAM. On Moto G54, the first inference after model load took 351 s for an 82-token prime — almost entirely disk I/O.

Solution. Before llama.cpp opens the file, the JNI bridge does its own mmap(MAP_SHARED | MAP_POPULATE) over the GGUF and immediately munmaps. MAP_POPULATE forces every page resident; the pages stay warm in the Linux page cache so llama.cpp's own mmap reads from RAM, not flash. Cost: ~5–10 s of sequential read at app launch. Pays for itself on the very first chat query.

Source: app/src/main/cpp/llm_bridge.cpp:108–149.

3. KV-cache prefix priming via `llama_state_seq_get_data` / `set_data`

Every chat query re-tokenises and re-decodes the same system prompt — ~38 tokens of "You are a first-aid action card. Reply with numbered steps only…" On a Cortex-A55 at ~7.5 tok/s prefill that's 3–5 seconds of wasted prompt-eval per query. Most llama.cpp Android wrappers eat this cost silently because they treat each generate() as stateless.

Solution. After model load, primeSystemPrompt() tokenises + decodes the system prompt once and snapshots the resulting KV state via llama_state_seq_get_data(ctx, buf, size, /*seq_id=*/0). On every subsequent generate(), if the incoming prompt begins with the cached prefix, the bridge restores the KV state with llama_state_seq_set_data(...) and decodes only the tail — the RAG chunks plus the user's question. System-prompt prefill is paid once at app start, never again.

The same architecture also enables: cooperative cancellation (atomic flag polled between tokens for a 60-second coroutine timeout), and a (used_kv_cache, n_tail_tokens) log line every query for cost auditing.

Source: app/src/main/cpp/llm_bridge.cpp:48–69, 254–301, 463–547.

What it does

A first-aid Q&A assistant grounded in peer-reviewed primary sources (WHO IMCI/mhGAP, MSF clinical guidelines, IFRC/Red Cross, AHA 2025, ERC 2025, TCCC 2024, BMJ Open, Lancet OA, PubMed Central OA, and military medicine field manuals). Every answer cites its source. The model fits in ~3.5 GB of working memory and runs without ever opening a network socket — INTERNET is not declared in the manifest.

Architecture in one paragraph. A user types a question. The query first hits a hand-curated safety router (DoseLookup) with 79 entries / ~470 patterns covering the life-critical first-aid surface (CPR ratios, choking algorithm, anaphylaxis dose, naloxone protocol, suicide-crisis routing, paediatric weight-based dosing). On a match, the router returns a pre-vetted answer and the LLM is never invoked — the model does not get to roll dice on life-critical doses. On a miss, the query goes to a hybrid retriever (sentence-transformers MiniLM-L6-v2 dense embeddings + BM25, fused with weighted Reciprocal Rank Fusion) over a 25,173-chunk corpus memory-mapped from APK assets. The top chunk is injected into a Gemma 4 chat prompt that ends with a "1. " assistant-turn prefill (forcing the model past its trained "I'm a first-aid reference, …" preamble bias). Output passes through a defence-in-depth filter chain (dose-leak guard, repetition-loop guard, surgical sentence-level scrub for external-referral hallucinations) before reaching the user. Total round-trip on Moto G54: under 20 s for typical first-aid queries.

Other engineering notes (not the headliners, but worth pointing at):

Custom JNI for minSdk 26 (Android 8.0). The official llama.cpp Android example requires API 33; the V1 floor includes phones from 2018. app/src/main/cpp/llm_bridge.cpp is a clean-room ~1,000-line bridge that exposes only what the app needs.
AAPT compression-free assets (noCompress += listOf("gguf", "bin", "bm25-stats")). Without this, AAPT runs zlib-deflate over the corpus + index — install time on commodity flash inflates by 30–90 s, and worse, llama.cpp's mmap() returns garbage on a deflate-stream-backed asset.
Action-verb-only system prompt + "1. " assistant-turn prefill mechanically forces the model past instruction-tuned preamble bias ("I'm an AI…", "As a first-aid reference…") into the middle of a numbered step on the first sampled token.
A78 / A55 thread placement was the wrong fix. Pinning to the big cluster halved effective core count; the Linux EAS scheduler at n_threads=4 was empirically faster. LPDDR4X bandwidth, not core count, is the ceiling on this class of hardware.
x86_64 stripped from the release AAB — every shipping target is arm64-v8a. Halves CMake build time and the AAB.
Defense-in-depth Unicode sanitiser — NFKC normalisation + zero-width / bidi-override stripping + IPA small-caps homoglyph fold-map applied symmetrically in both the UI sanitiser and the safety router. Closes the homoglyph / RLO / ZWSP class of prompt-injection attacks.
Pinned n_ubatch = n_batch in the llama.cpp context to avoid a ggml_abort SIGABRT on the chunked-prompt decode path. Default n_ubatch = 512; on memory-constrained phones we ship n_batch = 768 and a smaller phone might force n_batch < 512, at which point the first chunked-batch path tripped an internal invariant. One-line pin, not documented in llama.cpp upstream.
File-size shrink that's actually a RAM-budget fix. The --pure Q4_0 requantize dropped the GGUF from 3.35 GB to 2.5 GB. On a 4 GB-RAM phone (Samsung A06, Tecno Spark 20C) where Android + OEM bloat consume ~1.5 GB and JVM/Compose another ~200 MB, that 850 MB delta is the difference between "swap thrashing + LMK reap" and "headroom for a second app to be open." Not a storage win — a working-set win.
Co-resident embedder model in one JNI process. app/src/main/cpp/llm_bridge.cpp holds Gemma 4 E2B (~2.5 GB mmap) and a sentence-transformers MiniLM-L6-v2 (~20 MB) behind separate mutexes and separate llama_context instances. The on-device RAG retriever calls the embedder; the chat path calls Gemma. Most Android LLM apps either pre-embed corpus offline (no live query embedding) or ship two JNI bridges; one shared process saves the JNI startup cost and the duplicate llama_backend_init.
Cooperative cancellation via std::atomic<bool> polled per-token. Kotlin's withTimeout can't pre-empt a thread-blocking JNI call, so a 60 s coroutine timeout would otherwise leak the generation thread forever. The flag is set by cancelGenerate() from any thread, the decode loop polls it between tokens with memory_order_acquire, and bails within ~one token boundary (≤1 s) returning whatever was generated so far. Without this, a query that hits the 60 s wall would still hold the JNI mutex while the next user message waited.
Battery-tier maxResponseTokens scaling in HardwareProfiler.kt — 96 / 80 / 64 tokens by battery percentage. Software trade against thermal throttle on commodity SoCs: a deep-discharge phone gets a shorter (still complete) answer at lower SoC voltage, instead of a half-finished long answer the user can't act on.

What it does NOT do

Honesty here is more useful than marketing.

No clinician sign-off. The corpus and the 79 safety-routed canned responses were reviewed by internal multi-agent panels (CTO / LLM / Security / Clinical-agent / Pharmacology-agent), not by a licensed clinician. Deployment beyond the hackathon needs that review, especially on the paediatric weight-based dose-bearing chunks. The Pharmacology review pass exists because a public medical training dataset shipped to us with a pharmacologically impossible "10 g oxytocin IM" entry (the real dose is 10 IU); agent panels are good at catching that class of error, but a real paediatrician is the bar for production.
English only in V1. The safety filter chain reads 10 languages defensively (so non-English model-output leaks are still caught), but generation is English-only.
No voice input in V1. Voice is critical for the disaster audience (injured hands, in water, holding someone) — it's the V1.1 priority, not V1.
No image input. Gemma 4 E2B is text-only. MedGemma 4B does image+text but ships at a different audience (trained CHWs) and a different hardware floor.
Sustained inference will thermal-throttle on 4 GB phones. First-query latency on a Helio G85 / 4 GB chassis (Samsung A06, Tecno Spark 20C, Redmi 12) is comparable to Moto G54, but 2–3 queries back-to-back will hit thermal throttling and slow ~30–40% until the phone cools. Real-world disaster-zone use is more like "one question, walk away, come back" than continuous chat, so this is acceptable for V1 — but it's not a benchmark we'd publish without the disclaimer.
Not a substitute for emergency services where they exist. This is a first-aid reference for the case when they don't.
Dose-critical questions are intentionally narrow. Anything the curated router doesn't already have a pre-vetted answer for gets either a generic refusal or a model-generated answer with RAG grounding — we'd rather refuse than hallucinate a dose.

How it compares

Project	Type	Offline?	Hardware target	Differentiator vs. Apocalypse-Aid
Apocalypse-Aid	Android app	Yes	Sub-$200 commodity Android (Moto G54 / Tecno Spark / Samsung A)	This entry
Project N.O.M.A.D.	Linux basecamp server	Yes	Mini-PC at a basecamp	Complementary — Apocalypse-Aid is what you carry when you leave the basecamp
survive-ai	Android app	Yes	Unspecified	Older/smaller Gemma 2B, keyword search, no dose safety route, 2 GitHub stars
medical-gemma-3n (ericrisco)	Model only	n/a	n/a	Re-trained the model; we kept Gemma 4 general and added a curated RAG library + safety router
MedGemma Uganda triage	Android app	Yes	Mid-tier Android	Different audience (trained CHWs); image+text MedGemma 4B
Signpost AI (IRC/Mercy Corps)	Cloud chatbot	No — needs internet	n/a	Stops working when the internet does
PocketPal AI / SmolChat / Google AI Edge Gallery	Generic AI runners	Yes	Any	No medical content, no safety routes, no curated sources

To our knowledge, Apocalypse-Aid is the only shipped offline-on-device LLM survival app combining (a) the newer/bigger Gemma 4 E2B, (b) semantic + BM25 hybrid search over a peer-reviewed medical library, (c) a hand-curated pre-LLM router for life-critical questions, and (d) a target floor of commodity sub-$200 Android. See docs/competitor-landscape.md for the full survey.

Test state (2026-05-17)

Safety routing — 149-case JVM suite: 137/149 pass (~92%). 0 of the remaining 12 failures routes to a dangerous answer; all are routing-test fails where the test expected a different DL entry or a model fall-through. Categories at 100% pass: dl-routing (the 20 core queries), pediatric, drug-name, garbage. Run with ./gradlew :app:testDebugUnitTest --tests "com.apocalypseaid.qa.QaSuiteTest".

Model + safety-layer eval (102-probe holdout, prior session): the v2 fine-tune + AxiomScrub configuration scored 81.1% refusal accuracy / 3.9% false-refusal / 96.1% citation / 81.4% adversarial / 0 user-visible axiom violations. The headline number is the model the project shipped on; the QA suite above is the routing layer that sits in front of it. Full methodology in MODEL_CARD.md.

Building from source

git clone --recursive https://github.com/ApocalypseTech00/apocalypse-aid.git
cd apocalypse-aid

# Java 17 + Android NDK required (Android Studio installs both)
export JAVA_HOME=/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home

./gradlew :app:assembleDebug                              # debug APK
./gradlew :app:testDebugUnitTest                          # JVM unit tests + QA routing suite
./gradlew :app:assembleRelease                            # signed release (needs local.properties — see docs/PROTOCOL.md §7)

The Q4_0 GGUF (~2.5 GB) is not bundled in the APK — Android Gradle Plugin caps single-asset size at 2 GB. It is hosted on Hugging Face and imported on first launch via Android's Storage Access Framework picker, with SHA-256 verification on import.

Install on a phone: Download app-release.apk and the GGUF from the release page over Wi-Fi, tap to install, point the app at the downloaded .gguf. After first launch nothing leaves the phone.

Repo layout

Path	Purpose
`app/src/main/cpp/llm_bridge.cpp`	JNI bridge to llama.cpp — `minSdk 26`, MAP_POPULATE pre-warm, KV-cache prefix priming, cooperative cancellation, embedder context
`app/src/main/java/com/apocalypseaid/ai/`	AI inference, hardware profiler, RAG retriever, safety layer
`app/src/main/java/com/apocalypseaid/ai/safety/`	DoseLookup router (79 entries / ~470 patterns), AxiomScrub, DoseFilter, RepetitionCheck
`app/src/main/java/com/apocalypseaid/ui/`	Jetpack Compose UI (apocalypse-tech pink `#F918D0`)
`app/src/main/assets/`	Bundled corpus + embedder + safety patterns (model is NOT bundled)
`app/src/test/`	JVM unit tests + 149-case QA routing suite
`app/build.gradle.kts`	The `CMAKE_BUILD_TYPE=Release` fix lives here
`ai-training/`	Training pipeline, evaluation harnesses, Python tooling
`MODEL_CARD.md`	Training methodology, eval results, install architecture
`docs/PROTOCOL.md`	Development protocol (mandatory reading for contributors)
`docs/competitor-landscape.md`	Plain-English survey of the offline-on-device LLM medical space

License

GPLv3 — see LICENSE. Trained on permissive medical sources only (PubMed Central OA Commercial + government-published clinical guidelines). No commercial-licensed content is bundled in the model weights or in the runtime corpus.

This software is first-aid REFERENCE information, never medical advice. It is designed for scenarios where infrastructure is gone and a layperson is the only responder available.

Acknowledgements

Google Gemma team for releasing Gemma 4 E2B with permissive terms that allow on-device fine-tuning and redistribution under GPLv3.
ggml-org / llama.cpp for the inference engine, the KleidiAI ARM kernels, the GGUF format, and the llama_state_seq_* APIs that make KV-cache prefix priming a one-line call.
WHO, MSF, IFRC/Red Cross, AHA, ERC, TCCC, BMJ Open, Lancet OA, PubMed Central for the open clinical sources the corpus is built from.
Project N.O.M.A.D. for proving the basecamp half of the offline-infrastructure stack, and being the reason this app can stay focused on the phone half.

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
ai-training		ai-training
app		app
docs		docs
gradle		gradle
licenses		licenses
llama-cpp @ e62fa13		llama-cpp @ e62fa13
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
NEXT-SESSION-PROMPT.md		NEXT-SESSION-PROMPT.md
NOTICE		NOTICE
README-DRAFT.md		README-DRAFT.md
README.md		README.md
SESSION-LOG.md		SESSION-LOG.md
TASKS.md		TASKS.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

APOCALYPSE-AID

The problem

Top engineering innovations

1. The two-line speedup: build-flag fix + pure-Q4_0 requantize

2. `MAP_POPULATE` pre-warm — eliminates 60–200 s of cold-start page-fault I/O

3. KV-cache prefix priming via `llama_state_seq_get_data` / `set_data`

What it does

What it does NOT do

How it compares

Test state (2026-05-17)

Building from source

Repo layout

License

Acknowledgements

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

APOCALYPSE-AID

The problem

Top engineering innovations

1. The two-line speedup: build-flag fix + pure-Q4_0 requantize

2. MAP_POPULATE pre-warm — eliminates 60–200 s of cold-start page-fault I/O

3. KV-cache prefix priming via llama_state_seq_get_data / set_data

What it does

What it does NOT do

How it compares

Test state (2026-05-17)

Building from source

Repo layout

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. `MAP_POPULATE` pre-warm — eliminates 60–200 s of cold-start page-fault I/O

3. KV-cache prefix priming via `llama_state_seq_get_data` / `set_data`

Packages