A fully offline, privacy-first AI assistant with long-lived conversational memory, real-time voice interaction, and semantic understanding — no cloud required.
Nura is a six-engine memory architecture designed for long-horizon conversational AI with persistent, adaptive memory capabilities. The system runs entirely offline on consumer hardware, prioritizing privacy, low latency, and architectural discipline.
Key Differentiators:
- 100% Offline — No cloud APIs, no data leaves your device
- Semantic Understanding — ML-based comprehension, not regex/keyword matching
- Persistent Memory — Remembers facts, preferences, and conversations across sessions
- Real-Time Voice — Sub-second speech-to-speech latency (~800ms warm)
- Privacy-First — Your conversations stay on your machine
from sdk import Kenotic
k = Kenotic(user_id=0)
k.ingest(text="I adopted a golden retriever named Kobe. He loves swimming.", speaker="Sam")
result = k.retrieve(query="What is Sam's dog's name?")
print(result.text) # "a golden retriever named Kobe"See docs/sdk-quickstart.md for full guide, docs/sdk-reference.md for API reference.
One server, two protocols. Claude/Cursor get MCP. Everything else gets REST.
# Start the server
python -m mcp.http_server --port 7130
# Store via REST
curl -X POST http://localhost:7130/api/v1/store \
-H "Content-Type: application/json" \
-d '{"text": "Sam lives in Detroit", "speaker": "Sam"}'
# Retrieve via REST
curl -X POST http://localhost:7130/api/v1/retrieve \
-H "Content-Type: application/json" \
-d '{"query": "Where does Sam live?"}'See mcp/README.md for per-platform quickstart (Claude Desktop, ChatGPT, Cursor), docs/api-quickstart.md for full REST examples.
| Doc | For |
|---|---|
| mcp/README.md | Connect any AI client in 2 minutes |
| docs/sdk-quickstart.md | Python SDK getting started |
| docs/sdk-reference.md | Full API reference |
| docs/architecture.md | DTCM architecture for investors/researchers |
| docs/api-quickstart.md | REST API curl examples |
| docs/benchmark-methodology.md | ATANT benchmark methodology |
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Memory │ │ Retrieval │ │ Temporal │
│ Engine │ │ Engine │ │ Engine │
│ │ │ │ │ │
│ Storage & │ │ Semantic │ │ Time │
│ Facts │ │ Search │ │ Reasoning │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└────────────────┼────────────────┘
│
┌─────────┴─────────┐
│ Orchestrator │
└─────────┬─────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Adaptation │ │ Proactive │ │ Semantic │
│ Engine │ │ Engine │ │ Router │
│ │ │ │ │ │
│ Behavior │ │ Reminders │ │ NLU │
│ Learning │ │ & Nudges │ │ Understanding│
└─────────────┘ └─────────────┘ └─────────────┘
Memory Engine — Event ingestion, semantic classification, fact extraction, persistent storage Retrieval Engine — FAISS-accelerated semantic search, temporal-aware ranking Temporal Engine — Time phrase parsing, temporal context generation, deadline tracking Adaptation Engine — User profile evolution, warmth/formality tuning, behavioral adaptation Proactive Engine — Reminder scheduling, follow-up nudges, narrative boundary detection Semantic Router — ML-based intent/emotion/importance detection (replaces all regex)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ TEN VAD │───►│ Whisper │───►│Orchestr- │───►│ Local │───►│ Piper │
│ (50ms) │ │ STT │ │ ator │ │ LLM │ │ TTS │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
Voice Speech Memory Response Speech
Activity to Text + Context Generation Output
Detection Injection
Target Latency: <500ms end-to-end (achieved: ~806ms warm, ~1200ms cold)
Nura uses ML-based embeddings for all natural language understanding:
┌─────────────────────────────────────────────────────────────┐
│ SEMANTIC ROUTER │
│ │
│ User Input: "my dog name is Shiro" │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ EMBED ONCE │ (all-MiniLM-L6-v2) │
│ └──────┬───────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Intent │ │ Fact │ │ Importance │ │
│ │ 95% │ │ dog_name │ │ HIGH │ │
│ │PERSONAL│ │ 90% │ │ 85% │ │
│ └────────┘ └──────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────┘
Concept Domains:
intent_concepts.py— 7 intent types (personal_state, question, greeting, etc.)temporal_concepts.py— 30 temporal concepts (tomorrow, next week, etc.)fact_concepts.py— 20 fact types (name, age, pet, location, etc.)query_concepts.py— 16 query types (recall, search, compare, etc.)emotion_concepts.py— 20 emotion states (happy, stressed, anxious, etc.)importance_concepts.py— 16 importance levels (urgent, trivial, etc.)
- Offline-First — Every component runs locally; no network required
- Semantic Over Regex — ML embeddings understand meaning, not patterns
- Embed Once, Understand Everywhere — Single embedding serves all engines
- Strict Separation of Concerns — Each engine has non-overlapping responsibilities
- Privacy by Architecture — No telemetry, no cloud, no data collection
- ✅ Architectural boundary enforcement
- ✅ Cross-engine orchestration
- ✅ Dead code resolution
- ✅ Protocol interfaces
- ✅ Testing & validation (92.9% pass rate)
Objective: Replace development implementations with production-grade components.
- ✅ 6.1 Sentence Transformers —
all-MiniLM-L6-v2for semantic embeddings - ✅ 6.2 FAISS Integration — O(log N) vector search with
IndexFlatIP - ✅ 6.3 Database Optimizations — WAL mode, connection pooling, indexes
Objective: Real-time speech-to-speech interaction.
- ✅ 7.1 TEN VAD — 50ms latency voice activity detection
- ✅ 7.2 Whisper STT — Local speech recognition (faster-whisper)
- ✅ 7.3 Piper TTS — Neural text-to-speech (Jenny voice)
- ✅ 7.4 Streaming Pipeline — Token-by-token TTS for low latency
Achieved Latency:
| Component | Time |
|---|---|
| VAD | ~50ms |
| STT | ~150-200ms |
| Semantic Analysis | ~15-50ms |
| LLM Inference | ~400-600ms |
| TTS | ~100-150ms |
| Total (warm) | ~806ms |
Objective: Replace all regex/keyword matching with ML-based semantic understanding.
- ✅ 8.1 Intent Classification — Semantic embeddings replace regex patterns
- ✅ 8.2 Temporal Parsing — Semantic concepts replace time regex
- ✅ 8.3 Fact Extraction — Semantic detection of personal facts
- ✅ 8.4 Memory Classification — Semantic importance replaces CSV keywords
- ✅ 8.5 STT Prompting — initial_prompt generated from semantic concepts
Migration Summary:
| Component | Before | After |
|---|---|---|
| Intent Detection | 50+ regex patterns | Semantic embeddings |
| Temporal Parsing | 100+ time patterns | 30 temporal concepts |
| Fact Extraction | Hardcoded patterns | 20 fact type concepts |
| Memory Classification | CSV trigger words | Semantic importance |
| Query Detection | Keyword lists | 16 query concepts |
| Emotion Detection | Word lists | 20 emotion concepts |
Objective: Autonomous reminders and follow-ups without user prompting.
- ✅ 9.1 Reminder Scheduling — "Remind me tomorrow" creates scheduled nudges
- ✅ 9.2 Follow-up Detection — Detects unresolved commitments
- ✅ 9.3 Narrative Boundaries — Understands event conclusions
- ✅ 9.4 Cooldown System — Prevents reminder spam
Objective: Custom personality and response style.
- ✅ 10.1 Base Model Selection — Qwen 2.5 3B Instruct
- ✅ 10.2 LoRA Training — Identity injection, memory awareness
- ✅ 10.3 GGUF Export — Quantized for CPU inference (Q4_K_M)
- ✅ 10.4 Personality Embedding — Warm, supportive, memory-aware responses
Objective: Installer, error handling, edge cases.
- ✅ 11.1 Windows Installer — NSIS-based one-click setup
- ✅ 11.2 Model Downloads — Automatic first-run model fetching
- 11.3 Error Recovery — Graceful degradation on component failure
- 11.4 Multi-user Support — User profile switching
nura/
├── app/
│ ├── orchestrator/ # Central coordination
│ │ ├── orchestrator.py # Main engine coordinator
│ │ └── engine_policy.py # Engine activation rules
│ │
│ ├── semantic/ # ML-based understanding
│ │ ├── semantic_router.py # Unified NLU entry point
│ │ ├── concept_store.py # Embedding cache
│ │ └── concepts/ # Domain-specific concepts
│ │ ├── intent_concepts.py
│ │ ├── temporal_concepts.py
│ │ ├── fact_concepts.py
│ │ ├── query_concepts.py
│ │ ├── emotion_concepts.py
│ │ └── importance_concepts.py
│ │
│ ├── memory/ # Persistent storage
│ │ ├── memory_engine.py # Event ingestion
│ │ ├── memory_store.py # SQLite operations
│ │ ├── memory_classifier.py # Semantic classification
│ │ └── memory_summarizer.py # Session compression
│ │
│ ├── retrieval/ # Semantic search
│ │ ├── retrieval_engine.py # Search orchestration
│ │ ├── ranker.py # Relevance scoring
│ │ └── query_parser.py # Query understanding
│ │
│ ├── temporal/ # Time reasoning
│ │ ├── temporal_engine.py # Time awareness
│ │ └── temporal_patterns.py # Pattern detection
│ │
│ ├── adaptation/ # User modeling
│ │ └── adaptation_engine.py # Profile evolution
│ │
│ ├── proactive/ # Autonomous actions
│ │ └── proactive_engine.py # Reminder scheduling
│ │
│ ├── services/ # External interfaces
│ │ ├── realtime_stt.py # Whisper + TEN VAD
│ │ ├── streaming_tts.py # Piper neural TTS
│ │ ├── nura_llm_interface.py # Local LLM inference
│ │ └── wake_word_listener.py # "Hey Nura" detection
│ │
│ ├── vector/ # Embeddings & search
│ │ ├── embedding_service.py # all-MiniLM-L6-v2
│ │ └── vector_index.py # FAISS index
│ │
│ ├── guards/ # Safety & limits
│ │ ├── safety_layer.py # Content filtering
│ │ └── token_budget.py # Context management
│ │
│ ├── db/ # Database
│ │ └── session.py # SQLite connection pool
│ │
│ └── api/ # REST endpoints
│ └── memory_routes.py # Memory CRUD
│
├── config/
│ ├── settings.py # Global configuration
│ ├── thresholds.py # Tunable parameters
│ └── model_paths.py # Model file locations
│
├── models/ # Downloaded models
│ ├── nura-v3-q4_k_m.gguf # Fine-tuned LLM
│ ├── all-MiniLM-L6-v2/ # Embedding model
│ └── jenny_piper/ # TTS voice
│
├── Training/ # Fine-tuning scripts
│ ├── train_lora.py
│ └── export_gguf.py
│
└── Docs/ # Documentation
├── SketchArchitecture.md
└── NURA_DEVELOPMENT_STATUS.md
| Component | Technology |
|---|---|
| Language | Python 3.10+ |
| LLM | Qwen 2.5 3B (LoRA fine-tuned, Q4_K_M quantized) |
| LLM Runtime | llama-cpp-python |
| Embeddings | all-MiniLM-L6-v2 (sentence-transformers) |
| Vector Search | FAISS (IndexFlatIP) |
| STT | faster-whisper (small.en) |
| VAD | TEN VAD (50ms latency) |
| TTS | Piper (Jenny neural voice) |
| Database | SQLite (WAL mode) |
| API | FastAPI |
| Testing | pytest |
| Tier | RAM | Storage | Performance |
|---|---|---|---|
| Minimum | 8GB | 5GB | ~2s latency |
| Recommended | 16GB | 6GB | ~800ms latency |
| Optimal | 32GB + GPU | 8GB | ~400ms latency |
# Download and run the installer
Nura_Setup.exe
# Or manual installation
git clone https://github.com/Talknura/Nura.git
cd Nura
pip install -r requirements.txt
python first_run_setup.py # Downloads models
python run_ultra.py # Start NuraSay: "Hey Nura" # Wake word
Say: "My name is Sam" # Nura remembers
Say: "What's my name?" # Nura recalls: "Sam"
Say: "Bye Nura" # Session ends, memories summarized
- No Cloud — All processing happens locally
- No Telemetry — No usage data collected
- No Network — Works in airplane mode
- Local Storage — SQLite database in user directory
- Your Data — Stays on your device, always
- Phase 1–5: Core Architecture
- Phase 6: Scale Preparation (FAISS, Embeddings)
- Phase 7: Voice Pipeline
- Phase 8: Semantic Engine Migration
- Phase 9: Proactive Intelligence
- Phase 10: LLM Fine-Tuning
- Phase 11: Production Hardening
- Phase 12: Mobile Companion App
- Phase 13: Multi-modal (Vision)
- Phase 14: Edge Deployment (NVIDIA Jetson Nano Orin)
Nura's six-engine architecture is model-agnostic — designed to work with any LLM, not locked to a single provider.
Current: Phi-3.5 3B (local, offline) Next: NVIDIA PersonaPlex 7B (full-duplex speech-to-speech for demo) Future: Custom Nura model (in development)
The engines (Memory, Retrieval, Temporal, Adaptation, Proactive, Semantic Router) plug into ANY model. As better models emerge, Nura evolves — same engines, upgraded brain. That's the business model.
┌─────────────────────────────────────────────────┐
│ NURA ENGINE LAYER │
│ Memory | Temporal | Proactive | Adaptation │
│ Retrieval | Semantic Router | Safety │
└───────────────────────┬─────────────────────────┘
│ Context Injection
▼
┌─────────────────────────────────────────────────┐
│ MODEL LAYER (Swappable) │
├─────────────────────────────────────────────────┤
│ Today: Phi-3.5 3B + Whisper + Kokoro TTS │
│ Next: PersonaPlex 7B (full-duplex voice) │
│ Future: Custom Nura Model │
└─────────────────────────────────────────────────┘
This project explores:
- Offline-first AI — Bringing cloud-level capabilities to local devices
- Semantic memory architectures — Long-horizon conversational persistence
- Privacy-preserving AI — No compromise between capability and privacy
- Model-agnostic design — Engine layer decoupled from model layer
Samuel Sameer Tanguturi Master of Science in Information Systems Central Michigan University
Contact: Tangu1s@cmich.edu LinkedIn: linkedin.com/in/tanguturi-sameer Project Started: October 2025
Proprietary — All rights reserved. This is a private research project.
Nura proves that truly private AI assistants are possible. No cloud required. No compromises on capability. Your memories, your device, your control.