Skip to content

Digital Persona: Data Ingestion Architecture

Eric Hackathorn edited this page Jul 20, 2025 · 1 revision

Digital Persona: Data Ingestion Architecture

This document outlines the architecture of the Digital Persona’s data ingestion system. It is designed to flexibly import a wide range of personal data sources while respecting the project’s privacy-first principles.

🎯 Goals

  • Import personal content (emails, health logs, chat messages, calendar events, etc.)
  • Normalize into semantically structured formats (e.g. JSON-LD, ActivityStreams)
  • Store in user-owned, locally hosted memory vaults
  • Expose to the rest of the system through optional MCP interfaces

🔗 Supported Data Sources

  • Email: via IMAP, Gmail APIs, or Huginn
  • Calendar: Google Calendar, Apple Calendar (ICS exports, CalDAV)
  • Health & Fitness: MyFitnessPal, Apple Health, Fitbit (via API or scraper)
  • Chat Logs: Discord, SMS exports, WhatsApp (manual export), Limitless AI
  • Journaling/Writing: Obsidian markdown vaults, Google Docs, Notion
  • Media Metadata: EXIF from photos, YouTube watch history, Spotify playback

⚙️ Ingestion Pipeline

Each connector typically includes:

  • Fetcher: A script or agent that downloads raw data (via API, scraper, or sync)
  • Parser: Normalizes input to a memory object with metadata
  • Serializer: Converts memory object to JSON-LD/ActivityStreams
  • Storage Layer: Writes to persona_memory/<domain>/<source>/<timestamp>.json
  • Log Handler: Logs success/failure per run

Example: MyFitnessPal Connector

  1. Authenticates using browser cookie
  2. Scrapes daily entries
  3. Converts to structured JSON
  4. Writes to persona_memory/health/myfitnesspal/YYYY-MM-DD.json
  5. Exposes via local MCP server at /mcp/health/mfp/today

🗺️ Mermaid Diagram: Ingestion Flow

flowchart TD
    A[Personal Data Source] --> B[Connector Script]
    B --> C[Fetch Raw Data]
    C --> D[Parse and Normalize]
    D --> E[Semantic JSON Transformation]
    E --> F[Write to Local Memory Vault]
    F --> G{Expose via MCP?}
    G -->|Yes| H[Run MCP Server Endpoint]
    G -->|No| I[Archive Only]
    H --> J[Queried by Persona Core or RAG]
Loading

🧠 Memory Format

All ingested data is transformed into a semantic memory entry, such as:

{
  "@context": "https://www.w3.org/ns/activitystreams",
  "type": "Note",
  "name": "Weight entry",
  "content": "Weight: 171.2 lbs",
  "published": "2025-07-11T08:45:00Z",
  "tag": ["weight", "health", "myfitnesspal"]
}

🛡️ Privacy Enforcement

  • All data pulled and stored locally by default
  • No cloud upload unless explicitly opted in by user
  • Use HTTPS or encryption for API connections
  • Option to encrypt persona_memory directory

🌐 Optional: MCP Exposure

  • For each memory domain (health, calendar, etc.), run a lightweight MCP server (FastAPI)

  • Each endpoint serves filtered and structured data

  • Examples:

    • /mcp/calendar/today
    • /mcp/health/latest
    • /mcp/logs/errors

🧩 Extending Ingestion

To add a new source:

  1. Create a new connector script
  2. Follow the fetch → parse → normalize → store pattern
  3. Optionally expose data via MCP
  4. Register ingestion metadata in a .index.json file per domain for lookup

Next file: memory.md – will cover short-term and long-term memory mechanics.

Clone this wiki locally