Digital Persona: Data Ingestion Architecture

This document outlines the architecture of the Digital Persona’s data ingestion system. It is designed to flexibly import a wide range of personal data sources while respecting the project’s privacy-first principles.

🎯 Goals

Import personal content (emails, health logs, chat messages, calendar events, etc.)
Normalize into semantically structured formats (e.g. JSON-LD, ActivityStreams)
Store in user-owned, locally hosted memory vaults
Expose to the rest of the system through optional MCP interfaces

🔗 Supported Data Sources

Email: via IMAP, Gmail APIs, or Huginn
Calendar: Google Calendar, Apple Calendar (ICS exports, CalDAV)
Health & Fitness: MyFitnessPal, Apple Health, Fitbit (via API or scraper)
Chat Logs: Discord, SMS exports, WhatsApp (manual export), Limitless AI
Journaling/Writing: Obsidian markdown vaults, Google Docs, Notion
Media Metadata: EXIF from photos, YouTube watch history, Spotify playback

⚙️ Ingestion Pipeline

Each connector typically includes:

Fetcher: A script or agent that downloads raw data (via API, scraper, or sync)
Parser: Normalizes input to a memory object with metadata
Serializer: Converts memory object to JSON-LD/ActivityStreams
Storage Layer: Writes to persona_memory/<domain>/<source>/<timestamp>.json
Log Handler: Logs success/failure per run

Example: MyFitnessPal Connector

Authenticates using browser cookie
Scrapes daily entries
Converts to structured JSON
Writes to persona_memory/health/myfitnesspal/YYYY-MM-DD.json
Exposes via local MCP server at /mcp/health/mfp/today

🗺️ Mermaid Diagram: Ingestion Flow

flowchart TD
    A[Personal Data Source] --> B[Connector Script]
    B --> C[Fetch Raw Data]
    C --> D[Parse and Normalize]
    D --> E[Semantic JSON Transformation]
    E --> F[Write to Local Memory Vault]
    F --> G{Expose via MCP?}
    G -->|Yes| H[Run MCP Server Endpoint]
    G -->|No| I[Archive Only]
    H --> J[Queried by Persona Core or RAG]

🧠 Memory Format

All ingested data is transformed into a semantic memory entry, such as:

{
  "@context": "https://www.w3.org/ns/activitystreams",
  "type": "Note",
  "name": "Weight entry",
  "content": "Weight: 171.2 lbs",
  "published": "2025-07-11T08:45:00Z",
  "tag": ["weight", "health", "myfitnesspal"]
}

🛡️ Privacy Enforcement

All data pulled and stored locally by default
No cloud upload unless explicitly opted in by user
Use HTTPS or encryption for API connections
Option to encrypt persona_memory directory

🌐 Optional: MCP Exposure

For each memory domain (health, calendar, etc.), run a lightweight MCP server (FastAPI)
Each endpoint serves filtered and structured data
Examples:
- /mcp/calendar/today
- /mcp/health/latest
- /mcp/logs/errors

🧩 Extending Ingestion

To add a new source:

Create a new connector script
Follow the fetch → parse → normalize → store pattern
Optionally expose data via MCP
Register ingestion metadata in a .index.json file per domain for lookup

Next file: memory.md – will cover short-term and long-term memory mechanics.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Digital Persona: Data Ingestion Architecture

Digital Persona: Data Ingestion Architecture

🎯 Goals

🔗 Supported Data Sources

⚙️ Ingestion Pipeline

Example: MyFitnessPal Connector

🗺️ Mermaid Diagram: Ingestion Flow

🧠 Memory Format

🛡️ Privacy Enforcement

🌐 Optional: MCP Exposure

🧩 Extending Ingestion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally