Your data is exposed. Your identity is mapped. Take it back.
Features β’ Architecture β’ Tech Stack β’ Getting Started β’ API Docs β’ Screenshots
DataReaper is a full-stack autonomous AI platform that hunts down your personal data across the web and forces its deletion β without you lifting a finger.
Data brokers silently collect, package, and sell your personal information. DataReaper fights back by combining advanced OSINT reconnaissance, AI-powered identity resolution, and automated multi-jurisdiction legal compliance to systematically eliminate your digital footprint.
Seed Input β OSINT Discovery β Identity Graph β Broker Detection β Legal Dispatch β Autonomous Follow-up
π Intelligent OSINT Discovery
- Multi-platform account discovery seeded from email, phone, or username
- Username enumeration via Maigret across 3,000+ sites
- Anti-detection profile scraping with Playwright headless browser
- Web content extraction via Trafilatura and BeautifulSoup4
- Configurable probe depth: platform candidates, Maigret top-sites, max connections
- DuckDuckGo fallback, paste-site search, and search probe layers (feature-flagged)
π§ AI-Powered Identity Resolution
- LLM-driven cross-platform data correlation (Groq / llama-3.3-70b-versatile)
- Force-directed interactive identity graph with node-edge visualization
- Nodes: seeds, discovered accounts, usernames, aliases, resolved attributes
- Edges:
pivoted_to,discovered_username,found_on_broker, and more - Real-time graph updates as new data surfaces during scanning
π― Broker Detection & Verification
- Automated scanning against a catalog of 100+ data brokers
- Confidence-scored listing verification per broker
- Opt-out rule engine with broker-specific workflows (email / form / phone)
- Contact point discovery and validation
- YAML-driven broker catalog β easily extensible
βοΈ Legal Automation Engine
- Multi-jurisdiction compliance: GDPR, CCPA, DPDP Act (India)
- Automated legal notice generation via AI Legal Agent
- Escalation workflows for non-compliant brokers
- Full audit trail and compliance tracking
- Configurable default jurisdiction per deployment
π§ Autonomous Email Warfare
- Gmail OAuth 2.0 integration for sending and receiving
- AI intent classification for incoming broker emails
- Context-aware reply generation with objection handling
- Thread continuity and conversation memory
- Attachment handling for ID verification requests
- Periodic inbox sync (every 5 minutes) via background scheduler
π‘οΈ Tripwire Chrome Extension
- Downloadable Chrome extension (
datareaper-tripwire.zip) served directly from the API - Real-time malicious site detection and threat logging
- Password field interception monitoring (block / allow tracking)
- Heartbeat-based session linking via short-lived Redis tokens
- Shield Logs dashboard: per-hostname threat events and password attempt history
π» Shadow Browser
- Decoy persona engine β AI-generated fake identities browse in the background
- Randomizable personas with age, occupation, and interests
- Decoy session simulation to pollute data broker profiles
- Per-persona browsing history viewer with search and date grouping
- Toggle on/off from the dashboard; communicates with the Tripwire extension via
postMessage
πͺ Access Mirror
- Google OAuth 2.0 connect flow with PKCE
- Live scope analysis: maps granted scopes to risk levels (LOW / MEDIUM / HIGH)
- Per-app grant revocation with audit log
- Data export parser: upload Google Takeout (and Instagram, LinkedIn, Amazon, Spotify, Uber) archives up to 200 MB
- Extracts authorized OAuth apps from Google Takeout exports
- Persists reports to PostgreSQL with in-memory fallback
π Real-Time Command Center
- Exposure metrics dashboard with live scan progress
- Activity timeline with chronological event feed
- Threat level assessment and prioritization
- WebSocket-powered live updates
- TanStack Query for optimistic UI and background refetching
ποΈ Privacy War Room
- Centralized broker case management
- One-click deletion request dispatch
- Email thread viewer with AI-generated legal notices
- Escalation management for non-compliant brokers
- Batch operations across multiple broker cases
- Compliance deadline tracking and response time analytics
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React 18 + Vite) β
β β
β Landing β Onboarding β Command Center β Identity Graph β
β War Room β Inbox β Shield Logs β Shadow Browser β
β Access Mirror β Google Auth Callback β
β β
β TanStack Query Β· React Router 7 Β· Radix UI Β· Motion β
ββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β REST API + WebSocket
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ
β Backend (FastAPI 0.116+) β
β β
β /api/onboarding /api/scans /api/dashboard /api/recon β
β /api/targets /api/war-room /api/inbox /api/reports β
β /api/events /api/shield /api/access-mirror β
β /v1/content /ws/* β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Agent Orchestration β β
β β Sleuth Agent Β· Legal Agent Β· Communications Agent β β
β β Prompt Manager Β· Agent Registry Β· Base Agent β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Core Services β β
β β OSINT Engine Β· Broker Discovery Β· Email Sync β β
β β Legal Compliance Β· Identity Resolution Β· Scraper β β
β β Access Mirror Parser Β· Report Builder β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Background Workers (ARQ + Redis) β β
β β run_osint_pipeline Β· discover_targets β β
β β send_legal_requests Β· sync_inbox β β
β β continue_battles Β· build_report_snapshot (cron) β β
β β cleanup_old_events (cron) Β· sync_active_scan_inboxes β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
PostgreSQL Redis Playwright
(SQLAlchemy) (ARQ Queue) (Headless Browser)
1. Validate Seed β email / phone / username
2. Discover Accounts β Maigret + platform probes
3. Extract Usernames β cross-platform pivoting
4. Scrape Profiles β Playwright + Trafilatura
5. Resolve Identity β LLM correlation
6. Discover Targets β broker catalog matching
7. Generate Notices β Legal Agent (GDPR/CCPA/DPDP)
8. Send Requests β Gmail OAuth dispatch
9. Monitor Responses β inbox sync + AI triage
10. Escalate / Follow-up β Communications Agent
| Technology | Version | Purpose | |
|---|---|---|---|
| βοΈ | React | 18.3.1 | UI framework |
| π· | TypeScript | 5.0+ | Type safety |
| β‘ | Vite | 6.4+ | Build tool |
| π | TanStack Query | 5.99+ | Server state |
| π£οΈ | React Router | 7.13+ | Routing |
| π¨ | Radix UI | Latest | Accessible primitives (30+ components) |
| π¨ | Tailwind CSS | 4.1+ | Utility-first styling |
| π¬ | Motion | 12.23+ | Animations |
| π | Recharts | 2.15+ | Data visualization |
| π | Axios | 1.15+ | HTTP client |
| π | Sonner | 2.0+ | Toast notifications |
| π | React Hook Form | 7.55+ | Form management |
| π±οΈ | React DnD | 16.0.1 | Drag and drop |
| Technology | Version | Purpose | |
|---|---|---|---|
| π | Python | 3.11+ | Core language |
| π | FastAPI | 0.116+ | Web framework |
| ποΈ | SQLAlchemy | 2.0+ | Async ORM |
| β | Pydantic | 2.11+ | Data validation |
| π | Alembic | 1.14+ | DB migrations |
| π | Uvicorn | 0.34+ | ASGI server |
| π | asyncpg | 0.30+ | Async PostgreSQL |
| π¬ | ARQ | 0.26+ | Async task queue |
| β° | APScheduler | 3.11+ | Job scheduling |
| π | Structlog | 25.3+ | Structured logging |
| Technology | Purpose | |
|---|---|---|
| π€ | Groq (llama-3.3-70b) | LLM inference for agents |
| π | Playwright | Anti-detection browser automation |
| π | Maigret | Username OSINT (3,000+ sites) |
| πΏ | BeautifulSoup4 | HTML parsing |
| π | Trafilatura | Web content extraction |
| π | curl-cffi | Anti-detection HTTP client |
| Technology | Purpose | |
|---|---|---|
| π | PostgreSQL | Primary database |
| π΄ | Redis | Cache + task queue backend |
| π§ | Gmail API | OAuth email send/receive |
| π | Google OAuth 2.0 | User auth + Access Mirror |
| π³ | Docker | Containerization |
7 Alembic migrations covering 20+ tables:
users scan_jobs seeds
discovered_accounts graph_nodes graph_edges
brokers broker_listings broker_cases
email_threads email_messages attachments
legal_requests audit_logs consent
activity_events report_snapshots agent_runs
scan_stages access_mirror_reports google_oauth_connections
- Python 3.11+
- Node.js 18+ and pnpm
- PostgreSQL database
- Redis server
- Groq API key
- Gmail API credentials (for email features)
# 1. Clone and enter backend
git clone https://github.com/yourusername/datareaper.git
cd datareaper/backend
# 2. Install dependencies (uv recommended)
pip install uv
uv sync
# 3. Configure environment
cp .env.example .envEdit .env:
# Database
DATABASE_URL=postgresql+asyncpg://user:pass@localhost/datareaper
SYNC_DATABASE_URL=postgresql+psycopg://user:pass@localhost/datareaper
# Redis
REDIS_URL=redis://127.0.0.1:6379/0
# AI
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=llama-3.3-70b-versatile
# Google Sign-In + Access Mirror
GOOGLE_CLIENT_ID=your_client_id.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your_client_secret
# Gmail API (inbox/send features)
GMAIL_CLIENT_ID=your_gmail_client_id
GMAIL_CLIENT_SECRET=your_gmail_client_secret
GMAIL_SENDER_EMAIL=your_sender@gmail.com
GMAIL_SENDER_CLIENT_ID=your_sender_client_id
GMAIL_SENDER_CLIENT_SECRET=your_sender_client_secret
GMAIL_SENDER_REFRESH_TOKEN=your_sender_refresh_token
# App
APP_DEBUG=true
FRONTEND_URL=http://localhost:5173
DEFAULT_JURISDICTION=DPDP# 4. Run migrations
alembic upgrade head
# 5. Seed data (optional)
python scripts/import_broker_catalog.py
python scripts/import_platform_catalog.py
python scripts/seed_demo_data.py
# 6. Start API + worker (Windows)
.\scripts\start_stack.ps1
# Or manually
uvicorn datareaper.main:app --reload --app-dir src --port 8000
arq datareaper.workers.scheduler.WorkerSettingsAPI available at http://localhost:8000
cd frontend
pnpm install
pnpm devFrontend available at http://localhost:5173
cd backend
python scripts/seed_demo_data.pyPopulates sample scans, identity graph, broker cases, and email threads for exploration.
With the backend running:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Health: http://localhost:8000/api/health
POST /api/onboarding/start Start a new privacy scan
GET /api/scans/{scan_id} Scan status and results
GET /api/dashboard/overview Exposure metrics
GET /api/recon/graph/{scan_id} Identity graph data
GET /api/targets/{scan_id} Discovered broker targets
GET /api/war-room/cases All broker cases
POST /api/war-room/cases/{id}/dispatch Send deletion request
GET /api/inbox/threads Email threads
POST /api/inbox/sync Sync Gmail inbox
GET /api/shield/status Tripwire extension status
POST /api/shield/token Issue shield token
GET /api/shield/download Download Tripwire extension
GET /api/access-mirror/google/config Google OAuth config
POST /api/access-mirror/google/connect Connect Google account
GET /api/access-mirror/google/grants View OAuth grants
POST /api/access-mirror/google/revoke Revoke app access
POST /api/access-mirror/parse Parse data export archive
GET /api/reports/{scan_id} Privacy report
WS /ws/scans/{scan_id} Real-time scan events
Hero section, problem statement, three-pillar feature showcase (Scan β Identify β Terminate), process flow visualization, and CTA.
Seed input (email / phone / username), jurisdiction selection (GDPR / CCPA / DPDP), privacy preferences, and scan initialization.
Live exposure metrics, active scan monitoring, threat assessment, quick actions, and chronological activity timeline.
Interactive force-directed graph showing seeds, discovered accounts, usernames, aliases, resolved attributes, and broker exposures β with real-time updates.
Broker target inventory, deletion campaign status, AI-generated legal notice viewer, escalation management, and batch operations.
Decoy persona engine running in the background β AI-generated fake identities browse the web so data brokers see someone else. Shows the currently active persona (Mabel Thornton, 82 Β· Bridge Player), their interests, a randomize button, the full decoy visit history with timestamps and favicons, simulated account sessions, and a searchable per-persona history panel.
Your data footprint, laid bare. The Google Hub connects your Google account and surfaces every OAuth grant with risk levels (HIGH / LOW) and a per-app Revoke button. The Universal Data Drop accepts Takeout exports from Google, Instagram, LinkedIn, Amazon, Spotify, Uber, and more β DataReaper parses the archive and shows exactly what each platform has built on you.
Tripwire threat log pulled live from the Chrome extension. Shows all malicious hostnames detected during browsing β select one to see Tripwire / malicious URL event counts, password interception stats (blocked vs allowed), the full malicious URL log with timestamps, and a per-attempt password field breakdown.
datareaper/
βββ backend/
β βββ src/datareaper/
β β βββ agents/ # Sleuth, Legal, Communications agents
β β βββ api/ # FastAPI routes (15 route modules)
β β βββ brokers/ # Broker catalog, discovery, opt-out rules
β β βββ comms/ # Gmail client, OAuth, intent classifier
β β βββ compliance/ # GDPR / CCPA / DPDP legal engine
β β βββ core/ # Config, logging, constants, IDs
β β βββ db/ # Models, repositories, session
β β βββ identity/ # Identity resolution
β β βββ integrations/ # Groq LLM, Playwright browser
β β βββ osint/ # Discovery pipeline
β β βββ scraper/ # Web scraping orchestration
β β βββ services/ # Business logic layer
β β βββ workers/ # ARQ jobs + scheduler
β βββ data/
β β βββ brokers/ # broker_catalog.yaml, opt_out_rules.yaml
β β βββ legal/ # gdpr_rules.yaml, ccpa_rules.yaml, dpdp_rules.yaml
β β βββ platforms/ # platform_selectors.yaml, probe catalogs
β β βββ prompts/ # LLM prompt templates (10 prompts)
β βββ migrations/ # 7 Alembic migrations
β βββ scripts/ # Import, seed, smoke test, replay utilities
βββ frontend/
β βββ src/
β β βββ components/ # 18 components including AnimatedDataReaperLogo
β β βββ pages/ # 9 pages
β β βββ lib/ # API client, hooks, WebSocket, session manager
β β βββ styles/ # Tailwind, theme, fonts, cursor
β βββ public/
βββ .codex_tmp_accessmirror/ # Access Mirror component staging blocks
- Google OAuth 2.0 for user authentication and Gmail access
- JWT-based session management with realtime tokens for WebSocket auth
- CORS configured for frontend origin + Chrome extension ID pattern
- Pydantic v2 input validation on all API endpoints
- SQLAlchemy ORM for SQL injection protection
- Audit trail for all compliance operations
- Consent tracking per user
- No third-party analytics β your data stays in your database
- Responsible use: scan only your own personal information
# Backend
cd backend
pytest
pytest --cov=datareaper --cov-report=html
# Frontend
cd frontend
pnpm test- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit:
git commit -m 'Add your feature' - Push:
git push origin feature/your-feature - Open a Pull Request
Before committing:
# Backend
ruff check .
mypy src/
# Frontend
pnpm lintMIT License β see LICENSE for details.
Maigret Β· Playwright Β· FastAPI Β· React Β· Radix UI Β· Tailwind CSS Β· Groq







