Skip to content

NA0XY/Datareaper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataReaper Logo
DataReaper
Tagline



Python FastAPI React TypeScript PostgreSQL Redis License


Your data is exposed. Your identity is mapped. Take it back.


Features β€’ Architecture β€’ Tech Stack β€’ Getting Started β€’ API Docs β€’ Screenshots


⚑ What is DataReaper?

DataReaper is a full-stack autonomous AI platform that hunts down your personal data across the web and forces its deletion β€” without you lifting a finger.

Data brokers silently collect, package, and sell your personal information. DataReaper fights back by combining advanced OSINT reconnaissance, AI-powered identity resolution, and automated multi-jurisdiction legal compliance to systematically eliminate your digital footprint.

Seed Input  β†’  OSINT Discovery  β†’  Identity Graph  β†’  Broker Detection  β†’  Legal Dispatch  β†’  Autonomous Follow-up

✨ Features

πŸ” Intelligent OSINT Discovery
  • Multi-platform account discovery seeded from email, phone, or username
  • Username enumeration via Maigret across 3,000+ sites
  • Anti-detection profile scraping with Playwright headless browser
  • Web content extraction via Trafilatura and BeautifulSoup4
  • Configurable probe depth: platform candidates, Maigret top-sites, max connections
  • DuckDuckGo fallback, paste-site search, and search probe layers (feature-flagged)
🧠 AI-Powered Identity Resolution
  • LLM-driven cross-platform data correlation (Groq / llama-3.3-70b-versatile)
  • Force-directed interactive identity graph with node-edge visualization
  • Nodes: seeds, discovered accounts, usernames, aliases, resolved attributes
  • Edges: pivoted_to, discovered_username, found_on_broker, and more
  • Real-time graph updates as new data surfaces during scanning
🎯 Broker Detection & Verification
  • Automated scanning against a catalog of 100+ data brokers
  • Confidence-scored listing verification per broker
  • Opt-out rule engine with broker-specific workflows (email / form / phone)
  • Contact point discovery and validation
  • YAML-driven broker catalog β€” easily extensible
βš–οΈ Legal Automation Engine
  • Multi-jurisdiction compliance: GDPR, CCPA, DPDP Act (India)
  • Automated legal notice generation via AI Legal Agent
  • Escalation workflows for non-compliant brokers
  • Full audit trail and compliance tracking
  • Configurable default jurisdiction per deployment
πŸ“§ Autonomous Email Warfare
  • Gmail OAuth 2.0 integration for sending and receiving
  • AI intent classification for incoming broker emails
  • Context-aware reply generation with objection handling
  • Thread continuity and conversation memory
  • Attachment handling for ID verification requests
  • Periodic inbox sync (every 5 minutes) via background scheduler
πŸ›‘οΈ Tripwire Chrome Extension
  • Downloadable Chrome extension (datareaper-tripwire.zip) served directly from the API
  • Real-time malicious site detection and threat logging
  • Password field interception monitoring (block / allow tracking)
  • Heartbeat-based session linking via short-lived Redis tokens
  • Shield Logs dashboard: per-hostname threat events and password attempt history
πŸ‘» Shadow Browser
  • Decoy persona engine β€” AI-generated fake identities browse in the background
  • Randomizable personas with age, occupation, and interests
  • Decoy session simulation to pollute data broker profiles
  • Per-persona browsing history viewer with search and date grouping
  • Toggle on/off from the dashboard; communicates with the Tripwire extension via postMessage
πŸͺž Access Mirror
  • Google OAuth 2.0 connect flow with PKCE
  • Live scope analysis: maps granted scopes to risk levels (LOW / MEDIUM / HIGH)
  • Per-app grant revocation with audit log
  • Data export parser: upload Google Takeout (and Instagram, LinkedIn, Amazon, Spotify, Uber) archives up to 200 MB
  • Extracts authorized OAuth apps from Google Takeout exports
  • Persists reports to PostgreSQL with in-memory fallback
πŸ“Š Real-Time Command Center
  • Exposure metrics dashboard with live scan progress
  • Activity timeline with chronological event feed
  • Threat level assessment and prioritization
  • WebSocket-powered live updates
  • TanStack Query for optimistic UI and background refetching
πŸŽ–οΈ Privacy War Room
  • Centralized broker case management
  • One-click deletion request dispatch
  • Email thread viewer with AI-generated legal notices
  • Escalation management for non-compliant brokers
  • Batch operations across multiple broker cases
  • Compliance deadline tracking and response time analytics

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Frontend  (React 18 + Vite)                   β”‚
β”‚                                                                      β”‚
β”‚  Landing  β”‚  Onboarding  β”‚  Command Center  β”‚  Identity Graph        β”‚
β”‚  War Room β”‚  Inbox       β”‚  Shield Logs     β”‚  Shadow Browser        β”‚
β”‚  Access Mirror           β”‚  Google Auth Callback                     β”‚
β”‚                                                                      β”‚
β”‚  TanStack Query  Β·  React Router 7  Β·  Radix UI  Β·  Motion          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚  REST API  +  WebSocket
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       Backend  (FastAPI 0.116+)                      β”‚
β”‚                                                                      β”‚
β”‚  /api/onboarding   /api/scans      /api/dashboard   /api/recon      β”‚
β”‚  /api/targets      /api/war-room   /api/inbox        /api/reports   β”‚
β”‚  /api/events       /api/shield     /api/access-mirror               β”‚
β”‚  /v1/content       /ws/*                                            β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                    Agent Orchestration                      β”‚    β”‚
β”‚  β”‚   Sleuth Agent  Β·  Legal Agent  Β·  Communications Agent    β”‚    β”‚
β”‚  β”‚   Prompt Manager  Β·  Agent Registry  Β·  Base Agent         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                      Core Services                         β”‚    β”‚
β”‚  β”‚  OSINT Engine  Β·  Broker Discovery  Β·  Email Sync          β”‚    β”‚
β”‚  β”‚  Legal Compliance  Β·  Identity Resolution  Β·  Scraper      β”‚    β”‚
β”‚  β”‚  Access Mirror Parser  Β·  Report Builder                   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚               Background Workers  (ARQ + Redis)            β”‚    β”‚
β”‚  β”‚  run_osint_pipeline  Β·  discover_targets                   β”‚    β”‚
β”‚  β”‚  send_legal_requests  Β·  sync_inbox                        β”‚    β”‚
β”‚  β”‚  continue_battles  Β·  build_report_snapshot (cron)         β”‚    β”‚
β”‚  β”‚  cleanup_old_events (cron)  Β·  sync_active_scan_inboxes    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό                  β–Ό                  β–Ό
    PostgreSQL            Redis            Playwright
    (SQLAlchemy)       (ARQ Queue)       (Headless Browser)

Multi-Stage Scan Pipeline

1. Validate Seed          β†’  email / phone / username
2. Discover Accounts      β†’  Maigret + platform probes
3. Extract Usernames      β†’  cross-platform pivoting
4. Scrape Profiles        β†’  Playwright + Trafilatura
5. Resolve Identity       β†’  LLM correlation
6. Discover Targets       β†’  broker catalog matching
7. Generate Notices       β†’  Legal Agent (GDPR/CCPA/DPDP)
8. Send Requests          β†’  Gmail OAuth dispatch
9. Monitor Responses      β†’  inbox sync + AI triage
10. Escalate / Follow-up  β†’  Communications Agent

πŸ› οΈ Tech Stack

Frontend

Technology Version Purpose
βš›οΈ React 18.3.1 UI framework
πŸ”· TypeScript 5.0+ Type safety
⚑ Vite 6.4+ Build tool
πŸ”„ TanStack Query 5.99+ Server state
πŸ›£οΈ React Router 7.13+ Routing
🎨 Radix UI Latest Accessible primitives (30+ components)
πŸ’¨ Tailwind CSS 4.1+ Utility-first styling
🎬 Motion 12.23+ Animations
πŸ“Š Recharts 2.15+ Data visualization
🌐 Axios 1.15+ HTTP client
πŸ”” Sonner 2.0+ Toast notifications
πŸ“‹ React Hook Form 7.55+ Form management
πŸ–±οΈ React DnD 16.0.1 Drag and drop

Backend

Technology Version Purpose
🐍 Python 3.11+ Core language
πŸš€ FastAPI 0.116+ Web framework
πŸ—„οΈ SQLAlchemy 2.0+ Async ORM
βœ… Pydantic 2.11+ Data validation
πŸ”„ Alembic 1.14+ DB migrations
🌐 Uvicorn 0.34+ ASGI server
🐘 asyncpg 0.30+ Async PostgreSQL
πŸ“¬ ARQ 0.26+ Async task queue
⏰ APScheduler 3.11+ Job scheduling
πŸ“ Structlog 25.3+ Structured logging

AI & Automation

Technology Purpose
πŸ€– Groq (llama-3.3-70b) LLM inference for agents
🎭 Playwright Anti-detection browser automation
πŸ” Maigret Username OSINT (3,000+ sites)
🌿 BeautifulSoup4 HTML parsing
πŸ“„ Trafilatura Web content extraction
πŸ”’ curl-cffi Anti-detection HTTP client

Infrastructure

Technology Purpose
🐘 PostgreSQL Primary database
πŸ”΄ Redis Cache + task queue backend
πŸ“§ Gmail API OAuth email send/receive
πŸ”‘ Google OAuth 2.0 User auth + Access Mirror
🐳 Docker Containerization

πŸ—„οΈ Database Schema

7 Alembic migrations covering 20+ tables:

users                  scan_jobs              seeds
discovered_accounts    graph_nodes            graph_edges
brokers                broker_listings        broker_cases
email_threads          email_messages         attachments
legal_requests         audit_logs             consent
activity_events        report_snapshots       agent_runs
scan_stages            access_mirror_reports  google_oauth_connections

πŸš€ Getting Started

Prerequisites

  • Python 3.11+
  • Node.js 18+ and pnpm
  • PostgreSQL database
  • Redis server
  • Groq API key
  • Gmail API credentials (for email features)

Backend Setup

# 1. Clone and enter backend
git clone https://github.com/yourusername/datareaper.git
cd datareaper/backend

# 2. Install dependencies (uv recommended)
pip install uv
uv sync

# 3. Configure environment
cp .env.example .env

Edit .env:

# Database
DATABASE_URL=postgresql+asyncpg://user:pass@localhost/datareaper
SYNC_DATABASE_URL=postgresql+psycopg://user:pass@localhost/datareaper

# Redis
REDIS_URL=redis://127.0.0.1:6379/0

# AI
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=llama-3.3-70b-versatile

# Google Sign-In + Access Mirror
GOOGLE_CLIENT_ID=your_client_id.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your_client_secret

# Gmail API (inbox/send features)
GMAIL_CLIENT_ID=your_gmail_client_id
GMAIL_CLIENT_SECRET=your_gmail_client_secret
GMAIL_SENDER_EMAIL=your_sender@gmail.com
GMAIL_SENDER_CLIENT_ID=your_sender_client_id
GMAIL_SENDER_CLIENT_SECRET=your_sender_client_secret
GMAIL_SENDER_REFRESH_TOKEN=your_sender_refresh_token

# App
APP_DEBUG=true
FRONTEND_URL=http://localhost:5173
DEFAULT_JURISDICTION=DPDP
# 4. Run migrations
alembic upgrade head

# 5. Seed data (optional)
python scripts/import_broker_catalog.py
python scripts/import_platform_catalog.py
python scripts/seed_demo_data.py

# 6. Start API + worker (Windows)
.\scripts\start_stack.ps1

# Or manually
uvicorn datareaper.main:app --reload --app-dir src --port 8000
arq datareaper.workers.scheduler.WorkerSettings

API available at http://localhost:8000

Frontend Setup

cd frontend
pnpm install
pnpm dev

Frontend available at http://localhost:5173

Demo Mode

cd backend
python scripts/seed_demo_data.py

Populates sample scans, identity graph, broker cases, and email threads for exploration.


πŸ“š API Documentation

With the backend running:

Key Endpoints

POST   /api/onboarding/start              Start a new privacy scan
GET    /api/scans/{scan_id}               Scan status and results
GET    /api/dashboard/overview            Exposure metrics
GET    /api/recon/graph/{scan_id}         Identity graph data
GET    /api/targets/{scan_id}             Discovered broker targets
GET    /api/war-room/cases                All broker cases
POST   /api/war-room/cases/{id}/dispatch  Send deletion request
GET    /api/inbox/threads                 Email threads
POST   /api/inbox/sync                    Sync Gmail inbox
GET    /api/shield/status                 Tripwire extension status
POST   /api/shield/token                  Issue shield token
GET    /api/shield/download               Download Tripwire extension
GET    /api/access-mirror/google/config   Google OAuth config
POST   /api/access-mirror/google/connect  Connect Google account
GET    /api/access-mirror/google/grants   View OAuth grants
POST   /api/access-mirror/google/revoke   Revoke app access
POST   /api/access-mirror/parse           Parse data export archive
GET    /api/reports/{scan_id}             Privacy report
WS     /ws/scans/{scan_id}               Real-time scan events

πŸ–ΌοΈ Screenshots

Landing Page

Landing Page

Hero section, problem statement, three-pillar feature showcase (Scan β†’ Identify β†’ Terminate), process flow visualization, and CTA.


Onboarding

Onboarding Page

Seed input (email / phone / username), jurisdiction selection (GDPR / CCPA / DPDP), privacy preferences, and scan initialization.


Command Center

Command Centre

Live exposure metrics, active scan monitoring, threat assessment, quick actions, and chronological activity timeline.


Identity Graph

Identity Graph

Interactive force-directed graph showing seeds, discovered accounts, usernames, aliases, resolved attributes, and broker exposures β€” with real-time updates.


War Room

War Room

Broker target inventory, deletion campaign status, AI-generated legal notice viewer, escalation management, and batch operations.


Shadow Browser

Shadow Browser

Decoy persona engine running in the background β€” AI-generated fake identities browse the web so data brokers see someone else. Shows the currently active persona (Mabel Thornton, 82 Β· Bridge Player), their interests, a randomize button, the full decoy visit history with timestamps and favicons, simulated account sessions, and a searchable per-persona history panel.


Access Mirror

Access Mirror

Your data footprint, laid bare. The Google Hub connects your Google account and surfaces every OAuth grant with risk levels (HIGH / LOW) and a per-app Revoke button. The Universal Data Drop accepts Takeout exports from Google, Instagram, LinkedIn, Amazon, Spotify, Uber, and more β€” DataReaper parses the archive and shows exactly what each platform has built on you.


Shield Logs

Shield Logs

Tripwire threat log pulled live from the Chrome extension. Shows all malicious hostnames detected during browsing β€” select one to see Tripwire / malicious URL event counts, password interception stats (blocked vs allowed), the full malicious URL log with timestamps, and a per-attempt password field breakdown.


πŸ“ Project Structure

datareaper/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ src/datareaper/
β”‚   β”‚   β”œβ”€β”€ agents/          # Sleuth, Legal, Communications agents
β”‚   β”‚   β”œβ”€β”€ api/             # FastAPI routes (15 route modules)
β”‚   β”‚   β”œβ”€β”€ brokers/         # Broker catalog, discovery, opt-out rules
β”‚   β”‚   β”œβ”€β”€ comms/           # Gmail client, OAuth, intent classifier
β”‚   β”‚   β”œβ”€β”€ compliance/      # GDPR / CCPA / DPDP legal engine
β”‚   β”‚   β”œβ”€β”€ core/            # Config, logging, constants, IDs
β”‚   β”‚   β”œβ”€β”€ db/              # Models, repositories, session
β”‚   β”‚   β”œβ”€β”€ identity/        # Identity resolution
β”‚   β”‚   β”œβ”€β”€ integrations/    # Groq LLM, Playwright browser
β”‚   β”‚   β”œβ”€β”€ osint/           # Discovery pipeline
β”‚   β”‚   β”œβ”€β”€ scraper/         # Web scraping orchestration
β”‚   β”‚   β”œβ”€β”€ services/        # Business logic layer
β”‚   β”‚   └── workers/         # ARQ jobs + scheduler
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ brokers/         # broker_catalog.yaml, opt_out_rules.yaml
β”‚   β”‚   β”œβ”€β”€ legal/           # gdpr_rules.yaml, ccpa_rules.yaml, dpdp_rules.yaml
β”‚   β”‚   β”œβ”€β”€ platforms/       # platform_selectors.yaml, probe catalogs
β”‚   β”‚   └── prompts/         # LLM prompt templates (10 prompts)
β”‚   β”œβ”€β”€ migrations/          # 7 Alembic migrations
β”‚   └── scripts/             # Import, seed, smoke test, replay utilities
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/      # 18 components including AnimatedDataReaperLogo
β”‚   β”‚   β”œβ”€β”€ pages/           # 9 pages
β”‚   β”‚   β”œβ”€β”€ lib/             # API client, hooks, WebSocket, session manager
β”‚   β”‚   └── styles/          # Tailwind, theme, fonts, cursor
β”‚   └── public/
└── .codex_tmp_accessmirror/ # Access Mirror component staging blocks

πŸ”’ Security & Privacy

  • Google OAuth 2.0 for user authentication and Gmail access
  • JWT-based session management with realtime tokens for WebSocket auth
  • CORS configured for frontend origin + Chrome extension ID pattern
  • Pydantic v2 input validation on all API endpoints
  • SQLAlchemy ORM for SQL injection protection
  • Audit trail for all compliance operations
  • Consent tracking per user
  • No third-party analytics β€” your data stays in your database
  • Responsible use: scan only your own personal information

πŸ§ͺ Testing

# Backend
cd backend
pytest
pytest --cov=datareaper --cov-report=html

# Frontend
cd frontend
pnpm test

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Commit: git commit -m 'Add your feature'
  4. Push: git push origin feature/your-feature
  5. Open a Pull Request

Before committing:

# Backend
ruff check .
mypy src/

# Frontend
pnpm lint

πŸ“„ License

MIT License β€” see LICENSE for details.


πŸ™ Acknowledgments

Maigret Β· Playwright Β· FastAPI Β· React Β· Radix UI Β· Tailwind CSS Β· Groq


About

DataReaper is an autonomous, multi-agent AI system designed to be a personal privacy "Search & Destroy" unit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors