Skip to content

πŸ” AI-powered document analyzer that transforms PDFs, emails & scans into interactive knowledge graphs. Built with Claude AI for investigative journalism

License

Notifications You must be signed in to change notification settings

MrRemit/Intel-doc-analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Intelligence Document Analyzer

πŸ†“ 100% FREE AI-Powered Document Analysis - No API Costs!

Transform thousands of unstructured documents (PDFs, emails, scans) into interactive, explorable knowledge graphs using FREE local AI (spaCy). No API keys, no costs, completely open-source. Built for investigative journalists, researchers, and anyone dealing with massive FOIA releases, government document dumps, or corporate leak archives.


🎯 The Problem

When governments and organizations release document dumps (Epstein files, JFK files, corporate leaks), they arrive as thousands of unstructured PDFs with:

  • ❌ No organization or categorization
  • ❌ No way to find connections between entities
  • ❌ Months of manual reading required
  • ❌ Easy to miss critical relationships

Examples: Epstein files, JFK assassination docs, Panama Papers, Wikileaks cables, corporate FOIA releases


✨ The Solution

Intelligence Document Analyzer automatically:

  1. πŸ“₯ Ingests massive document collections (PDF, DOCX, email, scans)
  2. πŸ€– Extracts entities using FREE local AI (spaCy):
    • People, organizations, locations
    • Events, dates, phone numbers, emails
    • Financial amounts, legal references
    • NO API costs, runs completely offline!
  3. πŸ”— Maps relationships between entities:
    • Who knows whom
    • Who works where
    • Who attended what event
    • Document co-mentions
  4. πŸ“Š Visualizes as interactive network graphs (Maltego-style)
  5. πŸ” Enables exploration:
    • Click any person β†’ see all connections
    • Find shortest path between two entities (6 degrees of separation)
    • Filter by date, confidence, entity type
    • Export findings as reports

πŸ’° Cost Comparison:

  • This Tool (spaCy): $0 forever πŸ†“
  • Commercial alternatives: $50-200+ per 1,000 documents πŸ’Έ
  • Optional Claude AI: Available if you want premium accuracy (requires API key)

πŸš€ Quick Start

Installation (100% FREE - No API Keys!)

# Clone the repository
git clone https://github.com/MrRemit/intel-doc-analyzer.git
cd intel-doc-analyzer

# Install Python dependencies (includes FREE spaCy)
pip install -r requirements.txt

# Download spaCy language model (one-time, ~40MB)
python -m spacy download en_core_web_sm

# That's it! No API keys needed! πŸŽ‰

Basic Usage (FREE Mode)

# Analyze documents with FREE local AI (no costs!)
python src/cli.py analyze data/examples/sample_document.txt --output my_analysis

# That's it! Extracted entities, built graph, created visualization
# Total cost: $0.00 πŸ’°

Output:

  • data/graphs/my_analysis.json - Knowledge graph
  • data/graphs/my_analysis.png - Network visualization
  • Extracted entities & relationships saved

Query the Graph

# Find connections between entities
python src/cli.py query data/graphs/my_analysis.json "John Smith" "ACME Corp"

# Find most important entities
python src/cli.py centrality data/graphs/my_analysis.json --top 20

# Detect communities
python src/cli.py communities data/graphs/my_analysis.json

πŸ“– Usage Examples

Example 1: Analyze a FOIA Document Dump

# Process 1,000 PDFs from government release
python src/cli.py analyze \
    --input data/raw/foia_release/ \
    --output data/graphs/foia_analysis \
    --entity-types PERSON,ORGANIZATION,LOCATION,EVENT \
    --confidence-threshold 0.75

# Results:
# βœ“ Extracted 5,432 entities
# βœ“ Found 12,874 relationships
# βœ“ Identified 23 communities
# βœ“ Processing time: 2h 15m

Example 2: Find Connections Between Two People

from src.graph import GraphAnalyzer

graph = GraphAnalyzer.load("data/graphs/foia_analysis.graphml")

# Find shortest path between two entities
path = graph.shortest_path("John Smith", "ACME Corporation")
print(path)
# Output:
# John Smith β†’ works_at β†’ Tech Startup Inc β†’ acquired_by β†’ ACME Corporation
# (3 degrees of separation)

Example 3: Python API

from src.ingestion import DocumentProcessor
from src.extraction import EntityExtractor
from src.graph import GraphBuilder

# 1. Process PDF
processor = DocumentProcessor()
chunks = processor.process_pdf("data/raw/document.pdf")

# 2. Extract entities with Claude AI
extractor = EntityExtractor(api_key="your_key")
entities, relationships = extractor.extract(chunks)

# 3. Build graph
builder = GraphBuilder()
builder.add_entities(entities)
builder.add_relationships(relationships)
builder.save("data/graphs/my_graph.graphml")

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Document Dumps     β”‚
β”‚  (PDF, Email, Scan) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. INGESTION       β”‚
β”‚  - PDF parsing      β”‚
β”‚  - Email parsing    β”‚
β”‚  - OCR (scans)      β”‚
β”‚  - Text chunking    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  2. AI EXTRACTION   β”‚
β”‚  - Claude API       β”‚
β”‚  - Entity NER       β”‚
β”‚  - Relationships    β”‚
β”‚  - Confidence       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  3. GRAPH DATABASE  β”‚
β”‚  - NetworkX (MVP)   β”‚
β”‚  - Neo4j (Prod)     β”‚
β”‚  - Deduplication    β”‚
β”‚  - Merging          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  4. VISUALIZATION   β”‚
β”‚  - Interactive web  β”‚
β”‚  - Cytoscape.js     β”‚
β”‚  - Filtering        β”‚
β”‚  - Export reports   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎨 Features

Current Features (MVP)

  • βœ… PDF text extraction
  • βœ… Claude AI entity extraction (PERSON, ORGANIZATION, LOCATION)
  • βœ… Relationship mapping
  • βœ… NetworkX graph storage
  • βœ… Basic web visualization
  • βœ… CLI interface

Coming Soon (v1.0)

  • πŸ”œ All document formats (DOCX, email, MBOX)
  • πŸ”œ OCR for scanned documents
  • πŸ”œ Neo4j production database
  • πŸ”œ Advanced filtering & search
  • πŸ”œ Timeline view (temporal analysis)
  • πŸ”œ Export to PDF reports
  • πŸ”œ Docker deployment

Future (v2.0+)

  • πŸ’‘ Coreference resolution (merge "John Smith" and "J. Smith")
  • πŸ’‘ Entity disambiguation (link to Wikipedia, Wikidata)
  • πŸ’‘ Multi-language support
  • πŸ’‘ Collaborative annotation
  • πŸ’‘ Machine learning entity ranking

πŸ› οΈ Technology Stack

Backend (Python 3.9+)

  • AI/NLP: Anthropic Claude API (entity extraction)
  • Document Processing: PyMuPDF, pdfplumber, python-docx
  • OCR: Tesseract, pytesseract
  • Graph: NetworkX (MVP), Neo4j (production)
  • API: FastAPI + uvicorn
  • Data Validation: Pydantic

Frontend (Web)

  • Framework: React + TypeScript
  • Visualization: Cytoscape.js (interactive network graphs)
  • UI: Tailwind CSS
  • State Management: Zustand

Infrastructure

  • Containerization: Docker
  • Database: SQLite (metadata), PostgreSQL (optional)
  • Cache: Redis (optional)

πŸ“Š Entity Types Supported

Core Entities

  • PERSON: Individuals mentioned in documents
  • ORGANIZATION: Companies, agencies, groups, NGOs
  • LOCATION: Cities, countries, addresses, buildings
  • EVENT: Meetings, transactions, incidents, conferences
  • DOCUMENT: Referenced documents, files, reports
  • DATE: Specific dates or date ranges

Additional Entities

  • PHONE: Phone numbers
  • EMAIL: Email addresses
  • MONEY: Financial amounts (USD, EUR, etc.)
  • LEGAL: Case numbers, statutes, regulations
  • VEHICLE: License plates, aircraft tail numbers

Relationship Types

  • works_at, employed_by
  • located_in, based_in
  • attended, participated_in
  • mentioned_in (document co-occurrence)
  • associated_with (general connection)
  • owns, controls
  • transacted_with

πŸ” Security & Privacy

Data Protection

  • βœ… All document processing is local (documents never leave your machine)
  • βœ… Claude API only receives text chunks, not full documents
  • βœ… .gitignore protects sensitive files
  • βœ… Optional PII anonymization before visualization
  • βœ… Audit logging for access tracking

Best Practices

  • πŸ”’ Never commit original documents to git
  • πŸ”’ Use .env for API keys (never hardcode)
  • πŸ”’ Sanitize data before sharing visualizations
  • πŸ”’ Enable authentication for web interface (production)

πŸ“š Documentation


πŸ§ͺ Testing

# Run all tests
pytest tests/

# Run with coverage
pytest --cov=src tests/

# Run specific test module
pytest tests/test_entity_extraction.py

🀝 Contributing

Contributions welcome! This tool is built to help journalists, researchers, and transparency advocates.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“œ License

MIT License - See LICENSE for details

Security Notice: This software is intended for educational, research, and authorized investigative journalism purposes only. Users must comply with all applicable laws. Unauthorized access to confidential documents is illegal and unethical.

The authors are not responsible for misuse of this software.


πŸ™ Acknowledgments

  • Anthropic Claude - AI-powered entity extraction
  • Cytoscape.js - Network graph visualization
  • PyMuPDF - Fast PDF processing
  • NetworkX - Python graph library
  • Maltego - Inspiration for visualization UX

πŸ“ž Contact

Author: MrR3m1t GitHub: @MrRemit Repository: intel-doc-analyzer

Issues: Report bugs or request features via GitHub Issues


🎯 Use Cases

Investigative Journalism

  • Analyze leaked documents (Panama Papers style)
  • Map corporate/political networks
  • Find hidden connections in FOIA releases

Academic Research

  • Historical document analysis (declassified archives)
  • Social network studies
  • Computational journalism

Legal & Compliance

  • eDiscovery (find relevant entities in case files)
  • Regulatory compliance investigations
  • Fraud detection networks

Intelligence Analysis

  • OSINT (Open Source Intelligence)
  • Threat actor mapping
  • Attribution analysis

πŸ“ˆ Roadmap

Phase 1: MVP (Current)

  • Project structure
  • PDF ingestion
  • Claude API entity extraction
  • Basic graph storage
  • Simple web visualization
  • CLI interface

Phase 2: v1.0 (Q1 2025)

  • All document formats
  • Neo4j integration
  • Advanced filtering
  • Timeline view
  • PDF export
  • Docker deployment

Phase 3: v2.0 (Q2 2025)

  • OCR for scans
  • Coreference resolution
  • Entity disambiguation
  • Multi-language support
  • Collaborative features

Made with ❀️ for transparency and accountability


πŸ’‘ FREE vs Premium Extraction

FREE Mode (Default - spaCy)

# Uses local spaCy AI - completely FREE
python src/cli.py analyze documents/ --engine spacy

βœ… $0 cost
βœ… Works offline
βœ… No API keys
βœ… 75-85% accuracy
βœ… Fast processing

Premium Mode (Optional - Claude AI)

# Uses Claude API - costs money but higher accuracy
python src/cli.py analyze documents/ --engine claude --api-key sk-ant-...

πŸ’° ~$0.10 per document
🌐 Requires internet
πŸ”‘ Needs API key
βœ… 95%+ accuracy
βœ… Better relationship extraction

Recommendation: Start with FREE mode. Only use Claude for critical documents where you need maximum accuracy.

About

πŸ” AI-powered document analyzer that transforms PDFs, emails & scans into interactive knowledge graphs. Built with Claude AI for investigative journalism

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages