π 100% FREE AI-Powered Document Analysis - No API Costs!
Transform thousands of unstructured documents (PDFs, emails, scans) into interactive, explorable knowledge graphs using FREE local AI (spaCy). No API keys, no costs, completely open-source. Built for investigative journalists, researchers, and anyone dealing with massive FOIA releases, government document dumps, or corporate leak archives.
When governments and organizations release document dumps (Epstein files, JFK files, corporate leaks), they arrive as thousands of unstructured PDFs with:
- β No organization or categorization
- β No way to find connections between entities
- β Months of manual reading required
- β Easy to miss critical relationships
Examples: Epstein files, JFK assassination docs, Panama Papers, Wikileaks cables, corporate FOIA releases
Intelligence Document Analyzer automatically:
- π₯ Ingests massive document collections (PDF, DOCX, email, scans)
- π€ Extracts entities using FREE local AI (spaCy):
- People, organizations, locations
- Events, dates, phone numbers, emails
- Financial amounts, legal references
- NO API costs, runs completely offline!
- π Maps relationships between entities:
- Who knows whom
- Who works where
- Who attended what event
- Document co-mentions
- π Visualizes as interactive network graphs (Maltego-style)
- π Enables exploration:
- Click any person β see all connections
- Find shortest path between two entities (6 degrees of separation)
- Filter by date, confidence, entity type
- Export findings as reports
- This Tool (spaCy): $0 forever π
- Commercial alternatives: $50-200+ per 1,000 documents πΈ
- Optional Claude AI: Available if you want premium accuracy (requires API key)
# Clone the repository
git clone https://github.com/MrRemit/intel-doc-analyzer.git
cd intel-doc-analyzer
# Install Python dependencies (includes FREE spaCy)
pip install -r requirements.txt
# Download spaCy language model (one-time, ~40MB)
python -m spacy download en_core_web_sm
# That's it! No API keys needed! π# Analyze documents with FREE local AI (no costs!)
python src/cli.py analyze data/examples/sample_document.txt --output my_analysis
# That's it! Extracted entities, built graph, created visualization
# Total cost: $0.00 π°Output:
data/graphs/my_analysis.json- Knowledge graphdata/graphs/my_analysis.png- Network visualization- Extracted entities & relationships saved
# Find connections between entities
python src/cli.py query data/graphs/my_analysis.json "John Smith" "ACME Corp"
# Find most important entities
python src/cli.py centrality data/graphs/my_analysis.json --top 20
# Detect communities
python src/cli.py communities data/graphs/my_analysis.json# Process 1,000 PDFs from government release
python src/cli.py analyze \
--input data/raw/foia_release/ \
--output data/graphs/foia_analysis \
--entity-types PERSON,ORGANIZATION,LOCATION,EVENT \
--confidence-threshold 0.75
# Results:
# β Extracted 5,432 entities
# β Found 12,874 relationships
# β Identified 23 communities
# β Processing time: 2h 15mfrom src.graph import GraphAnalyzer
graph = GraphAnalyzer.load("data/graphs/foia_analysis.graphml")
# Find shortest path between two entities
path = graph.shortest_path("John Smith", "ACME Corporation")
print(path)
# Output:
# John Smith β works_at β Tech Startup Inc β acquired_by β ACME Corporation
# (3 degrees of separation)from src.ingestion import DocumentProcessor
from src.extraction import EntityExtractor
from src.graph import GraphBuilder
# 1. Process PDF
processor = DocumentProcessor()
chunks = processor.process_pdf("data/raw/document.pdf")
# 2. Extract entities with Claude AI
extractor = EntityExtractor(api_key="your_key")
entities, relationships = extractor.extract(chunks)
# 3. Build graph
builder = GraphBuilder()
builder.add_entities(entities)
builder.add_relationships(relationships)
builder.save("data/graphs/my_graph.graphml")βββββββββββββββββββββββ
β Document Dumps β
β (PDF, Email, Scan) β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β 1. INGESTION β
β - PDF parsing β
β - Email parsing β
β - OCR (scans) β
β - Text chunking β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β 2. AI EXTRACTION β
β - Claude API β
β - Entity NER β
β - Relationships β
β - Confidence β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β 3. GRAPH DATABASE β
β - NetworkX (MVP) β
β - Neo4j (Prod) β
β - Deduplication β
β - Merging β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β 4. VISUALIZATION β
β - Interactive web β
β - Cytoscape.js β
β - Filtering β
β - Export reports β
βββββββββββββββββββββββ
- β PDF text extraction
- β Claude AI entity extraction (PERSON, ORGANIZATION, LOCATION)
- β Relationship mapping
- β NetworkX graph storage
- β Basic web visualization
- β CLI interface
- π All document formats (DOCX, email, MBOX)
- π OCR for scanned documents
- π Neo4j production database
- π Advanced filtering & search
- π Timeline view (temporal analysis)
- π Export to PDF reports
- π Docker deployment
- π‘ Coreference resolution (merge "John Smith" and "J. Smith")
- π‘ Entity disambiguation (link to Wikipedia, Wikidata)
- π‘ Multi-language support
- π‘ Collaborative annotation
- π‘ Machine learning entity ranking
- AI/NLP: Anthropic Claude API (entity extraction)
- Document Processing: PyMuPDF, pdfplumber, python-docx
- OCR: Tesseract, pytesseract
- Graph: NetworkX (MVP), Neo4j (production)
- API: FastAPI + uvicorn
- Data Validation: Pydantic
- Framework: React + TypeScript
- Visualization: Cytoscape.js (interactive network graphs)
- UI: Tailwind CSS
- State Management: Zustand
- Containerization: Docker
- Database: SQLite (metadata), PostgreSQL (optional)
- Cache: Redis (optional)
- PERSON: Individuals mentioned in documents
- ORGANIZATION: Companies, agencies, groups, NGOs
- LOCATION: Cities, countries, addresses, buildings
- EVENT: Meetings, transactions, incidents, conferences
- DOCUMENT: Referenced documents, files, reports
- DATE: Specific dates or date ranges
- PHONE: Phone numbers
- EMAIL: Email addresses
- MONEY: Financial amounts (USD, EUR, etc.)
- LEGAL: Case numbers, statutes, regulations
- VEHICLE: License plates, aircraft tail numbers
works_at,employed_bylocated_in,based_inattended,participated_inmentioned_in(document co-occurrence)associated_with(general connection)owns,controlstransacted_with
- β All document processing is local (documents never leave your machine)
- β Claude API only receives text chunks, not full documents
- β
.gitignoreprotects sensitive files - β Optional PII anonymization before visualization
- β Audit logging for access tracking
- π Never commit original documents to git
- π Use
.envfor API keys (never hardcode) - π Sanitize data before sharing visualizations
- π Enable authentication for web interface (production)
- User Guide - Complete usage documentation
- API Reference - Python API docs
- Architecture - System design details
- Development - Contributing guide
- CLAUDE.md - AI context (full project specification)
# Run all tests
pytest tests/
# Run with coverage
pytest --cov=src tests/
# Run specific test module
pytest tests/test_entity_extraction.pyContributions welcome! This tool is built to help journalists, researchers, and transparency advocates.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - See LICENSE for details
Security Notice: This software is intended for educational, research, and authorized investigative journalism purposes only. Users must comply with all applicable laws. Unauthorized access to confidential documents is illegal and unethical.
The authors are not responsible for misuse of this software.
- Anthropic Claude - AI-powered entity extraction
- Cytoscape.js - Network graph visualization
- PyMuPDF - Fast PDF processing
- NetworkX - Python graph library
- Maltego - Inspiration for visualization UX
Author: MrR3m1t GitHub: @MrRemit Repository: intel-doc-analyzer
Issues: Report bugs or request features via GitHub Issues
- Analyze leaked documents (Panama Papers style)
- Map corporate/political networks
- Find hidden connections in FOIA releases
- Historical document analysis (declassified archives)
- Social network studies
- Computational journalism
- eDiscovery (find relevant entities in case files)
- Regulatory compliance investigations
- Fraud detection networks
- OSINT (Open Source Intelligence)
- Threat actor mapping
- Attribution analysis
Phase 1: MVP (Current)
- Project structure
- PDF ingestion
- Claude API entity extraction
- Basic graph storage
- Simple web visualization
- CLI interface
Phase 2: v1.0 (Q1 2025)
- All document formats
- Neo4j integration
- Advanced filtering
- Timeline view
- PDF export
- Docker deployment
Phase 3: v2.0 (Q2 2025)
- OCR for scans
- Coreference resolution
- Entity disambiguation
- Multi-language support
- Collaborative features
Made with β€οΈ for transparency and accountability
# Uses local spaCy AI - completely FREE
python src/cli.py analyze documents/ --engine spacyβ
$0 cost
β
Works offline
β
No API keys
β
75-85% accuracy
β
Fast processing
# Uses Claude API - costs money but higher accuracy
python src/cli.py analyze documents/ --engine claude --api-key sk-ant-...π° ~$0.10 per document
π Requires internet
π Needs API key
β
95%+ accuracy
β
Better relationship extraction
Recommendation: Start with FREE mode. Only use Claude for critical documents where you need maximum accuracy.