A hackathon-built AI-powered document analysis tool that transforms messy document dumps into an interactive detective-style investigation board.
Upload documents β extract people/places/organizations β build evidence-backed connections β explore them through a timeline-driven graph interface.
- Python 3.8+
- pip
-
Create and activate a virtual environment (recommended):
# Create virtual environment python3 -m venv venv # Activate it # On macOS/Linux: source venv/bin/activate # On Windows: # venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up OpenAI API key (see section below)
-
Test the PDF parser:
# With OpenAI (default, requires API key): python scripts/parse_pdf.py sample_data/EFTA00011414.pdf -o output.json # Without LLM (simple extraction only): python scripts/parse_pdf.py sample_data/EFTA00011414.pdf -o output.json --no-llm
The PDF parser uses OpenAI's API for generating summaries and extracting dates. To use this feature:
-
Get an OpenAI API key:
- Sign up at https://platform.openai.com/
- Create an API key in your account settings
-
Set the API key as an environment variable:
# macOS/Linux: export OPENAI_API_KEY="sk-your-key-here" # Windows (PowerShell): $env:OPENAI_API_KEY="sk-your-key-here" # Or add to your ~/.zshrc or ~/.bashrc for persistence: echo 'export OPENAI_API_KEY="sk-your-key-here"' >> ~/.zshrc
-
Test the parser with OpenAI:
python scripts/parse_pdf.py sample_data/EFTA00011414.pdf -o output.json
The script will automatically use OpenAI if
OPENAI_API_KEYis set. By default, it usesgpt-4o-minifor cost efficiency.
The frontend is a Next.js application located in the frontend/ directory.
-
Navigate to the frontend directory:
cd frontend -
Install dependencies:
npm install
-
Run the development server:
npm run dev
-
Open your browser: Navigate to http://localhost:3000
The frontend will automatically read output.json from the project root after you've parsed some PDFs.
Backend:
pdfplumberfor PDF text extractionpython-dateutilfor date parsingopenaifor LLM-powered summarization and date extractiontransformersandtorch(optional, for local models)
Frontend:
- Next.js 14 with TypeScript
- React Flow for graph visualization
- Tailwind CSS for styling
Large investigative datasets (court filings, logs, reports, leaked archives, etc.) are difficult to navigate manually.
We wanted to build a system that helps users answer questions like:
- Who appears most often across documents?
- Which people are mentioned together?
- How do relationships change over time?
- What connections emerge when you zoom into a specific date range?
Inspired by investigative link analysis tools, we created a lightweight MVP in 24 hours.
This project takes in a collection of documents and automatically:
- People
- Organizations
- Locations
- Dates
- Entities mentioned in the same sentence are connected
- Each edge is backed by a source citation snippet
- Timeline scrubber filters connections over time
- Detective board graph updates dynamically
- Clicking an edge reveals the exact evidence text
- Clicking a node shows all mentions across documents
Documents are uploaded and converted into clean text.
- PDFs are parsed using lightweight extraction tools
- Each document is assigned a timestamp (from metadata or filename)
We run NLP-based entity extraction using spaCy:
PERSONORGGPE(locations)DATE
Each entity mention is stored with:
- Document ID
- Page number
- Exact text snippet
- Timestamp
Instead of guessing intent, we build only safe, source-grounded links:
- If two entities appear in the same sentence β connect them
- Edge weight increases with repeated co-occurrence
- Every edge stores the best supporting snippets
This produces an investigation graph like:
Person A ββ mentioned_with ββ Person B
(evidence sentence attached)
The key feature is time-based navigation:
- User selects a date window
- Graph updates to show only nodes/edges active in that period
- Relationships evolve as the timeline moves
flowchart LR
A[DOJ Website] -->|Scrape Documents| B[Document Scraper]
B -->|Raw PDFs/Documents| C[Raw Data Storage<br/>S3 / Local FS / Postgres]
style A fill:#e1f5ff,color:#000000
style C fill:#fff4e1,color:#000000
flowchart LR
D[PDF Parser<br/>pdfplumber] -->|Clean Text| E[NER Processing<br/>spaCy]
E -->|Extract Entities<br/>PERSON, ORG, GPE, DATE| F[Entity Mentions<br/>+ Evidence Snippets]
F -->|Build Relationships| G[Graph Construction<br/>Co-mention Analysis]
style E fill:#ffe1f5,color:#000000
flowchart LR
H[Graph Database<br/>Neo4j / Postgres + pgvector] -->|Query API| I[FastAPI Backend<br/>REST Endpoints]
I -->|Graph Data + Evidence| J[Frontend<br/>Next.js + React Flow]
J -->|Interactive Visualization| K[Detective Board UI<br/>Timeline + Graph]
style H fill:#e1ffe1,color:#000000
style J fill:#f5e1ff,color:#000000
style K fill:#ffe1e1,color:#000000
{
"doc_id": "doc_000123",
"source_type": "pdf|email|image|web",
"source_uri": "s3://bucket/000123.pdf",
"scraped_at": "2024-01-15T10:30:00Z",
"metadata": {
"title": "Document Title",
"date": "2004-04-12",
"author": null,
"url": "https://www.justice.gov/...",
"file_size": 245678,
"mime_type": "application/pdf"
},
"raw_content": "<binary or raw text>"
}{
"doc_id": "doc_000123",
"source_type": "pdf|email|image|web",
"source_uri": "s3://bucket/000123.pdf",
"page_id": "doc_000123#p04",
"text": "On April 12, 2004, John Doe met Jane Smith in Palm Beach...",
"page_number": 4,
"extracted_at": "2024-01-15T10:35:00Z",
"document_timestamp": "2004-04-12T00:00:00Z"
}{
"doc_id": "doc_000123",
"page_id": "doc_000123#p04",
"entities": [
{
"entity_id": "ent_001",
"text": "John Doe",
"type": "PERSON",
"start_char": 20,
"end_char": 28,
"confidence": 0.95,
"sentence": "On April 12, 2004, John Doe met Jane Smith in Palm Beach.",
"sentence_start": 0,
"sentence_end": 65
},
{
"entity_id": "ent_002",
"text": "Jane Smith",
"type": "PERSON",
"start_char": 33,
"end_char": 43,
"confidence": 0.94,
"sentence": "On April 12, 2004, John Doe met Jane Smith in Palm Beach.",
"sentence_start": 0,
"sentence_end": 65
},
{
"entity_id": "ent_003",
"text": "Palm Beach",
"type": "GPE",
"start_char": 47,
"end_char": 57,
"confidence": 0.92,
"sentence": "On April 12, 2004, John Doe met Jane Smith in Palm Beach.",
"sentence_start": 0,
"sentence_end": 65
},
{
"entity_id": "ent_004",
"text": "April 12, 2004",
"type": "DATE",
"start_char": 3,
"end_char": 17,
"confidence": 0.98,
"normalized_date": "2004-04-12",
"sentence": "On April 12, 2004, John Doe met Jane Smith in Palm Beach.",
"sentence_start": 0,
"sentence_end": 65
}
],
"processed_at": "2024-01-15T10:36:00Z"
}{
"id": "ent_001",
"name": "John Doe",
"type": "PERSON",
"canonical_id": "person_john_doe_001",
"first_seen": "2004-04-12T00:00:00Z",
"last_seen": "2004-08-15T00:00:00Z",
"mention_count": 47,
"documents": ["doc_000123", "doc_000456"],
"aliases": ["J. Doe", "John D."]
}{
"edge_id": "edge_001",
"src_entity_id": "ent_001",
"dst_entity_id": "ent_002",
"relationship_type": "mentioned_with",
"weight": 12,
"first_seen": "2004-04-12T00:00:00Z",
"last_seen": "2004-08-15T00:00:00Z",
"evidence": [
{
"doc_id": "doc_000123",
"page_id": "doc_000123#p04",
"snippet": "On April 12, 2004, John Doe met Jane Smith in Palm Beach.",
"timestamp": "2004-04-12T00:00:00Z",
"sentence_start": 0,
"sentence_end": 65
},
{
"doc_id": "doc_000456",
"page_id": "doc_000456#p12",
"snippet": "John Doe and Jane Smith attended the meeting together.",
"timestamp": "2004-05-20T00:00:00Z",
"sentence_start": 0,
"sentence_end": 52
}
],
"evidence_count": 12
}{
"nodes": [
{
"id": "ent_001",
"label": "John Doe",
"type": "PERSON",
"data": {
"mention_count": 47,
"first_seen": "2004-04-12T00:00:00Z",
"last_seen": "2004-08-15T00:00:00Z",
"documents": ["doc_000123", "doc_000456"]
}
},
{
"id": "ent_002",
"label": "Jane Smith",
"type": "PERSON",
"data": {
"mention_count": 32,
"first_seen": "2004-04-12T00:00:00Z",
"last_seen": "2004-07-10T00:00:00Z",
"documents": ["doc_000123", "doc_000789"]
}
}
],
"edges": [
{
"id": "edge_001",
"source": "ent_001",
"target": "ent_002",
"label": "mentioned_with",
"weight": 12,
"data": {
"evidence_count": 12,
"first_seen": "2004-04-12T00:00:00Z",
"last_seen": "2004-08-15T00:00:00Z"
}
}
],
"timeline_range": {
"start": "2004-04-12T00:00:00Z",
"end": "2004-08-15T00:00:00Z"
},
"filter_applied": {
"date_start": "2004-04-01T00:00:00Z",
"date_end": "2004-06-30T00:00:00Z"
}
}{
"edge_id": "edge_001",
"src_entity": {
"id": "ent_001",
"name": "John Doe",
"type": "PERSON"
},
"dst_entity": {
"id": "ent_002",
"name": "Jane Smith",
"type": "PERSON"
},
"evidence": [
{
"doc_id": "doc_000123",
"page_id": "doc_000123#p04",
"page_number": 4,
"snippet": "On April 12, 2004, John Doe met Jane Smith in Palm Beach.",
"timestamp": "2004-04-12T00:00:00Z",
"source_uri": "s3://bucket/000123.pdf",
"context_before": "...previous sentence...",
"context_after": "...next sentence..."
}
],
"total_evidence_count": 12
}- FastAPI β REST API for graph queries
- Python β NLP + data processing
- spaCy β Named Entity Recognition
- pdfplumber β PDF text extraction
- SQLite/Postgres β storage for entities + edges
- Next.js + React
- React Flow / Cytoscape.js β interactive graph visualization
- Timeline slider + evidence side panel
idnametypefirst_seen,last_seen
doc_idpagesnippettimestamp
src_entity_iddst_entity_idweightevidence[]
- Upload documents
- Entity extraction
- Co-mention relationship graph
- Timeline filtering
- Clickable evidence panel
- Investigation-board visualization
This tool is designed as a document navigation + evidence explorer, not an accusation engine.
Key safeguards:
- Connections represent only textual co-occurrence
- Every link is backed by an explicit citation snippet
- No suspicion scoring or automated conclusions
Appearance in a document β wrongdoing.
- Event extraction (meetings, flights, filings)
- Entity resolution + alias merging
- Semantic relationship extraction beyond co-mentions
- Scalable graph search (Neo4j / pgvector)
- Better support for OCR-scanned documents
- Collaboration + investigation workspaces
- Upload a document pack
- Entities are extracted automatically
- Timeline scrub filters the investigation board
- Click edges β view evidence
- Explore evolving connections over time
Built in 24 hours as a hackathon project to explore:
- NLP + entity extraction
- Graph-based document intelligence
- Timeline-driven investigation UX
MIT License β feel free to build on top of this project.
β¨ Turning unstructured documents into structured investigations.