Russian documentation: README_RU.md
This repository contains a graph-first retrieval-augmented generation (RAG) system for archive and legal-style documents. The backend combines LLM-based extraction, Neo4j graph retrieval, and Milvus vector search. The frontend provides a chat interface, archive card browsing, and graph visualization.
This English README is intended for public repository usage and reflects the current code state in this project.
Key points:
- retrieval is centered around Neo4j graph facts;
- responses and extraction pipelines use the OpenAI Responses API;
- archive card pipeline: file or text input -> extraction -> SQLite
(
document_cards) -> Neo4j triplets; - duplicate protection exists for repeated uploads;
- API startup performs automatic initialization only for uninitialized files in
documents/; - manual archive reinitialization via
/v1/gallery/initializedefaults tomode=backward.
- What the System Does
- UI Screenshots
- Current Architecture
- Components and Folder Structure
- Requirements
- Quick Start
- Environment Variables
- Running the API
- HTTP API: Full Overview
- Gallery Upload and Deduplication
- Graph and Archive Initialization
- Smoke Checks and Quality Validation
- Telegram Bot
- Data Stores
- Troubleshooting
- Known Limitations
ArchiveGPT covers three major capabilities:
- Question answering over archival and legal-style materials.
- Fact extraction and graph retrieval over Neo4j triplets.
- Archive card lifecycle management through file/text ingestion.
Supported scenarios:
- plain text query;
- query over a local document;
- query over an image scan;
- streaming response over SSE;
- gallery card upload via multipart/form-data;
- archive card creation from raw text.
Chat,Archive, andMapnow share one header component;- header navigation is centered relative to the full page width;
- archive reinitialization is intentionally de-emphasized and moved into a
compact overflow menu (
...) with one action:Reinitialize archive.
flowchart LR
U[Client] --> API[FastAPI api/app.py]
API --> S[ArchiveGPTSearch]
S --> LLM[LLMHelper / OpenAI Responses]
S --> G[ArchiveGraphSearcher / Neo4j]
S --> M[Milvus abstract template index]
API --> GR[GalleryRepository / SQLite]
API --> G
D[Local docs/images] --> LLM
flowchart TD
Q[User query] --> EX1[LLM: extract abstract templates]
Q --> EX2[LLM: extract named entities]
EX1 --> MT[Milvus: abstract template ANN search]
MT --> GM[ArchiveGraphSearcher: template-based match]
EX2 --> GM
GM --> H{Concrete triplets found?}
H -- Yes --> CTX[Build retrieval context]
H -- No --> FB[Fallback: graph keyword search]
FB --> CTX
CTX --> FINAL[LLM final response]
For type=llm, the system follows graph-first retrieval:
- LLM extracts abstract relation templates from the query.
- LLM extracts named entities from the query.
- Abstract templates are matched against a pre-vectorized Milvus index.
ArchiveGraphSearcherresolves matched templates to concrete relation edges in Neo4j.- If template matching is empty, fallback keyword graph search is used.
- LLM generates the final answer from retrieval context.
The final prompt includes:
- relevant concrete triplets from retrieval;
- full text for top-ranked source documents (configured by
RAG_FULL_SOURCE_DOCS_K); - source-aware and global chunk retrieval from Milvus.
This keeps context focused while preserving factual grounding.
flowchart LR
T[Concrete triplets] --> TOP[Top source docs full text]
T --> RANK[Ranked sources]
RANK --> WIN[Sources rank 6..10]
Q[Search queries] --> VS[Milvus chunk search]
WIN --> SA[Source-aware chunks]
VS --> SA
VS --> GL[Independent best chunks]
T --> P[Final LLM prompt]
TOP --> P
SA --> P
GL --> P
If Neo4j settings are missing or the connection fails,
ArchiveGraphSearcher falls back to in-memory behavior.
This is useful for local smoke/dev sessions, but production should use stable Neo4j connectivity.
Core modules:
api/app.py: FastAPI endpoints, query handling, gallery APIs, startup archive sync.searchers/search.py: retrieval orchestration and LLM-facing query flow.aitools/llm.py: OpenAI helper, extraction routines, triplet logic.databases/graph_db.py: Neo4j retrieval and upsert layer.databases/graph_init.py: deterministic graph/chunk initialization from local documents.databases/milvus_db.py: Milvus indexes for abstract templates and chunks.databases/milvus_init.py: abstract template synchronization into Milvus.databases/gallery_db.py: SQLite repository for archive cards.bot/*: Telegram bot runtime and handlers.frontend/*: Next.js frontend.
Simplified top-level tree:
.
├── api/
│ └── app.py
├── aitools/
│ ├── embedder.py
│ └── llm.py
├── bot/
│ ├── bot.py
│ ├── handlers.py
│ ├── keyboards.py
│ ├── messages.py
│ └── states.py
├── confs/
│ └── config.py
├── databases/
│ ├── db.py
│ ├── graph_db.py
│ ├── graph_init.py
│ ├── gallery_db.py
│ ├── milvus_db.py
│ └── milvus_init.py
├── documents/
├── frontend/
├── searchers/
│ └── search.py
├── tools/
│ └── embedding_model_zip.py
├── requirements.txt
├── run_api.py
├── run_bot.py
└── run_service.py
- Python 3.10+
- pip
- OpenAI API key
- Neo4j 5.x (strongly recommended for complete retrieval quality)
- Node.js 20+ and npm (frontend)
- MySQL (only if running Telegram bot user/balance storage)
From repository root:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtUse your own environment values (there is no mandatory public
.env.example template in this repository snapshot).
Minimum recommended values:
OPENAI_API_KEY=...
OPENAI_MODEL_NANO=gpt-4.1-mini
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=...
NEO4J_DATABASE=neo4jOptional: local embedding model ZIP workflow (to avoid online model pulls):
python3.10 -m tools.embedding_model_zip pack \
--model google/embeddinggemma-300m \
--output ./artifacts/embeddinggemma-300m.zipThen add:
EMBEDDING_MODEL_ZIP_PATH=./artifacts/embeddinggemma-300m.zip
EMBEDDING_MODEL_EXTRACT_DIR=~/.cache/archive-gpt/embedding-modelspython -m databases.graph_init --documents-dir documents --smoke-checkNote: databases.graph_init currently defaults to test_data/documents, which
is not present in this workspace. Pass --documents-dir documents explicitly.
uvicorn api.app:app --host 127.0.0.1 --port 8000Swagger UI: http://127.0.0.1:8000/docs
python run_service.pyProduction mode:
python run_service.py --mode prodThe table below summarizes key variables from confs/config.py and runtime
entrypoints.
| Variable | Required | Default | Purpose |
|---|---|---|---|
OPENAI_API_KEY |
Yes | - | OpenAI key for Responses API |
OPENAI_MODEL_NANO |
No | gpt-4.1-mini |
Model for routing/extraction/answers |
OPENAI_BASE_URL |
No | empty | Custom OpenAI-compatible gateway URL |
NEO4J_URI |
Recommended | - | Neo4j address (bolt://...) |
NEO4J_USER |
Recommended | - | Neo4j username |
NEO4J_PASSWORD |
Recommended | - | Neo4j password |
NEO4J_DATABASE |
No | neo4j (if set) |
Neo4j database/namespace |
MILVUS_DB_PATH |
No | milvus_archive_gpt.db |
Milvus local DB file |
MILVUS_ABSTRACT_COLLECTION |
No | abstract_triplet_templates |
Collection for abstract triplets |
MILVUS_CHUNK_COLLECTION |
No | document_chunks |
Collection for document chunks |
EMBEDDING_MODEL_NAME |
No | google/embeddinggemma-300m |
Embedding model source |
EMBEDDING_MODEL_ZIP_PATH |
No | empty | Local ZIP model path |
EMBEDDING_MODEL_EXTRACT_DIR |
No | ~/.cache/archive-gpt/embedding-models |
ZIP extraction path |
RAG_FULL_SOURCE_DOCS_K |
No | 2 |
Top source docs included as full text |
RAG_CONCRETE_TRIPLETS_LIMIT |
No | 0 |
Triplet limit in context (0 = no limit) |
RAG_VECTOR_SEARCH_TOP_K |
No | 80 |
Chunk candidates fetched from Milvus |
RAG_VECTOR_SOURCE_RANK_START |
No | 6 |
Source-rank window start |
RAG_VECTOR_SOURCE_RANK_END |
No | 10 |
Source-rank window end |
RAG_VECTOR_SOURCE_CHUNKS_K |
No | 5 |
Source-aware chunks count |
RAG_VECTOR_GLOBAL_CHUNKS_K |
No | 5 |
Global chunks count |
RAG_VECTOR_CHUNK_WORDS |
No | 50 |
Word chunk size |
RAG_VECTOR_CHUNK_OVERLAP |
No | 10 |
Word overlap between adjacent chunks |
RAG_VECTOR_MAX_CHUNKS_PER_DOC |
No | 120 |
Max vector chunks per document |
GALLERY_DB_PATH |
No | databases/gallery.db |
SQLite path for archive cards |
TELEGRAM_BOT_TOKEN |
Bot only | - | Telegram bot token |
DB_HOST/DB_PORT/DB_USER/DB_PASSWORD/DB_NAME |
Bot only | - | MySQL for bot users |
API_HOST/API_PORT/API_WORKERS |
No | script defaults | API launcher options |
FRONTEND_PORT |
No | 3000 |
Frontend port |
ARCHIVE_GPT_API_URL |
No | http://localhost:8000 |
Backend URL used by frontend |
Note: legacy Azure-related variables may still appear in environments. Current
runtime in this repository is centered on OPENAI_*.
Run from project root:
uvicorn api.app:app --host 0.0.0.0 --port 8000python run_api.py --host 0.0.0.0 --port 8000 --reloadRunning from legacy working directories or old PYTHONPATH values can produce
import failures such as:
ModuleNotFoundError: No module named 'datascience'
Use project-root execution and api.app:app module path.
Base URL in examples: http://127.0.0.1:8000
sequenceDiagram
participant C as Client
participant API as FastAPI /v1/query
participant S as ArchiveGPTSearch
participant G as Neo4j Graph
participant M as Milvus
participant O as OpenAI Responses
C->>API: POST /v1/query
API->>S: route(query, type, lang)
S->>O: extract templates/entities
S->>M: template ANN search
S->>G: concrete triplets retrieval
S->>M: optional chunk retrieval
S->>O: final answer generation
O-->>S: answer text
S-->>API: text + response_id
API-->>C: JSON response
GET /health
curl -sS http://127.0.0.1:8000/healthGET /v1/graph/visualization?limit=500&offset=0&search=&entity_name=&relation_type=
Purpose:
- returns graph edges in triplet-style format;
- returns node list for direct frontend graph rendering;
- supports pagination and filters.
curl -sS "http://127.0.0.1:8000/v1/graph/visualization?limit=200&offset=0&search=baytemirov"POST /v1/query
Body fields:
query: user question;type:llmorsearch;lang:ruorkg;previous_response_id: optional response continuation id.
curl -sS -X POST http://127.0.0.1:8000/v1/query \
-H "Content-Type: application/json" \
-d '{
"query": "Why was Baytemirov arrested?",
"type": "llm",
"lang": "ru"
}'POST /v1/query/stream
curl -N -X POST http://127.0.0.1:8000/v1/query/stream \
-H "Content-Type: application/json" \
-d '{
"query": "Provide a short case summary",
"type": "llm",
"lang": "ru"
}'SSE event types:
deltadoneerror
POST /v1/query/doc
Body fields:
queryfile_url(local file path or URL)type(llm)langprevious_response_id(optional)
curl -sS -X POST http://127.0.0.1:8000/v1/query/doc \
-H "Content-Type: application/json" \
-d '{
"query": "Extract key facts",
"file_url": "documents/delo_baytemirova.txt",
"type": "llm",
"lang": "ru"
}'Commonly supported local formats:
.txt, .md, .markdown, .json, .csv, .tsv, .log, .rst,
.yaml, .yml, .xml, .html, .htm, .docx, .pdf.
POST /v1/query/doc/stream
Same body as /v1/query/doc with streamed output.
POST /v1/query/image
Body fields:
queryimage_url(public URL or local path)type(llm)langprevious_response_id(optional)
curl -sS -X POST http://127.0.0.1:8000/v1/query/image \
-H "Content-Type: application/json" \
-d '{
"query": "What is written in the scan?",
"image_url": "https://example.com/scan.jpg",
"type": "llm",
"lang": "ru"
}'POST /v1/query/image/stream
GET /v1/gallery/cards?limit=100&offset=0&search=
Parameters:
limit: 1..500 (service-level upper clamp may apply)offset:>= 0search: optional text filter
curl -sS "http://127.0.0.1:8000/v1/gallery/cards?limit=20&offset=0"POST /v1/gallery/cards
curl -sS -X POST http://127.0.0.1:8000/v1/gallery/cards \
-H "Content-Type: application/json" \
-d '{
"title": "Sample archive card",
"short_description": "Manual card for validation",
"source": "manual-entry",
"status": "pending"
}'PUT /v1/gallery/cards/{card_id}
curl -sS -X PUT http://127.0.0.1:8000/v1/gallery/cards/1 \
-H "Content-Type: application/json" \
-d '{
"status": "verified"
}'POST /v1/gallery/cards/upload?lang=ru
Content-Type: multipart/form-data
Required form field:
file
Allowed upload suffixes for this endpoint:
.txt.md.markdown.pdf.docx
curl -sS -X POST "http://127.0.0.1:8000/v1/gallery/cards/upload?lang=ru" \
-F "file=@documents/delo_baytemirova.txt;type=text/plain"POST /v1/gallery/cards/text
curl -sS -X POST http://127.0.0.1:8000/v1/gallery/cards/text \
-H "Content-Type: application/json" \
-d '{
"title": "Text-ingested card",
"text": "Raw archival text fragment goes here.",
"source": "inline-text",
"lang": "ru"
}'GET /v1/gallery/cards/{card_id}
curl -sS http://127.0.0.1:8000/v1/gallery/cards/1GET /v1/gallery/cards/{card_id}/details?relation_limit=2000
curl -sS "http://127.0.0.1:8000/v1/gallery/cards/1/details?relation_limit=1000"POST /v1/gallery/initialize
Body fields:
mode:forward,backward, orforward_backward(default:backward);lang:ruorkg.
curl -sS -X POST http://127.0.0.1:8000/v1/gallery/initialize \
-H "Content-Type: application/json" \
-d '{
"mode": "backward",
"lang": "ru"
}'Important:
- API startup automatically initializes only new documents from
documents/; - the manual endpoint is used for full reinitialization runs.
Pipeline for /v1/gallery/cards/upload:
- API accepts file and validates extension.
sha256is computed from raw file bytes.- Fast duplicate check by
content_hashin SQLite. - If not duplicated: text extraction via
LLMHelper.get_doc_data. - Structured card extraction via
LLMHelper.extract_document_card. - Duplicate check by
title + source. - If an existing record misses
content_hash, hash is backfilled. - If no duplicates: new card is inserted into SQLite.
- Document text is converted to triplets and upserted into Neo4j.
flowchart TD
U[Upload file] --> H[Compute sha256]
H --> D1{Duplicate by content_hash?}
D1 -- Yes --> R1[Return existing card]
D1 -- No --> TX[Extract text]
TX --> EC[Extract document card]
EC --> D2{Duplicate by title + source?}
D2 -- Yes --> R2[Return existing card]
D2 -- No --> SQL[Insert card into SQLite]
SQL --> NEO[Upsert triplets to Neo4j]
NEO --> DONE[Return created card]
Duplicate behavior:
- API returns the already existing card;
graph_triplets_upsertedis0;- no duplicate row is created in SQLite.
SQLite table used by this flow: document_cards.
Main card fields:
titleshort_descriptionsourcecontent_hashstatus(verified,pending,draft)
On FastAPI startup, documents/ is synchronized automatically with these rules:
- pass mode:
forward; - only uninitialized documents are processed;
- existing cards are not recreated.
This behavior is implemented in the startup hook in api/app.py.
For forced re-sync runs, call /v1/gallery/initialize.
In the current archive UI, this action is exposed via overflow menu (...) as
Reinitialize archive and sends mode=backward.
databases/graph_init.py can preload deterministic triplets and document chunks:
flowchart LR
DOC[documents dir] --> PARSE[Parse documents]
PARSE --> TRIP[Triplets from docs]
PARSE --> CH[Word chunks 50/10]
TRIP --> GR
GR --> N[(Neo4j)]
CH --> EMB[EmbeddingGemma Retrieval-document]
EMB --> MC[(Milvus document_chunks)]
AT[Abstract templates] --> MINIT[milvus_init sync]
MINIT --> MA[(Milvus abstract templates)]
Quick bootstrap from documents/:
python -m databases.graph_init --documents-dir documentsWith smoke check:
python -m databases.graph_init --documents-dir documents --smoke-checkParameterized bootstrap:
python -m databases.graph_init \
--documents-dir documents \
--max-chunks-per-doc 18 \
--chunk-words 50 \
--chunk-overlap-words 10 \
--max-vector-chunks-per-doc 120Sync pre-vectorized abstract templates into Milvus:
python -m databases.milvus_init --verify-search --limit 4000This repository currently does not include a dedicated end-to-end smoke script in root. A practical validation checklist is:
- API health check.
- Graph visualization endpoint sanity.
- One text query (
/v1/query). - One SSE text stream (
/v1/query/stream). - One gallery file upload (
/v1/gallery/cards/upload). - One manual archive initialization call (
/v1/gallery/initialize).
Example quick checks:
curl -sS http://127.0.0.1:8000/health
curl -sS "http://127.0.0.1:8000/v1/graph/visualization?limit=20&offset=0"
curl -sS -X POST http://127.0.0.1:8000/v1/query \
-H "Content-Type: application/json" \
-d '{"query":"What is this archive about?","type":"llm","lang":"ru"}'To run bot mode:
- set
TELEGRAM_BOT_TOKEN; - start MySQL and apply
databases/init.sql; - run:
python run_bot.py- stores
Entitynodes andRELATED_TOrelations; - relation payload typically carries
relation_type,evidence,confidence,sources.
- stores pre-vectorized abstract templates
object|relation_type|subject; - used by abstract relation matching for top-k template selection;
- stores document chunks (word chunking 50/10) for vector retrieval;
- supports hybrid context construction in
llmmode: top full docs by triplet-ranked sources + source-aware chunks (ranks 6..10)- independent best chunks.
- path is controlled by
GALLERY_DB_PATH; document_cardsis created automatically;- unique index by
content_hashis enabled.
- stores bot users and related runtime metadata (balance/language/mode/history).
Symptom:
uvicorn api.app:app ...
ModuleNotFoundError: No module named 'datascience'
Cause: legacy import path from old scripts or incorrect working directory.
Fix:
uvicorn api.app:app --host 127.0.0.1 --port 8000Always run from repository root.
Cause: file extension is outside the allowed list for upload endpoint.
Fix: use .txt, .md, .markdown, .pdf, or .docx.
Check:
- file is not empty;
- file type is supported;
- API process can read the file content correctly.
Check:
- graph has data (run
graph_init); - Neo4j is reachable via
NEO4J_*; - query is not too broad; try
type=llmfirst.
Cause: databases.graph_init default path still points to test_data/documents.
Fix:
python -m databases.graph_init --documents-dir documentsCurrent upload logic prevents new duplicates. Historical duplicates already in SQLite before dedup rollout should be cleaned with a dedicated one-time script.
- Document query endpoint is primarily used with local file paths in current workflows.
- Gallery upload format is intentionally limited to a fixed extension list.
- Extraction quality depends on OCR/text quality and document structure.
- Retrieval remains graph-first: abstract template ranking is Milvus-backed with fallback scoring; chunk retrieval augments context but does not replace graph grounding.
If you need deeper split documentation, recommended next docs are:
README_API.mdwith full request/response schemas;README_DEPLOY.mdwith Docker/systemd/nginx deployment scenarios;README_GALLERY.mdwith moderation and card normalization policy.


