ArchiveGPT System (Graph-First RAG)

Russian documentation: README_RU.md

This repository contains a graph-first retrieval-augmented generation (RAG) system for archive and legal-style documents. The backend combines LLM-based extraction, Neo4j graph retrieval, and Milvus vector search. The frontend provides a chat interface, archive card browsing, and graph visualization.

This English README is intended for public repository usage and reflects the current code state in this project.

Key points:

retrieval is centered around Neo4j graph facts;
responses and extraction pipelines use the OpenAI Responses API;
archive card pipeline: file or text input -> extraction -> SQLite (document_cards) -> Neo4j triplets;
duplicate protection exists for repeated uploads;
API startup performs automatic initialization only for uninitialized files in documents/;
manual archive reinitialization via /v1/gallery/initialize defaults to mode=backward.

What the System Does
UI Screenshots
Current Architecture
Components and Folder Structure
Requirements
Quick Start
Environment Variables
Running the API
HTTP API: Full Overview
Gallery Upload and Deduplication
Graph and Archive Initialization
Smoke Checks and Quality Validation
Telegram Bot
Data Stores
Troubleshooting
Known Limitations

What the System Does

ArchiveGPT covers three major capabilities:

Question answering over archival and legal-style materials.
Fact extraction and graph retrieval over Neo4j triplets.
Archive card lifecycle management through file/text ingestion.

Supported scenarios:

plain text query;
query over a local document;
query over an image scan;
streaming response over SSE;
gallery card upload via multipart/form-data;
archive card creation from raw text.

UI Screenshots

Chat

Map

Recent UI Updates

Chat, Archive, and Map now share one header component;
header navigation is centered relative to the full page width;
archive reinitialization is intentionally de-emphasized and moved into a compact overflow menu (...) with one action: Reinitialize archive.

Current Architecture

High-Level Flow

flowchart LR
    U[Client] --> API[FastAPI api/app.py]
    API --> S[ArchiveGPTSearch]
    S --> LLM[LLMHelper / OpenAI Responses]
    S --> G[ArchiveGraphSearcher / Neo4j]
    S --> M[Milvus abstract template index]

    API --> GR[GalleryRepository / SQLite]
    API --> G

    D[Local docs/images] --> LLM

Detailed Retrieval Pipeline

flowchart TD
    Q[User query] --> EX1[LLM: extract abstract templates]
    Q --> EX2[LLM: extract named entities]

    EX1 --> MT[Milvus: abstract template ANN search]
    MT --> GM[ArchiveGraphSearcher: template-based match]
    EX2 --> GM

    GM --> H{Concrete triplets found?}
    H -- Yes --> CTX[Build retrieval context]
    H -- No --> FB[Fallback: graph keyword search]
    FB --> CTX

    CTX --> FINAL[LLM final response]

How Retrieval Works

For type=llm, the system follows graph-first retrieval:

LLM extracts abstract relation templates from the query.
LLM extracts named entities from the query.
Abstract templates are matched against a pre-vectorized Milvus index.
ArchiveGraphSearcher resolves matched templates to concrete relation edges in Neo4j.
If template matching is empty, fallback keyword graph search is used.
LLM generates the final answer from retrieval context.

Final Answer Context Composition

The final prompt includes:

relevant concrete triplets from retrieval;
full text for top-ranked source documents (configured by RAG_FULL_SOURCE_DOCS_K);
source-aware and global chunk retrieval from Milvus.

This keeps context focused while preserving factual grounding.

flowchart LR
  T[Concrete triplets] --> TOP[Top source docs full text]
  T --> RANK[Ranked sources]
  RANK --> WIN[Sources rank 6..10]

  Q[Search queries] --> VS[Milvus chunk search]
  WIN --> SA[Source-aware chunks]
  VS --> SA
  VS --> GL[Independent best chunks]

  T --> P[Final LLM prompt]
  TOP --> P
  SA --> P
  GL --> P

Behavior When Neo4j Is Unavailable

If Neo4j settings are missing or the connection fails, ArchiveGraphSearcher falls back to in-memory behavior.

This is useful for local smoke/dev sessions, but production should use stable Neo4j connectivity.

Components and Folder Structure

Core modules:

api/app.py: FastAPI endpoints, query handling, gallery APIs, startup archive sync.
searchers/search.py: retrieval orchestration and LLM-facing query flow.
aitools/llm.py: OpenAI helper, extraction routines, triplet logic.
databases/graph_db.py: Neo4j retrieval and upsert layer.
databases/graph_init.py: deterministic graph/chunk initialization from local documents.
databases/milvus_db.py: Milvus indexes for abstract templates and chunks.
databases/milvus_init.py: abstract template synchronization into Milvus.
databases/gallery_db.py: SQLite repository for archive cards.
bot/*: Telegram bot runtime and handlers.
frontend/*: Next.js frontend.

Simplified top-level tree:

.
├── api/
│   └── app.py
├── aitools/
│   ├── embedder.py
│   └── llm.py
├── bot/
│   ├── bot.py
│   ├── handlers.py
│   ├── keyboards.py
│   ├── messages.py
│   └── states.py
├── confs/
│   └── config.py
├── databases/
│   ├── db.py
│   ├── graph_db.py
│   ├── graph_init.py
│   ├── gallery_db.py
│   ├── milvus_db.py
│   └── milvus_init.py
├── documents/
├── frontend/
├── searchers/
│   └── search.py
├── tools/
│   └── embedding_model_zip.py
├── requirements.txt
├── run_api.py
├── run_bot.py
└── run_service.py

Requirements

Python 3.10+
pip
OpenAI API key
Neo4j 5.x (strongly recommended for complete retrieval quality)
Node.js 20+ and npm (frontend)
MySQL (only if running Telegram bot user/balance storage)

Quick Start

1) Install backend dependencies

From repository root:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

2) Create `.env` in repository root

Use your own environment values (there is no mandatory public .env.example template in this repository snapshot).

Minimum recommended values:

OPENAI_API_KEY=...
OPENAI_MODEL_NANO=gpt-4.1-mini

NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=...
NEO4J_DATABASE=neo4j

Optional: local embedding model ZIP workflow (to avoid online model pulls):

python3.10 -m tools.embedding_model_zip pack \
  --model google/embeddinggemma-300m \
  --output ./artifacts/embeddinggemma-300m.zip

Then add:

EMBEDDING_MODEL_ZIP_PATH=./artifacts/embeddinggemma-300m.zip
EMBEDDING_MODEL_EXTRACT_DIR=~/.cache/archive-gpt/embedding-models

3) Optional: preload graph/chunks from local documents

python -m databases.graph_init --documents-dir documents --smoke-check

Note: databases.graph_init currently defaults to test_data/documents, which is not present in this workspace. Pass --documents-dir documents explicitly.

4) Run backend API

uvicorn api.app:app --host 127.0.0.1 --port 8000

Swagger UI: http://127.0.0.1:8000/docs

5) Optional: run backend + frontend together

python run_service.py

Production mode:

python run_service.py --mode prod

Environment Variables

The table below summarizes key variables from confs/config.py and runtime entrypoints.

Variable	Required	Default	Purpose
`OPENAI_API_KEY`	Yes	-	OpenAI key for Responses API
`OPENAI_MODEL_NANO`	No	`gpt-4.1-mini`	Model for routing/extraction/answers
`OPENAI_BASE_URL`	No	empty	Custom OpenAI-compatible gateway URL
`NEO4J_URI`	Recommended	-	Neo4j address (`bolt://...`)
`NEO4J_USER`	Recommended	-	Neo4j username
`NEO4J_PASSWORD`	Recommended	-	Neo4j password
`NEO4J_DATABASE`	No	`neo4j` (if set)	Neo4j database/namespace
`MILVUS_DB_PATH`	No	`milvus_archive_gpt.db`	Milvus local DB file
`MILVUS_ABSTRACT_COLLECTION`	No	`abstract_triplet_templates`	Collection for abstract triplets
`MILVUS_CHUNK_COLLECTION`	No	`document_chunks`	Collection for document chunks
`EMBEDDING_MODEL_NAME`	No	`google/embeddinggemma-300m`	Embedding model source
`EMBEDDING_MODEL_ZIP_PATH`	No	empty	Local ZIP model path
`EMBEDDING_MODEL_EXTRACT_DIR`	No	`~/.cache/archive-gpt/embedding-models`	ZIP extraction path
`RAG_FULL_SOURCE_DOCS_K`	No	`2`	Top source docs included as full text
`RAG_CONCRETE_TRIPLETS_LIMIT`	No	`0`	Triplet limit in context (`0` = no limit)
`RAG_VECTOR_SEARCH_TOP_K`	No	`80`	Chunk candidates fetched from Milvus
`RAG_VECTOR_SOURCE_RANK_START`	No	`6`	Source-rank window start
`RAG_VECTOR_SOURCE_RANK_END`	No	`10`	Source-rank window end
`RAG_VECTOR_SOURCE_CHUNKS_K`	No	`5`	Source-aware chunks count
`RAG_VECTOR_GLOBAL_CHUNKS_K`	No	`5`	Global chunks count
`RAG_VECTOR_CHUNK_WORDS`	No	`50`	Word chunk size
`RAG_VECTOR_CHUNK_OVERLAP`	No	`10`	Word overlap between adjacent chunks
`RAG_VECTOR_MAX_CHUNKS_PER_DOC`	No	`120`	Max vector chunks per document
`GALLERY_DB_PATH`	No	`databases/gallery.db`	SQLite path for archive cards
`TELEGRAM_BOT_TOKEN`	Bot only	-	Telegram bot token
`DB_HOST/DB_PORT/DB_USER/DB_PASSWORD/DB_NAME`	Bot only	-	MySQL for bot users
`API_HOST/API_PORT/API_WORKERS`	No	script defaults	API launcher options
`FRONTEND_PORT`	No	`3000`	Frontend port
`ARCHIVE_GPT_API_URL`	No	`http://localhost:8000`	Backend URL used by frontend

Note: legacy Azure-related variables may still appear in environments. Current runtime in this repository is centered on OPENAI_*.

Running the API

Alternative launcher script

python run_api.py --host 0.0.0.0 --port 8000 --reload

Important import-path note

Running from legacy working directories or old PYTHONPATH values can produce import failures such as:

ModuleNotFoundError: No module named 'datascience'

Use project-root execution and api.app:app module path.

HTTP API: Full Overview

Base URL in examples: http://127.0.0.1:8000

Text Query Request Flow

sequenceDiagram
  participant C as Client
  participant API as FastAPI /v1/query
  participant S as ArchiveGPTSearch
  participant G as Neo4j Graph
  participant M as Milvus
  participant O as OpenAI Responses

  C->>API: POST /v1/query
  API->>S: route(query, type, lang)
  S->>O: extract templates/entities
  S->>M: template ANN search
  S->>G: concrete triplets retrieval
  S->>M: optional chunk retrieval
  S->>O: final answer generation
  O-->>S: answer text
  S-->>API: text + response_id
  API-->>C: JSON response

1) Health

GET /health

curl -sS http://127.0.0.1:8000/health

2) Graph data for visualization

GET /v1/graph/visualization?limit=500&offset=0&search=&entity_name=&relation_type=

Purpose:

returns graph edges in triplet-style format;
returns node list for direct frontend graph rendering;
supports pagination and filters.

curl -sS "http://127.0.0.1:8000/v1/graph/visualization?limit=200&offset=0&search=baytemirov"

3) Text query

POST /v1/query

Body fields:

query: user question;
type: llm or search;
lang: ru or kg;
previous_response_id: optional response continuation id.

curl -sS -X POST http://127.0.0.1:8000/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Why was Baytemirov arrested?",
    "type": "llm",
    "lang": "ru"
  }'

4) Text streaming (SSE)

POST /v1/query/stream

curl -N -X POST http://127.0.0.1:8000/v1/query/stream \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Provide a short case summary",
    "type": "llm",
    "lang": "ru"
  }'

SSE event types:

delta
done
error

5) Query over a document

POST /v1/query/doc

Body fields:

query
file_url (local file path or URL)
type (llm)
lang
previous_response_id (optional)

curl -sS -X POST http://127.0.0.1:8000/v1/query/doc \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Extract key facts",
    "file_url": "documents/delo_baytemirova.txt",
    "type": "llm",
    "lang": "ru"
  }'

Commonly supported local formats:

.txt, .md, .markdown, .json, .csv, .tsv, .log, .rst, .yaml, .yml, .xml, .html, .htm, .docx, .pdf.

6) Document streaming (SSE)

POST /v1/query/doc/stream

Same body as /v1/query/doc with streamed output.

7) Query over an image

POST /v1/query/image

Body fields:

query
image_url (public URL or local path)
type (llm)
lang
previous_response_id (optional)

curl -sS -X POST http://127.0.0.1:8000/v1/query/image \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is written in the scan?",
    "image_url": "https://example.com/scan.jpg",
    "type": "llm",
    "lang": "ru"
  }'

8) Image streaming (SSE)

POST /v1/query/image/stream

9) Gallery card list

GET /v1/gallery/cards?limit=100&offset=0&search=

Parameters:

limit: 1..500 (service-level upper clamp may apply)
offset: >= 0
search: optional text filter

curl -sS "http://127.0.0.1:8000/v1/gallery/cards?limit=20&offset=0"

10) Create gallery card manually (JSON)

POST /v1/gallery/cards

curl -sS -X POST http://127.0.0.1:8000/v1/gallery/cards \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Sample archive card",
    "short_description": "Manual card for validation",
    "source": "manual-entry",
    "status": "pending"
  }'

11) Update gallery card

PUT /v1/gallery/cards/{card_id}

curl -sS -X PUT http://127.0.0.1:8000/v1/gallery/cards/1 \
  -H "Content-Type: application/json" \
  -d '{
    "status": "verified"
  }'

12) Upload gallery card from file

POST /v1/gallery/cards/upload?lang=ru

Content-Type: multipart/form-data

Required form field:

file

Allowed upload suffixes for this endpoint:

.txt
.md
.markdown
.pdf
.docx

curl -sS -X POST "http://127.0.0.1:8000/v1/gallery/cards/upload?lang=ru" \
  -F "file=@documents/delo_baytemirova.txt;type=text/plain"

13) Upload gallery card from raw text

POST /v1/gallery/cards/text

curl -sS -X POST http://127.0.0.1:8000/v1/gallery/cards/text \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Text-ingested card",
    "text": "Raw archival text fragment goes here.",
    "source": "inline-text",
    "lang": "ru"
  }'

14) Get gallery card by id

GET /v1/gallery/cards/{card_id}

curl -sS http://127.0.0.1:8000/v1/gallery/cards/1

15) Get gallery card details with graph relations

GET /v1/gallery/cards/{card_id}/details?relation_limit=2000

curl -sS "http://127.0.0.1:8000/v1/gallery/cards/1/details?relation_limit=1000"

16) Manual archive reinitialization from `documents/`

POST /v1/gallery/initialize

Body fields:

mode: forward, backward, or forward_backward (default: backward);
lang: ru or kg.

curl -sS -X POST http://127.0.0.1:8000/v1/gallery/initialize \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "backward",
    "lang": "ru"
  }'

Important:

API startup automatically initializes only new documents from documents/;
the manual endpoint is used for full reinitialization runs.

Gallery Upload and Deduplication

Pipeline for /v1/gallery/cards/upload:

API accepts file and validates extension.
sha256 is computed from raw file bytes.
Fast duplicate check by content_hash in SQLite.
If not duplicated: text extraction via LLMHelper.get_doc_data.
Structured card extraction via LLMHelper.extract_document_card.
Duplicate check by title + source.
If an existing record misses content_hash, hash is backfilled.
If no duplicates: new card is inserted into SQLite.
Document text is converted to triplets and upserted into Neo4j.

flowchart TD
  U[Upload file] --> H[Compute sha256]
  H --> D1{Duplicate by content_hash?}
  D1 -- Yes --> R1[Return existing card]
  D1 -- No --> TX[Extract text]

  TX --> EC[Extract document card]
  EC --> D2{Duplicate by title + source?}
  D2 -- Yes --> R2[Return existing card]
  D2 -- No --> SQL[Insert card into SQLite]
  SQL --> NEO[Upsert triplets to Neo4j]
  NEO --> DONE[Return created card]

Duplicate behavior:

API returns the already existing card;
graph_triplets_upserted is 0;
no duplicate row is created in SQLite.

SQLite table used by this flow: document_cards.

Main card fields:

title
short_description
source
content_hash
status (verified, pending, draft)

Graph and Archive Initialization

Automatic archive initialization on API startup

On FastAPI startup, documents/ is synchronized automatically with these rules:

pass mode: forward;
only uninitialized documents are processed;
existing cards are not recreated.

This behavior is implemented in the startup hook in api/app.py.

Manual archive reinitialization

For forced re-sync runs, call /v1/gallery/initialize. In the current archive UI, this action is exposed via overflow menu (...) as Reinitialize archive and sends mode=backward.

Graph/chunk bootstrap script

databases/graph_init.py can preload deterministic triplets and document chunks:

flowchart LR
  DOC[documents dir] --> PARSE[Parse documents]

  PARSE --> TRIP[Triplets from docs]
  PARSE --> CH[Word chunks 50/10]

  TRIP --> GR
  GR --> N[(Neo4j)]

  CH --> EMB[EmbeddingGemma Retrieval-document]
  EMB --> MC[(Milvus document_chunks)]

  AT[Abstract templates] --> MINIT[milvus_init sync]
  MINIT --> MA[(Milvus abstract templates)]

Quick bootstrap from documents/:

python -m databases.graph_init --documents-dir documents

With smoke check:

python -m databases.graph_init --documents-dir documents --smoke-check

Parameterized bootstrap:

python -m databases.graph_init \
  --documents-dir documents \
  --max-chunks-per-doc 18 \
  --chunk-words 50 \
  --chunk-overlap-words 10 \
  --max-vector-chunks-per-doc 120

Sync pre-vectorized abstract templates into Milvus:

python -m databases.milvus_init --verify-search --limit 4000

Smoke Checks and Quality Validation

This repository currently does not include a dedicated end-to-end smoke script in root. A practical validation checklist is:

API health check.
Graph visualization endpoint sanity.
One text query (/v1/query).
One SSE text stream (/v1/query/stream).
One gallery file upload (/v1/gallery/cards/upload).
One manual archive initialization call (/v1/gallery/initialize).

Example quick checks:

curl -sS http://127.0.0.1:8000/health
curl -sS "http://127.0.0.1:8000/v1/graph/visualization?limit=20&offset=0"
curl -sS -X POST http://127.0.0.1:8000/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"What is this archive about?","type":"llm","lang":"ru"}'

Telegram Bot

To run bot mode:

set TELEGRAM_BOT_TOKEN;
start MySQL and apply databases/init.sql;
run:

python run_bot.py

Data Stores

Neo4j

stores Entity nodes and RELATED_TO relations;
relation payload typically carries relation_type, evidence, confidence, sources.

Milvus

stores pre-vectorized abstract templates object|relation_type|subject;
used by abstract relation matching for top-k template selection;
stores document chunks (word chunking 50/10) for vector retrieval;
supports hybrid context construction in llm mode: top full docs by triplet-ranked sources + source-aware chunks (ranks 6..10)
- independent best chunks.

SQLite (gallery)

path is controlled by GALLERY_DB_PATH;
document_cards is created automatically;
unique index by content_hash is enabled.

MySQL (bot)

stores bot users and related runtime metadata (balance/language/mode/history).

Troubleshooting

Error: ModuleNotFoundError: datascience

Symptom:

uvicorn api.app:app ...
ModuleNotFoundError: No module named 'datascience'

Cause: legacy import path from old scripts or incorrect working directory.

Fix:

uvicorn api.app:app --host 127.0.0.1 --port 8000

Always run from repository root.

Error: 415 Unsupported Media Type on gallery upload

Cause: file extension is outside the allowed list for upload endpoint.

Fix: use .txt, .md, .markdown, .pdf, or .docx.

Error: 422 Failed to extract text from uploaded file

Check:

file is not empty;
file type is supported;
API process can read the file content correctly.

Error: empty retrieval results

Check:

graph has data (run graph_init);
Neo4j is reachable via NEO4J_*;
query is not too broad; try type=llm first.

Error: `Documents directory not found: test_data/documents`

Cause: databases.graph_init default path still points to test_data/documents.

Fix:

python -m databases.graph_init --documents-dir documents

Existing duplicates in an old gallery database

Current upload logic prevents new duplicates. Historical duplicates already in SQLite before dedup rollout should be cleaned with a dedicated one-time script.

Known Limitations

Document query endpoint is primarily used with local file paths in current workflows.
Gallery upload format is intentionally limited to a fixed extension list.
Extraction quality depends on OCR/text quality and document structure.
Retrieval remains graph-first: abstract template ranking is Milvus-backed with fallback scoring; chunk retrieval augments context but does not replace graph grounding.

If you need deeper split documentation, recommended next docs are:

README_API.md with full request/response schemas;
README_DEPLOY.md with Docker/systemd/nginx deployment scenarios;
README_GALLERY.md with moderation and card normalization policy.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
aitools		aitools
api		api
bot		bot
confs		confs
databases		databases
frontend		frontend
searchers		searchers
tools		tools
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
README_RU.md		README_RU.md
archive.png		archive.png
chat.png		chat.png
llm_all_functions_test.ipynb		llm_all_functions_test.ipynb
main.py		main.py
map.png		map.png
requirements.txt		requirements.txt
run_api.py		run_api.py
run_bot.py		run_bot.py
run_service.py		run_service.py

Folders and files

Latest commit

History

Repository files navigation

ArchiveGPT System (Graph-First RAG)

Table of Contents

What the System Does

UI Screenshots

Chat

Archive

Map

Recent UI Updates

Current Architecture

High-Level Flow

Detailed Retrieval Pipeline

How Retrieval Works

Final Answer Context Composition

Behavior When Neo4j Is Unavailable

Components and Folder Structure

Requirements

Quick Start

1) Install backend dependencies

2) Create .env in repository root

3) Optional: preload graph/chunks from local documents

4) Run backend API

5) Optional: run backend + frontend together

Environment Variables

Running the API

Recommended

Alternative launcher script

Important import-path note

HTTP API: Full Overview

Text Query Request Flow

1) Health

2) Graph data for visualization

3) Text query

4) Text streaming (SSE)

5) Query over a document

6) Document streaming (SSE)

7) Query over an image

8) Image streaming (SSE)

9) Gallery card list

10) Create gallery card manually (JSON)

11) Update gallery card

12) Upload gallery card from file

13) Upload gallery card from raw text

14) Get gallery card by id

15) Get gallery card details with graph relations

16) Manual archive reinitialization from documents/

Gallery Upload and Deduplication

Graph and Archive Initialization

Automatic archive initialization on API startup

Manual archive reinitialization

Graph/chunk bootstrap script

Smoke Checks and Quality Validation

Telegram Bot

Data Stores

Neo4j

Milvus

SQLite (gallery)

MySQL (bot)

Troubleshooting

Error: ModuleNotFoundError: datascience

Error: 415 Unsupported Media Type on gallery upload

Error: 422 Failed to extract text from uploaded file

Error: empty retrieval results

Error: Documents directory not found: test_data/documents

Existing duplicates in an old gallery database

Known Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

2) Create `.env` in repository root

16) Manual archive reinitialization from `documents/`

Error: `Documents directory not found: test_data/documents`

Packages