Skip to content

Hanbiike/graph-rag-system

Repository files navigation

ArchiveGPT System (Graph-First RAG)

Russian documentation: README_RU.md

This repository contains a graph-first retrieval-augmented generation (RAG) system for archive and legal-style documents. The backend combines LLM-based extraction, Neo4j graph retrieval, and Milvus vector search. The frontend provides a chat interface, archive card browsing, and graph visualization.

This English README is intended for public repository usage and reflects the current code state in this project.

Key points:

  • retrieval is centered around Neo4j graph facts;
  • responses and extraction pipelines use the OpenAI Responses API;
  • archive card pipeline: file or text input -> extraction -> SQLite (document_cards) -> Neo4j triplets;
  • duplicate protection exists for repeated uploads;
  • API startup performs automatic initialization only for uninitialized files in documents/;
  • manual archive reinitialization via /v1/gallery/initialize defaults to mode=backward.

Table of Contents

  1. What the System Does
  2. UI Screenshots
  3. Current Architecture
  4. Components and Folder Structure
  5. Requirements
  6. Quick Start
  7. Environment Variables
  8. Running the API
  9. HTTP API: Full Overview
  10. Gallery Upload and Deduplication
  11. Graph and Archive Initialization
  12. Smoke Checks and Quality Validation
  13. Telegram Bot
  14. Data Stores
  15. Troubleshooting
  16. Known Limitations

What the System Does

ArchiveGPT covers three major capabilities:

  1. Question answering over archival and legal-style materials.
  2. Fact extraction and graph retrieval over Neo4j triplets.
  3. Archive card lifecycle management through file/text ingestion.

Supported scenarios:

  • plain text query;
  • query over a local document;
  • query over an image scan;
  • streaming response over SSE;
  • gallery card upload via multipart/form-data;
  • archive card creation from raw text.

UI Screenshots

Chat

Chat

Archive

Archive

Map

Map

Recent UI Updates

  • Chat, Archive, and Map now share one header component;
  • header navigation is centered relative to the full page width;
  • archive reinitialization is intentionally de-emphasized and moved into a compact overflow menu (...) with one action: Reinitialize archive.

Current Architecture

High-Level Flow

flowchart LR
    U[Client] --> API[FastAPI api/app.py]
    API --> S[ArchiveGPTSearch]
    S --> LLM[LLMHelper / OpenAI Responses]
    S --> G[ArchiveGraphSearcher / Neo4j]
    S --> M[Milvus abstract template index]

    API --> GR[GalleryRepository / SQLite]
    API --> G

    D[Local docs/images] --> LLM
Loading

Detailed Retrieval Pipeline

flowchart TD
    Q[User query] --> EX1[LLM: extract abstract templates]
    Q --> EX2[LLM: extract named entities]

    EX1 --> MT[Milvus: abstract template ANN search]
    MT --> GM[ArchiveGraphSearcher: template-based match]
    EX2 --> GM

    GM --> H{Concrete triplets found?}
    H -- Yes --> CTX[Build retrieval context]
    H -- No --> FB[Fallback: graph keyword search]
    FB --> CTX

    CTX --> FINAL[LLM final response]
Loading

How Retrieval Works

For type=llm, the system follows graph-first retrieval:

  1. LLM extracts abstract relation templates from the query.
  2. LLM extracts named entities from the query.
  3. Abstract templates are matched against a pre-vectorized Milvus index.
  4. ArchiveGraphSearcher resolves matched templates to concrete relation edges in Neo4j.
  5. If template matching is empty, fallback keyword graph search is used.
  6. LLM generates the final answer from retrieval context.

Final Answer Context Composition

The final prompt includes:

  • relevant concrete triplets from retrieval;
  • full text for top-ranked source documents (configured by RAG_FULL_SOURCE_DOCS_K);
  • source-aware and global chunk retrieval from Milvus.

This keeps context focused while preserving factual grounding.

flowchart LR
  T[Concrete triplets] --> TOP[Top source docs full text]
  T --> RANK[Ranked sources]
  RANK --> WIN[Sources rank 6..10]

  Q[Search queries] --> VS[Milvus chunk search]
  WIN --> SA[Source-aware chunks]
  VS --> SA
  VS --> GL[Independent best chunks]

  T --> P[Final LLM prompt]
  TOP --> P
  SA --> P
  GL --> P
Loading

Behavior When Neo4j Is Unavailable

If Neo4j settings are missing or the connection fails, ArchiveGraphSearcher falls back to in-memory behavior.

This is useful for local smoke/dev sessions, but production should use stable Neo4j connectivity.

Components and Folder Structure

Core modules:

  • api/app.py: FastAPI endpoints, query handling, gallery APIs, startup archive sync.
  • searchers/search.py: retrieval orchestration and LLM-facing query flow.
  • aitools/llm.py: OpenAI helper, extraction routines, triplet logic.
  • databases/graph_db.py: Neo4j retrieval and upsert layer.
  • databases/graph_init.py: deterministic graph/chunk initialization from local documents.
  • databases/milvus_db.py: Milvus indexes for abstract templates and chunks.
  • databases/milvus_init.py: abstract template synchronization into Milvus.
  • databases/gallery_db.py: SQLite repository for archive cards.
  • bot/*: Telegram bot runtime and handlers.
  • frontend/*: Next.js frontend.

Simplified top-level tree:

.
├── api/
│   └── app.py
├── aitools/
│   ├── embedder.py
│   └── llm.py
├── bot/
│   ├── bot.py
│   ├── handlers.py
│   ├── keyboards.py
│   ├── messages.py
│   └── states.py
├── confs/
│   └── config.py
├── databases/
│   ├── db.py
│   ├── graph_db.py
│   ├── graph_init.py
│   ├── gallery_db.py
│   ├── milvus_db.py
│   └── milvus_init.py
├── documents/
├── frontend/
├── searchers/
│   └── search.py
├── tools/
│   └── embedding_model_zip.py
├── requirements.txt
├── run_api.py
├── run_bot.py
└── run_service.py

Requirements

  • Python 3.10+
  • pip
  • OpenAI API key
  • Neo4j 5.x (strongly recommended for complete retrieval quality)
  • Node.js 20+ and npm (frontend)
  • MySQL (only if running Telegram bot user/balance storage)

Quick Start

1) Install backend dependencies

From repository root:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

2) Create .env in repository root

Use your own environment values (there is no mandatory public .env.example template in this repository snapshot).

Minimum recommended values:

OPENAI_API_KEY=...
OPENAI_MODEL_NANO=gpt-4.1-mini

NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=...
NEO4J_DATABASE=neo4j

Optional: local embedding model ZIP workflow (to avoid online model pulls):

python3.10 -m tools.embedding_model_zip pack \
  --model google/embeddinggemma-300m \
  --output ./artifacts/embeddinggemma-300m.zip

Then add:

EMBEDDING_MODEL_ZIP_PATH=./artifacts/embeddinggemma-300m.zip
EMBEDDING_MODEL_EXTRACT_DIR=~/.cache/archive-gpt/embedding-models

3) Optional: preload graph/chunks from local documents

python -m databases.graph_init --documents-dir documents --smoke-check

Note: databases.graph_init currently defaults to test_data/documents, which is not present in this workspace. Pass --documents-dir documents explicitly.

4) Run backend API

uvicorn api.app:app --host 127.0.0.1 --port 8000

Swagger UI: http://127.0.0.1:8000/docs

5) Optional: run backend + frontend together

python run_service.py

Production mode:

python run_service.py --mode prod

Environment Variables

The table below summarizes key variables from confs/config.py and runtime entrypoints.

Variable Required Default Purpose
OPENAI_API_KEY Yes - OpenAI key for Responses API
OPENAI_MODEL_NANO No gpt-4.1-mini Model for routing/extraction/answers
OPENAI_BASE_URL No empty Custom OpenAI-compatible gateway URL
NEO4J_URI Recommended - Neo4j address (bolt://...)
NEO4J_USER Recommended - Neo4j username
NEO4J_PASSWORD Recommended - Neo4j password
NEO4J_DATABASE No neo4j (if set) Neo4j database/namespace
MILVUS_DB_PATH No milvus_archive_gpt.db Milvus local DB file
MILVUS_ABSTRACT_COLLECTION No abstract_triplet_templates Collection for abstract triplets
MILVUS_CHUNK_COLLECTION No document_chunks Collection for document chunks
EMBEDDING_MODEL_NAME No google/embeddinggemma-300m Embedding model source
EMBEDDING_MODEL_ZIP_PATH No empty Local ZIP model path
EMBEDDING_MODEL_EXTRACT_DIR No ~/.cache/archive-gpt/embedding-models ZIP extraction path
RAG_FULL_SOURCE_DOCS_K No 2 Top source docs included as full text
RAG_CONCRETE_TRIPLETS_LIMIT No 0 Triplet limit in context (0 = no limit)
RAG_VECTOR_SEARCH_TOP_K No 80 Chunk candidates fetched from Milvus
RAG_VECTOR_SOURCE_RANK_START No 6 Source-rank window start
RAG_VECTOR_SOURCE_RANK_END No 10 Source-rank window end
RAG_VECTOR_SOURCE_CHUNKS_K No 5 Source-aware chunks count
RAG_VECTOR_GLOBAL_CHUNKS_K No 5 Global chunks count
RAG_VECTOR_CHUNK_WORDS No 50 Word chunk size
RAG_VECTOR_CHUNK_OVERLAP No 10 Word overlap between adjacent chunks
RAG_VECTOR_MAX_CHUNKS_PER_DOC No 120 Max vector chunks per document
GALLERY_DB_PATH No databases/gallery.db SQLite path for archive cards
TELEGRAM_BOT_TOKEN Bot only - Telegram bot token
DB_HOST/DB_PORT/DB_USER/DB_PASSWORD/DB_NAME Bot only - MySQL for bot users
API_HOST/API_PORT/API_WORKERS No script defaults API launcher options
FRONTEND_PORT No 3000 Frontend port
ARCHIVE_GPT_API_URL No http://localhost:8000 Backend URL used by frontend

Note: legacy Azure-related variables may still appear in environments. Current runtime in this repository is centered on OPENAI_*.

Running the API

Recommended

Run from project root:

uvicorn api.app:app --host 0.0.0.0 --port 8000

Alternative launcher script

python run_api.py --host 0.0.0.0 --port 8000 --reload

Important import-path note

Running from legacy working directories or old PYTHONPATH values can produce import failures such as:

ModuleNotFoundError: No module named 'datascience'

Use project-root execution and api.app:app module path.

HTTP API: Full Overview

Base URL in examples: http://127.0.0.1:8000

Text Query Request Flow

sequenceDiagram
  participant C as Client
  participant API as FastAPI /v1/query
  participant S as ArchiveGPTSearch
  participant G as Neo4j Graph
  participant M as Milvus
  participant O as OpenAI Responses

  C->>API: POST /v1/query
  API->>S: route(query, type, lang)
  S->>O: extract templates/entities
  S->>M: template ANN search
  S->>G: concrete triplets retrieval
  S->>M: optional chunk retrieval
  S->>O: final answer generation
  O-->>S: answer text
  S-->>API: text + response_id
  API-->>C: JSON response
Loading

1) Health

GET /health

curl -sS http://127.0.0.1:8000/health

2) Graph data for visualization

GET /v1/graph/visualization?limit=500&offset=0&search=&entity_name=&relation_type=

Purpose:

  • returns graph edges in triplet-style format;
  • returns node list for direct frontend graph rendering;
  • supports pagination and filters.
curl -sS "http://127.0.0.1:8000/v1/graph/visualization?limit=200&offset=0&search=baytemirov"

3) Text query

POST /v1/query

Body fields:

  • query: user question;
  • type: llm or search;
  • lang: ru or kg;
  • previous_response_id: optional response continuation id.
curl -sS -X POST http://127.0.0.1:8000/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Why was Baytemirov arrested?",
    "type": "llm",
    "lang": "ru"
  }'

4) Text streaming (SSE)

POST /v1/query/stream

curl -N -X POST http://127.0.0.1:8000/v1/query/stream \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Provide a short case summary",
    "type": "llm",
    "lang": "ru"
  }'

SSE event types:

  • delta
  • done
  • error

5) Query over a document

POST /v1/query/doc

Body fields:

  • query
  • file_url (local file path or URL)
  • type (llm)
  • lang
  • previous_response_id (optional)
curl -sS -X POST http://127.0.0.1:8000/v1/query/doc \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Extract key facts",
    "file_url": "documents/delo_baytemirova.txt",
    "type": "llm",
    "lang": "ru"
  }'

Commonly supported local formats:

.txt, .md, .markdown, .json, .csv, .tsv, .log, .rst, .yaml, .yml, .xml, .html, .htm, .docx, .pdf.

6) Document streaming (SSE)

POST /v1/query/doc/stream

Same body as /v1/query/doc with streamed output.

7) Query over an image

POST /v1/query/image

Body fields:

  • query
  • image_url (public URL or local path)
  • type (llm)
  • lang
  • previous_response_id (optional)
curl -sS -X POST http://127.0.0.1:8000/v1/query/image \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is written in the scan?",
    "image_url": "https://example.com/scan.jpg",
    "type": "llm",
    "lang": "ru"
  }'

8) Image streaming (SSE)

POST /v1/query/image/stream

9) Gallery card list

GET /v1/gallery/cards?limit=100&offset=0&search=

Parameters:

  • limit: 1..500 (service-level upper clamp may apply)
  • offset: >= 0
  • search: optional text filter
curl -sS "http://127.0.0.1:8000/v1/gallery/cards?limit=20&offset=0"

10) Create gallery card manually (JSON)

POST /v1/gallery/cards

curl -sS -X POST http://127.0.0.1:8000/v1/gallery/cards \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Sample archive card",
    "short_description": "Manual card for validation",
    "source": "manual-entry",
    "status": "pending"
  }'

11) Update gallery card

PUT /v1/gallery/cards/{card_id}

curl -sS -X PUT http://127.0.0.1:8000/v1/gallery/cards/1 \
  -H "Content-Type: application/json" \
  -d '{
    "status": "verified"
  }'

12) Upload gallery card from file

POST /v1/gallery/cards/upload?lang=ru

Content-Type: multipart/form-data

Required form field:

  • file

Allowed upload suffixes for this endpoint:

  • .txt
  • .md
  • .markdown
  • .pdf
  • .docx
curl -sS -X POST "http://127.0.0.1:8000/v1/gallery/cards/upload?lang=ru" \
  -F "file=@documents/delo_baytemirova.txt;type=text/plain"

13) Upload gallery card from raw text

POST /v1/gallery/cards/text

curl -sS -X POST http://127.0.0.1:8000/v1/gallery/cards/text \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Text-ingested card",
    "text": "Raw archival text fragment goes here.",
    "source": "inline-text",
    "lang": "ru"
  }'

14) Get gallery card by id

GET /v1/gallery/cards/{card_id}

curl -sS http://127.0.0.1:8000/v1/gallery/cards/1

15) Get gallery card details with graph relations

GET /v1/gallery/cards/{card_id}/details?relation_limit=2000

curl -sS "http://127.0.0.1:8000/v1/gallery/cards/1/details?relation_limit=1000"

16) Manual archive reinitialization from documents/

POST /v1/gallery/initialize

Body fields:

  • mode: forward, backward, or forward_backward (default: backward);
  • lang: ru or kg.
curl -sS -X POST http://127.0.0.1:8000/v1/gallery/initialize \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "backward",
    "lang": "ru"
  }'

Important:

  • API startup automatically initializes only new documents from documents/;
  • the manual endpoint is used for full reinitialization runs.

Gallery Upload and Deduplication

Pipeline for /v1/gallery/cards/upload:

  1. API accepts file and validates extension.
  2. sha256 is computed from raw file bytes.
  3. Fast duplicate check by content_hash in SQLite.
  4. If not duplicated: text extraction via LLMHelper.get_doc_data.
  5. Structured card extraction via LLMHelper.extract_document_card.
  6. Duplicate check by title + source.
  7. If an existing record misses content_hash, hash is backfilled.
  8. If no duplicates: new card is inserted into SQLite.
  9. Document text is converted to triplets and upserted into Neo4j.
flowchart TD
  U[Upload file] --> H[Compute sha256]
  H --> D1{Duplicate by content_hash?}
  D1 -- Yes --> R1[Return existing card]
  D1 -- No --> TX[Extract text]

  TX --> EC[Extract document card]
  EC --> D2{Duplicate by title + source?}
  D2 -- Yes --> R2[Return existing card]
  D2 -- No --> SQL[Insert card into SQLite]
  SQL --> NEO[Upsert triplets to Neo4j]
  NEO --> DONE[Return created card]
Loading

Duplicate behavior:

  • API returns the already existing card;
  • graph_triplets_upserted is 0;
  • no duplicate row is created in SQLite.

SQLite table used by this flow: document_cards.

Main card fields:

  • title
  • short_description
  • source
  • content_hash
  • status (verified, pending, draft)

Graph and Archive Initialization

Automatic archive initialization on API startup

On FastAPI startup, documents/ is synchronized automatically with these rules:

  • pass mode: forward;
  • only uninitialized documents are processed;
  • existing cards are not recreated.

This behavior is implemented in the startup hook in api/app.py.

Manual archive reinitialization

For forced re-sync runs, call /v1/gallery/initialize. In the current archive UI, this action is exposed via overflow menu (...) as Reinitialize archive and sends mode=backward.

Graph/chunk bootstrap script

databases/graph_init.py can preload deterministic triplets and document chunks:

flowchart LR
  DOC[documents dir] --> PARSE[Parse documents]

  PARSE --> TRIP[Triplets from docs]
  PARSE --> CH[Word chunks 50/10]

  TRIP --> GR
  GR --> N[(Neo4j)]

  CH --> EMB[EmbeddingGemma Retrieval-document]
  EMB --> MC[(Milvus document_chunks)]

  AT[Abstract templates] --> MINIT[milvus_init sync]
  MINIT --> MA[(Milvus abstract templates)]
Loading

Quick bootstrap from documents/:

python -m databases.graph_init --documents-dir documents

With smoke check:

python -m databases.graph_init --documents-dir documents --smoke-check

Parameterized bootstrap:

python -m databases.graph_init \
  --documents-dir documents \
  --max-chunks-per-doc 18 \
  --chunk-words 50 \
  --chunk-overlap-words 10 \
  --max-vector-chunks-per-doc 120

Sync pre-vectorized abstract templates into Milvus:

python -m databases.milvus_init --verify-search --limit 4000

Smoke Checks and Quality Validation

This repository currently does not include a dedicated end-to-end smoke script in root. A practical validation checklist is:

  1. API health check.
  2. Graph visualization endpoint sanity.
  3. One text query (/v1/query).
  4. One SSE text stream (/v1/query/stream).
  5. One gallery file upload (/v1/gallery/cards/upload).
  6. One manual archive initialization call (/v1/gallery/initialize).

Example quick checks:

curl -sS http://127.0.0.1:8000/health
curl -sS "http://127.0.0.1:8000/v1/graph/visualization?limit=20&offset=0"
curl -sS -X POST http://127.0.0.1:8000/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"What is this archive about?","type":"llm","lang":"ru"}'

Telegram Bot

To run bot mode:

  1. set TELEGRAM_BOT_TOKEN;
  2. start MySQL and apply databases/init.sql;
  3. run:
python run_bot.py

Data Stores

Neo4j

  • stores Entity nodes and RELATED_TO relations;
  • relation payload typically carries relation_type, evidence, confidence, sources.

Milvus

  • stores pre-vectorized abstract templates object|relation_type|subject;
  • used by abstract relation matching for top-k template selection;
  • stores document chunks (word chunking 50/10) for vector retrieval;
  • supports hybrid context construction in llm mode: top full docs by triplet-ranked sources + source-aware chunks (ranks 6..10)
    • independent best chunks.

SQLite (gallery)

  • path is controlled by GALLERY_DB_PATH;
  • document_cards is created automatically;
  • unique index by content_hash is enabled.

MySQL (bot)

  • stores bot users and related runtime metadata (balance/language/mode/history).

Troubleshooting

Error: ModuleNotFoundError: datascience

Symptom:

uvicorn api.app:app ...
ModuleNotFoundError: No module named 'datascience'

Cause: legacy import path from old scripts or incorrect working directory.

Fix:

uvicorn api.app:app --host 127.0.0.1 --port 8000

Always run from repository root.

Error: 415 Unsupported Media Type on gallery upload

Cause: file extension is outside the allowed list for upload endpoint.

Fix: use .txt, .md, .markdown, .pdf, or .docx.

Error: 422 Failed to extract text from uploaded file

Check:

  • file is not empty;
  • file type is supported;
  • API process can read the file content correctly.

Error: empty retrieval results

Check:

  • graph has data (run graph_init);
  • Neo4j is reachable via NEO4J_*;
  • query is not too broad; try type=llm first.

Error: Documents directory not found: test_data/documents

Cause: databases.graph_init default path still points to test_data/documents.

Fix:

python -m databases.graph_init --documents-dir documents

Existing duplicates in an old gallery database

Current upload logic prevents new duplicates. Historical duplicates already in SQLite before dedup rollout should be cleaned with a dedicated one-time script.

Known Limitations

  1. Document query endpoint is primarily used with local file paths in current workflows.
  2. Gallery upload format is intentionally limited to a fixed extension list.
  3. Extraction quality depends on OCR/text quality and document structure.
  4. Retrieval remains graph-first: abstract template ranking is Milvus-backed with fallback scoring; chunk retrieval augments context but does not replace graph grounding.

If you need deeper split documentation, recommended next docs are:

  • README_API.md with full request/response schemas;
  • README_DEPLOY.md with Docker/systemd/nginx deployment scenarios;
  • README_GALLERY.md with moderation and card normalization policy.

About

A graph-first retrieval-augmented generation (RAG) system for archive and legal-style documents. The backend combines LLM-based extraction, Neo4j graph retrieval, and Milvus vector search. The frontend provides a chat interface, archive card browsing, and graph visualization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors