SinoRAG

SinoRAG is a local-first research tool for Chinese Buddhist and classical Chinese corpora.

It ingests TEI/XML and corpus exchange files into a searchable passage database, then builds compact indexes for exact phrase search, TF-IDF similarity, catalog browsing, and LLM-agent research workflows.

The goal is simple:

Let an LLM agent research Chinese source texts with tools instead of guessing from memory.

SinoRAG is built for evidence-backed workflows: exact Chinese citations, passage IDs, source metadata, search results, similarity candidates, and generated research artifacts that can be reviewed later.

What it does

Ingests CBETA-style TEI/XML into Parquet passage data
Supports optional Kanripo and custom corpus ingestion
Provides exact phrase search over normalized Chinese text
Builds a document table for stable doc_id <-> passage_id mapping
Builds catalog indexes for corpus/work navigation
Builds TF-IDF indexes for similarity and reuse discovery
Supports memory-mapped phrase index experiments for large corpora
Exposes tools through MCP for LLM agents
Produces structured JSON output for reports, graph generation, and downstream apps

Why this exists

General LLMs are useful, but they should not be trusted to “remember” the Buddhist canon or infer textual history without evidence.

SinoRAG gives agents a local research backend:

user question
  -> agent calls SinoRAG tools
  -> SinoRAG searches the corpus
  -> agent receives exact passages + metadata
  -> report cites the source text

This is especially useful for questions like:

“Where is the first loaded-corpus occurrence of this phrase?”
“What passages are textually similar to this one?”
“Where does this Chan/Zen saying appear?”
“Which works cite or reuse this formula?”
“What source passages support this graph edge?”

Data model

SinoRAG separates source text from derived indexes:

passages.parquet      canonical passage database
doc_table.bin         stable doc_id / passage_id mapping
catalog.index         corpus/work/outline navigation
phrase_v2.index       exact phrase candidate index
tfidf.index           similarity index
registry.sqlite       optional mutable research state

The intended long-term shape is:

one canonical text store
many compact reference indexes
no repeated full-text copies

Supported corpora

CBETA TEI/XML

CBETA-style TEI/XML is the primary target.

sinoragd ingest \
  --corpus /path/to/cbeta/xml-p5 \
  --out data/passages.parquet

Kanripo

Kanripo support is experimental and can require a large amount of disk space.

sinoragd ingest \
  --kanripo-input /path/to/kanripo \
  --out data/passages.parquet

Custom corpora

Custom Chinese corpora can be converted into the SinoRAG Corpus Exchange Format.

sinoragd cef-init --out my-corpus
# edit corpus.toml, works.jsonl, passages.jsonl

sinoragd ingest-cef \
  --input my-corpus \
  --out data/passages.parquet

Minimum custom format:

corpus.toml
works.jsonl
passages.jsonl

Basic workflow

1. Build passage database

sinoragd ingest \
  --corpus /path/to/cbeta/xml-p5 \
  --out data/passages.parquet

2. Build document table

sinoragd doc-table-build \
  --parquet data/passages.parquet \
  --out data/doc_table.bin

3. Build catalog index

sinoragd catalog-index-build \
  --parquet data/passages.parquet \
  --out derived/catalog.index

4. Build phrase index

sinoragd phrase-index-build-v2 \
  --parquet data/passages.parquet \
  --doc-table data/doc_table.bin \
  --out derived/phrase_v2.index \
  --gram-len 4 \
  --buckets 2048 \
  --temp-dir /path/to/large/temp

Use a real disk-backed temp directory for large builds. Avoid /tmp if it is RAM-backed.

5. Build TF-IDF index

sinoragd tfidf-build \
  --parquet data/passages.parquet \
  --doc-table data/doc_table.bin \
  --out derived/tfidf.index \
  --min-ngram 5 \
  --max-ngram 8 \
  --min-df 5 \
  --max-features 200000

For very large corpora, prefer sharded builds.

Search examples

Exact phrase search

sinoragd phrase-index-search \
  --phrase "如是我聞" \
  --parquet data/passages.parquet \
  --doc-table data/doc_table.bin \
  --phrase-index derived/phrase_v2.index

SQL-style passage search

sinoragd search \
  --parquet data/passages.parquet \
  --phrase "平常心是道" \
  --limit 20

Similar passages

sinoragd similar \
  --parquet data/passages.parquet \
  --index derived/tfidf.index \
  --seed "T/T48/T48n2005.xml#p001" \
  --limit 20

MCP server

SinoRAG includes an MCP server so LLM agents can call research tools directly.

sinoragd mcp \
  --parquet data/passages.parquet \
  --doc-table data/doc_table.bin \
  --phrase-index derived/phrase_v2.index \
  --tfidf-index derived/tfidf.index \
  --catalog-index derived/catalog.index

The intended agent pattern is:

agent receives user question
agent calls SinoRAG tools
SinoRAG returns structured evidence
agent writes answer/report with citations

LLM-assisted research

SinoRAG is not meant to replace a scholar or translator. It is meant to make an agent’s work inspectable.

Generated reports should preserve:

exact Chinese source text
passage IDs
source work metadata
search method
confidence / review state
candidate graph edges
rejected or weak evidence where relevant

ReadZen integration idea

SinoRAG can generate research artifacts that other tools organize.

A clean division is:

SinoRAG generates evidence.
ReadZen organizes and displays it beside source texts.

Possible outputs:

research_bundle/
  artifact_index.jsonl
  source_attachments.jsonl
  reports/
  graphs/
  dossiers/

This allows generated reports, diagrams, phrase histories, and lineage outputs to be attached to the CBETA documents and passages they cite.

Performance notes

Large corpora need disk-aware indexing.

Recommendations:

Use release builds.
Put temp files on a fast SSD.
Avoid RAM-backed /tmp for phrase or TF-IDF builds.
Use sharded TF-IDF for very large corpora.
Prefer doc_id references over repeated passage ID strings.
Memory-mapped indexes are preferred for large read-only retrieval structures.

Build with:

cargo build --release

Install

From source:

git clone https://github.com/yourname/sinoragd
cd sinoragd
cargo install --path .

Or run directly:

cargo run --release -- --help

Status

SinoRAG is experimental but usable.

Current focus:

scalable phrase index v2
memory-conscious TF-IDF indexing
stable corpus exchange format
MCP agent integration
ReadZen-compatible research bundles
better translation/glossing workflows for Classical Chinese

Expect schema and command names to change while the project stabilizes.

License

AGPL-3.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cbeta-pdf-creator		cbeta-pdf-creator
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SinoRAG

What it does

Why this exists

Data model

Supported corpora

CBETA TEI/XML

Kanripo

Custom corpora

Basic workflow

1. Build passage database

2. Build document table

3. Build catalog index

4. Build phrase index

5. Build TF-IDF index

Search examples

Exact phrase search

SQL-style passage search

Similar passages

MCP server

LLM-assisted research

ReadZen integration idea

Performance notes

Install

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SinoRAG

What it does

Why this exists

Data model

Supported corpora

CBETA TEI/XML

Kanripo

Custom corpora

Basic workflow

1. Build passage database

2. Build document table

3. Build catalog index

4. Build phrase index

5. Build TF-IDF index

Search examples

Exact phrase search

SQL-style passage search

Similar passages

MCP server

LLM-assisted research

ReadZen integration idea

Performance notes

Install

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages