TopoRAG 🕳️

Find what your RAG system is missing — before your LLM makes things up.

TopoRAG is a Python library that uses Topological Data Analysis (TDA) and Persistent Homology to detect conceptual gaps and blind spots in text documents. It's specifically designed for AI Engineers who want to validate that retrieved context is complete before feeding it to an LLM.

🧠 The Problem It Solves

When you build a RAG system, your vector database retrieves the nearest chunks — but "nearest" doesn't mean "complete". There can be a topological hole in the middle of your retrieved context: a concept that makes the logical sequence coherent that happened to not be in your database.

When you send that holey context to an LLM, it halluccinates the missing part.

TopoRAG detects those holes before the LLM ever sees the context.

🚀 Installation

pip install toporag

Set your OpenAI API key:

export OPENAI_API_KEY="sk-..."

⚡ Quick Start

import asyncio
from toporag import TopoAnalyzer

async def main():
    analyzer = TopoAnalyzer()  # Picks up OPENAI_API_KEY automatically

    # Your RAG-retrieved context chunks
    chunks = [
        "The user signed up and verified their email.",
        "They upgraded their subscription to Pro.",
        # <- There's a gap here. They never actually used the product!
        "They exported the final report as a PDF.",
        "They logged out."
    ]

    gaps = await analyzer.analyze_segments(chunks)
    for gap in gaps:
        print(f"⚠️  Missing: {gap['topic_label']}")
        print(f"   Why: {gap['explanation']}")
        print(f"   Fix: {gap['suggested_content']}")

asyncio.run(main())

Output:

⚠️  Missing: Product Usage & Core Workflow
   Why: The context jumps from billing to exporting a report, completely skipping how the user actually used the platform.
   Fix: The user uploaded their dataset, configured the analysis settings, and ran the processing pipeline.

📚 Examples

The examples/ folder contains ready-to-run scripts:

File	Description
`basic_usage.py`	Minimal example — analyze a raw text string
`ai_engineer_test.py`	Real-world scenario: an AI engineer validating RAG context chunks before LLM inference

To run an example:

git clone https://github.com/MuLIAICHI/toporag_lib.git
cd toporag_lib
pip install -e .
export OPENAI_API_KEY="sk-..."
python examples/basic_usage.py

🛠️ API Reference

`TopoAnalyzer.analyze_text(text, threshold=0.15, max_holes=5, generate_suggestions=True)`

Splits raw text into segments, embeds them, and finds topological gaps.

`TopoAnalyzer.analyze_url(url, ...)`

Scrapes a URL for readable text and runs the full analysis.

`TopoAnalyzer.analyze_segments(segments, ...)`

Runs the analysis on a pre-chunked List[str]. Best for validating documents already retrieved by a RAG pipeline.

Parameters:

threshold (float, default=0.15): Minimum persistence for a hole to be reported. Lower = more sensitive. For small document sets (< 10 chunks), try 0.05.
max_holes (int, default=5): Maximum number of gaps to return.
generate_suggestions (bool, default=True): Whether to ask the LLM to suggest content that would fill the gap.

Returns: List[Dict] with fields: topic_label, explanation, suggested_content, persistence, rank.

🏗️ How It Works

Raw Text / URL / Chunks
       ↓
   split_text()         ← paragraph-aware chunking
       ↓
  OpenAI Embeddings     ← text-embedding-3-small
       ↓
   PCA Reduction        ← reduce to min(10, n-1) dims
       ↓
   Ripser (TDA)         ← persistent homology, H1 holes
       ↓  (fallback)
  LOF Outlier Detect   ← if Ripser fails, LOF finds sparse regions
       ↓
  GPT-4o-mini Labels   ← names and explains each hole using nearby context
       ↓
  List of Gaps 🕳️

🗺️ Roadmap

Support for HuggingFace local embedding models (no OpenAI key needed)
Async streaming progress callbacks for large document sets
LangChain & LlamaIndex integration wrappers
CLI: toporag analyze --file report.txt
Richer persistence diagrams as output options

Have an idea? Open an issue or vote on existing ones!

🤝 Contributing

Contributions of any kind are welcome — bug fixes, new embedding model adapters, better TDA heuristics, or examples!

Fork the repo
Create a branch: git checkout -b feature/my-feature
Make your changes and commit: git commit -m 'feat: add my feature'
Push: git push origin feature/my-feature
Open a Pull Request 🎉

Found a bug? Open an issue. Have a question? Start a Discussion.

⚖️ License

MIT — see LICENSE. Use this freely in personal or commercial projects.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
src/toporag		src/toporag
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TopoRAG 🕳️

🧠 The Problem It Solves

🚀 Installation

⚡ Quick Start

📚 Examples

🛠️ API Reference

`TopoAnalyzer.analyze_text(text, threshold=0.15, max_holes=5, generate_suggestions=True)`

`TopoAnalyzer.analyze_url(url, ...)`

`TopoAnalyzer.analyze_segments(segments, ...)`

🏗️ How It Works

🗺️ Roadmap

🤝 Contributing

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TopoRAG 🕳️

🧠 The Problem It Solves

🚀 Installation

⚡ Quick Start

📚 Examples

🛠️ API Reference

TopoAnalyzer.analyze_text(text, threshold=0.15, max_holes=5, generate_suggestions=True)

TopoAnalyzer.analyze_url(url, ...)

TopoAnalyzer.analyze_segments(segments, ...)

🏗️ How It Works

🗺️ Roadmap

🤝 Contributing

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`TopoAnalyzer.analyze_text(text, threshold=0.15, max_holes=5, generate_suggestions=True)`

`TopoAnalyzer.analyze_url(url, ...)`

`TopoAnalyzer.analyze_segments(segments, ...)`

Packages