Find what your RAG system is missing — before your LLM makes things up.
TopoRAG is a Python library that uses Topological Data Analysis (TDA) and Persistent Homology to detect conceptual gaps and blind spots in text documents. It's specifically designed for AI Engineers who want to validate that retrieved context is complete before feeding it to an LLM.
When you build a RAG system, your vector database retrieves the nearest chunks — but "nearest" doesn't mean "complete". There can be a topological hole in the middle of your retrieved context: a concept that makes the logical sequence coherent that happened to not be in your database.
When you send that holey context to an LLM, it halluccinates the missing part.
TopoRAG detects those holes before the LLM ever sees the context.
pip install toporagSet your OpenAI API key:
export OPENAI_API_KEY="sk-..."import asyncio
from toporag import TopoAnalyzer
async def main():
analyzer = TopoAnalyzer() # Picks up OPENAI_API_KEY automatically
# Your RAG-retrieved context chunks
chunks = [
"The user signed up and verified their email.",
"They upgraded their subscription to Pro.",
# <- There's a gap here. They never actually used the product!
"They exported the final report as a PDF.",
"They logged out."
]
gaps = await analyzer.analyze_segments(chunks)
for gap in gaps:
print(f"⚠️ Missing: {gap['topic_label']}")
print(f" Why: {gap['explanation']}")
print(f" Fix: {gap['suggested_content']}")
asyncio.run(main())Output:
⚠️ Missing: Product Usage & Core Workflow
Why: The context jumps from billing to exporting a report, completely skipping how the user actually used the platform.
Fix: The user uploaded their dataset, configured the analysis settings, and ran the processing pipeline.
The examples/ folder contains ready-to-run scripts:
| File | Description |
|---|---|
basic_usage.py |
Minimal example — analyze a raw text string |
ai_engineer_test.py |
Real-world scenario: an AI engineer validating RAG context chunks before LLM inference |
To run an example:
git clone https://github.com/MuLIAICHI/toporag_lib.git
cd toporag_lib
pip install -e .
export OPENAI_API_KEY="sk-..."
python examples/basic_usage.pySplits raw text into segments, embeds them, and finds topological gaps.
Scrapes a URL for readable text and runs the full analysis.
Runs the analysis on a pre-chunked List[str]. Best for validating documents already retrieved by a RAG pipeline.
Parameters:
threshold(float, default=0.15): Minimum persistence for a hole to be reported. Lower = more sensitive. For small document sets (< 10 chunks), try0.05.max_holes(int, default=5): Maximum number of gaps to return.generate_suggestions(bool, default=True): Whether to ask the LLM to suggest content that would fill the gap.
Returns: List[Dict] with fields: topic_label, explanation, suggested_content, persistence, rank.
Raw Text / URL / Chunks
↓
split_text() ← paragraph-aware chunking
↓
OpenAI Embeddings ← text-embedding-3-small
↓
PCA Reduction ← reduce to min(10, n-1) dims
↓
Ripser (TDA) ← persistent homology, H1 holes
↓ (fallback)
LOF Outlier Detect ← if Ripser fails, LOF finds sparse regions
↓
GPT-4o-mini Labels ← names and explains each hole using nearby context
↓
List of Gaps 🕳️
- Support for HuggingFace local embedding models (no OpenAI key needed)
- Async streaming progress callbacks for large document sets
- LangChain & LlamaIndex integration wrappers
- CLI:
toporag analyze --file report.txt - Richer persistence diagrams as output options
Have an idea? Open an issue or vote on existing ones!
Contributions of any kind are welcome — bug fixes, new embedding model adapters, better TDA heuristics, or examples!
- Fork the repo
- Create a branch:
git checkout -b feature/my-feature - Make your changes and commit:
git commit -m 'feat: add my feature' - Push:
git push origin feature/my-feature - Open a Pull Request 🎉
Found a bug? Open an issue. Have a question? Start a Discussion.
MIT — see LICENSE. Use this freely in personal or commercial projects.