Skip to content

MuLIAICHI/toporag_lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TopoRAG 🕳️

PyPI version License: MIT Python 3.9+ PRs Welcome

Find what your RAG system is missing — before your LLM makes things up.

TopoRAG is a Python library that uses Topological Data Analysis (TDA) and Persistent Homology to detect conceptual gaps and blind spots in text documents. It's specifically designed for AI Engineers who want to validate that retrieved context is complete before feeding it to an LLM.


🧠 The Problem It Solves

When you build a RAG system, your vector database retrieves the nearest chunks — but "nearest" doesn't mean "complete". There can be a topological hole in the middle of your retrieved context: a concept that makes the logical sequence coherent that happened to not be in your database.

When you send that holey context to an LLM, it halluccinates the missing part.

TopoRAG detects those holes before the LLM ever sees the context.


🚀 Installation

pip install toporag

Set your OpenAI API key:

export OPENAI_API_KEY="sk-..."

⚡ Quick Start

import asyncio
from toporag import TopoAnalyzer

async def main():
    analyzer = TopoAnalyzer()  # Picks up OPENAI_API_KEY automatically

    # Your RAG-retrieved context chunks
    chunks = [
        "The user signed up and verified their email.",
        "They upgraded their subscription to Pro.",
        # <- There's a gap here. They never actually used the product!
        "They exported the final report as a PDF.",
        "They logged out."
    ]

    gaps = await analyzer.analyze_segments(chunks)
    for gap in gaps:
        print(f"⚠️  Missing: {gap['topic_label']}")
        print(f"   Why: {gap['explanation']}")
        print(f"   Fix: {gap['suggested_content']}")

asyncio.run(main())

Output:

⚠️  Missing: Product Usage & Core Workflow
   Why: The context jumps from billing to exporting a report, completely skipping how the user actually used the platform.
   Fix: The user uploaded their dataset, configured the analysis settings, and ran the processing pipeline.

📚 Examples

The examples/ folder contains ready-to-run scripts:

File Description
basic_usage.py Minimal example — analyze a raw text string
ai_engineer_test.py Real-world scenario: an AI engineer validating RAG context chunks before LLM inference

To run an example:

git clone https://github.com/MuLIAICHI/toporag_lib.git
cd toporag_lib
pip install -e .
export OPENAI_API_KEY="sk-..."
python examples/basic_usage.py

🛠️ API Reference

TopoAnalyzer.analyze_text(text, threshold=0.15, max_holes=5, generate_suggestions=True)

Splits raw text into segments, embeds them, and finds topological gaps.

TopoAnalyzer.analyze_url(url, ...)

Scrapes a URL for readable text and runs the full analysis.

TopoAnalyzer.analyze_segments(segments, ...)

Runs the analysis on a pre-chunked List[str]. Best for validating documents already retrieved by a RAG pipeline.

Parameters:

  • threshold (float, default=0.15): Minimum persistence for a hole to be reported. Lower = more sensitive. For small document sets (< 10 chunks), try 0.05.
  • max_holes (int, default=5): Maximum number of gaps to return.
  • generate_suggestions (bool, default=True): Whether to ask the LLM to suggest content that would fill the gap.

Returns: List[Dict] with fields: topic_label, explanation, suggested_content, persistence, rank.


🏗️ How It Works

Raw Text / URL / Chunks
       ↓
   split_text()         ← paragraph-aware chunking
       ↓
  OpenAI Embeddings     ← text-embedding-3-small
       ↓
   PCA Reduction        ← reduce to min(10, n-1) dims
       ↓
   Ripser (TDA)         ← persistent homology, H1 holes
       ↓  (fallback)
  LOF Outlier Detect   ← if Ripser fails, LOF finds sparse regions
       ↓
  GPT-4o-mini Labels   ← names and explains each hole using nearby context
       ↓
  List of Gaps 🕳️

🗺️ Roadmap

  • Support for HuggingFace local embedding models (no OpenAI key needed)
  • Async streaming progress callbacks for large document sets
  • LangChain & LlamaIndex integration wrappers
  • CLI: toporag analyze --file report.txt
  • Richer persistence diagrams as output options

Have an idea? Open an issue or vote on existing ones!


🤝 Contributing

Contributions of any kind are welcome — bug fixes, new embedding model adapters, better TDA heuristics, or examples!

  1. Fork the repo
  2. Create a branch: git checkout -b feature/my-feature
  3. Make your changes and commit: git commit -m 'feat: add my feature'
  4. Push: git push origin feature/my-feature
  5. Open a Pull Request 🎉

Found a bug? Open an issue. Have a question? Start a Discussion.


⚖️ License

MIT — see LICENSE. Use this freely in personal or commercial projects.

About

TopoRAG is a Topological Data Analysis (TDA) library designed to detect conceptual gaps and blind spots in text documents, specifically useful for evaluating the context retrieved in Retrieval-Augmented Generation (RAG) pipelines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages