An interactive Streamlit-based playground for experimenting with multiple text chunking strategies for RAG (Retrieval-Augmented Generation) pipelines.
This tool allows you to upload documents (TXT, PDF, DOCX, Markdown) and analyze how different chunking strategies split text into chunks, making it easier to compare, visualize, and optimize for your use case.
- Fixed-size (with overlap)
- Recursive (paragraphs → sentences → fallback)
- Document-based (structure-aware: headers, sections)
- Semantic (sentence embeddings + cosine similarity)
- LLM-based (Gemini-powered intelligent chunking)
- Upload
.txt,.pdf,.docx, or.mdfiles - Adjust parameters like
chunk_size,overlap, andsemantic threshold - Compare strategies side by side
- Preview original text
- Expandable chunk results
- First 20 chunks displayed for quick inspection
- Sentence-Transformer model is loaded once
- Gemini client cached for efficiency
chunking-playground/
│── app.py # Streamlit main app (your code)
│── requirements.txt # Python dependencies
│── README.md # Project documentation
│── .env.example # Example env file
1️⃣ Clone the repository
git clone https://github.com/MuhammadAbdullah95/rag-chunking-playground.git
cd chunking-playground2️⃣ Create a virtual environment (with uv)
uv venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux / Mac3️⃣ Install dependencies
uv sync4️⃣ Set up environment variables
Create a .env file in the project root:
GOOGLE_API_KEY=<your_gemini_api_key>
streamlit run app.pyThen open your browser at http://localhost:8501
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Equal-sized chunks with optional overlap | Simple, consistent processing |
| Recursive | Hierarchical splitting at natural boundaries | General-purpose text |
| Document-based | Structure-aware (headers, sections) | Structured documents (reports, papers) |
| Semantic | Uses embeddings & similarity thresholds | Context preservation |
| LLM-based | Gemini generates coherent chunks | Complex, nuanced content |
- Upload a
.pdfresearch paper. - Select Semantic Chunking with threshold
0.7. - Compare with Fixed-size Chunking (
chunk_size=500,overlap=50). - Expand results to inspect chunk boundaries.
Main libraries used:
streamlitpython-dotenvPyPDF2sentence-transformersscikit-learngoogle-generativeaidocx2txt
-
Caching is enabled for:
- Sentence Transformer model (
@st.cache_resource) - Gemini Client (
@st.cache_resource) - Semantic & LLM chunking (
@st.cache_data)
- Sentence Transformer model (
-
Fallbacks ensure:
- If a strategy fails → it falls back to fixed-size chunking.
- Document-based chunking defaults to fixed-size if no headers are found.
This project is open-source under the MIT License.