📑 Text Chunking Playground for RAG

An interactive Streamlit-based playground for experimenting with multiple text chunking strategies for RAG (Retrieval-Augmented Generation) pipelines.

This tool allows you to upload documents (TXT, PDF, DOCX, Markdown) and analyze how different chunking strategies split text into chunks, making it easier to compare, visualize, and optimize for your use case.

🚀 Features

Multiple Chunking Strategies

Fixed-size (with overlap)
Recursive (paragraphs → sentences → fallback)
Document-based (structure-aware: headers, sections)
Semantic (sentence embeddings + cosine similarity)
LLM-based (Gemini-powered intelligent chunking)

Interactive Playground

Upload .txt, .pdf, .docx, or .md files
Adjust parameters like chunk_size, overlap, and semantic threshold
Compare strategies side by side

User-Friendly Interface

Preview original text
Expandable chunk results
First 20 chunks displayed for quick inspection

Caching for Speed

Sentence-Transformer model is loaded once
Gemini client cached for efficiency

📂 Project Structure

chunking-playground/
│── app.py              # Streamlit main app (your code)
│── requirements.txt    # Python dependencies
│── README.md           # Project documentation
│── .env.example        # Example env file

⚡ Quickstart

1️⃣ Clone the repository

git clone https://github.com/MuhammadAbdullah95/rag-chunking-playground.git
cd chunking-playground

2️⃣ Create a virtual environment (with uv)

uv venv
.venv\Scripts\activate   # Windows
source .venv/bin/activate   # Linux / Mac

3️⃣ Install dependencies

uv sync

4️⃣ Set up environment variables

Create a .env file in the project root:

GOOGLE_API_KEY=<your_gemini_api_key>

▶️ Run the App

streamlit run app.py

Then open your browser at http://localhost:8501

📊 Chunking Strategies

Strategy	Description	Best For
Fixed-size	Equal-sized chunks with optional overlap	Simple, consistent processing
Recursive	Hierarchical splitting at natural boundaries	General-purpose text
Document-based	Structure-aware (headers, sections)	Structured documents (reports, papers)
Semantic	Uses embeddings & similarity thresholds	Context preservation
LLM-based	Gemini generates coherent chunks	Complex, nuanced content

🔧 Example Usage

Upload a .pdf research paper.
Select Semantic Chunking with threshold 0.7.
Compare with Fixed-size Chunking (chunk_size=500, overlap=50).
Expand results to inspect chunk boundaries.

📦 Requirements

Main libraries used:

streamlit
python-dotenv
PyPDF2
sentence-transformers
scikit-learn
google-generativeai
docx2txt

🛠 Development Notes

Caching is enabled for:
- Sentence Transformer model (@st.cache_resource)
- Gemini Client (@st.cache_resource)
- Semantic & LLM chunking (@st.cache_data)
Fallbacks ensure:
- If a strategy fails → it falls back to fixed-size chunking.
- Document-based chunking defaults to fixed-size if no headers are found.

📜 License

This project is open-source under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📑 Text Chunking Playground for RAG

🚀 Features

Multiple Chunking Strategies

Interactive Playground

User-Friendly Interface

Caching for Speed

📂 Project Structure

⚡ Quickstart

📊 Chunking Strategies

🔧 Example Usage

📦 Requirements

🛠 Development Notes

📜 License

About

Uh oh!

Releases

Packages

Languages

License

MuhammadAbdullah95/rag-chunking-playground

Folders and files

Latest commit

History

Repository files navigation

📑 Text Chunking Playground for RAG

🚀 Features

Multiple Chunking Strategies

Interactive Playground

User-Friendly Interface

Caching for Speed

📂 Project Structure

⚡ Quickstart

📊 Chunking Strategies

🔧 Example Usage

📦 Requirements

🛠 Development Notes

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages