GitHub - 07subhadip/ResearchPilot: Production RAG system for ML research papers

𝙍𝙚𝙨𝙚𝙖𝙧𝙘𝙝𝙋𝙞𝙡𝙤𝙩 𝙞𝙨 𝙖𝙣 𝙖𝙙𝙫𝙖𝙣𝙘𝙚𝙙 𝙍𝙚𝙩𝙧𝙞𝙚𝙫𝙖𝙡-𝘼𝙪𝙜𝙢𝙚𝙣𝙩𝙚𝙙 𝙂𝙚𝙣𝙚𝙧𝙖𝙩𝙞𝙤𝙣 (𝙍𝘼𝙂) 𝙚𝙣𝙜𝙞𝙣𝙚 𝙙𝙚𝙨𝙞𝙜𝙣𝙚𝙙 𝙨𝙥𝙚𝙘𝙞𝙛𝙞𝙘𝙖𝙡𝙡𝙮 𝙛𝙤𝙧 𝙨𝙮𝙣𝙩𝙝𝙚𝙨𝙞𝙯𝙞𝙣𝙜 𝙖𝙣𝙙 𝙦𝙪𝙚𝙧𝙮𝙞𝙣𝙜 𝙈𝙖𝙘𝙝𝙞𝙣𝙚 𝙇𝙚𝙖𝙧𝙣𝙞𝙣𝙜 𝙡𝙞𝙩𝙚𝙧𝙖𝙩𝙪𝙧𝙚. 𝙄𝙩 𝙞𝙣𝙜𝙚𝙨𝙩𝙨 𝙩𝙝𝙤𝙪𝙨𝙖𝙣𝙙𝙨 𝙤𝙛 𝙨𝙘𝙞𝙚𝙣𝙩𝙞𝙛𝙞𝙘 𝙥𝙖𝙥𝙚𝙧𝙨 𝙛𝙧𝙤𝙢 𝘼𝙧𝙓𝙞𝙫 𝙖𝙣𝙙 𝙖𝙥𝙥𝙡𝙞𝙚𝙨 𝙨𝙩𝙖𝙩𝙚-𝙤𝙛-𝙩𝙝𝙚-𝙖𝙧𝙩 𝙝𝙮𝙗𝙧𝙞𝙙 𝙫𝙚𝙘𝙩𝙤𝙧 𝙨𝙚𝙖𝙧𝙘𝙝𝙞𝙣𝙜 𝙖𝙣𝙙 𝙘𝙧𝙤𝙨𝙨-𝙚𝙣𝙘𝙤𝙙𝙚𝙧 𝙧𝙚𝙧𝙖𝙣𝙠𝙞𝙣𝙜 𝙩𝙤 𝙙𝙚𝙡𝙞𝙫𝙚𝙧 𝙥𝙧𝙚𝙘𝙞𝙨𝙚, 𝙘𝙞𝙩𝙖𝙩𝙞𝙤𝙣-𝙗𝙖𝙘𝙠𝙚𝙙 𝙖𝙣𝙨𝙬𝙚𝙧𝙨. 𝙏𝙝𝙚 𝙨𝙮𝙨𝙩𝙚𝙢 𝙞𝙨 𝙗𝙪𝙞𝙡𝙩 𝙩𝙤 𝙨𝙘𝙖𝙡𝙚 𝙧𝙚𝙡𝙞𝙖𝙗𝙡𝙮 𝙬𝙞𝙩𝙝 𝙖𝙣 𝙞𝙣𝙩𝙚𝙡𝙡𝙞𝙜𝙚𝙣𝙩 𝙢𝙪𝙡𝙩𝙞-𝙢𝙤𝙙𝙚𝙡 𝙇𝙇𝙈 𝙛𝙖𝙡𝙡𝙗𝙖𝙘𝙠 𝙘𝙝𝙖𝙞𝙣.

Demo

The ResearchPilot user interface demonstrating conversation memory, LaTeX mathematical rendering, inline citations, and streaming responses from the AI.

Why ResearchPilot

Feature	Naive RAG	ResearchPilot
Retrieval method	Standard Dense Vectors	✅ Hybrid BM25 + Dense Vectors (RRF Fusion)
Re-ranking	❌ None	✅ Cross-Encoder Re-ranking (`ms-marco-MiniLM`)
Chunking strategy	Fixed-size character splitting	✅ Semantic chunking at topic boundaries
LLM fallback	❌ None (Static Model)	✅ 4-Model Fallback Chain for failover
Conversation memory	❌ Single turn	✅ Multi-turn with Query Rewriting
Math rendering	Broken/Raw LaTeX	✅ Full KaTeX equation rendering
Citation sourcing	Missing or Hallucinated	✅ Inline ArXiv citation badges

Architecture

User Query 
   │
   ▼
Query Rewriter (Follow-up expansion)
   │
   ▼
Retrieval Layer
   ├── BM25 Sparse Index (Term matching)
   └── Qdrant Dense Index (Semantic matching)
   │
   ▼
Reciprocal Rank Fusion (RRF)
   │
   ▼
Cross-Encoder Reranker
   │
   ▼
Context Compressor
   │
   ▼
LLM Fabric (GLM-5.1 → Qwen-3.5 → LLaMA-3.3 → Qwen-2.5-Coder)
   │
   ▼
Streaming Response
   │
   ▼
Frontend (Next.js + KaTeX)

Component	Technology	Purpose
Frontend	Next.js & React	High-performance user interface with App Router.
Backend	FastAPI & Pydantic	Asynchronous, highly concurrent API server.
Embeddings	BAAI/bge-base-en-v1.5	State-of-the-art 768-dimensional textual embeddings.
Vector DB	Qdrant (local, CPU)	Fast, scalable persistent storage for deep semantic vectors.
Sparse Index	rank-bm25	Token-based matching to locate exact keywords and terms.
Re-ranking	MiniLM-L-6-v2	High-fidelity scoring of retrieved contexts against the query.
Generation	Multiple LLMs	Dynamic routing via HuggingFace Inference to ensure high availability.

Tech Stack

Backend

Component	Technology	Purpose
Language	Python 3.10+	Core language for ML applications and data processing
Framework	FastAPI & Uvicorn	High performance API and async web serving
Embeddings	SentenceTransformers	Dense vector generation (BGE-base-en-v1.5)
Vector Store	Qdrant	Vector indexing and semantic retrieval
Search	rank-bm25	Sparse lexical retrieval
Re-ranker	Cross-Encoder	Secondary ranking of retrieved contexts

Frontend

Component	Technology	Purpose
Framework	Next.js (App Router)	React application framework
Styling	Tailwind CSS	Utility-first responsive design
Animations	Framer Motion	Smooth UI rendering and layout transitions
Markdown	react-markdown	Rendering LLM text outputs securely
Mathematics	KaTeX & rehype-katex	Native rendering for complex mathematical equations

Project Structure

researchpilot/
├── src/
│   ├── ingestion/      # ArXiv API fetcher, PDF downloader
│   ├── processing/     # PDF extractor, text cleaner, chunker
│   ├── embeddings/     # BGE embedding model, cache, pipeline
│   ├── vectorstore/    # Qdrant store, BM25 store, indexer
│   ├── retrieval/      # Hybrid retriever, reranker, pipeline
│   ├── rag/            # RAG pipeline, LLM client, prompts
│   └── api/            # FastAPI app, routes, schemas
├── frontend-next/      # Next.js frontend
├── config/             # Settings, environment config
├── data/               # Chunks, embeddings, Qdrant DB (gitignored)
├── logs/               # Application logs
├── Dockerfile          # HuggingFace Spaces deployment
└── run_*.py            # Pipeline runner scripts

Local Setup

Prerequisites

Python 3.10+
Node.js 18+
Git

Installation Steps

Clone the repository:

git clone https://github.com/Subhadip007/ResearchPilot.git
cd ResearchPilot

Create and activate a Python virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

Install backend dependencies:
```
pip install -r requirements.txt
```
Configure environment variables: Copy .env.example to .env and fill in your API keys:
```
cp .env.example .env
```
Add your GROQ_API_KEY and HF_API_KEY.
Run the data ingestion and indexing pipeline: (Note: The full pipeline takes several hours on the first run. For skipping this, see the HF Dataset note below).
```
python run_ingestion.py
python run_chunking.py
python run_embedding.py
python run_indexing.py
```
Start the FastAPI backend:
```
python run_api.py
```
Start the Next.js frontend (in a new terminal):
```
cd frontend-next
npm install
npm run dev
```
Open the application: Navigate to http://localhost:3000

Note: Pre-built vectorized data can be downloaded directly from the Hugging Face Dataset Subhadip007/researchpilot-data to skip step 5.

Pipeline Scripts

Script	Purpose	Runtime (approx)
`run_ingestion.py`	Fetch papers from ArXiv + download PDFs	~10 min / 100 papers
`run_chunking.py`	Semantic chunking with BGE embeddings	~45 min to 1 hr (Kaggle GPU) / 3000 papers
`run_embedding.py`	Generate 768-dim embeddings for all chunks	~1.5 hrs (Kaggle GPU)
`run_indexing.py`	Index embeddings into Qdrant	~2 min
`run_api.py`	Start FastAPI server	Instant

API Reference

POST /query
Body: 
{ 
  "question": "string", 
  "history": [], 
  "top_k": 5,
  "filter_category": "cs.LG", 
  "filter_year_gte": 2024 
}
Returns: 
{ 
  "answer": "string", 
  "citations": ["1234.56789"], 
  "total_time_ms": 1245.5 
}

GET /health
Returns: 
{ 
  "status": "healthy", 
  "vector_db_size": 15664 
}

POST /feedback
Body: 
{ 
  "query": "string", 
  "rating": 5, 
  "comment": "string" 
}
Returns:
{
  "status": "ok"
}

How Retrieval Works

Hybrid search: BM25 finds exact keyword matches, while BGE finds deep semantic textual matches.
RRF fusion: Ranks from sparse and dense retrievers are combined using the 1/(k+rank) formulation, rather than raw score addition.
Two-stage retrieval: The top-40 candidates are retrieved cheaply via vector database, then the top-10 are strictly re-scored by a powerful cross-encoder.
Semantic chunking: Documents are split intelligently at topic boundaries by tracking sentence similarity drops, preventing concept shearing.
Query rewriting: Conversational follow-up questions are automatically expanded using history to ensure the vector database always receives full context.

Deployment

Backend Server

The backend is Dockerized and deployed continuously to Hugging Face Spaces. It runs on internal port 7860. To stay within storage limits, pre-processed vector chunks are automatically restored from a Hugging Face Dataset during component startup.

Frontend Application

The Next.js frontend is deployed via Vercel. It relies on the environment variable NEXT_PUBLIC_API_URL to route requests to the Hugging Face Space.

Note: Hugging Face Spaces on the free tier automatically go to sleep after 48 hours of inactivity. The machine will cold-start temporarily upon the next interaction, but active usage keeps the hardware fully awake and responsive.

Roadmap

Contributing

Contributions and Pull Requests are welcome. However, please open an issue first to discuss any major architectural changes before submitting code.

License

✨ Built with ❤️ by Subhadip Hensh ✨

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.vscode		.vscode
config		config
docs		docs
frontend-next		frontend-next
src		src
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
diagnose_payload.py		diagnose_payload.py
fix_categories.py		fix_categories.py
fix_chunk_categories.py		fix_chunk_categories.py
fix_processed_categories.py		fix_processed_categories.py
merge_embeddings.py		merge_embeddings.py
output.txt		output.txt
reindex_light.py		reindex_light.py
requirements.txt		requirements.txt
run_api.py		run_api.py
run_chunking.py		run_chunking.py
run_embedding.py		run_embedding.py
run_indexing.py		run_indexing.py
run_ingestion.py		run_ingestion.py
test_api.py		test_api.py
test_chunk_quality.py		test_chunk_quality.py
test_chunking.py		test_chunking.py
test_embedding.py		test_embedding.py
test_fetch.py		test_fetch.py
test_fetch_2.py		test_fetch_2.py
test_processing.py		test_processing.py
test_rag.py		test_rag.py
test_retrieval.py		test_retrieval.py
test_search.py		test_search.py
test_stream.py		test_stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demo

Why ResearchPilot

Architecture

Tech Stack

Backend

Frontend

Project Structure

Local Setup

Prerequisites

Installation Steps

Pipeline Scripts

API Reference

How Retrieval Works

Deployment

Backend Server

Frontend Application

Roadmap

Contributing

License

✨ Built with ❤️ by Subhadip Hensh ✨

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Demo

Why ResearchPilot

Architecture

Tech Stack

Backend

Frontend

Project Structure

Local Setup

Prerequisites

Installation Steps

Pipeline Scripts

API Reference

How Retrieval Works

Deployment

Backend Server

Frontend Application

Roadmap

Contributing

License

✨ Built with ❤️ by Subhadip Hensh ✨

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages