Research Copilot is a portfolio-grade MVP: an AI document assistant that combines ingestion, chunking, OpenAI embeddings, Chroma vector search, and LangChain-orchestrated RAG with grounded answers, structured citations, and lightweight workflow routing (summaries, key concepts, comparison, study notes). A FastAPI backend and Streamlit UI are separated so you can swap the frontend or call the API directly.
Internship and research applications benefit from artifacts that show you can move ideas into a working system: reliable PDF/text ingestion, metadata-aware chunking, retrieval you can inspect, answers that cite sources, and a thin agent-style layer that routes user intent to the right tool—without pretending to be a fully autonomous agent. This repo is intentionally readable and extensible (pluggable LLM/embeddings factories, service layer, evaluation hooks) rather than a single-script demo.
flowchart LR
UI[Streamlit UI] --> API[FastAPI]
API --> SVC[Services]
SVC --> RAG[RAG layer]
RAG --> CHR[Chroma]
RAG --> OAI[OpenAI LLM / Embeddings]
SVC --> LOG[JSONL logs]
API --> UP[data/uploads]
CHR --> VS[data/vectorstore]
| Area | Role |
|---|---|
app/api/routes |
HTTP: documents, chat (with session memory), workflows (stateless), evaluation retrieval |
app/services |
Ingestion, QA retrieval helpers, workflow router, session memory, evaluation logging |
app/rag |
Embeddings/LLM factories, chunking, Chroma wrapper, retriever helpers, LangChain-style prompts |
app/models |
Pydantic schemas |
app/core |
Settings (pydantic-settings, .env) |
data/uploads |
Original files + manifest.json |
data/vectorstore |
Persistent Chroma data |
data/logs |
qa_log.jsonl (question, context, answer) |
Routing: The “agent” is a deterministic router (WorkflowMode → retrieval strategy → prompt). No open-ended tool loop—easy to reason about and extend.
- Python 3.11+
- FastAPI, Uvicorn, Pydantic v2, python-dotenv
- LangChain (LCEL-style chains / chat prompts in
app/rag/chains.py) langchain-openai(chat + embeddings)- Chroma (persistent vector store)
- PyPDF-based PDF loading, text loaders for
.txt/.md - Streamlit + httpx for the UI
-
Clone / copy the project and enter the directory.
-
Create a virtual environment (recommended):
python3.11 -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Configure environment:
cp .env.example .env # Edit .env and set OPENAI_API_KEY
See .env.example for placeholders. Key variables:
| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
Required for embeddings and chat |
OPENAI_CHAT_MODEL |
Chat model name (default gpt-4o-mini) |
OPENAI_EMBEDDING_MODEL |
Embeddings model (default text-embedding-3-small) |
CHUNK_SIZE / CHUNK_OVERLAP |
Text splitter settings |
RETRIEVER_TOP_K |
Default k for similarity search |
CHROMA_COLLECTION_NAME |
Collection name on disk |
RESEARCH_COPILOT_API |
Optional Streamlit override for API base URL |
Terminal 1 — API
cd /path/to/ResearchAI
source .venv/bin/activate
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000Terminal 2 — UI
source .venv/bin/activate
streamlit run streamlit_app.pyOpen the Streamlit URL (usually http://localhost:8501). The app is chat-first: use the bottom input like a normal chat; library upload, mode, and scope live in the sidebar. Set the API base to http://127.0.0.1:8000 if needed.
If you see ModuleNotFoundError (e.g. langchain_community): dependencies are installed in .venv, but uvicorn is running with another Python (e.g. pyenv global). From the project folder run source .venv/bin/activate then which uvicorn — it should show .../ResearchAI/.venv/bin/uvicorn. Alternatively: pip install -r requirements.txt into the same interpreter you use for uvicorn (check with which python).
Health check
curl -s http://127.0.0.1:8000/health | python3 -m json.toolUpload a document
curl -s -X POST "http://127.0.0.1:8000/api/documents/upload" \
-F "file=@./samples/example.txt"List ingested documents
curl -s "http://127.0.0.1:8000/api/documents" | python3 -m json.toolChat (QA with citations + optional retrieval debug)
curl -s -X POST "http://127.0.0.1:8000/api/chat" \
-H "Content-Type: application/json" \
-d '{
"session_id": "demo-session-1",
"question": "What is the main contribution?",
"mode": "qa_over_docs",
"file_id": null,
"debug_retrieval": true
}' | python3 -m json.toolStateless workflow (no chat memory)
curl -s -X POST "http://127.0.0.1:8000/api/workflows/run" \
-H "Content-Type: application/json" \
-d '{
"session_id": "wf-1",
"mode": "summarize_doc",
"question": "",
"file_id": "YOUR_FILE_ID"
}' | python3 -m json.toolCompare two documents
curl -s -X POST "http://127.0.0.1:8000/api/chat" \
-H "Content-Type: application/json" \
-d '{
"session_id": "demo-session-2",
"question": "Contrast methodology and conclusions.",
"mode": "compare_docs",
"compare_file_id_a": "FILE_ID_A",
"compare_file_id_b": "FILE_ID_B",
"debug_retrieval": true
}' | python3 -m json.toolEvaluation: retrieval-only (top‑k chunks)
curl -s -X POST "http://127.0.0.1:8000/api/eval/retrieval" \
-H "Content-Type: application/json" \
-d '{"question": "neural architecture", "top_k": 5, "file_id": null}' | python3 -m json.toolLogs append to data/logs/qa_log.jsonl (question, retrieved context blob, answer, mode, timestamp).
- Multi-file upload and ingest
- Document list with chunk counts
- Mode selector: QA, summarize, compare, key concepts, study notes
- Optional file scope (single doc vs library-wide)
- Answer panel + structured citations
- Retrieval debug (from chat flag or dedicated eval section)
- Chroma similarity scores in debug (interpretation depends on distance metric; lower is often “closer” for L2)
pytest tests/ -qSmoke tests hit /health and /api/documents without calling OpenAI.
- New LLM provider: implement
LLMProviderinapp/rag/llm.pyandLLMFactory.register(...). - New embeddings: implement
EmbeddingsProviderinapp/rag/embeddings.pyand register. - FAISS instead of Chroma: implement a small adapter in
app/rag/vector_store.pymatching the same call sites (add_documents,similarity_search_with_score,as_retriever). - Stronger agents: replace
run_workflowwith a LangGraph / tool-calling loop; keep ingestion and retriever services as-is.
- Async ingestion queue and progress callbacks
- OCR for scanned PDFs
- Hybrid retrieval (BM25 + vectors)
- User authentication and multi-tenant stores
- Automated eval sets (RAGAS-style) wired to
data/logstraces - Export of conversations and citation audit trails
- Retrieval precision@k on labeled QA pairs from your corpus
- Citation faithfulness: human or model check that each claim maps to a retrieved span
- Ablations: chunk size, top‑k, embedding model, chat model
- Compare workflow stress tests: two papers with overlapping keywords but different claims
Built as a complete, runnable MVP—configure OPENAI_API_KEY, run Uvicorn + Streamlit, upload a PDF, and iterate.