Skip to content

Perfect29/ResearchAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Research Copilot

Research Copilot is a portfolio-grade MVP: an AI document assistant that combines ingestion, chunking, OpenAI embeddings, Chroma vector search, and LangChain-orchestrated RAG with grounded answers, structured citations, and lightweight workflow routing (summaries, key concepts, comparison, study notes). A FastAPI backend and Streamlit UI are separated so you can swap the frontend or call the API directly.

Why this project matters

Internship and research applications benefit from artifacts that show you can move ideas into a working system: reliable PDF/text ingestion, metadata-aware chunking, retrieval you can inspect, answers that cite sources, and a thin agent-style layer that routes user intent to the right tool—without pretending to be a fully autonomous agent. This repo is intentionally readable and extensible (pluggable LLM/embeddings factories, service layer, evaluation hooks) rather than a single-script demo.

Architecture

flowchart LR
  UI[Streamlit UI] --> API[FastAPI]
  API --> SVC[Services]
  SVC --> RAG[RAG layer]
  RAG --> CHR[Chroma]
  RAG --> OAI[OpenAI LLM / Embeddings]
  SVC --> LOG[JSONL logs]
  API --> UP[data/uploads]
  CHR --> VS[data/vectorstore]
Loading
Area Role
app/api/routes HTTP: documents, chat (with session memory), workflows (stateless), evaluation retrieval
app/services Ingestion, QA retrieval helpers, workflow router, session memory, evaluation logging
app/rag Embeddings/LLM factories, chunking, Chroma wrapper, retriever helpers, LangChain-style prompts
app/models Pydantic schemas
app/core Settings (pydantic-settings, .env)
data/uploads Original files + manifest.json
data/vectorstore Persistent Chroma data
data/logs qa_log.jsonl (question, context, answer)

Routing: The “agent” is a deterministic router (WorkflowMode → retrieval strategy → prompt). No open-ended tool loop—easy to reason about and extend.

Stack

  • Python 3.11+
  • FastAPI, Uvicorn, Pydantic v2, python-dotenv
  • LangChain (LCEL-style chains / chat prompts in app/rag/chains.py)
  • langchain-openai (chat + embeddings)
  • Chroma (persistent vector store)
  • PyPDF-based PDF loading, text loaders for .txt / .md
  • Streamlit + httpx for the UI

Setup

  1. Clone / copy the project and enter the directory.

  2. Create a virtual environment (recommended):

    python3.11 -m venv .venv
    source .venv/bin/activate   # Windows: .venv\Scripts\activate
    pip install -r requirements.txt
  3. Configure environment:

    cp .env.example .env
    # Edit .env and set OPENAI_API_KEY

Environment variables

See .env.example for placeholders. Key variables:

Variable Purpose
OPENAI_API_KEY Required for embeddings and chat
OPENAI_CHAT_MODEL Chat model name (default gpt-4o-mini)
OPENAI_EMBEDDING_MODEL Embeddings model (default text-embedding-3-small)
CHUNK_SIZE / CHUNK_OVERLAP Text splitter settings
RETRIEVER_TOP_K Default k for similarity search
CHROMA_COLLECTION_NAME Collection name on disk
RESEARCH_COPILOT_API Optional Streamlit override for API base URL

How to run

Terminal 1 — API

cd /path/to/ResearchAI
source .venv/bin/activate
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Terminal 2 — UI

source .venv/bin/activate
streamlit run streamlit_app.py

Open the Streamlit URL (usually http://localhost:8501). The app is chat-first: use the bottom input like a normal chat; library upload, mode, and scope live in the sidebar. Set the API base to http://127.0.0.1:8000 if needed.

If you see ModuleNotFoundError (e.g. langchain_community): dependencies are installed in .venv, but uvicorn is running with another Python (e.g. pyenv global). From the project folder run source .venv/bin/activate then which uvicorn — it should show .../ResearchAI/.venv/bin/uvicorn. Alternatively: pip install -r requirements.txt into the same interpreter you use for uvicorn (check with which python).

Health check

curl -s http://127.0.0.1:8000/health | python3 -m json.tool

Example API usage

Upload a document

curl -s -X POST "http://127.0.0.1:8000/api/documents/upload" \
  -F "file=@./samples/example.txt"

List ingested documents

curl -s "http://127.0.0.1:8000/api/documents" | python3 -m json.tool

Chat (QA with citations + optional retrieval debug)

curl -s -X POST "http://127.0.0.1:8000/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "demo-session-1",
    "question": "What is the main contribution?",
    "mode": "qa_over_docs",
    "file_id": null,
    "debug_retrieval": true
  }' | python3 -m json.tool

Stateless workflow (no chat memory)

curl -s -X POST "http://127.0.0.1:8000/api/workflows/run" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "wf-1",
    "mode": "summarize_doc",
    "question": "",
    "file_id": "YOUR_FILE_ID"
  }' | python3 -m json.tool

Compare two documents

curl -s -X POST "http://127.0.0.1:8000/api/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "demo-session-2",
    "question": "Contrast methodology and conclusions.",
    "mode": "compare_docs",
    "compare_file_id_a": "FILE_ID_A",
    "compare_file_id_b": "FILE_ID_B",
    "debug_retrieval": true
  }' | python3 -m json.tool

Evaluation: retrieval-only (top‑k chunks)

curl -s -X POST "http://127.0.0.1:8000/api/eval/retrieval" \
  -H "Content-Type: application/json" \
  -d '{"question": "neural architecture", "top_k": 5, "file_id": null}' | python3 -m json.tool

Logs append to data/logs/qa_log.jsonl (question, retrieved context blob, answer, mode, timestamp).

UI features

  • Multi-file upload and ingest
  • Document list with chunk counts
  • Mode selector: QA, summarize, compare, key concepts, study notes
  • Optional file scope (single doc vs library-wide)
  • Answer panel + structured citations
  • Retrieval debug (from chat flag or dedicated eval section)
  • Chroma similarity scores in debug (interpretation depends on distance metric; lower is often “closer” for L2)

Tests

pytest tests/ -q

Smoke tests hit /health and /api/documents without calling OpenAI.

Extending the system

  • New LLM provider: implement LLMProvider in app/rag/llm.py and LLMFactory.register(...).
  • New embeddings: implement EmbeddingsProvider in app/rag/embeddings.py and register.
  • FAISS instead of Chroma: implement a small adapter in app/rag/vector_store.py matching the same call sites (add_documents, similarity_search_with_score, as_retriever).
  • Stronger agents: replace run_workflow with a LangGraph / tool-calling loop; keep ingestion and retriever services as-is.

Future improvements

  • Async ingestion queue and progress callbacks
  • OCR for scanned PDFs
  • Hybrid retrieval (BM25 + vectors)
  • User authentication and multi-tenant stores
  • Automated eval sets (RAGAS-style) wired to data/logs traces
  • Export of conversations and citation audit trails

Evaluation ideas (research narrative)

  • Retrieval precision@k on labeled QA pairs from your corpus
  • Citation faithfulness: human or model check that each claim maps to a retrieved span
  • Ablations: chunk size, top‑k, embedding model, chat model
  • Compare workflow stress tests: two papers with overlapping keywords but different claims

Built as a complete, runnable MVP—configure OPENAI_API_KEY, run Uvicorn + Streamlit, upload a PDF, and iterate.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages