PaperPal converts unstructured manuscripts (DOCX/PDF) into a citation-safe LaTeX project and compiles a preview PDF.
For each run, PaperPal materializes:
build/main.texbuild/references.bibbuild/figures/...build/output.pdf- run artifacts like
extracted.ir.json,citation_map.json,compile_result.json,explainability.report.json
PaperPal is primarily run with Docker Compose in this repo.
- Docker Engine / Docker Desktop
- Docker Compose plugin
cp .env.example .env
cp apps/web/.env.example apps/web/.env.localSet at least:
OPENAI_API_KEYOPENAI_BASE_URL=https://api.openai.com/v1
docker compose -f infra/docker-compose.yml up -d --build- Web:
http://localhost:5173 - API:
http://localhost:8000 - API docs:
http://localhost:8000/docs
- Python
>=3.10(3.11 recommended) - Node.js
>=18(20 recommended) - PostgreSQL (local instance)
- Redis (local instance)
- Pandoc (required for DOCX extraction)
- For compile stage:
- either Docker Engine/Desktop (
PAPERPAL_COMPILE_USE_DOCKER=1), or - local
latexmk+ TeX Live (PAPERPAL_COMPILE_USE_DOCKER=0)
- either Docker Engine/Desktop (
python -m venv .venv
# PowerShell
.\.venv\Scripts\Activate.ps1
pip install -e .[dev]cd apps/web
npm install
cd ../..Copy env templates:
cp .env.example .env
cp apps/web/.env.example apps/web/.env.localMinimum required variables in .env for full pipeline:
DATABASE_URL(Postgres DSN)REDIS_URLOPENAI_API_KEY(for live model calls)
Commonly adjusted variables:
ENABLE_PDF_PHASE2(trueto allow PDF ingestion)PAPERPAL_COMPILE_USE_DOCKER(1uses sandbox container;0uses local latexmk)OPENAI_BASE_URL(keephttps://api.openai.com/v1unless using compatible proxy)CORS_ORIGINS(JSON list, e.g.["http://localhost:5173"])
alembic upgrade headAPI:
uvicorn paperpal_api.main:app --reload --host 0.0.0.0 --port 8000Pipeline worker:
rq worker pipeline --url redis://localhost:6379/0Compile worker:
rq worker compile --url redis://localhost:6379/0Frontend:
cd apps/web
npm run devOpen:
- Web:
http://localhost:5173 - API docs:
http://localhost:8000/docs
Authoritative template: .env.example.
The template is grouped by:
- API/service settings
- DB/Redis
- storage/journals paths
- queue names
- pipeline + compile behavior
- LLM/provider config
- model-step overrides
- compile sandbox limits
- PDF ingestion adapter settings
Frontend env template: apps/web/.env.example
High-level run graph:
extract_document(DOCX via Pandoc AST, PDF via PDF extractor when enabled)extract_referencesparse_referencesemit_bibtexextract_citation_mentionslink_citationsvalidate_citation_coverageload_guidelinesanalyze_scopebuild_block_plangenerate_blockspostprocess_citationsgenerate_figures_tablesassemble_project- optional polish stages (
full_manuscript_polish,secondary_manuscript_polish) precompile_validatecompile(latexmk + deterministic/LLM repair loop)explainability
Compile loop is bounded and compile-safe:
- deterministic fixes first
- then limited LLM patch retries
- citation invariant enforced before compile (
\cite{key}must exist inreferences.bib)
POST /projectsPOST /projects/{id}/generatePOST /projects/{id}/compileGET /projects/{id}/status?run_id=...GET /projects/{id}/filesGET /projects/{id}/file?path=...PUT /projects/{id}/fileGET /projects/{id}/pdfGET /projects/{id}/explainability?run_id=...GET /journalsGET /journals/{id}/guidelines-bundle
- If
OPENAI_API_KEYis empty, model steps fall back to deterministic defaults (quality may degrade, but pipeline stays compile-safe). - For PDF support, keep
ENABLE_PDF_PHASE2=trueand ensure extraction dependencies are available. - Runtime artifacts are under
storage/projects/(ignored by git except.gitkeep).