Drop a .docx. Get compilable LaTeX. Open in Overleaf in one click.
Quick Start Β· Architecture Β· Features Β· API Β· Deploy Β· Roadmap
Academic writers draft in Word. Conferences demand LaTeX. The gap between them eats hours.
CoreTex is a compiler-style pipeline that converts Microsoft Word documents into compilable LaTeX with academic-grade fidelity. It targets IEEE, ACM, and Springer conference submissions β but works for any LaTeX writing.
| Hand re-typing | Pandoc CLI | mammoth.js | CoreTex | |
|---|---|---|---|---|
| Equations (OMML) | β | β | β (batched) | |
| Unicode math (β β XΜ Ξ±ββ) β math mode | β | β | β | β (~140 glyphs) |
| Tables + alignment | β | β | β | |
| Citations detected | β | β | β | |
| IEEE / ACM / Springer templates | β | β | β | β |
| i18n list detection (FR/DE/ES/IT/PT) | β | β | β | |
| Compile check + error line | n/a | β | β | β |
| Overleaf one-click | β | β | β | β |
| Decompression-bomb hardening | n/a | β | β | β |
| Web UI | β | β | β | β |
| Time to convert 20-page paper | ~3 hours | ~10 minΒΉ | ~10 minΒΉ | ~3 seconds |
ΒΉ Plus manual cleanup, structural fixes, template porting, etc.
CoreTex follows a strict compiler-style Intermediate Representation (IR) pattern. The parser never produces LaTeX strings; the renderer never reads Word XML. The IR is the only shared contract β making each layer independently testable.
flowchart LR
subgraph Frontend["π₯ Frontend (Vercel)"]
UI[React + CodeMirror]
end
subgraph API["π FastAPI Service (Railway)"]
R[Routes /convert /status /download /temp]
end
subgraph Queue["π RQ Worker (Railway)"]
direction TB
P[Parser<br/>OOXML β IR]
H1[Equation Handler<br/>Pandoc subprocess]
H2[Image Handler<br/>Pillow compress]
H3[Table Handler<br/>column-spec]
RD[Renderer<br/>IR β LaTeX]
CC[Compile Check<br/>pdflatex]
P --> H1 --> H2 --> H3 --> RD --> CC
end
subgraph Cache["π Redis"]
Jobs[Job state + result]
Temp[5-min temp URLs<br/>for Overleaf snip_uri]
Figs[Compressed figures]
end
UI -->|POST .docx| R
R -->|enqueue| Jobs
Jobs -->|dequeue| P
CC -->|ConversionResult| Jobs
Jobs --> R
R -->|.tex or .zip| UI
R -.->|cache .tex| Temp
UI -.->|snip_uri| Overleaf[(Overleaf)]
Overleaf -.->|fetch| Temp
classDef frontend fill:#22d3ee,stroke:#0e7490,color:#06141d,stroke-width:2px
classDef backend fill:#10b981,stroke:#047857,color:#06141d,stroke-width:2px
classDef worker fill:#f59e0b,stroke:#b45309,color:#06141d,stroke-width:2px
classDef cache fill:#ef4444,stroke:#991b1b,color:#fff,stroke-width:2px
class UI frontend
class R backend
class P,H1,H2,H3,RD,CC worker
class Jobs,Temp,Figs cache
| # | Layer | Purpose |
|---|---|---|
| 1 | Ingestion (FastAPI) | MIME magic-byte validation, 20 MB cap, BytesIO only, enqueue RQ job |
| 2 | Surgical Parser (python-docx + lxml) | Walks OOXML tree β typed IR nodes |
| 3 | IR Schema (Pydantic v2) | FROZEN contract: 10 node types |
| 4 | Specialist Handlers | Equations (Pandoc), images (Pillow), tables (column alignment) |
| 5 | IR Renderer (pure Python + Jinja2) | IR β LaTeX with smart preamble injection |
| 6 | Bibliography | CitationNode β display text + [CITATION] warning |
| 7 | Packager | .tex or .zip + Overleaf temp URL + pdflatex check |
sequenceDiagram
autonumber
participant U as Browser
participant A as FastAPI
participant R as Redis
participant W as RQ Worker
participant P as pdflatex
U->>A: POST /convert (.docx)
A->>A: Magic-byte check + 20 MB cap
A->>R: enqueue(job_id)
A-->>U: {job_id, status: queued}
loop every 2s
U->>A: GET /status/{job_id}
A->>R: fetch job state
R-->>A: status
A-->>U: {status, result_summary?}
end
R-->>W: dequeue
W->>W: parse_docx β IRDocument
W->>W: hydrate equations (1 batched Pandoc call)
W->>W: compress images (Pillow, bomb-capped, EXIF preserved)
W->>W: render β LaTeX (smart preamble, ~140 unicode math glyphs)
W->>P: pdflatex (-no-shell-escape, openin_any=p, 30s)
P-->>W: (ok, error_line)
W->>R: store ConversionResult + per-file figure keys
U->>A: GET /download/{job_id}
A->>R: fetch result + assemble zip from figure keys
A->>R: SETEX temp:{id} TTL=5min
A-->>U: .tex or .zip + X-Overleaf-Temp-URL
U->>U: Open in Overleaf β
Note over U: overleaf.com/docs?snip_uri=https://your-api/temp/{id}
|
|
|
The renderer walks the IR and only emits the packages the document actually uses β |
|
Prerequisites: Docker Desktop + Node.js 20+
# 1. Clone
git clone https://github.com/TheClazer/CoreTex.git
cd CoreTex
# 2. Backend (Redis + API + RQ worker + TeX Live, all in Docker)
docker compose up --build
# 3. Frontend (new terminal)
cd frontend
npm install
npm run devOpen http://localhost:5173, drop a .docx, pick a template, hit Convert β.
π‘ First Docker build pulls ~1.5 GB of TeX Live for the compile check. Skip it with
docker compose build --build-arg INSTALL_TEXLIVE=0for fast iteration.
# Backend β 60 unit + integration tests (escape, parser, renderer,
# golden-doc regression on 16 .docx fixtures, HTTP integration)
pytest tests/ -v
# Frontend β Vitest + tsc
cd frontend && npm testBase URL: http://localhost:8000 (dev) Β· your Railway domain (prod)
| Method | Path | Purpose |
|---|---|---|
POST |
/convert?template=<article|ieee|acm|springer> |
Upload .docx, returns {job_id, status: "queued"} |
GET |
/status/{job_id} |
Polled every 2 s. Returns {status, result_summary?} with citation/warning counts, compile error line. |
GET |
/download/{job_id} |
.tex (text/plain) or .zip (with figures/). Adds X-Overleaf-Temp-URL header. |
GET |
/temp/{job_id}[.tex|.zip] |
Public 5-min snip URL β Overleaf's snip_uri target. Suffix lets Overleaf detect the type from the URL. |
Example: full conversion flow
# 1. Submit
JOB=$(curl -s -X POST 'http://localhost:8000/convert?template=ieee' \
-F 'file=@paper.docx' | jq -r .job_id)
# 2. Poll
while true; do
STATUS=$(curl -s "http://localhost:8000/status/$JOB" | jq -r .status)
echo "$STATUS"
[[ "$STATUS" == "finished" || "$STATUS" == "failed" ]] && break
sleep 2
done
# 3. Download
curl -OJ "http://localhost:8000/download/$JOB"CoreTex/
βββ app/
β βββ main.py FastAPI entry
β βββ config.py Pydantic Settings
β βββ api/routes.py 4 endpoints
β βββ queue/worker.py RQ orchestrator
β βββ converter/
β β βββ ir_schema.py FROZEN Pydantic IR (10 nodes)
β β βββ parser.py OOXML β IRDocument
β β βββ renderer.py IRDocument β LaTeX
β β βββ escape.py 10 reserved + Unicode
β β βββ compile_check.py pdflatex + error parsing
β β βββ handlers/
β β βββ equation_handler.py OMML β LaTeX via Pandoc
β β βββ image_handler.py Pillow compression
β β βββ table_handler.py column-spec
β βββ templates/ article / ieee / acm / springer
βββ frontend/ React + Vite + TS
β βββ src/
β βββ App.tsx
β βββ hooks/useConversion.ts Upload β poll β download
β βββ components/ UploadZone, LatexEditor, β¦
βββ tests/
β βββ test_escape.py 17 unit tests
β βββ test_parser.py 9 unit + regression
β βββ test_renderer.py 16 unit + regression
β βββ test_integration.py 6 HTTP tests
β βββ golden/ 15 synthetic .docx + your real docs
βββ .github/workflows/ci.yml pytest + ruff + vitest + tsc
βββ doc/ Phase-by-phase architecture docs
βββ Dockerfile + docker-compose.yml
βββ railway.toml Backend deploy
βββ frontend/vercel.json Frontend deploy
βββ DEPLOY.md Step-by-step deploy walkthrough
CoreTex is built for Railway (backend + worker + Redis) + Vercel (frontend).
π Follow the step-by-step walkthrough in DEPLOY.md β every manual step is marked β You: so you know what's automatic vs what needs your attention.
π‘ Overleaf integration requires public deployment. When running locally, Overleaf's servers can't reach your
localhost, so the "Open in Overleaf" button needs the backend to be deployed. Use the download button locally; the Overleaf button activates once you're on Railway.
Why a custom renderer instead of Pandoc end-to-end?
Pandoc adds its own structural commands (\tightlist, custom list macros, heading style resets) that collide with the IR renderer's output, producing duplicate or contradictory formatting that frequently fails to compile.
Pandoc is invoked only in the equation handler, on individual OMML fragments. Document structure is 100% custom-rendered.
Why python-docx + lxml instead of mammoth.js?
mammoth.js is a Node.js library built for Word β HTML. It exposes no access to OMML equations, paragraph properties, citation XML fields, or run structure. Using it would require running a Node subprocess from Python β and we'd still lack equation support. python-docx + lxml gives full OOXML access in-process.
Why is the IR schema frozen?
The parser writes IR; the renderer reads IR. If anyone renames a field mid-project without coordination, the renderer breaks silently. Freezing the schema after Week 1 β and requiring an explicit schema/<change> branch + team sign-off β ensures Pydantic surfaces drift as a ValidationError rather than a silent corruption.
Why not back-map LaTeX errors to the source Word paragraph?
Mapping pdflatex errors back through the IR to the original Word paragraph is extremely complex (LaTeX line numbers don't correspond to IR node indices). Practical alternative: surface the LaTeX line number in the warnings panel and have CodeMirror scroll to + highlight it. Users can fix there or open the document in Overleaf.
Why convert Unicode math glyphs at the escape layer instead of in math mode upstream?
Authors paste characters like β, β, XΜ, Ξ±ββ into normal text runs without ever opening Word's equation editor. Those characters reach the parser as paragraph text β not as <m:oMath> β so the equation handler never sees them. Catching them at the escape layer means a single pass over every text string is enough; consecutive glyphs merge into a single $β¦$ region with ^a^b^c β ^{abc} grouping so pdflatex doesn't raise "double superscript".
The renderer also detects when any text run contains a math glyph and auto-injects \usepackage{amssymb} so \mathbb{R} resolves. Documents without math glyphs don't pull in the package.
Why batch all equations through a single Pandoc call?
Pandoc's startup cost is ~400 ms on a warm machine. A paper with 150 equations would block the worker for ~60 seconds if each fired its own subprocess. We instead synthesize a single .docx with N paragraphs (one per equation), each sandwiched between unique ASCII sentinels, then call Pandoc once and split the output back into per-equation chunks by regex. A 150-equation paper converts in ~0.6 s.
Why raw bytes in Redis instead of pickled artefacts?
pickle.loads on untrusted Redis data is an RCE primitive. Even though Redis is internal, treating that boundary as trusted is the same mistake that broke a thousand other systems. Each figure is stored as a separate raw-byte key (figures:{job_id}:f:{name}) alongside a newline-delimited manifest of filenames. No Python-specific wrapper, no deserialisation risk, and individual figures can be looked up without loading the whole dict.
| Area | Limitation | Tracked in |
|---|---|---|
| Citations | Plain-text fallback, no .bib generation |
v2 roadmap |
| Tables | Merged-cell rendering uses \multicolumn only (no row spans) |
v2 roadmap |
| Equations | Requires Pandoc on PATH; failures degrade to a placeholder + warning | bible Β§6 |
| Compile errors | LaTeX line number is surfaced; no back-mapping to original Word paragraph | bible Β§6 |
| Tracked changes | Revision markup stripped; final text only | bible Β§9 |
| Resume-style layouts | Per-word formatting + tabs produce verbose output | bible Β§9 |
| Area | Trade-off | Upgrade path |
|---|---|---|
| Upload size | Hard 20 MB cap | Set MAX_FILE_SIZE_MB; bump Railway RAM |
| Worker count | Single RQ worker β head-of-line blocking under load | Railway Pro + worker replicas |
| Figure storage | Staged in Redis (5-min TTL, ~50 MB ceiling) | Swap for S3 / Cloudflare R2 (~30 lines) |
| Upload memory | ~5Γ duplication at peak (HTTP β buffer β RQ β Redis β worker) | Presigned PUT to S3, pass key through RQ |
| TeX Live image | 1.5 GB Docker layer; ~30 s cold start | Pre-warm with min-replicas, or INSTALL_TEXLIVE=0 |
See DEPLOY.md β Scaling constraints for the full upgrade-path discussion.
Full feature scope (v1 vs v2) lives in word_latex_bible.pdf Β§9.
gantt
title CoreTex roadmap
dateFormat YYYY-MM-DD
axisFormat %b
section v1 (shipped)
IR schema freeze :done, 2026-01-01, 7d
Parser + Renderer :done, 2026-01-08, 14d
Handlers + Templates :done, 2026-01-22, 14d
Compile check + UI :done, 2026-02-05, 14d
Deploy + Docs :done, 2026-02-19, 7d
section v1.1 (shipped β hardening)
Unicode math escape (~140) :done, 2026-03-01, 5d
Combining diacritics :done, 2026-03-06, 2d
Batched Pandoc :done, 2026-03-08, 2d
Pillow bomb cap + EXIF :done, 2026-03-10, 1d
pdflatex sandbox flags :done, 2026-03-11, 1d
i18n list detection (6 locales) :done, 2026-03-12, 2d
Redis raw-byte figures :done, 2026-03-14, 1d
section v2 (planned)
Full BibTeX extraction :2026-04-01, 21d
Beamer slides template :2026-04-22, 14d
Direct OMML parser :2026-05-06, 21d
Run-merger optimisation :2026-05-27, 7d
Style mapping config :2026-06-03, 14d
S3 / R2 figure storage :2026-06-17, 7d
Worker autoscale on Railway :2026-06-24, 5d
PRs welcome! Please follow the bible's branch conventions:
feature/<area>-<short-desc>β new functionalityfix/<area>-<short-desc>β bug fixesschema/<change-desc>β IR schema changes (requires team sign-off)
Every PR must:
- β
Keep all 48 backend tests passing (
pytest tests/) - β
Keep frontend tests passing (
npm test) - β
Pass
ruff check+tsc --noEmit - β
Not regress on any golden doc in
tests/golden/
CI enforces all four β see .github/workflows/ci.yml.
Distributed under the MIT License. See LICENSE for full text.
Copyright (c) 2026 Rayyan Shaikh and CoreTex contributors
Built around the principles laid out in word_latex_bible.pdf v1.0 β
a compiler-style IR pipeline, schema-first design, and per-layer ownership.
Standing on the shoulders of: FastAPI Β· python-docx Β· Pandoc Β· CodeMirror Β· Pillow Β· Overleaf Β· TeX Live
β Star this repo if CoreTex saved you from re-typing your paper.
Made with \textbf{β₯} by @TheClazer