docx-extractor

A small Rust CLI that converts a .docx file into structured JSON — paragraphs, headings, lists, tables, footnotes, headers/footers, comments, tracked changes, and base64-encoded images. Designed to be consumed programmatically (e.g. by a Claude skill) without needing Microsoft Office.

Use from Claude Desktop (MCP)

Easiest way for end users. Add this to your claude_desktop_config.json (Settings → Developer → Edit Config), restart Claude Desktop, and the extract_docx tool is available:

{
  "mcpServers": {
    "docx-extractor": {
      "command": "npx",
      "args": ["-y", "docx-extractor-mcp"]
    }
  }
}

The MCP wrapper auto-downloads the matching platform binary on first call and caches it in ~/.cache/docx-extractor-mcp/<version>/. Requires Node.js 18+. See mcp/README.md for details.

Install

Download the latest binary for your platform from the Releases page:

Platform	Asset
Linux x86-64	`docx-extractor-linux-x86_64`
macOS Intel	`docx-extractor-macos-x86_64`
macOS Apple Silicon	`docx-extractor-macos-aarch64`
Windows x86-64	`docx-extractor-windows-x86_64.exe`

Make it executable (chmod +x docx-extractor-linux-x86_64) and put it on your PATH, or invoke it by full path. The Windows asset is named docx-extractor-windows-x86_64.exe for clarity on the Releases page — rename it to docx-extractor.exe once it's on your PATH.

Or build from source:

cargo build --release
# binary at target/release/docx-extractor

Usage

docx-extractor path/to/file.docx                       # JSON to stdout
docx-extractor path/to/file.docx --pretty              # pretty-printed
docx-extractor path/to/file.docx --output out.json     # write to file
docx-extractor path/to/file.docx --no-images           # skip base64 image bytes
docx-extractor path/to/file.docx --max-image-bytes 1048576   # cap individual images at 1 MB

Output

{
  "source": "report.docx",
  "metadata": { "title": "...", "author": "...", "created": "..." },
  "sections": [
    { "type": "heading",   "level": 1, "text": "Introduction" },
    { "type": "paragraph", "text": "See [^1].", "footnote_refs": [1] },
    { "type": "list_item", "level": 0, "text": "First bullet" },
    { "type": "paragraph", "text": "Figure caption", "images": ["image1.png"] },
    { "type": "table", "rows": [[{ "text": "A" }, { "text": "B" }]] }
  ],
  "headers":   [{ "type": "default", "sections": [/* ... */] }],
  "footers":   [{ "type": "default", "sections": [/* ... */] }],
  "footnotes": [{ "id": 1, "sections": [{ "type": "paragraph", "text": "..." }] }],
  "comments":  [{ "id": 0, "author": "Jane", "anchor": { "section_index": 2, "char_start": 4, "char_end": 15 },
                  "sections": [{ "type": "paragraph", "text": "Looks good." }] }],
  "revisions": [{ "kind": "delete", "author": "Jane", "text": "removed words",
                  "anchor": { "section_index": 1, "char_start": 5, "char_end": 5 } }],
  "images":    [{ "id": "image1.png", "mime_type": "image/png", "base64": "..." }]
}

Empty arrays and null fields are omitted. See CLAUDE.md for the full schema and known limitations.

What is extracted

Paragraphs, headings (Heading1..9 or outlineLvl), list items with level
Tables (cells are {text, images}; nested tables are flattened into the outer cell as joined text)
Tracked insertions and deletions — author, date, anchor (deletions are removed from body text but surfaced in revisions[])
Comments — author, date, body, and anchor into the body section it wraps
Footnotes and endnotes — full nested sections + inline footnote_refs / endnote_refs
Headers and footers — full nested sections, typed default / first / even
Hyperlinks — inlined as markdown [text](url)
Images — PNG / JPEG / GIF / BMP / TIFF / WebP, base64-encoded; per-section images[filename] ties them to their location
Core metadata — title, author, last-modified-by, created, modified

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
mcp		mcp
python		python
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docx-extractor

Use from Claude Desktop (MCP)

Install

Usage

Output

What is extracted

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docx-extractor

Use from Claude Desktop (MCP)

Install

Usage

Output

What is extracted

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages