DITA Converter

Converts text-based PDF and DOCX files into valid DITA 2.0 XML via a rule-based pipeline with a Streamlit web UI. Calibrated against Gilbarco technical manuals.

DITA Version: 2.0
Style Reference: Gilbarco Technical Manuals
Source Analysis: Passport V24.04 — 42 docs, 64 elements, 1,559 instances

Quick Start

Requirements: Python 3.11, pip

```bash git clone https://github.com/Lockdndream/dita-converter.git cd dita-converter pip install -r requirements.txt streamlit run ui/app.py ```

Open `http://localhost:8501\` — upload a PDF or DOCX and download your DITA output.

Pipeline

``` Upload PDF or DOCX │ ▼ ┌─────────────┐ pdfplumber / python-docx │ Extractor │ → Content Tree (block dicts with Y-position ordering) └─────────────┘ ROW_SHOW table detection · TM sentinel encoding · bold detection │ ▼ ┌─────────────┐ config/mapping_rules.yaml │ Mapper │ → Annotated Content Tree (dita_element per block) └─────────────┘ Hazard statement classification · UI path detection │ ▼ ┌─────────────┐ lxml │ Generator │ → DITA 2.0 XML + .ditamap or .bookmap └─────────────┘ Per-topic type detection · Ditabase composite output │ ▼ ┌─────────────┐ lxml well-formedness │ Validator │ → ValidationResult + report per topic └─────────────┘ │ ▼ Map view → select topics → download .dita · scoped ZIP · .ditamap · .bookmap ```

Features

`)

Feature	Detail
Multi-topic output	Each H1 → separate `.dita` file
Ditabase composite	Non-intro topics use `` (ditabase) root with typed child topics
Topic type detection	`task` · `concept` · `reference` · `topic` — per chunk, per H2/H3
Introduction topics	H2/H3 → ` ` with `<title>`
Appendix topics	Title starts with "Appendix" → always ``
ROW_SHOW table detection	FrameMaker-style borderless tables: 2pt header rules, no vertical lines
Multi-row thead	Straddle/spanning header rows with `namest`/`nameend` attributes
Hazard statements	`IMPORTANT INFORMATION` / `WARNING` / `CAUTION` / `DANGER` → `` per ANSI Z535
Bold retention	Bold text in paragraphs and table cells → `` (stripped in `
Trademark markup	Superscript ® ™ ℠ → `<tm tmtype="reg
Map output	`.ditamap` (kit documents) or `.bookmap` (book documents)
Page range	Extract specific pages: `1-5, 8, 12-15`
Blank page detection	"Intentionally left blank" pages skipped automatically
Selective export	Check topics → download one `.dita` or scoped ZIP
Image support (DOCX)	Provide extracted `media/` folder path
Particle UI	Animated magnetic field particle background

Topic Structure Rules

Topic	Root element	H2/H3 handling
Introduction	``	` <title>`
Appendix	``	` <title>`
Everything else	`` (ditabase)	Sibling typed topics inside ``

Ditabase child type detection (priority order):

Steps detected → ``
Majority tables/dl → ``
Prose paragraphs → ``
Default → ``

Project Structure

``` dita-converter/ ├── agents/ │ ├── extractor.py # PDF/DOCX → Content Tree │ ├── mapper.py # Content Tree + YAML → Annotated Tree │ ├── generator.py # Annotated Tree → DITA 2.0 XML + maps │ └── validator.py # XML validation + report ├── config/ │ └── mapping_rules.yaml # Style mapping rules (editable) ├── ui/ │ └── app.py # Streamlit web UI ├── build/ │ ├── launcher.py # Windows exe entry point │ ├── build.py # PyInstaller build script │ ├── dita_converter.spec │ └── IT_Certificate_Guide.md ├── runtime.txt ├── requirements.txt ├── COMMANDS.md # Developer command reference ├── CLAUDE.md # AI assistant context └── README.md ```

DOCX Image Extraction

Copy your `.docx` → rename to `.zip` → extract
Navigate to extracted folder → `word/` → `media/`
Paste the full path to `media/` in the UI image folder field

Mapping Rules

Edit `config/mapping_rules.yaml` to adapt to different document styles — no code changes required. Based on Passport V24.04 analysis (42 documents, 64 element types).

Key configurable sections: `topic_type_signals`, `note_map`, `table_map`, `task_section_map`, `drop_patterns`.

Windows Executable

Build a standalone `.exe` (no Python install required):

```cmd py -3.11 build\build.py ```

Output: `dist\DITAConverter.exe` — double-click to launch. See `build\IT_Certificate_Guide.md` for enterprise signing via GPO.

Dependencies

Library	Version	Purpose
pdfplumber	0.10.x	PDF text, table, and geometry extraction
python-docx	1.1.x	DOCX parsing
PyYAML	6.x	Mapping rules config
lxml	5.x	XML generation and validation
streamlit	1.35.x	Web UI
protobuf	3.20.3	Pinned for Python 3.11 exe compatibility

Total runtime cost: $0.00 — no API keys, no cloud services.

Roadmap

Version	Focus
v1.1	✅ DITA 2.0 · Ditabase · ROW_SHOW tables · Hazard statements · TM markup · Bold · Bookmap · Particle UI
v2.0	Batch conversion · Auto DOCX image extraction · Full DTD validation
v2.1	DITA map editor · Drag-to-reorder topics
v3.0	LLM-assisted mapping for ambiguous content

License

MIT — see `LICENSE` for details.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.streamlit		.streamlit
agents		agents
config		config
image_processor_src		image_processor_src
screenshots		screenshots
ui		ui
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
COMMANDS.md		COMMANDS.md
HANDOVER.md		HANDOVER.md
LICENSE		LICENSE
README.md		README.md
SCREENSHOTS.md		SCREENSHOTS.md
packages.txt		packages.txt
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DITA Converter

Quick Start

Pipeline

Features

Topic Structure Rules

Project Structure

DOCX Image Extraction

Mapping Rules

Windows Executable

Dependencies

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DITA Converter

Quick Start

Pipeline

Features

Topic Structure Rules

Project Structure

DOCX Image Extraction

Mapping Rules

Windows Executable

Dependencies

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages