Skip to content

Docx Renderer Extension

Vladimir Schneider edited this page Apr 19, 2020 · 11 revisions

flexmark-java Docx-Renderer extension

Overview

Renders the parsed Markdown AST to docx format using the docx4j library.

See the DocxConverterCommonMark Sample for code and Customizing Docx Rendering for an overview and information on customizing the styles.

Pegdown version can be found in DocxConverterPegdown Sample

⚠️ Emoji extension with Java7 will not load GitHub provided images. Use Java8+ or do not set EmojiExtension.USE_SHORTCUT_TYPE to EmojiShortcutType.GITHUB or EmojiShortcutType.ANY_GITHUB_PREFERRED which causes GitHub provided images to be used.

Syntax

Renders AST generated by flexmark-java parser. No special syntax is implemented by this extension.

Limited Attributes Node Handling

  • .className on paragraph elements will set the docx styleId to className if the style id is found. This allows using specific style ids to change formatting for paragraphs, special classes pagebreak and tab are excluded.
  • page break via {.pagebreak} attributes
  • tab via {.tab} attributes
  • Use {style=""} to set attributes on text or block elements. Only the following are processed:
    • color - text color
    • background-color - shade fill color, pattern always solid.
    • font-family - not implemented
    • font-size - in pt, rounded to nearest 1/2 pt. Units pt is optional.
    • font-weight - set/clear bold (if using numeric weights then >= 550 sets bold, less clears it)
    • font-style - set/clear italic
  • inline image alignment with {align=}:
    • left - left align, wrap text to right
    • right - right align, wrap text to left
    • center - center align, wrap text to left and right
    • else no wrapping around image, image inserted into text

Parsing Details

artifact: flexmark-docx-converter

The following options are available:

Defined in DocxRenderer class:

  • CODE_HIGHLIGHT_SHADING default "" , when non-empty will use this color as a highlight, also overrides NO_CHARACTER_STYLES to true, see NOTE on Highlight Colors colors.
  • CUSTOM_PROPERTIES default Collections.emptyMap() , set to Map<String, String> containing map of property name to property value for custom properties to be set in document. reference. Needed in some cases for post processing.
  • DEFAULT_LINK_RESOLVER default true , use default link resolver, which uses the DOC_RELATIVE_URL and DOC_ROOT_URL options
  • DEFAULT_TEMPLATE_RESOURCE default "/empty.xml" , default template resource path
  • DOC_EMOJI_IMAGE_VERT_OFFSET default -0.10 , vertical offset of emoji image as a factor of line height at point of insertion. The final value is rounded to nearest pt so jumps of 1 pt for small changes of this value can occur.
  • DOC_EMOJI_IMAGE_VERT_SIZE default 1.05 , size of emoji image as a factor of line height at point of insertion.
  • DOC_RELATIVE_URL default "" , the prefix to use for all relative URLs: not starting with protocol or /
  • DOC_ROOT_URL default "" , the prefix to use for all absolute URLs: ones starting with /
  • ERROR_SOURCE_FILE default "" , name of source file to use in error logs
  • ERRORS_TO_STDERR default false , log errors to stdout
  • FORM_CONTROLS default "" , set to name of form control reference to generate form controls with name given by this key [name]{.type attributes}
  • LINEBREAK_ON_INLINE_HTML_BR default true , convert inline HTML <br> to line break in the docx
  • LOCAL_HYPERLINK_MISSING_FORMAT default "Missing target id: #%s" , when non-empty uses String.format() on the given string with the missing ref anchor as the argument to generate a tooltip for unresolved hyperlinks
  • LOCAL_HYPERLINK_MISSING_HIGHLIGHT default "red" , when non-empty will highlight unresolved hyperlinks local to the document with this color. see NOTE on Highlight Colors colors.
  • LOCAL_HYPERLINK_SUFFIX default "" , appends this suffix to in document hyperlink anchor reference. Needed in some cases for post processing.
  • LOG_IMAGE_PROCESSING default false , log image processing errors
  • MAX_IMAGE_WIDTH default 0 , max image width, 0 no max
  • NO_CHARACTER_STYLES default false , when true will not set character style but explicitly set the run values from the style
  • NUMBERING_XML default getResourceString("/numbering.xml") , default numbering section if missing in wordprocessing package
  • PREFIX_WWW_LINKS default true , controls whether links starting with www. will be prefixed with https://
  • RENDER_BODY_ONLY default false , when rendering to string will only output the body of the document part. Used for tests.
  • STYLES_XML default getResourceString("/styles.xml") , default styles section if missing in wordprocessing package
  • TABLE_CAPTION_BEFORE_TABLE default false , insert caption before table
  • TABLE_CAPTION_TO_PARAGRAPH default true , convert table captions to paragraphs, styled with TableCaption style id
  • TABLE_LEFT_INDENT default 120 , table left indent in twips
  • TABLE_PREFERRED_WIDTH_PCT default 0 , preferred table width
  • TABLE_STYLE default "" , table font style
  • TOC_GENERATE default false , whether to generate TOC, even if no TOC Markdown element is present in the file
  • TOC_INSTRUCTION default "TOC \\o \"1-3\" \\h \\z \\u " , defines the instruction string used for the TOC element
NOTE on Highlight Colors

Docx format requires a named color. Any color provided that does not match a named color will be converted to the closest named color.

When CODE_HIGHLIGHT_SHADING is set to "shade" then will use the closest named color taken from the SourceText shade fill color if available.

Style Names used for rendering various markdown elements

Element styles:

  • ASIDE_BLOCK_STYLE default "AsideBlock", style to use for aside blocks
  • BLOCK_QUOTE_STYLE default "Quotations", style to use for block quotes
  • BOLD_STYLE default "StrongEmphasis", style to use for the markdown element
  • BULLET_LIST_STYLE default "BulletList", numbering list style to use for bullet list item paragraph
  • DEFAULT_STYLE default "Normal", style to use for the markdown element
  • ENDNOTE_ANCHOR_STYLE default "EndnoteReference", style to use for the markdown element
  • FOOTER default "Footer", style to use for the markdown element
  • FOOTNOTE_ANCHOR_STYLE default "FootnoteReference", style to use for the markdown element
  • FOOTNOTE_STYLE default "Footnote", style to use for footnote text
  • FOOTNOTE_TEXT default "FootnoteText", style to use for the markdown element
  • HEADER default "Header", style to use for the markdown element
  • HEADING_1 default "Heading1", style to use for the markdown element
  • HEADING_2 default "Heading2", style to use for the markdown element
  • HEADING_3 default "Heading3", style to use for the markdown element
  • HEADING_4 default "Heading4", style to use for the markdown element
  • HEADING_5 default "Heading5", style to use for the markdown element
  • HEADING_6 default "Heading6", style to use for the markdown element
  • HORIZONTAL_LINE_STYLE default "HorizontalLine", style to use for thematic breaks
  • HYPERLINK_STYLE default "Hyperlink", style to use for the markdown element
  • INLINE_CODE_STYLE default "SourceText", style to use for the markdown element
  • INS_STYLE default "Underlined", style to use for the markdown element
  • ITALIC_STYLE default "Emphasis", style to use for the markdown element
  • LOOSE_PARAGRAPH_STYLE default "ParagraphTextBody", style to use for loose list type items
  • NUMBERED_LIST_STYLE default "NumberedList", numbering list style to use for numbered list item paragraph
  • PARAGRAPH_BULLET_LIST_STYLE default "ListBullet", style to use for tight list type items
  • PARAGRAPH_NUMBERED_LIST_STYLE default "ListNumber", style to use for tight list type items
  • PREFORMATTED_TEXT_STYLE default "PreformattedText", style to use for fenced code and indented code
  • STRIKE_THROUGH_STYLE default "Strikethrough", style to use for the markdown element
  • SUBSCRIPT_STYLE default "Subscript", style to use for the markdown element
  • SUPERSCRIPT_STYLE default "Superscript", style to use for the markdown element
  • TABLE_CAPTION default "TableCaption", style to use for table captions
  • TABLE_CONTENTS default "TableContents", style to use for table bodies
  • TABLE_GRID default "TableGrid", style to use for the markdown element
  • TABLE_HEADING default "TableHeading", style to use for table headings
  • TIGHT_PARAGRAPH_STYLE default "BodyText", style to use for tight list type items

List Element Styles

Unordered lists use numbering list style named BulletList while ordered lists use NumberedList. If these are not present then default numbering style (id = 2) is used for unordered lists and default numbering style (id = 3) is used for ordered lists.

The following are equivalent to Renderer properties of the same name. Included in DocxRenderer for convenience.

For the TOC_INSTRUCTION string see Docx4j GettingStarted under the heading TOC Content Control

NOTE: Word does not handle inserted HTML very well. Any HTML not suppressed will be escaped: ie. it will render into the document as text. The exception is for the <br> tag which if enabled will be rendered as a line break.

Html rendering options available in DocxRenderer for convenience:

  • ESCAPE_HTML_BLOCKS default value of ESCAPE_HTML, escape html blocks found in the document
  • ESCAPE_HTML_COMMENT_BLOCKS default value of ESCAPE_HTML_BLOCKS, escape html comment blocks found in the document.
  • ESCAPE_HTML default false, escape all html found in the document
  • ESCAPE_INLINE_HTML_COMMENTS default value of ESCAPE_HTML_BLOCKS, escape inline html found in the document
  • ESCAPE_INLINE_HTML default value of ESCAPE_HTML, escape inline html found in the document
  • PERCENT_ENCODE_URLS default false, percent encode urls
  • RECHECK_UNDEFINED_REFERENCES default false, Recheck the existence of refences in Parser.REFERENCES for link and image refs marked undefined. Used when new references are added after parsing
  • SUPPRESS_HTML_BLOCKS default value of SUPPRESS_HTML, suppress html output for html blocks
  • SUPPRESS_HTML_COMMENT_BLOCKS default value of SUPPRESS_HTML_BLOCKS, suppress html output for html comment blocks
  • SUPPRESS_HTML default false, suppress html output for all html
  • SUPPRESS_INLINE_HTML_COMMENTS default value of SUPPRESS_INLINE_HTML, suppress html output for inline html comments
  • SUPPRESS_INLINE_HTML default value of SUPPRESS_HTML, suppress html output for inline html
  • HEADER_ID_GENERATOR_NO_DUPED_DASHES default false, When true duplicate - in id will be replaced by a single -
  • HEADER_ID_GENERATOR_RESOLVE_DUPES default true, When true will add an incrementing integer to duplicate ids to make them unique
  • HEADER_ID_GENERATOR_TO_DASH_CHARS default "_", set of characters to convert to - in text used to generate id, non-alpha numeric chars not in set will be removed
  • HEADER_ID_GENERATOR_NON_ASCII_TO_LOWERCASE, default true. When set to false changes the default header id generator to not convert non-ascii alphabetic characters to lowercase. Needed for GitHub id compatibility.
  • HEADER_ID_REF_TEXT_TRIM_LEADING_SPACES, default true. When set to false then leading spaces in link reference text in heading is not trimmed for text used to generate id.
  • HEADER_ID_REF_TEXT_TRIM_TRAILING_SPACES, default true. When set to false then trailing spaces in link reference text in heading is not trimmed for text used to generate id.
  • HEADER_ID_ADD_EMOJI_SHORTCUT, default false. When set to true, emoji shortcut nodes add the shortcut to collected text used to generate heading id.
  • HEADER_ID_GENERATOR_TO_DASH_CHARS default "_", set of characters to convert to - in text used to generate id, non-alpha numeric chars not in set will be removed
  • RENDER_HEADER_ID default false, Render a header id attribute for headers using the configured HtmlIdGenerator