Skip to content

feat: implement html-to-md-swift converter #13

@PsychQuantClaw

Description

@PsychQuantClaw

Summary

Add a new Layer 3 converter package html-to-md-swift to convert HTML documents into Markdown using macdoc's streaming architecture.

Why

  • docs/modular-architecture.md already identifies html-to-md-swift as a planned future converter.
  • HTML → Markdown fits the existing DocumentConverter + StreamingOutput protocol cleanly.
  • It is the most direct next converter after word-to-md-swift because the target format (markdown-swift) already exists.

Package / architecture

Layer 3 package

  • packages/html-to-md-swift
  • Swift Package product: HTMLToMDSwift
  • Module: HTMLToMDSwift

Dependencies

  • Layer 2: doc-converter-swift
  • Layer 1 target writer: markdown-swift
  • HTML parsing: SwiftSoup

This package should not import other converters.

DocumentConverter shape

Implement:

  • HTMLConverter: DocumentConverter
  • static let sourceFormat = "html"
  • convert(input:output:options:)

Implementation approach:

  • parse HTML with SwiftSoup
  • walk the DOM in document order
  • project HTML semantics directly into Markdown-aware block / inline emission
  • stream Markdown to StreamingOutput instead of building a Markdown AST

Initial supported mappings

Block-level

  • headings h1...h6
  • paragraphs p
  • unordered / ordered lists ul / ol / li
  • blockquote
  • fenced code blocks from pre > code
  • horizontal rule hr
  • tables table / tr / th / td
  • line breaks br

Inline

  • strong / bold
  • emphasis / italic
  • strikethrough (del, s)
  • inline code (code outside pre)
  • links (a[href])
  • images (img[src])
  • raw text with entity decoding and whitespace normalization

CLI integration

Add a new macdoc subcommand group:

  • macdoc html input.html -o output.md

Optional follow-up subcommands can come later, but the first pass should mirror the current word UX.

Testing strategy

  • package-level unit tests (80%+ coverage target)
  • focused fixtures for headings, emphasis, links, lists, code, tables, blockquote, and nested structures
  • end-to-end conversion from temporary .html files to Markdown strings
  • whitespace normalization regression tests to avoid noisy output

Notes

If this lands, md-to-html-swift should be promoted in the conversion matrix as the reverse-path follow-up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions