Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Two possibilities for integrating Markdown into TSDoc #12

Open
octogonz opened this issue Mar 23, 2018 · 12 comments
Open

RFC: Two possibilities for integrating Markdown into TSDoc #12

octogonz opened this issue Mar 23, 2018 · 12 comments
Labels
request for comments A proposed addition to the TSDoc spec

Comments

@octogonz
Copy link
Collaborator

octogonz commented Mar 23, 2018

Problem Statement

There are numerous incompatible Markdown flavors. For this discussion, let's assume "Markdown" means strict CommonMark unless otherwise specified.

Many people expect to use Markdown notations inside their JSDoc. Writing a Markdown parser is already somewhat tricky, since the grammar is highly sensitive to context (compared to other rich-text formats such as HTML). Extending it with JSDoc tags causes some interesting collisions and ambiguities. Some motivating examples:

1. Code fences that span tags

/**
 * I can use backticks to create a `code fence` that gets highlighted in a typewriter font.
 * 
 * This `@tag` is not a TSDoc tag, since it's inside a code fence.
 * 
 * This {@link MyClass | hyperlink to the `MyClass` base class} should get highlighting
 * in its target text.
 * 
 * This {@link MyClass | example of backtick (`)} has an unbalanced backtick (`)
 * inside a tag.
 */

Intuitively we'd expect it to be rendered like this:


I can use backticks to create a code fence that gets highlighted in a typewriter font.

This @tag is not a TSDoc tag, since it's inside a code fence.

This hyperlink to the MyClass base class should get highlighting in its target text.

This example of backtick (`) has an unbalanced backtick (`) inside a tag.


2. Stars

Stars have the same problems as backticks, but with even more special cases:

/**
 * Markdown would treat these as
 * * bullet
 * * items.
 *
 * Inside code comments, the left margin is sometimes ambiguous:
 ** bullet
 ** items?
 *
 * Markdown confusingly *allows a * inside an emphasis*.
 * Does a *{@link MyClass | * tag}* participate in this?
 */

Intuitively we'd expect it to be rendered like this:


Markdown would treat these as

  • bullet
  • items.

Inside code comments, the left margin is sometimes ambiguous:

  • bullet
  • items?

Markdown confusingly allows a * inside an emphasis.
Does a * tag participate in this?


3. Whitespace

Markdown assigns special meanings to whitespace indentation. For example, indenting 4 spaces is equivalent to a ``` block. Newlines also have lots of have special meanings.

This could be fairly confusing inside a code comment, particularly with weird cases like this:

/**     Is this indented? */

/** some junk
    Is this indented? */

/**
Is this okay at all? */

/**
Is this star part of the comment?
 * mystery
Or is it a Markdown bullet? 
 */

Perhaps TSDoc should issue warnings about malformed comment framing.

Perhaps we should try to disable some of Markdown's indentation rules. For example, the TSDoc parser could trim whitespace from the start of each line.

4. Markdown Links

Markdown supports these constructs:

[Regular Link](http://example.com)

[Cross-reference Link][1]
. . .
[1]: http://b.org

![Image Link](http://example.com/a.png)

Autolinks are handy:  http://example.com
However if you want an accurate URL-detector, it turns out to be a fairly big library dependency.

The Markdown link functionality partially overlaps with JSDoc's {@link} tag. But it's missing support for API item references.

5. Markdown Tables

Markdown tables have a ton of limitations. Many constructs aren't supported inside table cells. You can't even put a newline inside a table cell. CommonMark had a long discussion about this, but so far does not support the pipes-and-dashes table syntax at all. Instead it uses HTML tables. This seems pretty wise.

6. HTML elements

Most Markdown flavors allow HTML mixed into your content. The CommonMark spec has an entire section about this. This is convenient, although HTML is an entire separate grammar with its own complexities. For example, HTML has a completely distinct escaping mechanism from Markdown.

Here's a few interesting cases to show some interactions:

/**
 * Here's a <!-- @remarks --> tag inside an HTML comment.
 *
 * Here's a TSDoc tag that {@link MyClass | <!-- } seemingly starts --> an HTML comment.
 *
 * The `@remarks` tag normally separates two major TSDoc blocks.  Is it okay for that
 * to appear inside a table?
 *
 * <table><tr><td>
 * @remarks
 * </td></tr></table>
 */

Two Possible Solutions

Option 1: Extend an existing CommonMark library

The most natural approach would be for the TSDoc parser to include an integrated CommonMark parser. The two grammars would be mixed together. We definitely don't want to write a CommonMark parser from scratch, so instead the TSDoc library would need to extend an existing library. Markdown-it and Flavormark are possible choices that are both oriented towards custom extensions.

Possible downsides:

  • Incorporating full Markdown into the TSDoc AST nodes implies that our doc comment emitter would need to be a full Markdown emitter. (In my experience, correctly emitting Markdown is every bit as tricky as parsing Markdown.)
  • To support an entrenched backend with its own opinionated Markdown flavor, this approach wouldn't passthrough Markdown content from doc comments; instead the backend would have to parse AST nodes that were emitted back to Markdown. This can be good (if you're rigorous and writing a proper translator) or bad (if you're taking the naive route)
  • This approach couples our API contract (e.g. the AST structure) to an external project
  • Possibly increases friction for tools that are contemplating taking a dependency on @microsoft/tsdoc

Option 2: Treat full Markdown as a postprocess

A possible shortcut would be to say that TSDoc operates as a first pass that snips out the structures we care about, and returns everything else as plain text. We don't want to get tripped up by backticks, so we make a small list of core constructs that can easily screw up parsing:

  • code fences (backticks)
  • links
  • CommonMark escapes
  • HTML elements (but only as tokens, ignoring nesting)
  • HTML comments (?)

Anything else is treated as plain text for TSDoc, and gets passed through (to be possibly reinterpreted by another layer of the documentation pipeline).

/**
 * This is *bold*. Here's a {@link MyClass | link to `MyClass`}. <div>
 * @remarks
 * Here's some more stuff. </bad>
 */

Here's some pseudocode for a corresponding AST:

[
  {
    "nodeKind": "textNode",
    "content": "This is *bold*. Here's a "  // <-- we ignore the Markdown stars
  },
  {
    "nodeKind": "linkNode",
    "apiItemReference": {
      "itemPath": "MyClass"
    },
    "linkText": [
      {
        "nodeKind": "textNode",
        "content": "link to "
      },
      {
        "nodeKind": "codeFenceNode",  // <-- we parse the backticks though
        "children": [
          {
            "nodeKind": "textNode",
            "content": "MyClass"
          }
        ]
      },
      {
        "nodeKind": "textNode",
        "content": ". "
      },
      {
        "nodeKind": "htmlElementNode",
        "elementName": "div"
      }
    ]
  },
  {
    "nodeKind": "customTagNode",
    "tag": "@remarks"
  },
  {
    "nodeKind": "textNode",
    "content": "Here's some more stuff."
  },
  {
    "nodeKind": "htmlElementNode",
    "elementName": "bad", // <-- we care about HTML delimiters, but not HTML structure
    "isEndTag": true
  }
]

Possible downsides:

  • The resulting syntax would be fairly counterintuitive for people who assume they're writing real Markdown. All the weird little Markdown edge cases would be handled oddly.
  • This model invites a documentation pipeline to do nontrivial syntactic postprocessing. For content authors, the language wouldn't have a unified specification. (This isn't like a templating library that supports proprietary HTML tags. Instead, it's more like if one tool defined HTML without attributes, and then another tried to retrofit attributes on top of it.)
  • We might end up having to code a small CommonMark parser (although it would be a subset of the work involved for a parser that handles the full grammar)
  • How will the second stage Markdown parser accurately report line numbers for errors?

What do you think? Originally I was leaning towards #1 above, but now I'm wondering if #2 might be a better option.

@octogonz octogonz added the request for comments A proposed addition to the TSDoc spec label Mar 23, 2018
@aciccarello
Copy link

aciccarello commented Mar 24, 2018

TypeDoc takes the second approach. There are some special case situations where parsing needs to be markdown aware (e.g. code blocks) but most of the parsing can be passed to a true markdown parser.

When testing this out I noticed some bugs with how TypeDoc handles links and markdown. You can see how TypeDoc renders some of the examples above as well as how it handles the following links.

/**

 * TypeDoc handles a square bracket syntax to link to [[MyClass]]
 * with [[MyClass|pipe labeled links]] and [[MyClass space labeled links]]
 * 
 * TypeDoc handles basic links to {@link MyClass}
 * but {@link MyClass | labeled links} are broken.
 * 
 * Code fences can expose parsed links `{@link MyClass}`
 * 
 * ```
 * As are code blocks with {@link MyClass} text
 * ```
 */
export class MyClass {

Typedoc rendering of links and code fences

For #2 and #3. TypeDoc is very forgiving. Generally it removes the first asterisk of a line. Whitespace is usually retained. If there is an empty (except the asterisk) line in between lines, a break is generated.
typedoc whitespace and list parsing

generated html showing whitespace being retained

HTML is also supported causing some interesting results
TypeDoc rendering of comment with html elements

@yume-chan
Copy link

Currently, JSDoc tags split into two categories: block tags and inline tags. However, there are only two inline tags ({@link} and {@tutorial}) both represents links.

I want to know is there any other inline tags being used today and what's the possibility that tsdoc will add more inline tags in the future? If both answers are no, can we just throw the whole "inline tags" concept away, and extend Markdown links to replace the two existing tags?

By doing this, for question 1, we don't need to worry about the collision between inline JSDoc tags and inline Markdown syntaxes; for question 4, we will have a single (instead of two) but powerful syntax to express links, and we can just rely on the Markdown parser to parse comments.

Then, the only remaining is the block tags. As the name suggests, I expect them to be the first non-whitespace token at their lines, to start a block. Anything between two tags (or the end of comment) belongs to the tag above them. This also answers #13.

For question 2 and 3, personally I want to enforce well-formed comments (from the second line, every line starts with exactly one star and exactly one whitespace), instead of some random lines without any gains.

@aciccarello
Copy link

@yume-chan It would be great to simplify the comment parsing. However, I think it may be surprising to JavaScript developers familiar with JSDoc to not support {@link} tags. I would expect TSDoc should support the text of existing doc comments as projects convert over to TypeScript. Additionally, markdown links don't seem well suited for the API links discussed in #9.

@octogonz
Copy link
Collaborator Author

Getting back in the saddle, today we merged the PR that sets up the initial tsdoc parser library project. I'm starting with Option 2 and we'll see how that goes. I've been experimenting with different approaches for the tokenizer strategy and will follow up.

@octogonz
Copy link
Collaborator Author

I want to know is there any other inline tags being used today and what's the possibility that tsdoc will add more inline tags in the future? If both answers are no, can we just throw the whole "inline tags" concept away, and extend Markdown links to replace the two existing tags?

The @include tag is another potential inline tag. For custom tags, I believe the {@tagname parameters...} is the only JSDoc-flavored way to allow custom tags with arbitrary parameters. The block tags (e.g. @tagname) can't have parameters because we don't know where their content ends.

Markdown links don't provide a generalized pattern for parameterized tags. Also their link target is somewhat vague ("zero or more characters") which might make it difficult to detect the rather elaborate reference syntax that @MartynasZilinskas was working on.

@octogonz
Copy link
Collaborator Author

So, the big architecture of the parser would be like this:

  1. Extract comment lines from "/** */" their blocks (see this PR)
  2. Tokenize the content of the lines (e.g. "<" symbol, chunk-of-text, etc.)
  3. Parse the lines into pre-TSDoc AST nodes (e.g. CommonMark code fences, HTML elements, HTML comments, etc).
  4. For the non-escaped text, parse TSDoc block tags and inline tags

By pre-TSDoc I mean a conservative subset of the minimal CommonMark-compatible constructs that the TSDoc stage needs to understand in order to avoid accidentally parsing something like these examples:

/** 
 * Not a TSDoc tag: \@tag
 * Not a TSDoc tag: `@tag`
 * Not a TSDoc tag: <!-- @tag -->
 * Not a TSDoc tag:
 * ```
 * @tag
 * ```
 * Not a TSDoc tag:  <div data-text="@tag" />
 */

So the basic pre-TSDoc constructs would be:

  • CommonMark escaped characters (backslash)
  • CommonMark code fences/spans (backticks)
  • HTML elements/attributes, but not inner text e.g. <table><tr><td>@tag</td></tr></table> does contain a transformable TSDoc tag
  • HTML comments

These are questionable:

  • CommonMark ATX headings (# Heading)
  • CommonMark links including image links
  • CommonMark emphasis characters (e.g. **bold** or _italics_)
  • autolinks (e.g. http://blarg/@tag)

These I'm proposing to NOT consider in the TSDoc stage (but a documentation tool's backend Markdown render is free to process them):

  • CommonMark lists, blockquotes, breaks, etc.
  • HTML escapes (e.g. &amp;, <![CDATA[, etc)
  • HTML directives (<?, <!)
  • CommonMark escaped end-of-line (hard line break)

Thoughts? Did we miss any important CommonMark constructs (that would trip up the TSDoc parser)?

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 27, 2018

Important Update: Friends, I am losing my mind trying to pick out an internally-consistent subset of CommonMark syntax! Recall that the planned compromise was to allow ambiguity about how documentation gets rendered, while instead focusing on (1) being very rigorous about whether an @ is a TSDoc tag or not, and (2) producing a grammar that's reasonably intuitive and faithful to Markdown. Unfortunately the Markdown grammar is just really crazy.

Some sinister edge cases...

According to the CommonMark spec, code spans can be split across multiple lines, but not if there's a list bullet in there. Example:

Quoted code: `@one +
@two`

Parseable TSDoc tags: `@one
+ @two`

Blockquotes can stick ">" characters in the middle of any construct. Example:

> This is a complete HTML tag and not TSDoc:
> <tag attr1="@one"
> attr2="@two" />

> `Similar to how this code span
> does not contain a ">"`

Whitespace can really matter. Example:

These are all invalid HTML tags, and thus contain valid TSDoc tags:

< tag attr1="@one"
attr2="@two" />

<tag attr1="@one"

attr2="@two" />

    <tag attr1="indented code block"
attr2="@two" />

Concerns

These are just a few examples. There are endless weird edge cases like this. It raises a couple concerns:

  1. A "core" syntax doesn't exist. It's starting to seem unrealistic that we can find an internally-consistent "pre-TSDoc" subset of Markdown syntax. Instead, the TSDoc library would need to include a nearly complete Markdown parser. Given our goal of parsing content "in place" in a TypeScript source file and very fast, we probably cannot use an existing library directly; it would need to be a fork or new implementation. The CommonMark reference parser is pretty small and readable though, so reimplementing it seems less daunting than before. However...

  2. Authoring experience. I am very troubled by all these "gotchas" in the syntax. I showed the samples above to some coworkers who had used Markdown many times before, and they were very surprised by the output. These issue generally go unnoticed in real life, since Markdown editors always have an interactive "preview" tab so you can fiddle with the spaces/symbols until your output looks right. But this may not be the case for TypeScript authoring pipelines. In my current setup at work, developers don't get to see the rendered output until after their PR is merged (unless they made a special effort to manually invoke the doc pipeline). Even if we built an awesome VS Code plugin to give live previews of TSDoc, there is no guarantee a particular doc engine will perfectly mimic all the CommonMark edge cases as seen in the VS Code preview.

I originally got involved with TSDoc because API Extractor was having problems where computer symbols were constantly getting rendered wrongly on the web site, due to lack of formalism about the doc comment syntax. I really want to make sure all our work here actually solves that problem.

A new direction

So, I'm proposing a different approach to the Markdown integration:

  • We will not attempt to guarantee that CommonMark and TSDoc will generally interpret a string the same way
  • Instead, TSDoc will incorporate a complete parser for a proprietary flavor of Markdown
  • This "TSDoc-flavored-Markdown" (TSFM) will be a core set of essential CommonMark notations, with as little extra baggage as possible
  • The primary goal of TSFM will be "no gotchas": Enable a layperson to memorize all the syntax rules, such that they can easily predict exactly how their input will get rendered.

Ideally, we would design it such that every interesting markup sequence has a normalized form that will be interpreted identically by TSDoc and CommonMark. And then in strict mode, the TSDoc library could report warnings for TSFM constructs that would be mishandled by CommonMark. (I need to do a little more research to confirm whether this is possible.)

Feedback?

Does this new direction make sense? Does anyone see a potential problem with it?

The main consequence I'm aware of is more work for documentation engines that render markdown as an output, since they cannot naively pass through the TSDoc text.

@octogonz
Copy link
Collaborator Author

I opened a separate issue #29 to provide some concrete details about this "TSFM" idea.

@typhonrt
Copy link

Does this new direction make sense? Does anyone see a potential problem with it?

Given for a quicker first release of the effort is it necessary to parse markdown at all?

At least w/ ESDoc and hence my significantly updated fork TJSDoc comment blocks are parsed such that all text above the first tag parsed from a new line is considered as the description / @desc tag. Once the first tag from start of a newline including whitespace with a leading @ is seen then everything else is then parsed just as a block of tags. IE you can't intermingle between tags and description. In this manner the description tag / automatic leading top portion in publishing the docs can be treated as wholesale markdown if the doc tooling at hand decides to do so.

In most doc tooling pipelines it can be assumed that plugins are supported. Further processing of tags for markdown can be handled via a plugin just like the JSDoc markdown plugin

The core value to me of TSDoc is defining a standard set of tags and providing a tag parser that treats a block of text only parsing tags and nothing else. At least if this was split out as a separate utility method / option to invoke then I'd be happy. If desired also offer the full AST breakdown of comments w/ interspersed tags and markdown / custom TSFM approach. Seems like a whole lot of work though out of the gate at least.

TSFM may face the challenge of being too opinionated regarding wide adoption.

@octogonz
Copy link
Collaborator Author

octogonz commented Jun 30, 2018

At least w/ ESDoc and hence my significantly updated fork TJSDoc comment blocks are parsed such that all text above the first tag parsed from a new line is considered as the description / @desc tag. Once the first tag from start of a newline including whitespace with a leading @ is seen then everything else is then parsed just as a block of tags. IE you can't intermingle between tags and description. In this manner the description tag / automatic leading top portion in publishing the docs can be treated as wholesale markdown if the doc tooling at hand decides to do so.

I don't think this would be sufficient for my own use case. For example we need {@link} inline with the documentation text. And our users very much are asking for Markdown-like rich text formatting -- but it would be a major problem if a misparsed backtick caused @beta to get ignored accidentally.

The core value to me of TSDoc is defining a standard set of tags and providing a tag parser that treats a block of text only parsing tags and nothing else. At least if this was split out as a separate utility method / option to invoke then I'd be happy. If desired also offer the full AST breakdown of comments w/ interspersed tags and markdown / custom TSFM approach.

I believe the latest design will handle both cases pretty easily. Currently we are planning "strict" and "lax" modes for the parser. Right now it's looking like the same parser algorithm will handle both modes -- the main difference will be that in "lax" a mode consumer would ignore the error nodes (and simply render them as plain text), and in "strict" mode there will be additional validations/checks that can produce errors/warnings that are ignored in "lax" mode. If your specific style of documentation can be seen as a subset of the full TSDoc grammar, then it could be modeled as an additional mode with slightly different validations. It will also be relatively easy to turn parser features on/off. So e.g. if someone wanted to say "I want backticks to always be treated as plain backticks" we can provide a switch turn off the code span parsing.

Seems like a whole lot of work though out of the gate at least.

This week I actually made a bunch of progress on an algorithm that handles the ideas proposed in #29 . It's moving along very fast. (Having the dev design sorted out really makes a big difference heheh.) I expect to have something I can publish next week for people to provide feedback on.

@sharwell
Copy link
Member

sharwell commented Dec 3, 2018

Extending it with JSDoc tags causes some interesting collisions and ambiguities.

The most straightforward way for me to think about this is to treat {@tag ...} as an indicator of inline content, and @tag at the start of a line as an indicator of block content. The precedence rules cover this.

The actual handling of inline content could be very similar to the way **Bold** interacts with `Inline code`. The only difference is the beginning of the inline is identified {@ (or possibly {@<identifier>, and the end of the inline is identified by }.

Are there cases which do not work with this in a straightforward manner?

@vassudanagunta
Copy link

The Markdown link functionality partially overlaps with JSDoc's {@link} tag. But it's missing support for API item references.

Easy peasy:

This method is part of the [Statistics subsystem]({@link core-library#Statistics}).

See #70 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
request for comments A proposed addition to the TSDoc spec
Projects
None yet
Development

No branches or pull requests

6 participants