fix: strip HTML entities like > #27

jedwards1211 · 2020-02-07T06:39:03Z

I discovered in my repo https://github.com/vscodeshift/material-ui-snippets that HTML entities like > get stripped out of slugs. Right now in this and other libs, <div> gets converted to ltdivgt, but the actual GitHub slug is just div.

wooorm · 2020-02-07T08:47:18Z

This can (and should) be solved on your side, by first decoding html entities: <div> -> <div>, then passing it to this library, which will give just div.

I personally maintain parse-entities that does this, but there are many more libraries that do that too, which you could use.

This library focusses on slug generation, not the Markdown or HTML parsing. So I’d say it’s out of scope.

jedwards1211 · 2020-02-07T10:27:22Z

Doesn't make sense...

The overall goal of this package is to emulate the way GitHub handles generating markdown heading anchors as close as possible.

This implies that one should be able to pass the exact text of a markdown heading to this lib

jedwards1211 · 2020-02-07T10:28:27Z

That said, I was unaware that there could be entities that don't end with ;

jedwards1211 · 2020-02-07T10:34:31Z

I walk my comment back I guess. But I'm a bit confused what would count as "parsed" input to pass to this library. For instance if there is a header with a link and backticks like

# Usage with [`react`](https://reactjs.org)

What should the "parsed" input passed to this library look like?

wooorm · 2020-02-07T10:59:48Z

Parsed input in that case is Usage with react.

If you want to parse markdown, see for example remark, or other libraries.

jedwards1211 · 2020-02-07T16:37:24Z

I would expect the output of that parser to be a syntax tree, even within the header node I would expect the link to be a subnode? Not just simple string with all the control characters stripped away? Hence my confusion

wooorm · 2020-02-07T17:10:32Z

Wait, I don’t get it, which parser should be a syntax tree but isn’t? Where are links stripped?

jedwards1211 · 2020-02-07T17:19:16Z

No I mean the parser does output a syntax tree, as expected.

remark-parse — Parse Markdown documents to syntax trees

https://github.com/syntax-tree/mdast#heading

It's really not obvious how I would get a simple string like Usage with react out of this.

wooorm · 2020-02-07T18:16:32Z

Look, sorry, but I have no clue what you want to do or how to help you, but a) this project takes text and turns it into a slug, b) remark is an advanced system of hundreds of projects to work with markdown.

remark may be to advanced for your use case; I don’t know what your use case is.

jedwards1211 · 2020-02-07T19:35:13Z

No worries...My use case is just that I'm autogenerating markdown and was looking into using this lib to generate slugs instead of my own custom code.
I just wound up confused what this library is really intended for since there's little clarity on whether anything exists that processes the header into the basic text input you're expecting in this lib. Adding to the confusion is the fact that some special characters inside the header like do get sanitized by this lib, by accident I guess, which gave the impression it was meant to take the raw header.

I had also used markdown-toc to generate some tables of contents but I'm not sure it does full blown parsing either because it doesn't strip HTML entities out of the slugs it generates either.

I guess basically the text we pass to this lib should be the same text we would get if we select the header and copy in compiled HTML? Out of curiosity do you know if there's anything within remark that actually spits out that text?

jedwards1211 · 2020-02-07T20:05:43Z

@wooorm okay I discovered that Facebook's docusaurus library, which uses github-slugger, doesn't generate the correct slugs for headings with entities either, so sounds like we should better document the form of input this lib expects, since it's popular.

I added this header to one of their test fixtures:

# &lt;div&gt; test

Here in the failing test you can see that the entities don't get removed:

   - Snapshot
    + Received

    @@ -100,9 +100,9 @@
          "rawContent": "bar",
        },
        Object {
          "children": Array [],
          "content": "&lt;div&gt; test",
    -     "hashLink": "div-test",
    +     "hashLink": "ltdivgt-test",
          "rawContent": "&lt;div&gt; test",
        },
      ]

If even a major Facebook project has the same misconceptions I did about how to use github-slugger, it's probably not obvious enough.

wooorm · 2020-02-17T08:39:34Z

It's […] not obvious how I would get a […] string […] out of this.

First, find a node, then serialize it with mdast-util-to-string

My use case is […] generating markdown

The whole unified/remark/mdast/unist is really good at that

I […] wound up confused what this library is […] whether anything exists that processes the header into […] text […]

To clarify: no, it does not do that; as the docs do not mention it, I would assume it does not exist.
Feel free to suggest how your question could be clarified in the readme, PRs are welcome.

[…] some special characters inside the header […] do get sanitized by this lib […]

That’s not an accident, that’s the exact behavior of GitHub’s slugging mechanism and the goal of this project.

I […] used markdown-toc to generate some tables of contents but […] it doesn't strip HTML entities out of the slugs it generates either.

rehype-slug and rehype-toc do

[…] we pass […] the same text we would get if we select the header and copy in compiled HTML?

It’s closer to node.textContent

Out of curiosity do you know if there's anything within remark that actually spits out that text?

Could you clarify what you want to do?

If even a major Facebook project has the same misconceptions I did about how to use github-slugger, it's probably not obvious enough.

Please raise an issue over there. It may be fixed in docusaurus@2 already. Btw, docusaurus maintainers and unified maintainers do talk about stuff like this.

jedwards1211 · 2020-02-17T20:57:53Z

Could you clarify what you want to do?

Convert the raw markdown header into what string will guarantee github-slugger outputs the same slug as GitHub (assuming we're fully aware of GitHub's behavior). Is "the text the user would see on screen" an accurate term for the input format?

I'm just trying to figure out how we can document this clearly so that other people won't have the same confusion I did. It's clear to you where github-slugger fits in this ecosystem of markdown tools because you work on a lot of them, but for someone writing a one-off script to output text to their README.md and thinking "lemme see if there's a quick and easy way to generate GitHub slugs for these headers", github-slugger looks like more of a one-stop shop than it is.

And also when I realized that HTML entities cause problems in downstream libraries (docusaurus) and also unrelated libraries (markdown-toc) I got the impression that there isn't enough awareness of the fact that tools should deal with HTML entities. For me the most practical solution was to just write some quick and dirty regex processing my build script, but I figured it would be helpful to raise awareness of this across the ecosystem so that other people don't waste time trying to figure out how to generate the correct slugs.

So I'm thinking the README should say something like

slug() will not necessarily work correctly on raw markdown text; instead you should pass the text the user would see on screen in GitHub (one way to get this is parse the markdown, find the header node, and serialize it with mdast-util-to-string)

But I imagine specific instructions might be necessary to fully process GitHub markdown?

fix: strip HTML entities like >

2cc0a4b

jedwards1211 force-pushed the strip-html-entities branch from 65206e7 to 2cc0a4b Compare February 7, 2020 06:46

wooorm closed this Feb 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: strip HTML entities like > #27

fix: strip HTML entities like > #27

jedwards1211 commented Feb 7, 2020 •

edited

Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 •

edited

Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 •

edited

Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 •

edited

Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 •

edited

Loading

jedwards1211 commented Feb 7, 2020 •

edited

Loading

wooorm commented Feb 17, 2020

jedwards1211 commented Feb 17, 2020 •

edited

Loading

fix: strip HTML entities like &gt; #27

fix: strip HTML entities like &gt; #27

Conversation

jedwards1211 commented Feb 7, 2020 • edited Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 • edited Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 • edited Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 • edited Loading

wooorm commented Feb 7, 2020

jedwards1211 commented Feb 7, 2020 • edited Loading

jedwards1211 commented Feb 7, 2020 • edited Loading

wooorm commented Feb 17, 2020

jedwards1211 commented Feb 17, 2020 • edited Loading

fix: strip HTML entities like > #27

fix: strip HTML entities like > #27

jedwards1211 commented Feb 7, 2020 •

edited

Loading

jedwards1211 commented Feb 7, 2020 •

edited

Loading

jedwards1211 commented Feb 7, 2020 •

edited

Loading

jedwards1211 commented Feb 7, 2020 •

edited

Loading

jedwards1211 commented Feb 7, 2020 •

edited

Loading

jedwards1211 commented Feb 7, 2020 •

edited

Loading

jedwards1211 commented Feb 17, 2020 •

edited

Loading