Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: strip HTML entities like > #27

Closed
wants to merge 1 commit into from

Conversation

jedwards1211
Copy link

@jedwards1211 jedwards1211 commented Feb 7, 2020

I discovered in my repo https://github.com/vscodeshift/material-ui-snippets that HTML entities like > get stripped out of slugs. Right now in this and other libs, <div> gets converted to ltdivgt, but the actual GitHub slug is just div.

@wooorm
Copy link
Collaborator

wooorm commented Feb 7, 2020

This can (and should) be solved on your side, by first decoding html entities: &lt;div&gt; -> <div>, then passing it to this library, which will give just div.

I personally maintain parse-entities that does this, but there are many more libraries that do that too, which you could use.

This library focusses on slug generation, not the Markdown or HTML parsing. So I’d say it’s out of scope.

@wooorm wooorm closed this Feb 7, 2020
@jedwards1211
Copy link
Author

Doesn't make sense...

The overall goal of this package is to emulate the way GitHub handles generating markdown heading anchors as close as possible.

This implies that one should be able to pass the exact text of a markdown heading to this lib

@jedwards1211
Copy link
Author

That said, I was unaware that there could be entities that don't end with ;

@jedwards1211
Copy link
Author

jedwards1211 commented Feb 7, 2020

I walk my comment back I guess. But I'm a bit confused what would count as "parsed" input to pass to this library. For instance if there is a header with a link and backticks like

# Usage with [`react`](https://reactjs.org)

What should the "parsed" input passed to this library look like?

@wooorm
Copy link
Collaborator

wooorm commented Feb 7, 2020

Parsed input in that case is Usage with react.

If you want to parse markdown, see for example remark, or other libraries.

@jedwards1211
Copy link
Author

jedwards1211 commented Feb 7, 2020

I would expect the output of that parser to be a syntax tree, even within the header node I would expect the link to be a subnode? Not just simple string with all the control characters stripped away? Hence my confusion

@wooorm
Copy link
Collaborator

wooorm commented Feb 7, 2020

Wait, I don’t get it, which parser should be a syntax tree but isn’t? Where are links stripped?

@jedwards1211
Copy link
Author

jedwards1211 commented Feb 7, 2020

No I mean the parser does output a syntax tree, as expected.

remark-parse — Parse Markdown documents to syntax trees

https://github.com/syntax-tree/mdast#heading

It's really not obvious how I would get a simple string like Usage with react out of this.

@wooorm
Copy link
Collaborator

wooorm commented Feb 7, 2020

Look, sorry, but I have no clue what you want to do or how to help you, but a) this project takes text and turns it into a slug, b) remark is an advanced system of hundreds of projects to work with markdown.

remark may be to advanced for your use case; I don’t know what your use case is.

@jedwards1211
Copy link
Author

jedwards1211 commented Feb 7, 2020

No worries...My use case is just that I'm autogenerating markdown and was looking into using this lib to generate slugs instead of my own custom code.
I just wound up confused what this library is really intended for since there's little clarity on whether anything exists that processes the header into the basic text input you're expecting in this lib. Adding to the confusion is the fact that some special characters inside the header like do get sanitized by this lib, by accident I guess, which gave the impression it was meant to take the raw header.

I had also used markdown-toc to generate some tables of contents but I'm not sure it does full blown parsing either because it doesn't strip HTML entities out of the slugs it generates either.

I guess basically the text we pass to this lib should be the same text we would get if we select the header and copy in compiled HTML? Out of curiosity do you know if there's anything within remark that actually spits out that text?

@jedwards1211
Copy link
Author

jedwards1211 commented Feb 7, 2020

@wooorm okay I discovered that Facebook's docusaurus library, which uses github-slugger, doesn't generate the correct slugs for headings with entities either, so sounds like we should better document the form of input this lib expects, since it's popular.

I added this header to one of their test fixtures:

# &lt;div&gt; test

Here in the failing test you can see that the entities don't get removed:

   - Snapshot
    + Received

    @@ -100,9 +100,9 @@
          "rawContent": "bar",
        },
        Object {
          "children": Array [],
          "content": "&lt;div&gt; test",
    -     "hashLink": "div-test",
    +     "hashLink": "ltdivgt-test",
          "rawContent": "&lt;div&gt; test",
        },
      ]

If even a major Facebook project has the same misconceptions I did about how to use github-slugger, it's probably not obvious enough.

@wooorm
Copy link
Collaborator

wooorm commented Feb 17, 2020

It's […] not obvious how I would get a […] string […] out of this.

First, find a node, then serialize it with mdast-util-to-string

My use case is […] generating markdown

The whole unified/remark/mdast/unist is really good at that

I […] wound up confused what this library is […] whether anything exists that processes the header into […] text […]

To clarify: no, it does not do that; as the docs do not mention it, I would assume it does not exist.
Feel free to suggest how your question could be clarified in the readme, PRs are welcome.

[…] some special characters inside the header […] do get sanitized by this lib […]

That’s not an accident, that’s the exact behavior of GitHub’s slugging mechanism and the goal of this project.

I […] used markdown-toc to generate some tables of contents but […] it doesn't strip HTML entities out of the slugs it generates either.

rehype-slug and rehype-toc do

[…] we pass […] the same text we would get if we select the header and copy in compiled HTML?

It’s closer to node.textContent

Out of curiosity do you know if there's anything within remark that actually spits out that text?

Could you clarify what you want to do?

If even a major Facebook project has the same misconceptions I did about how to use github-slugger, it's probably not obvious enough.

Please raise an issue over there. It may be fixed in docusaurus@2 already. Btw, docusaurus maintainers and unified maintainers do talk about stuff like this.

@jedwards1211
Copy link
Author

jedwards1211 commented Feb 17, 2020

Could you clarify what you want to do?

Convert the raw markdown header into what string will guarantee github-slugger outputs the same slug as GitHub (assuming we're fully aware of GitHub's behavior). Is "the text the user would see on screen" an accurate term for the input format?

I'm just trying to figure out how we can document this clearly so that other people won't have the same confusion I did. It's clear to you where github-slugger fits in this ecosystem of markdown tools because you work on a lot of them, but for someone writing a one-off script to output text to their README.md and thinking "lemme see if there's a quick and easy way to generate GitHub slugs for these headers", github-slugger looks like more of a one-stop shop than it is.

And also when I realized that HTML entities cause problems in downstream libraries (docusaurus) and also unrelated libraries (markdown-toc) I got the impression that there isn't enough awareness of the fact that tools should deal with HTML entities. For me the most practical solution was to just write some quick and dirty regex processing my build script, but I figured it would be helpful to raise awareness of this across the ecosystem so that other people don't waste time trying to figure out how to generate the correct slugs.

So I'm thinking the README should say something like

slug() will not necessarily work correctly on raw markdown text; instead you should pass the text the user would see on screen in GitHub (one way to get this is parse the markdown, find the header node, and serialize it with mdast-util-to-string)

But I imagine specific instructions might be necessary to fully process GitHub markdown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants