Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to match GitHub’s algorithm on unicode #38

Merged
merged 1 commit into from
Aug 24, 2021
Merged

Conversation

wooorm
Copy link
Collaborator

@wooorm wooorm commented Aug 22, 2021

I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

  • generate-fixtures.mjs, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHub
  • generate-regex.mjs, which generates the regex that GitHub uses for characters to ignore.

The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex \p{} classes in /u regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow -, and turn (space) into -.

Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme: #  . This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the # and the content. In fact, this makes it the content. And GitHub creates a slug of - for it.

Further work: I think it would be nice to release this as is. Then, afterwards, I’d like to modernize the project, add GH Actions to generate the build, add types, and move to ESM.

/cc @Flet @jablko

Closes GH-22.
Closes GH-25.
Closes GH-35.

I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

* `generate-fixtures.mjs`, which generates a markdown file, in part
  from manual fixtures and in part on the Unicode General Categories,
  creates a gist, crawls the gist, removes it, and saves fixtures
  annotated with the expected result from GitHub
* `generate-regex.mjs`, which generates the regex that GitHub uses for
  characters to ignore.

The regex is about 2.5kb minzipped.
This increases the file size of this project a bit.
But matching GitHub is worth it in my opinion.
I also investigated regex `\p{}` classes in `/u` regexes. They work
mostly fine, with two caveats:
a) they don’t work everywhere, so would be a major release,
b) GitHub does not implement the same Unicode version as browsers.
I tested with Unicode 13 and 14, and they include characters that
GitHub handles differently.
In the end, GitHub’s algorithm is mostly fine: strip
non-alphanumericals, allow `-`, and turn ` ` (space) into `-`.

Finally, I removed the trim functionality, because it is not
implemented by GitHub.
To assert this, make a heading like so in a readme: `#  `.
This is a space encoded as a character reference, meaning that the
markdown does not see it as the whitespace between the `#` and the
content.
In fact, this makes it the content.
And GitHub creates a slug of `-` for it.

Further work: I think it would be nice to release this as is.
Then, afterwards, I’d like to modernize the project, add GH Actions
to generate the build, add types, and move to ESM.

/cc @Flet @jablkojablko

Closes GH-22.
Closes GH-25.
Closes GH-35.

Co-authored-by: Dan Flettre <flettre@gmail.com>
Co-authored-by: Jack Bates <jack@nottheoilrig.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lozenge ◊ not stripped?
1 participant