Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization: tokenize HTML or process textually entirely #15

Open
stapelberg opened this issue Jan 15, 2017 · 8 comments
Open

optimization: tokenize HTML or process textually entirely #15

stapelberg opened this issue Jan 15, 2017 · 8 comments

Comments

@stapelberg
Copy link
Contributor

Tokenizing shaves off about 1 minute on a 6 minute rendering of Debian unstable.

The code is not entirely straight-forward to port due to the HTML-tag-agnostic cross reference detection (e.g. for <i>crontab</i>(5)) which requires us to keep state after all.

If we could improve mandoc’s cross reference detection and id generation, we could probably get away with textually processing the HTML, which has the potential to shave off another 30 seconds.

@stapelberg
Copy link
Contributor Author

stapelberg commented Jan 15, 2017

Another interesting measurement: peak memory usage during conversion is reduced by about 150MB when using HTML tokenization instead of HTML parsing. (with -concurrency_render=20)

@stapelberg
Copy link
Contributor Author

I decided to not work on this for the time being, unless it becomes a blocker for anything. Help welcome :).

@lahwaacz
Copy link

The post-processing for cross reference detection is necessary only for the man pages written in the old man(7) language, which is not semantic and references are usually written with the .BR or .IR macros. I think it should really be improved in mandoc itself, also as a way of working on #56.

Until then, you can probably detect if the manual is written in man(7) or mdoc(7) and post-process only the first case 😉

@stapelberg
Copy link
Contributor Author

That’s orthogonal to the issue in this ticket, I think: we do our own cross-referencing for internationalization.

@lahwaacz
Copy link

lahwaacz commented Aug 27, 2017

Could you elaborate on what else is necessary to post-process besides the example <i>crontab</i>(5) above? And where does the internationalization part come in?

@stapelberg
Copy link
Contributor Author

Have a look at

func postprocess(resolve func(ref string) string, n *html.Node, toc *[]string) error {

Post-processing consists of 3 steps:

  1. We strip <html>, <head> and <body> tags because we’re inserting the resulting HTML into an existing document.
  2. We set IDs for each heading. I know that mandoc ≥ 1.14.2 does this as well, but unfortunately with a slightly different algorithm than we use, so we need to keep ours in order to not break existing links.
  3. We find cross-references and URLs and turn them into links.

Notably, ③ finds cross-references even if they include formatting directives (such as the italic tag in the example).

Internationalization in this context means linking to the best language match for the target, as viewed from the source. For example, if the user is browsing manpages in Danish, but the target is only available in Norwegian and English, than we link to the Norwegian version. However, if the target is only available in, say, Italian and English, we’d link to the English version.

mandoc doesn’t know which manpages are available in which language (at least in the way we’re invoking it), so doing language matching when cross-referencing is out of scope for mandoc, I think.

@lahwaacz
Copy link

lahwaacz commented Aug 27, 2017

  1. You're running mandoc with -Ofragment, so stripping <html>, <head> and <body> again should be useless:
    cmd := exec.Command("mandoc", "-Ofragment", "-Thtml")
  2. I was actually running an older version, thanks for pointing this out!
  3. I admit that post-processing is indeed necessary to get cross-language links on a static site, thanks for the description. Though if mandoc was improved to handle <i>crontab</i>(5) etc., then you could pass -O man=/something/definitely/unique/%N.%S.html to mandoc and do just a (probably much simpler) replacement on the <a href="..."> tags.

@stapelberg
Copy link
Contributor Author

Fair point, the stripping must be a remnant of when we didn’t use -Ofragment. We should remove it eventually (pull requests welcome!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants