Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API docs are not being properly indexed for search #1432

Open
waylan opened this issue Jan 18, 2024 · 18 comments
Open

API docs are not being properly indexed for search #1432

waylan opened this issue Jan 18, 2024 · 18 comments
Labels
docs Related to the project documentation.

Comments

@waylan
Copy link
Member

waylan commented Jan 18, 2024

The default behavior is for search to use the heading of the relevant section as the title of the search result. However, the code spans within headings of the API docs are being HTML escaped in the results and many of the obvious headings are not being returned as results. I'm assuming the second issue is related to the first in that the correct text is not getting indexed. In other words, if only the plain text content of a heading was being indexed, then that would result in better search results.

Consider the following example. ESCAPED_CHARS is an instance attribute of the Markdown class. One might expect that its relevant section (ESCAPED_CHARS) would be returned in a search for the string ESCAPED_CHARS, but it is not. However, the method markdown.inlinepatterns.EscapeInlineProcessor.handleMatch is in the results because it mentions ESCAPED_CHARS in the body of its documentation. Yet, in the search result, the title is the text <code class="doc-symbol doc-symbol-toc doc-symbol-method"></code>&nbsp;handleMatch not handleMatch or markdown.inlinepatterns.EscapeInlineProcessor.handleMatch as one might expect. Frustratingly, the search term handleMatch does not return that result at all. Yet, it is in the results for the search term doc-symbol-method, which shouldn't even be indexed, as it is an HTML class assigned to the code span, not text.

@pawamoy do you have any insight into this? I have not yet looked at the code and am not sure how the mkdocsstrings extension passes its generated pages to search for indexing.

@waylan waylan added the docs Related to the project documentation. label Jan 18, 2024
@pawamoy
Copy link
Contributor

pawamoy commented Jan 18, 2024

We recently reached our first funding goal, and therefore released the show_symbol_type_heading option. By releasing it, it was activated in Python-Markdown's docs, because the option was already set to true. Previously it had no effect.

While we investigate this issue, you can set show_symbol_type_heading: false in mkdocs.yml 🙂

@waylan
Copy link
Member Author

waylan commented Jan 18, 2024

Okay, so the symbol type code span is being included in the index. I failed to recognize that as the body text of the code span wasn't included in the search result. But now I realize that the body text is added by CSS so it isn't in the HTML. Looks like the code element and the non-breaking space that follows it need to be removed before indexing then.

@pawamoy
Copy link
Contributor

pawamoy commented Jan 18, 2024

We could also try to somehow unescape the title before adding it to the page. IIUC escaping is done in the format_result function the search plugin's main.js script:

function formatResult (location, title, summary) {
  return '<article><h3><a href="' + joinUrl(base_url, location) + '">'+ escapeHtml(title) + '</a></h3><p>' + escapeHtml(summary) +'</p></article>';
}

That would at least fix the rendering issue in the search page.

@waylan
Copy link
Member Author

waylan commented Jan 18, 2024

While we investigate this issue, you can set show_symbol_type_heading: false in mkdocs.yml

Actually, this setting makes no difference. To be clear, changing the setting does remove the symbol type from the documentation, but it does not remove it from the search results. I expect that should be a clue as to where the problem is.

While testing this, I also noticed that any search result which returns the module (top heading in the page which included the module level docstring) also has the entire header (name of module) in escaped code span (for example, <code>markdown</code> for a search result pointing at the markdown module. There are no symbol types there and in fact this issue existed prior to the release of 1.8.

@pawamoy
Copy link
Contributor

pawamoy commented Jan 18, 2024

Oh, yes, when we don't separate signatures from the headings, then headings are the signatures, and they're wrapped in code tags. I suppose we could wrap them in spans instead (as you suggested in previous discussions), with the right CSS styling, but they would be escaped as <span ...>Markdown</span> once again?

About symbols still appearing in the search results, try also setting show_symbol_type_toc: false 🤷

@waylan
Copy link
Member Author

waylan commented Jan 19, 2024

About symbols still appearing in the search results, try also setting show_symbol_type_toc: false

That fixes the issue. The show_symbol_type_heading option has no bearing on the issue either way, only show_symbol_type_toc. And that makes sense. MkDocs uses the toc title of a page for the search index. Regardless, these code spans should not be making it into the search index regardless of the show_symbol_type_toc setting.

However, we still have the issue with the module titles. I should point out that all of the headings (for functions, classes, attributes, methods, etc) have code spans in them, which get stripped. But only the module titles do not get their tags stripped. The offending code span here is not the symbol type, but the code span which contains the module name.

@waylan
Copy link
Member Author

waylan commented Jan 22, 2024

I just deployed the docs with the changes in #1434 and search results are better (they are actually usable). However, we still have the issue with the module names not being striped of tags before being indexed. Therefore, this issue is still unresolved.

@waylan
Copy link
Member Author

waylan commented Jan 24, 2024

However, we still have the issue with the module names not being striped of tags before being indexed. Therefore, this issue is still unresolved.

Upon further investigation, it appears that this issue is a symptom of mkdocs/mkdocs#3357. MkDocs' search plugin gets its entries from 2 sources: (1) a page and (2) the sections of a page. The sections are obtained by iterating over the TOC for a page. The TOC already sanitizes section headings and therefore we don't have an issue with them (for the most part). However, until mkdocs/mkdocs#3357 is resolved, there is no guarantee that a page title is sanitized. Unfortunately, the search plugin assumes that it is. I could submit a change to mkdocs which removes that assumption (by explicitly sanitizing the title within the search plugin before indexing), but it would be more sensible to just fix mkdocs/mkdocs#3357 directly.

That said, there is still the issue with symbol types in the TOC. They should not be getting through the TOC sanitation. I wonder if there is a change to mkdocsstrings which could work around the issue entirely. That needs further investigation. It is also possible that the TOC extension itself needs better sanitation. I'm leaving this open until we can get definitive answers on those matters.

@waylan
Copy link
Member Author

waylan commented Jan 24, 2024

It is also possible that the TOC extension itself needs better sanitation.

After I posted that I remembered that the type symbols are added by mkdocsstrings after Markdown is done with the TOC. Therefore, changes to the TOC extension would have no effect. I wonder if perhaps mkdocsstrings needs to take a different approach here so as not to dirty the TOC and page title. And that is what I want to investigate.

@waylan
Copy link
Member Author

waylan commented Jan 26, 2024

we still have the issue with the module names not being striped of tags before being indexed.

After some investigation, it is now clear that the as-yet unresolved problem lies with MkDocs itself. See mkdocs/mkdocs#3357, which is specifically about ensuring page titles do not contain markup. When that is resolved, markup will not be passed to the search index for whole-page entries.

Also of note is mkdocs/mkdocs#3560. If that (or something similar) gets merged, then we can turn show_symbol_type_toc back on without ill effects. I am leaving this open as a reminder about that.

@pawamoy
Copy link
Contributor

pawamoy commented Jan 26, 2024

Great investigation and PR explanation @waylan 👌

I wonder if perhaps mkdocsstrings needs to take a different approach here so as not to dirty the TOC and page title.

I'm open to any suggestion! However IIUC this will be solved thanks to the two PRs mentioned above?

However, we still have the issue with the module names not being striped of tags before being indexed.

The <code> tag around module names (H1 heading in each generated file) is set by the gen_ref_nav.py script:

nav_parts = [f"<code>{part}</code>" for part in parts]

You can try to remove the code tag from there, but it will also impact page headings.

@waylan
Copy link
Member Author

waylan commented Jan 29, 2024

I wonder if perhaps mkdocsstrings needs to take a different approach here so as not to dirty the TOC and page title.

I'm open to any suggestion! However IIUC this will be solved thanks to the two PRs mentioned above?

There are a few different reasons why I say that.

First of all, we don't know yet what the fix for mkdocs/mkdocs#3357 will be. It could be complex and allow preservation of markup in a page title or it could simply strip all markup from the page title. If it is the later, then those code tags will get removed even in places where we want them to stay.

And that leads me to the bigger issue. Where is the proper location to be wrapping content in tags? I think ideally the answer should be the HTML templates (I don't mean mkdocstrings-python's templates as they are rendering Markdown, not HTML). To be clear, the page body is understood to contain HTML when passed to a template. However, everything else is assumed to be plain text. In other words, when a page template is rendering the page's TOC, it is iterating over a collection of plain text strings and rendering each into a TOC item. This is where the various tags and symbols should get inserted (by the template), They should not be part of the text passed to the template as they are now.

Of course, the problem is how to get a plugin to do that. Yes, plugins have access to and can modify the Jinja env, which would allow a plugin to override/replace/inherit a theme's templates. But that falls down when the user provides their own template overrides. And even if the user-templates-issue was resolved, a plugin would presumably need to provide a different set of templates for every supported theme out there, which is not practical.

So, in the end, I don't know what the solution should be. I'm just pointing out that this will continue to be an issue to some degree or another so long as Python objects continue to get markup inserted into them by plugins.

The <code> tag around module names (H1 heading in each generated file) is set by the gen_ref_nav.py script:

nav_parts = [f"<code>{part}</code>" for part in parts]

You can try to remove the code tag from there, but it will also impact page headings.

Yes, that's the problem. I want the code tags where I want them, but not where I don't. The question is how to preserve them where we want them, but to exclude them where we don't. It all depends on what solution mkdocs takes.

@oprypin
Copy link
Contributor

oprypin commented Jan 31, 2024

According to the direction that the issue is moving to...

Putting html into the nav value may become unsupported

https://github.com/mkdocs/regressions/actions/runs/7733622178/job/21087001738

But I'll need to keep thinking on this

@oprypin
Copy link
Contributor

oprypin commented Feb 1, 2024

No actually the nav option currently allows HTML, so that will have to stay

@oprypin
Copy link
Contributor

oprypin commented Feb 2, 2024

This is off-topic here but I don't know any other thread that talks about this f"<code blah>{part}</code>" usage.

It is currently dumping garbage such as the following into the HTML when using mkdocs-material:

 <a aria-label='Previous: &lt;code class="doc-symbol doc-symbol-nav doc-symbol-module"&gt;&lt;/code&gt; inventory' class="md-footer__link md-footer__link--prev" href="../inventory/">

And with upcoming changes to MkDocs it will dump slightly different garbage into there :D

@oprypin
Copy link
Contributor

oprypin commented Feb 2, 2024

Made a further post about it mkdocs/mkdocs#3357 (comment)

@pawamoy
Copy link
Contributor

pawamoy commented Feb 2, 2024

I think we had a discussion about this with @squidfunk when I was implementing the feature in mkdocstrings. I think it led to the use of |striptags in the Material for MkDocs templates responsible for rendering the ToC (because my code symbols caused visible issues there), but I didn't think about the footer (and didn't notice anything about it).

@waylan
Copy link
Member Author

waylan commented Feb 2, 2024

I don't know any other thread that talks about this f"<code blah>{part}</code>" usage.

@oprypin, for context, this is all about Python-Markdown using mkdocstrings for its API documentation. The <code>{part}</code is in a mkdocstrings template which mkdocstrings uses to render its output as generated MkDocs Page objects. Those Page objects then then passed on to MkDocs templating system. For reference, see our site map, under API Reference you will see links to various pages for the different submodules of the library, The label on each link comes from the page title (the page map is generated from this template). In the mkdocstrings template, each title is generated with <code>{{ modulename }}</code> and that is assigned to the Page.title of the MkDocs object . Note that that includes the markup, which I absolutely insist needs to remain. Neither stripping tags nor escaping would be appropriate here (in the sitemap). However, on a specific page, in the <title> tag (among other locations), then the page title absolutely needs to have tags stripped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Related to the project documentation.
Projects
None yet
Development

No branches or pull requests

3 participants