Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical urls for deduplication of google results in rustdoc #9461

Closed
Seldaek opened this issue Sep 24, 2013 · 24 comments
Closed

Canonical urls for deduplication of google results in rustdoc #9461

Seldaek opened this issue Sep 24, 2013 · 24 comments
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Comments

@Seldaek
Copy link
Contributor

Seldaek commented Sep 24, 2013

When multiple versions of the documentation are available, it tends to pollute google results. As a way to prevent that, it would be good to always have the latest stable release available under /current/, and have all previous versions + the master docs contain canonical links to the current docs like:

<link rel="canonical" href="http://.../current/..." />

That way it consolidates all results under the current URL which will always be correct, and it also encourages people linking to docs in blog posts and such to use links that will not rot.

/cc @alexcrichton

@alexcrichton
Copy link
Member

Do you know how search engines handle situations where pages go away or pages are just created? In theory old documentation could refer to a canonical location which no longer exists (if the module were removed), and new documentation could refer to canonical locations which do not yet exist (because they're newly added modules).

Do you know of special attributes to handle these cases?

@thestinger
Copy link
Contributor

If a module is removed, a 404 is correct. In theory it would be better to redirect them on renames but it's not going to be possible because it's not tracked.

The point of a canonical URL is to say that the page is only a non-canonical version of another URL and shouldn't show up separate in search. When we eventually have supported versions, the newest release (or master) can be given as the canonical one so the older pages won't clutter search results but will be available via a drop-down menu.

Of course, if the newer version does not have the module, you would have to omit stating it is the canonical URL - meaning you need to regenerate the old documentation every time you do the new ones. I don't think it's worth the complexity.

@thestinger
Copy link
Contributor

FWIW I think we should only have documentation on the site for releases we still support. Until we get to 1.0, we can make an exception for the last 0.x snapshot :).

@chris-morgan
Copy link
Member

When a module is removed, 404 is indeed correct, but just remember that that's not the end of the story, as I wrote recently about at http://chrismorgan.info/blog/github-links-case-study.html.

What the Django docs do is worthwhile considering: https://docs.djangoproject.com/. It makes it easy to switch between versions and shows a warning banner for the development build suggesting you may want to look at the latest stable instead. They don't, however, have a banner reminding you "this isn't the latest stable version" for old versions, which continues to surprise me a little. I reckon old versions (though not before 1.0 after a while) should stay in existence but with a banner at the top indicating that this is an unsupported release, and docs for the latest version, X.Y, are available in such-and-such a place. Of course, these things become much more directly applicable once we get to 1.0 and beyond.

@alexcrichton I guess in the no-longer-exists case you'd need to either implement something so that you can conveniently reprocess the old docs, or do a little bit of post-processing to fix the "errors". For the doesn't-yet-exist case, checking online or comparing crates (which sounds risky) would be the only real ways, I suppose.

@steveklabnik
Copy link
Member

Triage: no change.

@steveklabnik
Copy link
Member

Triage: no changes

@SamWhited
Copy link
Contributor

SamWhited commented Jan 22, 2017

(sorry for the duplicate; moving relavant link here)

See also: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Choosing_between_www_and_non-www_URLs#Using_%3Clink_relcanonical%3E

This page from Google's help center [1] appears to suggest that they only use the canonical URL as a hint. While this doesn't explicitly say it, this seems to concur with behavior I've seen in the past where if the page pointed to by the canonical URL is a 404, Google simply uses the original URL (which I suspect is what we want in this case since it makes it easy: point canonical url's to the /stable path and if the module is deleted it doesn't really matter).

@sanmai-NL
Copy link

sanmai-NL commented Mar 2, 2017

I would like to re-open discussion on this issue (@steveklabnik, I think you would be the one to ping).

With https://docs.rs/ in place, I think all rustdoc documentation for public crates should provide a canonical link to the appropriate documentation there. Why? Search for any popular crate on Google and you get a litter of confusing, often outdated self-hosted versions of the docs. This may lead the programmer to accidentally study outdated publications of documentation (e.g. when not depending on a specific version of a library), and use and perhaps even bookmark publications that may be partially broken or are not as continuously available as on https://docs.rs/.

Google's former SEO representative has indicated that Google may disregard canonical links that result in 404 HTTP response codes. Since Google is by far the most used search engine, and it addressed this issue in a sensible way already, I personally take little issue with the possibility of 404 canonical links.

Here's my logic:

  • Only published open-source code has a need for public, search engine accessible documentation of the current kind rustdoc produces.
  • The only form of ‘publication’ of Rust crate source code that the Rust project supports is via https://crates.io and https://docs.rs/.
  • Therefore, if a crate is to be published, it should be published on https://crates.io/ and its docs should be published on https://docs.rs/ and this should be the reference/canonical publication.
  • The reference/canonical publication of a crate should suppress other publications (e.g. self-hosted) on search engine results pages for the reasons I gave in my first paragraph.
  • Thus, docs for crates can be expected to be published on https://docs.rs, adding a canonical link that claims so is appropriate, and if not, that isn't a blocking issue for the rustdoc maintainer(s).

@steveklabnik
Copy link
Member

I would like to re-open discussion on this issue (@steveklabnik, I think you would be the one to ping).

No need to re-open anything 😄 It's an open issue.

With https://docs.rs/ in place, I think all rustdoc documentation for public crates should provide a canonical link to the appropriate documentation there

That would be nice, but without some improvements to docs.rs, it's not feasible. There are several people who do extra things to make their docs nicer and explicitly don't want their docs hosted on docs.rs at all.

@sanmai-NL
Copy link

Interesting, could you point me to examples of what those people do?

@steveklabnik
Copy link
Member

I believe @briansmith, @retep998 , and @bluss are three of those people?

@retep998
Copy link
Member

retep998 commented Mar 2, 2017

I'm perfectly fine with my docs being hosted on docs.rs. I just haven't actually published a new winapi since docs.rs gained Windows support. There's a few features I'm waiting for like the ability to specify the default target to show docs for and which cargo features to enable. But once that's all set then I'd much rather use docs.rs than have to deal with rustdoc generating a hundred thousand files and then committing them to git and pushing them (which is a really slow process).

There's a few copies of winapi documentation floating on the internet from other people's personal project documentation being published and I really wish they wouldn't exist because they interfere with search results. Sometimes I'll lookup some obscure windows function and the only results will be someone's rustdoc generated documentation that happens to include winapi.

@bluss
Copy link
Member

bluss commented Mar 4, 2017

@sanmai-NL A crate needs to be compiled to generate its docs, and the dependencies might not be present on docs.rs's builders, nor is there yet any way to indicate what dependencies to use.

My crates they should all have migrated their docs to docs.rs except ndarray. ndarray has lots of optional crate features and I want their items to be visible in the docs (and such items are marked in their doc string). It's not a big thing, but ndarray's docs are therefore technically superiour outside docs.rs. It also has blue boxes for example code, which is obviously nicer to the eye 😉

And by the way, here's a group of crates where an author has done an amazing job with non-docs.rs docs http://nalgebra.org/

@sanmai-NL
Copy link

Thanks for the comments. I deduce two extra issues. First, https://docs.rs should provide complete documentation and it should combine well with features and optional dependencies, and it seems not to. Secondly, sometimes https://docs.rs docs should not be the canonical variant anyway. IMO, it should be canonical by default, and this may be overridden with some configuration setting coming with the source tree.

@bluss
Copy link
Member

bluss commented Mar 6, 2017

onur/docs.rs/pull/73 can fix some of these issues

@sanmai-NL
Copy link

sanmai-NL commented Mar 7, 2017

The optional configuration setting may be a string that is a URL to the canonical API docs.

@briansmith
Copy link
Contributor

briansmith commented Mar 19, 2017

A while back, I filed https://github.com/onur/docs.rs/issues/74 to have docs.rs include the canonical link, and @onur committed at least one change towards making that happen.

https://github.com/onur/docs.rs/issues/73 will help a lot with the current main concrete problem with doc.rs. In the meantime I added a note to my documentation: “IMPORTANT: If you are reading this on docs.rs or another third-party site, you may not be seeing the complete documentation due to their limitations. Read it at https://briansmith.org/rustdoc/ring/signature/ instead.”

@skade
Copy link
Contributor

skade commented May 10, 2017

I see the problem that projects may want their official doc pages as the canonical page. Making docs.rs the canonical URL by default might give credit where no credit is due.

A stopgap would be a noindex tag for dependencies (#41882).

@steveklabnik steveklabnik added T-dev-tools Relevant to the dev-tools subteam, which will review and decide on the PR/issue. and removed T-tools labels May 18, 2017
@Mark-Simulacrum Mark-Simulacrum added C-feature-request Category: A feature request, i.e: not implemented / a PR. and removed C-feature-request Category: A feature request, i.e: not implemented / a PR. labels Jul 19, 2017
@luser
Copy link
Contributor

luser commented Mar 1, 2018

The docs.rs stuff feels like a separate discussion that could have its own issue. I think fixing the "every release version on doc.rust-lang.org shows up in Google search results" is a specific thing that's important to fix, and using <link rel="canonical"> to point at the stable docs sounds like the simplest fix.

@luser
Copy link
Contributor

luser commented Mar 1, 2018

After poking around the rustdoc sources a little bit I have a concrete proposal. rustdoc already supports several options on the #[doc] attribute to control HTML output, such as html_favicon_url:
https://doc.rust-lang.org/rustdoc/the-doc-attribute.html#at-the-crate-level

We should add support for a html_canonical_base_url option, and add it to the crates that wind up as part of the std documentation like:
#![doc(html_canonical_base_url = "https://doc.rust-lang.org/stable/")]

It would be picked up and stored into the SharedContext along with the other attributes here:

// Crawl the crate attributes looking for attributes which control how we're

Callers of render would need to pass down the relative URL from the root for a page, possibly as a member of Page itself:

pub struct Page<'a> {

This seems to mostly be useful for Context::item, which constructs on-disk paths:

fn item<F>(&mut self, item: clean::Item, mut f: F) -> Result<(), Error> where

which calls Context::render_item:

fn render_item(&self,

which calls layout::render. This might require some variation of format::href, which currently generates URLs relative to the current URL in order to generate URLs relative to the base URL:

pub fn href(did: DefId) -> Option<(String, ItemType, Vec<String>)> {

Then finally, layout::render could join the base canonical URL, if present, with the relative URL to the page and include a <link rel="canonical" href="{canonical_url}">.

@steveklabnik
Copy link
Member

Triage: there has been some small movement; by now, this issue is getting larger and larger, and is affecting more and more people. I hope to have a plan sometime in the near-ish future; we'l see.

@pietroalbini
Copy link
Member

By the way, the issue with doc.rust-lang.org has been fixed, as we now have a robots.txt in place.

@jsha
Copy link
Contributor

jsha commented Nov 30, 2021

I propose closing as a duplicate of rust-lang/docs.rs#1438.

@ehuss ehuss removed the T-dev-tools Relevant to the dev-tools subteam, which will review and decide on the PR/issue. label Jan 18, 2022
@workingjubilee
Copy link
Member

Closing as this seems fixed/taken on by docs.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests