Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dataverse pages more discoverable by search engines #5605

Open
landreev opened this issue Mar 6, 2019 · 12 comments
Open

Make dataverse pages more discoverable by search engines #5605

landreev opened this issue Mar 6, 2019 · 12 comments
Labels

Comments

@landreev
Copy link
Contributor

landreev commented Mar 6, 2019

Changed the title of the issue, to indicate that the goal is to make individual dataverses more discoverable in google and other search engines.
Embedding structured metadata into the dataverse page is not necessarily the best way to achieve that.
It appears to be more practical to focus on improving the crawl rules (specifically, discouraging the bots from crawling the facets and paginated search results on dataverse pages); in combination with using a sitemap, to point the bots to all the datasets and dataverses directly.

(end update)

We already go to admirable lengths embedding some structured metadata (DC, schema.org) into our dataset pages, making individual datasets more discoverable.
It would benefit our dataverse pages to have similarly easily indexable metadata as well.

@landreev
Copy link
Contributor Author

landreev commented Mar 6, 2019

As of now, many of our dataverse pages appear in google search results like this:
screen shot 2019-03-06 at 1 52 10 pm

@landreev
Copy link
Contributor Author

landreev commented Mar 6, 2019

This problem (empty google search record) appears to be more common for the dataverse urls of the "/dataverse/NAME" format, than for the "/dataverse.xhtml?alias=NAME" one:
screen shot 2019-03-06 at 1 55 07 pm

This may, or may not be related to #3130 (trailing slash in the dataverse URL resulting in a 404). But even if this were the case, adding structured metadata to the dataverse page would still be very useful, and would make the process of getting indexed in the search engines more efficient.

@pdurbin
Copy link
Member

pdurbin commented Mar 6, 2019

Huh. I'm surprised that the description of the dataverse isn't indexed. Ever since pull request #4879 was merged (a fix for #4468 which has a screenshot from Google search results), dataverses look much better when you link to them in Slack.

@landreev any thoughts on file landing pages? Would helping Google index them better be worth investigating, perhaps in a separate issue?

@landreev
Copy link
Contributor Author

landreev commented Mar 6, 2019

As I said, this may simply be the result of that trailing slash issue.
File pages could be something to investigate separately in the future, yes. (as of now, we are telling the bots to stay away from file pages completely).

@landreev
Copy link
Contributor Author

landreev commented Mar 6, 2019

Hmm, the "worldfish" dataverse has an empty record with the "alias=" URL:
screen shot 2019-03-06 at 2 11 28 pm

Then of course this may be a search result cached from before #4879 was merged. (I'm seeing that the bot has finally re-crawled this dataverse in the last few hours; so hopefully the updated entry will start appearing in searches shortly)

@pameyer
Copy link
Contributor

pameyer commented Mar 6, 2019

It might be worth having only one url to resolve to a dataverse page. If I'm remembering correctly, at least some search engines recommend that if there are multiple URLs with the same content, then one should have the rel=canonical meta tag. Having /dataverse.xhtml?alias=foo redirect to /dataverse/foo (and removing generated links to /dataverse.xhtml?alias=foo) might help with this.

@landreev
Copy link
Contributor Author

landreev commented Mar 6, 2019

It may be a good idea to exclude the "dataverse.xhtml?alias=..." format from crawling, via robots. And to completely exclude ALL the forms of the dataverse page urls except for the canonical "/dataverse/name", without any extra (search) arguments. As of now, we allow/encourage the bot to crawl through all the facets, and through the paginated search results. This is ineffecient, and does not result in anything useful being indexed.
So the solution should be to only allow one form of the dataverse page, and one form of the dataset page. And only expose them to the bots via sitemap, without relying on crawling at all.

@mheppler
Copy link
Contributor

mheppler commented Mar 6, 2019

I would be remiss in my obligations as issue author if I didn't point out Dataset - PrettyFaces URL Format #2486 fitting in both the "dataverse URL forward slash forwarding" and the "dataverse content indexing" story. Especially if we are making changes to block /dataverse.xhtml?alias=... indexing.

It would make more sense to me, and maybe even to a search engine robot, if we had a URL formatting structure that matched the dataverse > dataset > file hierarchy of our app. Something like:

- /dataverse/example
    - /dataverse/example/datasets/doi:10.0000/DVN/XXXXXX
    - /dataverse/example/datasets/doi:10.0000/DVN/YYYYYY
    - /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ

Maybe this is a bigger ask than I realize, but there is value to improving what we have now. I would very much like to improve the navigation experience of our app. The format of our URL's is a big part of this. Another part, which might be a conversation for another day, is the use of editMode for these pages. I was reminded of this when I found that the Add Dataset pg was indexed in Google.

screen shot 2019-03-06 at 5 31 20 pm

@pameyer
Copy link
Contributor

pameyer commented Mar 6, 2019

@mheppler I think improving the format of the URLs would be a great idea; /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ , /dataset/doi:10.0000/DVN/ZZZZZZ or dataset/doi/10.0000/DVN/ZZZZZZ seem like better matches to the conceptual model. I'm not sure about the implementation complexity, but my impression is that it's enough that it should have it's own issue.

@landreev landreev changed the title Add more indexable metadata to dataverse pages, to make them more discoverable Make dataverse pages more discoverable Mar 7, 2019
@djbrooke
Copy link
Contributor

@landreev is going to update this issue and discuss with @jggautier and @mheppler.

@landreev
Copy link
Contributor Author

@jggautier @mheppler
Here's the promised update:
Please disregard most of the info I entered earlier... because it's all lies! Seiously, most of it is no longer relevant.

  1. Improved crawler rules have been/are being addressed in Add documentation on new and improved process of making Dataverse indexed by Google and other search engines #5639;
  2. "empty google search cards" (when google finds a dataverse, but shows an empty card with the text "No information is available"): Whatever was causing this, it was not a problem with the page itself. Probably a result of a failed crawl. Once a dataverse page is successfully recrawled and reindexed, it no longer shows as empty on Google. For the dataverse in the example above (note that the dataverse doesn't have any user-entered description!) this is what the search result is looking like now:
    Screen Shot 2019-03-11 at 11 06 42 AM

So, since there's no description, it's just showing whatever is at the top of the page. Which happens to be the number of datasets in the dataverse plus part of the description of the first dataset.

This card looks ok to me (definitely not as bad as the "no information is available..." before). But I guess the remaining question is - is there anything at all that we can do to make it any better/any more useful, if the owner of the dataverse hasn't provided any description? One thing that was suggested (by Mike), maybe we could extract some summary of the data in the dataverse from the facets on the page - since they list all the subjects/authors/categories, etc.?

So the way it would work, we could embed a "DC.description" metadata fragment into the html of the page, similarly to what we do on the dataset page. If the dataverse has a description, we use that to populate it. If not, we generate some description on the fly: "This dataverse contains datasets on the subjects of ... by the authors ..." (for example - ?)

Also, an alternative is not to bother with any of this - and instead encourage the dataverse owners to enter meaningful description text, to make their data more discoverable.

@landreev
Copy link
Contributor Author

landreev commented Mar 15, 2019

Also,
3. The issue of both the "dataverse.xhtml?alias=<name>" and "/dataverse/<name>" pages being indexed for the same dataverse has also been addressed by the improved robots + sitemap approach in #5639.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants