Make dataverse pages more discoverable by search engines #5605

landreev · 2019-03-06T18:51:47Z

Changed the title of the issue, to indicate that the goal is to make individual dataverses more discoverable in google and other search engines.
Embedding structured metadata into the dataverse page is not necessarily the best way to achieve that.
It appears to be more practical to focus on improving the crawl rules (specifically, discouraging the bots from crawling the facets and paginated search results on dataverse pages); in combination with using a sitemap, to point the bots to all the datasets and dataverses directly.

(end update)

We already go to admirable lengths embedding some structured metadata (DC, schema.org) into our dataset pages, making individual datasets more discoverable.
It would benefit our dataverse pages to have similarly easily indexable metadata as well.

landreev · 2019-03-06T18:53:46Z

As of now, many of our dataverse pages appear in google search results like this:

landreev · 2019-03-06T19:00:40Z

This problem (empty google search record) appears to be more common for the dataverse urls of the "/dataverse/NAME" format, than for the "/dataverse.xhtml?alias=NAME" one:

This may, or may not be related to #3130 (trailing slash in the dataverse URL resulting in a 404). But even if this were the case, adding structured metadata to the dataverse page would still be very useful, and would make the process of getting indexed in the search engines more efficient.

pdurbin · 2019-03-06T19:06:40Z

Huh. I'm surprised that the description of the dataverse isn't indexed. Ever since pull request #4879 was merged (a fix for #4468 which has a screenshot from Google search results), dataverses look much better when you link to them in Slack.

@landreev any thoughts on file landing pages? Would helping Google index them better be worth investigating, perhaps in a separate issue?

landreev · 2019-03-06T19:09:21Z

As I said, this may simply be the result of that trailing slash issue.
File pages could be something to investigate separately in the future, yes. (as of now, we are telling the bots to stay away from file pages completely).

landreev · 2019-03-06T19:14:56Z

Hmm, the "worldfish" dataverse has an empty record with the "alias=" URL:

Then of course this may be a search result cached from before #4879 was merged. (I'm seeing that the bot has finally re-crawled this dataverse in the last few hours; so hopefully the updated entry will start appearing in searches shortly)

pameyer · 2019-03-06T19:21:50Z

It might be worth having only one url to resolve to a dataverse page. If I'm remembering correctly, at least some search engines recommend that if there are multiple URLs with the same content, then one should have the rel=canonical meta tag. Having /dataverse.xhtml?alias=foo redirect to /dataverse/foo (and removing generated links to /dataverse.xhtml?alias=foo) might help with this.

landreev · 2019-03-06T21:04:17Z

It may be a good idea to exclude the "dataverse.xhtml?alias=..." format from crawling, via robots. And to completely exclude ALL the forms of the dataverse page urls except for the canonical "/dataverse/name", without any extra (search) arguments. As of now, we allow/encourage the bot to crawl through all the facets, and through the paginated search results. This is ineffecient, and does not result in anything useful being indexed.
So the solution should be to only allow one form of the dataverse page, and one form of the dataset page. And only expose them to the bots via sitemap, without relying on crawling at all.

mheppler · 2019-03-06T22:32:33Z

I would be remiss in my obligations as issue author if I didn't point out Dataset - PrettyFaces URL Format #2486 fitting in both the "dataverse URL forward slash forwarding" and the "dataverse content indexing" story. Especially if we are making changes to block /dataverse.xhtml?alias=... indexing.

It would make more sense to me, and maybe even to a search engine robot, if we had a URL formatting structure that matched the dataverse > dataset > file hierarchy of our app. Something like:

- /dataverse/example
    - /dataverse/example/datasets/doi:10.0000/DVN/XXXXXX
    - /dataverse/example/datasets/doi:10.0000/DVN/YYYYYY
    - /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ

Maybe this is a bigger ask than I realize, but there is value to improving what we have now. I would very much like to improve the navigation experience of our app. The format of our URL's is a big part of this. Another part, which might be a conversation for another day, is the use of editMode for these pages. I was reminded of this when I found that the Add Dataset pg was indexed in Google.

pameyer · 2019-03-06T22:50:47Z

@mheppler I think improving the format of the URLs would be a great idea; /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ , /dataset/doi:10.0000/DVN/ZZZZZZ or dataset/doi/10.0000/DVN/ZZZZZZ seem like better matches to the conceptual model. I'm not sure about the implementation complexity, but my impression is that it's enough that it should have it's own issue.

djbrooke · 2019-03-13T18:49:55Z

@landreev is going to update this issue and discuss with @jggautier and @mheppler.

landreev · 2019-03-15T20:19:09Z

@jggautier @mheppler
Here's the promised update:
Please disregard most of the info I entered earlier... because it's all lies! Seiously, most of it is no longer relevant.

Improved crawler rules have been/are being addressed in Add documentation on new and improved process of making Dataverse indexed by Google and other search engines #5639;
"empty google search cards" (when google finds a dataverse, but shows an empty card with the text "No information is available"): Whatever was causing this, it was not a problem with the page itself. Probably a result of a failed crawl. Once a dataverse page is successfully recrawled and reindexed, it no longer shows as empty on Google. For the dataverse in the example above (note that the dataverse doesn't have any user-entered description!) this is what the search result is looking like now:

So, since there's no description, it's just showing whatever is at the top of the page. Which happens to be the number of datasets in the dataverse plus part of the description of the first dataset.

This card looks ok to me (definitely not as bad as the "no information is available..." before). But I guess the remaining question is - is there anything at all that we can do to make it any better/any more useful, if the owner of the dataverse hasn't provided any description? One thing that was suggested (by Mike), maybe we could extract some summary of the data in the dataverse from the facets on the page - since they list all the subjects/authors/categories, etc.?

So the way it would work, we could embed a "DC.description" metadata fragment into the html of the page, similarly to what we do on the dataset page. If the dataverse has a description, we use that to populate it. If not, we generate some description on the fly: "This dataverse contains datasets on the subjects of ... by the authors ..." (for example - ?)

Also, an alternative is not to bother with any of this - and instead encourage the dataverse owners to enter meaningful description text, to make their data more discoverable.

landreev · 2019-03-15T20:23:31Z

Also,
3. The issue of both the "dataverse.xhtml?alias=<name>" and "/dataverse/<name>" pages being indexed for the same dataverse has also been addressed by the improved robots + sitemap approach in #5639.

landreev added the Feature: Metadata label Mar 6, 2019

djbrooke added Status: Ready labels Mar 6, 2019

landreev changed the title ~~Add more indexable metadata to dataverse pages, to make them more discoverable~~ Make dataverse pages more discoverable Mar 7, 2019

djbrooke assigned landreev Mar 13, 2019

djbrooke added Status: This/Next Sprint and removed Status: Ready labels Mar 13, 2019

djbrooke assigned jggautier Mar 13, 2019

djbrooke added Status: Ready and removed ready for estimation labels Mar 13, 2019

jggautier mentioned this issue Mar 18, 2019

As a data curator, I need other discovery systems (search engines, social media platforms) to display description metadata for my datasets and files so that the data is easier to find #4894

Closed

djbrooke removed the Status: Ready label Mar 20, 2019

djbrooke unassigned landreev and jggautier Mar 20, 2019

djbrooke added this to Inbox 🗄 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) May 8, 2019

TaniaSchlatter changed the title ~~Make dataverse pages more discoverable~~ Make dataverse pages more discoverable by search engines Sep 5, 2019

pdurbin added the Type: Feature a feature request label Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make dataverse pages more discoverable by search engines #5605

Make dataverse pages more discoverable by search engines #5605

landreev commented Mar 6, 2019 •

edited

landreev commented Mar 6, 2019

landreev commented Mar 6, 2019 •

edited

pdurbin commented Mar 6, 2019

landreev commented Mar 6, 2019

landreev commented Mar 6, 2019 •

edited

pameyer commented Mar 6, 2019

landreev commented Mar 6, 2019

mheppler commented Mar 6, 2019

pameyer commented Mar 6, 2019

djbrooke commented Mar 13, 2019

landreev commented Mar 15, 2019

landreev commented Mar 15, 2019 •

edited

Make dataverse pages more discoverable by search engines #5605

Make dataverse pages more discoverable by search engines #5605

Comments

landreev commented Mar 6, 2019 • edited

landreev commented Mar 6, 2019

landreev commented Mar 6, 2019 • edited

pdurbin commented Mar 6, 2019

landreev commented Mar 6, 2019

landreev commented Mar 6, 2019 • edited

pameyer commented Mar 6, 2019

landreev commented Mar 6, 2019

mheppler commented Mar 6, 2019

pameyer commented Mar 6, 2019

djbrooke commented Mar 13, 2019

landreev commented Mar 15, 2019

landreev commented Mar 15, 2019 • edited

landreev commented Mar 6, 2019 •

edited

landreev commented Mar 6, 2019 •

edited

landreev commented Mar 6, 2019 •

edited

landreev commented Mar 15, 2019 •

edited