Spike: Make datasets discoverable by Google #2654

dafeder · 2018-09-07T14:45:55Z

While this would have been a good thing to do already, Google's announcement of a Dataset Search based on schema.org's Dataset definition makes support for this data standard much more imperative. It also means that the schema.org vocabulary is likely to grow in importance/dominance in the open data space.

There are two ways to approach this:

Use the schema.org module and attempt to implement inline markup for schema.org Dataset. We once had a similar solution to add inline RDFa markup for DCAT (which schema.org's Dataset is based on) but moved toward offering a complete RDF endpoint to allow more control over the metadata exposed. The model of datasets and resources on separate pages especially makes the inline path difficult.
Use [Open Data Schema Map] to expose the full dataset metadata in schema.org JSON-LD, and embed this in the dataset page. This requires less intervention in the markup directly but could have performance implications given ODSM's reliance on the Token system.

jensr · 2018-09-17T11:38:06Z

I was looking if any action would be progressing on this. Testing our DKAN site with Google's structured data testing tool, it seems the main issue is that :

Using data.json - google does not recognise dcat:catalog as a type, and ignores everything else
Using catalog.json - identifies all datasets, but the feed does not define the type e.g. @type: not defined

Would it be possible as a really simple solution to include type definitions in the catalog.json, or would that violate anything in DCAT-AP?
E.g. simply set type to "dataset"

dafeder · 2018-09-17T14:11:52Z

@jensr great find! Can you tell if it's discovering the catalog.json file though? It seemed to me that it needed either inline semantic markup or for the metadata to be defined inside a <script> tag to be discoverable.

jensr · 2018-09-17T14:34:59Z

I don't think it is discovering it.
But, I've tried registering both through google search console as a site map.
( For some reason you can only see them in the old version of search console)
The catalog.json is being picked up error-free and categorised as a data feed (still awaiting indexing from what I can see).
The data.json gives me 2 errors - not recognising the type dcat:catalog , and the second error reporting is that context is not getting recognised (see image)

I hope this helps - happy to try out other tests if necessary. I've tested the individual dataset json pages with the validation tool. Again, it picks up the structure content, but because type is missing, my guess it that it will nod get indexed for the dataset search (From what I've red, they must have rdf type "dataset" to be included )

chrisgorgo · 2019-07-14T17:03:20Z

It would be great if DKAN made dataset discoverable through Google Dataset Search! I would love to help you with that (disclaimer I work on the Dataset Search team).

I think there might be some misunderstanding in terms of how the metadata is ingested by googlebot. Instead of a dedicated single JSON file with metadata for the whole catalog the crawlers are expecting metadata embedded into the HTML code of landing pages for individual datasets. This metadata can be in Schema.org or DCAT.

Here's an example of a dataset from Kaggle with Schema.org annotation: https://www.kaggle.com/matheusfreitag/gas-prices-in-brazil. If you look at the code you will find the schema.org annotation inside the <script type='application/ld+json'> tags.

To see how this metadata is parsed you can use our Structure Data Testing Tool: https://search.google.com/structured-data/testing-tool?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset#url=https%3A%2F%2Fwww.kaggle.com%2Fmatheusfreitag%2Fgas-prices-in-brazil

You can read more about the different metadata fields here: https://developers.google.com/search/docs/data-types/dataset.

In terms of implementation I would recommend starting small - exposing just the name and description will get the datasets indexed.

janette · 2019-07-18T05:30:03Z

thanks @chrisgorgo that is great info, I'm thinking we could add name and description components to https://www.npmjs.com/package/react-structured-data for our dkan2 sites

chrisgorgo · 2019-07-18T05:51:09Z

Sounds like a plan!

chrisgorgo · 2019-09-21T03:24:38Z

Any progress on adding schema.org to dkan? Many government data repositories would benefit from it!

jensr · 2020-02-05T11:53:36Z

Hi there - same as @chrisgorgo - just checking back if any progress is being made on this? It would be hugely beneficial for our DKAN repo to have this implemented.

erogray · 2020-02-05T20:29:47Z

To be done in DKAN 2 via GetDKAN/dkan2#307

dafeder self-assigned this Sep 17, 2018

dafeder added the help wanted label Sep 17, 2018

erogray removed the help wanted label Sep 27, 2018

dafeder removed their assignment Oct 10, 2018

dafeder added the To Do label Oct 10, 2018

erogray closed this as completed Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Make datasets discoverable by Google #2654

Spike: Make datasets discoverable by Google #2654

dafeder commented Sep 7, 2018

jensr commented Sep 17, 2018 •

edited

dafeder commented Sep 17, 2018

jensr commented Sep 17, 2018

chrisgorgo commented Jul 14, 2019

janette commented Jul 18, 2019

chrisgorgo commented Jul 18, 2019

chrisgorgo commented Sep 21, 2019

jensr commented Feb 5, 2020

erogray commented Feb 5, 2020

Spike: Make datasets discoverable by Google #2654

Spike: Make datasets discoverable by Google #2654

Comments

dafeder commented Sep 7, 2018

jensr commented Sep 17, 2018 • edited

dafeder commented Sep 17, 2018

jensr commented Sep 17, 2018

chrisgorgo commented Jul 14, 2019

janette commented Jul 18, 2019

chrisgorgo commented Jul 18, 2019

chrisgorgo commented Sep 21, 2019

jensr commented Feb 5, 2020

erogray commented Feb 5, 2020

jensr commented Sep 17, 2018 •

edited