Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Make datasets discoverable by Google #2654

Closed
dafeder opened this issue Sep 7, 2018 · 9 comments
Closed

Spike: Make datasets discoverable by Google #2654

dafeder opened this issue Sep 7, 2018 · 9 comments
Labels

Comments

@dafeder
Copy link
Member

dafeder commented Sep 7, 2018

While this would have been a good thing to do already, Google's announcement of a Dataset Search based on schema.org's Dataset definition makes support for this data standard much more imperative. It also means that the schema.org vocabulary is likely to grow in importance/dominance in the open data space.

There are two ways to approach this:

  1. Use the schema.org module and attempt to implement inline markup for schema.org Dataset. We once had a similar solution to add inline RDFa markup for DCAT (which schema.org's Dataset is based on) but moved toward offering a complete RDF endpoint to allow more control over the metadata exposed. The model of datasets and resources on separate pages especially makes the inline path difficult.
  2. Use [Open Data Schema Map] to expose the full dataset metadata in schema.org JSON-LD, and embed this in the dataset page. This requires less intervention in the markup directly but could have performance implications given ODSM's reliance on the Token system.
@jensr
Copy link

jensr commented Sep 17, 2018

I was looking if any action would be progressing on this. Testing our DKAN site with Google's structured data testing tool, it seems the main issue is that :

  • Using data.json - google does not recognise dcat:catalog as a type, and ignores everything else
  • Using catalog.json - identifies all datasets, but the feed does not define the type e.g. @type: not defined

Would it be possible as a really simple solution to include type definitions in the catalog.json, or would that violate anything in DCAT-AP?
E.g. simply set type to "dataset"

@dafeder dafeder self-assigned this Sep 17, 2018
@dafeder
Copy link
Member Author

dafeder commented Sep 17, 2018

@jensr great find! Can you tell if it's discovering the catalog.json file though? It seemed to me that it needed either inline semantic markup or for the metadata to be defined inside a <script> tag to be discoverable.

@jensr
Copy link

jensr commented Sep 17, 2018

I don't think it is discovering it.
But, I've tried registering both through google search console as a site map.
( For some reason you can only see them in the old version of search console)
The catalog.json is being picked up error-free and categorised as a data feed (still awaiting indexing from what I can see).
The data.json gives me 2 errors - not recognising the type dcat:catalog , and the second error reporting is that context is not getting recognised (see image)

image

I hope this helps - happy to try out other tests if necessary. I've tested the individual dataset json pages with the validation tool. Again, it picks up the structure content, but because type is missing, my guess it that it will nod get indexed for the dataset search (From what I've red, they must have rdf type "dataset" to be included )

@dafeder dafeder removed their assignment Oct 10, 2018
@dafeder dafeder added the To Do label Oct 10, 2018
@chrisgorgo
Copy link

It would be great if DKAN made dataset discoverable through Google Dataset Search! I would love to help you with that (disclaimer I work on the Dataset Search team).

I think there might be some misunderstanding in terms of how the metadata is ingested by googlebot. Instead of a dedicated single JSON file with metadata for the whole catalog the crawlers are expecting metadata embedded into the HTML code of landing pages for individual datasets. This metadata can be in Schema.org or DCAT.

Here's an example of a dataset from Kaggle with Schema.org annotation: https://www.kaggle.com/matheusfreitag/gas-prices-in-brazil. If you look at the code you will find the schema.org annotation inside the <script type='application/ld+json'> tags.

To see how this metadata is parsed you can use our Structure Data Testing Tool: https://search.google.com/structured-data/testing-tool?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset#url=https%3A%2F%2Fwww.kaggle.com%2Fmatheusfreitag%2Fgas-prices-in-brazil

You can read more about the different metadata fields here: https://developers.google.com/search/docs/data-types/dataset.

In terms of implementation I would recommend starting small - exposing just the name and description will get the datasets indexed.

@janette
Copy link
Member

janette commented Jul 18, 2019

thanks @chrisgorgo that is great info, I'm thinking we could add name and description components to https://www.npmjs.com/package/react-structured-data for our dkan2 sites

@chrisgorgo
Copy link

Sounds like a plan!

@chrisgorgo
Copy link

Any progress on adding schema.org to dkan? Many government data repositories would benefit from it!

@jensr
Copy link

jensr commented Feb 5, 2020

Hi there - same as @chrisgorgo - just checking back if any progress is being made on this? It would be hugely beneficial for our DKAN repo to have this implemented.

@erogray
Copy link

erogray commented Feb 5, 2020

To be done in DKAN 2 via GetDKAN/dkan2#307

@erogray erogray closed this as completed Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants