Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029

chewsw · 2018-09-07T11:14:29Z

Author names are missing from the dataset records that Google Dataset Search indexed from Dataverse.

The JSON-LD schema for authors should be updated to "@type": "Person" or other appropriate types to differentiate between organization as authors or individuals as authors.

pdurbin · 2018-09-07T11:34:21Z

@chewsw thanks for opening this issue. Here's the thread from the dataverse-users mailing list: https://groups.google.com/d/msg/dataverse-community/TlQPNI3Ip2E/srLf29aSBAAJ

Originally we had "@type": "Person" in the JSON-LD output (in development, before release) but in Dataverse it's possible to have organizations as authors ("Gallup Organization", "Geological Survey (U.S.)", etc.) so we took it out. Please see discussion in these two places:

jggautier · 2018-09-14T20:42:02Z

Maybe outlining some more details would help with estimation:

This will be required metadata that the author can change from the UI or API, indicating in some way whether the author names she enters are people or organizations
We need a plan for how installations can add author types to existing datasets

jggautier · 2018-09-26T19:10:44Z

I'm starting to think that Google Dataset Search prefers the creator property (as opposed to the author property that Dataverse uses). For every dataset landing page I've found in Google Dataset Search where Google shows the author, the creator property is used instead of the author property.

Unless someone finds something different, I'd propose that instead of the author property Dataverse uses the creator property, which can use the same sub-properties:

"creator": [
    {
      "affiliation": "affiliation",
      "@type": "Person",
      "name": "Lname, Fname"
    },
    {
      "@type": "Organization",
      "name": "Org name"
    }

(Google doesn't like affiliation when the @type organization is used with the author or creator properties.)

Also, Google's Structured Data Testing Tool is no longer showing errors when author or creator types are missing (and it defaults to "Thing" instead of Person or Organization), although I still agree that Dataverse's schema.org metadata should say whether dataset authors are people or organizations.

pdurbin · 2018-10-04T01:57:19Z

Today during sprint planning @jggautier explained his hunch on how switching from author to creator might help. This was while discussing #4371.

jggautier · 2018-10-05T16:47:30Z

I asked for clarification in the structured data section of Google's webmaster forum.

"Creator" seems like the more used property, but DataCite is using "author" with an @type. On this dataset on Google Dataset Search), authors are displayed even though the "author" property is used, so maybe Google does want to see the specified @type.

chewsw · 2018-10-08T10:40:13Z

Hi Julian, Thanks very much for following up on this. Changing the "author" property to "creator" seems like a good idea. Let's see how Google will respond to this!

…

On Sat, Oct 6, 2018 at 12:47 AM Julian Gautier ***@***.***> wrote: I asked for clarification <https://productforums.google.com/forum/#!topic/webmasters/Ix1PXcY9IHc> in the structured data section of Google's webmaster forum. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5029 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ApEdgKhWjDabnbl2NOBncM-YAIKdlN6Sks5uh40jgaJpZM4WepYj> .

jggautier · 2019-01-11T19:15:31Z

Author names are showing up on some but not all Google Dataset Search pages for datasets in Dataverse repositories, like this page for a dataset from the Texas Data Repository (TDR), and this page for a dataset from Harvard Dataverse. But those pages also say metadata is coming from DataCite, which publishes its own schema.org metadata and uses only the "author" property, but includes its guessed @type (e.g. this schema.org metadata from DataCite for that TDR dataset). From what I can tell so far, every Google Dataset Search page for datasets from a Dataverse repository includes author names only when the "dataset provided by" includes DataCite.

Harvard Dataverse upgraded to Dataverse 4.10.1 two days ago (Jan 8), which includes adding the "creator" property to the schema.org metadata. Once Google starts indexing more recently published datasets, we can see if authors are displayed on Google Dataset Search pages (especially when it isn't also isn't using DataCite's schema.org metadata).

jggautier · 2019-01-22T17:32:29Z

Just an update: Datasets published in Harvard Dataverse after Jan 7, with the updated Schema.org metadata, are showing up in Google Dataset Search without the author names (like this one and this one). I think we can rule out any preference for "author" versus "creator" elements.

pdurbin · 2019-01-22T17:39:12Z

@jggautier bummer. Does that mean we should try adding "@type": "Person"? As indicated above, the Dataverse UI/API would need to allow dataset authors to choose between a person and an organization.

jggautier · 2019-01-22T18:07:52Z

That or we could do what @mfenner wrote in #2243 that DataCite does, which is basically guess (with >90% accuracy).

Venki18 · 2019-06-05T08:05:35Z

@pdurbin and @jggautier thank you for looking into this problem. I thought I will add my findings here if that will help to make the changes in the next version.

What I discovered using the Structured dataset test tool is for NTU datasets the author fields show as Thing in the test tool. If I change the "@type": "Person" for one of the authors the tool doesn't show any error. I think we must include the "@type": "Person" for all the authors in the ld+json in the script. Then google dataset results page will show the author names under Person.

I see from an example of GBIF—the Global Biodiversity Information Facility dataset record in Google dataset results page display author names as "@type": "Person".

Please refer to the screenshot below

I think we need to add "@type": "Person" in Dataverse to show the author name in Google dataset page.

pdurbin · 2019-06-05T10:23:29Z

@Venki18 hi! Yes, the "every person and organization is a Thing" problem is well known to us, unfortunately.

I've been hoping we can use some new code added by @fcadili in pull request #4664 to pass in a string that could either be a person or an organization and the code will tell use which it is.

I haven't studied the code yet but here's a test he wrote that shows the code figuring out if a string is for an organization or a person, for example:

https://github.com/IQSS/dataverse/blob/v4.14/src/test/java/edu/harvard/iq/dataverse/export/OrganizationsTest.java

Venki18 · 2019-06-05T11:01:13Z

@pdurbin thank you for the quick reply. May I know how does the export for TermsOfUse work? We have been using CC-BY-NC instead of CC0 and we have changed the necessary text in Bundle.properties file. But we are using the code CC0 as it is. Hence when you guys export to ld+json format it is exported as CC0. Hence for all our datasets with waiver terms the code exports what is entered in the additional text box. For default CC0 it is taken as it is.
Is there any way to show CC-BY-NC?

pdurbin · 2019-06-05T11:29:18Z

@Venki18 I'm not sure but let me at least give you and @Thanh-Thanh and others some pointers to the code:

https://github.com/IQSS/dataverse/blob/v4.14/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java#L1777

It looks like if CC0 isn't specified the code will put in the free form text the user entered as an alternative to CC0.

This is somewhat off topic for this issue, of course, but I hope this helps! 😄 Please feel free to create as many issues as we need!

pdurbin · 2019-06-05T11:54:20Z

@Venki18 also, if you're interested @rigelk and I are talking about Schema.org JSON-LD, especially in relation to ActivityPub (#5883) in chat. You can catch up on the conversation at http://irclog.iq.harvard.edu/dataverse/2019-06-05

adam3smith · 2022-10-12T16:10:00Z

Just to register our interest in fixing this, using the Datacite strategy of guessing. Would you take a fix along those lines as a PR? Otherwise we'd address it locally -- not having authors in the Google Dataset search is a bummer.

jggautier · 2022-10-12T17:09:23Z

I think we were worried about how well the Datacite strategy of guessing the name type (#2243 (comment)), used in the Dataverse software's OpenAIRE metadata export, would work for all Dataverse installations.

These are some or hopefully all of the next steps proposed already in another GitHub issue (which I can't find right now), some by @qqmyers:

Test how well the algorithm for guessing author name types works for different installations, especially those with different types of names, and improve the algorithm if needed before using it in other metadata exports like the Schema.org export
Let installations decide whether or not to use the algorithm for any of their metadata exports
Use the algorithm only for datasets already created in an installation and ask depositors to indicate the type of author name being entered

I think a combination of these things could be done, like testing the algorithm and adding a way for installations to say that they don't want to use it. In this case I'd advocate for providing installations with guidance about how to evaluate the accuracy of the algorithm. If GUI work could be done to let depositors choose the name type when they enter names, installations could also decide not to use the algorithm to add "type" metadata to the author/creator names of datasets already created in their installations.

But one blocker might be a lack of resources (people and time) to review the algorithm and design and test GUI changes. Maybe a review of the algorithm's accuracy would be less resource intensive if it was distributed among a representative sample of Dataverse installations, like those with names from different cultures in their metadata. Each installation could test the accuracy of their own metadata, report back, and the community could make a decision about what to do then.

The algorithm is being used to determine the name type for names entered in other fields, like the Contact Person field, but I'm also assuming that we would prioritize identifying the types of names in the authors/creator fields because Google Dataset Search usually displays those name but isn't when there's no name type in the Schema.org metadata.

adam3smith · 2022-10-12T18:02:10Z

Thanks for responding so quickly!

I think we were worried about how well [the Datacite strategy of guessing the name type] (#2243 (comment)) (used in the Dataverse software's OpenAIRE metadata export) would work for all Dataverse installations.

I understand this, but I'm wondering whether this is the right trade-off:

With the current behavior, we're generating invalid JSON-LD and having our data presented at Google without proper attribution to their creators
With a worst-case scenario of implementing the fix and "guessing" working for only half the data (a highly unlikely scenario), we'd fix the invalid JSON-LD, have our holding show up properly in google, with the main downside that some people would be categorized as organizations and vice versa. The main effect of this would be, I think, that some auto-generated citations would look a bit off.

Am I overlooking a massive cost to incorrectly labelling a creator?
In other words, even if the algo would work poorly in some installations -- is there any scenario where it's worse than the status quo? And if it isn't is there any way in which implementing a "guessing" fix would create a form of technology lock-in that we want to avoid? Otherwise, I don't really see the case against this.

jggautier · 2022-10-12T19:06:33Z

Am I overlooking a massive cost to incorrectly labelling a creator?

I agree. Maybe maybe making sure author names are displayed in Google Dataset Search outweighs any adverse effects of calling a person an organization or vice versa. And right now the only adverse effect we can think of is that the citations would look a bit off.

I don't think there's any technical lock-in, if I get your meaning. I think it would be easy to improve this over time, right? For example, repositories could run an improved algorithm and create or overwrite dataset versions with improved metadata, however that metadata is corrected.

The lock-in might be more behavioral and I'm probably suggesting that we front-load research because I always worry that the effectiveness of the changes that make it into each Dataverse release won't be evaluated soon enough, if at all. By the time we see that something needs to be fixed, the problem could be much more difficult to fix.

For example, I'm thinking of the problems I wrote about in #5920. The scale and consequences of those problems haven't been explored, but I think it wouldn't be great for discoverability and access if datasets that aren't really "closed access" are labelled as such.

Maybe it's enough that there's this record of a conversation about "what could go wrong with this decision" and an agreement that we don't need to more pro-actively evaluate the changes if fixing any problems won't become harder over time, in this case as more and more datasets are created.

adam3smith · 2022-10-12T19:22:42Z

I don't think there's any technical lock-in, if I get your meaning. I think it would be easy to improve this over time, right?

Right, that's what I mean -- I don't think this would make implementing either better algorithmic solutions or a manual UI for type selection in the future harder.

The lock-in might be more behavioral and I'm probably suggesting that we front-load research because I always worry that the effectiveness of the changes that make it into each Dataverse release won't be evaluated soon enough, if at all. By the time we see that something needs to be fixed, the problem could be much more difficult to fix.

That's fair. I guess beyond better algos (which I expect would be hard) the relevant question is whether, and to what extent, there is the need to allow author-type selection in the GUI, which would obviously be more precise in theory, but is also another UI feature and itself prone to user error, especially for large, self-curated repositories like Harvard DV.
How would you do the research on this? Just sample and look manually? At QDR we can obviously look at our entire holdings but I also know our metadata well enough to be pretty certain that even a naive algo without first-name matching would get us to 100%.

jggautier · 2022-10-13T16:06:31Z

Yes the only way I could think of is to have people look at a sample of the guesses that the algorithm makes.

But looking at the discussion in this issue some more I remember that Martin Fenner said the algorithm works over 90% of the time, and I think DataCite would have tested it on a much greater number and variety of names than we could. So maybe that's enough to be confident that the algorithm would be accurate enough, and instead we could consider here or later on why someone would want to correct a guess that the algorithm made and how they might be able to.

adam3smith · 2022-10-14T17:49:25Z

So maybe that's enough to be confident that the algorithm would be accurate enough, and instead we could consider here or later on why someone would want to correct a guess that the algorithm made and how they might be able to.

Great, yes, let's make that the plan

jggautier · 2022-10-19T19:26:37Z

@adam3smith, about planning to review why someone would want to correct a guess that the algorithm made and how they might be able to, I wanted to clarify that I think this should be done before the algorithm is used to add metadata to other exports, like the Schema.org export.

What do you think?

Very early in the Schema.org conversation and development work, I asked a liaison at Google why Google Dataset Search would insist on a nameType, but I thought the answer was vague and left a lot of room for speculation (like, their knowledge graph could take advantage of knowing more about dataset authors), which at the time I think made it easy for us to feel okay with ignoring the nameType error we kept seeing when checking the Schema.org exports (until we started noticing that if the dataset didn't have a person or organization nameType, the author names wouldn't show up in Google Dataset Search).

But since you mentioned that the algorithm guessing wrong might result in an off-looking citation, and Martin Fenner wrote earlier that the algorithm is necessary for generating correct citations, maybe we could start by assuming that the only way that a depositor or curator would notice the wrong guess is if they saw a citation of their dataset that was off. They could always just correct the autogenerated citation themselves, right? I remember having to do this a lot when managing citations in Zotero. But what are the chances that this becomes a big inconvenience and someone wants a curator or repository manager to correct the generated citations? In your experience, is the risk of that happening so low that we don't need to consider how a depositor or curator could have the guess corrected? You mentioned that this won't happen in QDR, but might it for other Dataverse repositories?

adam3smith · 2022-10-19T20:55:20Z

But what are the chances that this becomes a big inconvenience and someone wants a curator or repository manager to correct the generated citations?

I think for personal authors this is a real issue and it comes up with poor metadata in Zotero a fair amount, but:
Given the way the algo works, this is easy to do for person authors and we don't need any additional functionality for this (it's also why I think it's irrelevant at QDR): As long as you add your name as "Lastname, Firstname" or you have an ORCID in the metadata, you're always going to be a person. Lastname, Firstname is, of course, already explicitly recommended by Dataverse as the mode of entry for just this reason.

The only scenario for which we'd therefore need new functionality is an organization author that gets misclassified by the algo, either because they have a comma in their name or because they have a first word that looks like a first name. This, I think, is going to be incredibly rare and I'd argue not worth considering. The chances of it happening could be further reduced if the last step of the algorithm is skipped, i.e. we treat all names without a comma as organizations, but I don't think that's a good idea given the trade-offs of missing a large number of 'incorrectly' added persons.

jggautier · 2022-10-20T16:20:01Z

Great points! Thanks @adam3smith.

pdurbin · 2023-01-24T14:09:49Z

This PR is related, at least:

IQSS/7349-4 creator updates in schema.org #9089

Might fix it? Or go a long way? I'm not sure! 😄

qqmyers · 2023-01-24T14:50:42Z

It probably does close it but there's a lot of discussion above. Perhaps whatever's left, if anything, can be a new issue?

pdurbin · 2023-01-24T15:52:48Z

@qqmyers fantastic. I just marked that PR (#9089) as closing this issue.

To anyone reading this, please look through that pull request. If there are any remaining issues after it is merged, please open a fresh issue. Thanks! 🚀

adam3smith · 2023-01-24T16:48:52Z

Here's how things look on our end 😄

jggautier · 2023-01-24T17:30:43Z

Thanks @qqmyers for always helping make sure things don't fall through the cracks!

The two things I see in this issue that aren't addressed in PR #9089 are:

The licensing issues that @Venki18 mentioned in June 2019, which may have been resolved in v5.10 with the PR Support for configurable list of licenses #7920. In the installation that @Venki18 helps manages, license metadata in the Schema.org exports seems to be represented well, such as the CC BY-NC 4.0 dataset at https://doi.org/10.21979/N9/OXUQQA and its Schema.org export at https://researchdata.ntu.edu.sg/api/datasets/export?exporter=schema.org&persistentId=doi:10.21979/N9/OXUQQA.
Determining how well Dataverse guesses the type of author, using that algorithm. From my conversation with Sebastian, it makes sense to me for a review to pay more attention to how well the algorithm identifies organizational authors as organizations and how well it identifies people authors as people when the author's metadata doesn't include an ORCID and isn't in the lastname, first name format (or includes multiple commas?). I can open a GitHub issue about this review, which I think we've taken to calling spikes.

kcondon changed the title ~~Improving Dataverse's JSON-LD schema to enable author names display in Google Dataset Search's records~~ Improving Dataverse's JSON-LD schema to enable author names display in Google Dataset Search's Sep 10, 2018

djbrooke added Status: This/Next Sprint and removed Status: This/Next Sprint labels Sep 12, 2018

jggautier mentioned this issue Sep 14, 2018

As a researcher, Dataverse admin or curator, I want more information about my data sent to DataCite so that it's more discoverable #2917

Closed

djbrooke removed the Status: Backlog label Sep 14, 2018

jggautier mentioned this issue Sep 14, 2018

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

Closed

djbrooke removed the ready for estimation label Sep 19, 2018

jggautier added the Feature: Metadata label Sep 26, 2018

jggautier mentioned this issue Apr 15, 2019

As an installation admin, I want my repository to export OpenAIRE-compliant metadata to improve discoverability, reusability of research data #4257

Closed

pdurbin mentioned this issue Feb 12, 2020

Support Research Organization Registry (ROR) IDs #6640

Closed

jggautier changed the title ~~Improving Dataverse's JSON-LD schema to enable author names display in Google Dataset Search's~~ Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's Oct 22, 2020

haarli mentioned this issue Apr 5, 2022

Add @type Person to JSON-LD export of authors MPDL/dataverse#36

Closed

adam3smith mentioned this issue Oct 12, 2022

Improve/update Schema.org JSON-LD export #7349

Closed

qqmyers mentioned this issue Oct 19, 2022

IQSS/7349-4 creator updates in schema.org #9089

Merged

jggautier mentioned this issue Jan 26, 2023

Affiliations entered in affiliation fields are parenthesized in "Datacite" and Schema.org exports #9330

Open

kcondon closed this as completed in #9089 Feb 1, 2023

pdurbin added this to the 5.13 milestone Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029

Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029

chewsw commented Sep 7, 2018 •

edited by jggautier

Loading

pdurbin commented Sep 7, 2018

jggautier commented Sep 14, 2018 •

edited

Loading

jggautier commented Sep 26, 2018 •

edited

Loading

pdurbin commented Oct 4, 2018

jggautier commented Oct 5, 2018 •

edited

Loading

chewsw commented Oct 8, 2018 via email

jggautier commented Jan 11, 2019

jggautier commented Jan 22, 2019

pdurbin commented Jan 22, 2019

jggautier commented Jan 22, 2019

Venki18 commented Jun 5, 2019

pdurbin commented Jun 5, 2019

Venki18 commented Jun 5, 2019

pdurbin commented Jun 5, 2019

pdurbin commented Jun 5, 2019

adam3smith commented Oct 12, 2022

jggautier commented Oct 12, 2022 •

edited

Loading

adam3smith commented Oct 12, 2022

jggautier commented Oct 12, 2022

adam3smith commented Oct 12, 2022

jggautier commented Oct 13, 2022

adam3smith commented Oct 14, 2022

jggautier commented Oct 19, 2022 •

edited

Loading

adam3smith commented Oct 19, 2022

jggautier commented Oct 20, 2022

pdurbin commented Jan 24, 2023

qqmyers commented Jan 24, 2023

pdurbin commented Jan 24, 2023

adam3smith commented Jan 24, 2023

jggautier commented Jan 24, 2023

Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029

Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029

Comments

chewsw commented Sep 7, 2018 • edited by jggautier Loading

pdurbin commented Sep 7, 2018

jggautier commented Sep 14, 2018 • edited Loading

jggautier commented Sep 26, 2018 • edited Loading

pdurbin commented Oct 4, 2018

jggautier commented Oct 5, 2018 • edited Loading

chewsw commented Oct 8, 2018 via email

jggautier commented Jan 11, 2019

jggautier commented Jan 22, 2019

pdurbin commented Jan 22, 2019

jggautier commented Jan 22, 2019

Venki18 commented Jun 5, 2019

pdurbin commented Jun 5, 2019

Venki18 commented Jun 5, 2019

pdurbin commented Jun 5, 2019

pdurbin commented Jun 5, 2019

adam3smith commented Oct 12, 2022

jggautier commented Oct 12, 2022 • edited Loading

adam3smith commented Oct 12, 2022

jggautier commented Oct 12, 2022

adam3smith commented Oct 12, 2022

jggautier commented Oct 13, 2022

adam3smith commented Oct 14, 2022

jggautier commented Oct 19, 2022 • edited Loading

adam3smith commented Oct 19, 2022

jggautier commented Oct 20, 2022

pdurbin commented Jan 24, 2023

qqmyers commented Jan 24, 2023

pdurbin commented Jan 24, 2023

adam3smith commented Jan 24, 2023

jggautier commented Jan 24, 2023

chewsw commented Sep 7, 2018 •

edited by jggautier

Loading

jggautier commented Sep 14, 2018 •

edited

Loading

jggautier commented Sep 26, 2018 •

edited

Loading

jggautier commented Oct 5, 2018 •

edited

Loading

jggautier commented Oct 12, 2022 •

edited

Loading

jggautier commented Oct 19, 2022 •

edited

Loading