-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's #5029
Comments
@chewsw thanks for opening this issue. Here's the thread from the dataverse-users mailing list: https://groups.google.com/d/msg/dataverse-community/TlQPNI3Ip2E/srLf29aSBAAJ Originally we had |
Maybe outlining some more details would help with estimation:
|
I'm starting to think that Google Dataset Search prefers the Unless someone finds something different, I'd propose that instead of the author property Dataverse uses the
(Google doesn't like affiliation when the Also, Google's Structured Data Testing Tool is no longer showing errors when author or creator types are missing (and it defaults to "Thing" instead of Person or Organization), although I still agree that Dataverse's schema.org metadata should say whether dataset authors are people or organizations. |
Today during sprint planning @jggautier explained his hunch on how switching from |
I asked for clarification in the structured data section of Google's webmaster forum. "Creator" seems like the more used property, but DataCite is using "author" with an |
Hi Julian,
Thanks very much for following up on this. Changing the "author" property
to "creator" seems like a good idea. Let's see how Google will respond to
this!
…On Sat, Oct 6, 2018 at 12:47 AM Julian Gautier ***@***.***> wrote:
I asked for clarification
<https://productforums.google.com/forum/#!topic/webmasters/Ix1PXcY9IHc>
in the structured data section of Google's webmaster forum.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5029 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ApEdgKhWjDabnbl2NOBncM-YAIKdlN6Sks5uh40jgaJpZM4WepYj>
.
|
Author names are showing up on some but not all Google Dataset Search pages for datasets in Dataverse repositories, like this page for a dataset from the Texas Data Repository (TDR), and this page for a dataset from Harvard Dataverse. But those pages also say metadata is coming from DataCite, which publishes its own schema.org metadata and uses only the "author" property, but includes its guessed Harvard Dataverse upgraded to Dataverse 4.10.1 two days ago (Jan 8), which includes adding the "creator" property to the schema.org metadata. Once Google starts indexing more recently published datasets, we can see if authors are displayed on Google Dataset Search pages (especially when it isn't also isn't using DataCite's schema.org metadata). |
@jggautier bummer. Does that mean we should try adding |
That or we could do what @mfenner wrote in #2243 that DataCite does, which is basically guess (with >90% accuracy). |
@pdurbin and @jggautier thank you for looking into this problem. I thought I will add my findings here if that will help to make the changes in the next version. What I discovered using the Structured dataset test tool is for NTU datasets the author fields show as Thing in the test tool. If I change the "@type": "Person" for one of the authors the tool doesn't show any error. I think we must include the "@type": "Person" for all the authors in the ld+json in the script. Then google dataset results page will show the author names under Person. I see from an example of GBIF—the Global Biodiversity Information Facility dataset record in Google dataset results page display author names as "@type": "Person". Please refer to the screenshot below I think we need to add "@type": "Person" in Dataverse to show the author name in Google dataset page. |
@Venki18 hi! Yes, the "every person and organization is a Thing" problem is well known to us, unfortunately. I've been hoping we can use some new code added by @fcadili in pull request #4664 to pass in a string that could either be a person or an organization and the code will tell use which it is. I haven't studied the code yet but here's a test he wrote that shows the code figuring out if a string is for an organization or a person, for example: |
@pdurbin thank you for the quick reply. May I know how does the export for TermsOfUse work? We have been using CC-BY-NC instead of CC0 and we have changed the necessary text in Bundle.properties file. But we are using the code CC0 as it is. Hence when you guys export to ld+json format it is exported as CC0. Hence for all our datasets with waiver terms the code exports what is entered in the additional text box. For default CC0 it is taken as it is. |
@Venki18 I'm not sure but let me at least give you and @Thanh-Thanh and others some pointers to the code: It looks like if CC0 isn't specified the code will put in the free form text the user entered as an alternative to CC0. This is somewhat off topic for this issue, of course, but I hope this helps! 😄 Please feel free to create as many issues as we need! |
@Venki18 also, if you're interested @rigelk and I are talking about Schema.org JSON-LD, especially in relation to ActivityPub (#5883) in chat. You can catch up on the conversation at http://irclog.iq.harvard.edu/dataverse/2019-06-05 |
Just to register our interest in fixing this, using the Datacite strategy of guessing. Would you take a fix along those lines as a PR? Otherwise we'd address it locally -- not having authors in the Google Dataset search is a bummer. |
I think we were worried about how well the Datacite strategy of guessing the name type (#2243 (comment)), used in the Dataverse software's OpenAIRE metadata export, would work for all Dataverse installations. These are some or hopefully all of the next steps proposed already in another GitHub issue (which I can't find right now), some by @qqmyers:
I think a combination of these things could be done, like testing the algorithm and adding a way for installations to say that they don't want to use it. In this case I'd advocate for providing installations with guidance about how to evaluate the accuracy of the algorithm. If GUI work could be done to let depositors choose the name type when they enter names, installations could also decide not to use the algorithm to add "type" metadata to the author/creator names of datasets already created in their installations. But one blocker might be a lack of resources (people and time) to review the algorithm and design and test GUI changes. Maybe a review of the algorithm's accuracy would be less resource intensive if it was distributed among a representative sample of Dataverse installations, like those with names from different cultures in their metadata. Each installation could test the accuracy of their own metadata, report back, and the community could make a decision about what to do then. The algorithm is being used to determine the name type for names entered in other fields, like the Contact Person field, but I'm also assuming that we would prioritize identifying the types of names in the authors/creator fields because Google Dataset Search usually displays those name but isn't when there's no name type in the Schema.org metadata. |
Thanks for responding so quickly!
I understand this, but I'm wondering whether this is the right trade-off:
Am I overlooking a massive cost to incorrectly labelling a creator? |
I agree. Maybe maybe making sure author names are displayed in Google Dataset Search outweighs any adverse effects of calling a person an organization or vice versa. And right now the only adverse effect we can think of is that the citations would look a bit off. I don't think there's any technical lock-in, if I get your meaning. I think it would be easy to improve this over time, right? For example, repositories could run an improved algorithm and create or overwrite dataset versions with improved metadata, however that metadata is corrected. The lock-in might be more behavioral and I'm probably suggesting that we front-load research because I always worry that the effectiveness of the changes that make it into each Dataverse release won't be evaluated soon enough, if at all. By the time we see that something needs to be fixed, the problem could be much more difficult to fix. For example, I'm thinking of the problems I wrote about in #5920. The scale and consequences of those problems haven't been explored, but I think it wouldn't be great for discoverability and access if datasets that aren't really "closed access" are labelled as such. Maybe it's enough that there's this record of a conversation about "what could go wrong with this decision" and an agreement that we don't need to more pro-actively evaluate the changes if fixing any problems won't become harder over time, in this case as more and more datasets are created. |
Right, that's what I mean -- I don't think this would make implementing either better algorithmic solutions or a manual UI for type selection in the future harder.
That's fair. I guess beyond better algos (which I expect would be hard) the relevant question is whether, and to what extent, there is the need to allow author-type selection in the GUI, which would obviously be more precise in theory, but is also another UI feature and itself prone to user error, especially for large, self-curated repositories like Harvard DV. |
Yes the only way I could think of is to have people look at a sample of the guesses that the algorithm makes. But looking at the discussion in this issue some more I remember that Martin Fenner said the algorithm works over 90% of the time, and I think DataCite would have tested it on a much greater number and variety of names than we could. So maybe that's enough to be confident that the algorithm would be accurate enough, and instead we could consider here or later on why someone would want to correct a guess that the algorithm made and how they might be able to. |
Great, yes, let's make that the plan |
@adam3smith, about planning to review why someone would want to correct a guess that the algorithm made and how they might be able to, I wanted to clarify that I think this should be done before the algorithm is used to add metadata to other exports, like the Schema.org export. What do you think? Very early in the Schema.org conversation and development work, I asked a liaison at Google why Google Dataset Search would insist on a nameType, but I thought the answer was vague and left a lot of room for speculation (like, their knowledge graph could take advantage of knowing more about dataset authors), which at the time I think made it easy for us to feel okay with ignoring the nameType error we kept seeing when checking the Schema.org exports (until we started noticing that if the dataset didn't have a person or organization nameType, the author names wouldn't show up in Google Dataset Search). But since you mentioned that the algorithm guessing wrong might result in an off-looking citation, and Martin Fenner wrote earlier that the algorithm is necessary for generating correct citations, maybe we could start by assuming that the only way that a depositor or curator would notice the wrong guess is if they saw a citation of their dataset that was off. They could always just correct the autogenerated citation themselves, right? I remember having to do this a lot when managing citations in Zotero. But what are the chances that this becomes a big inconvenience and someone wants a curator or repository manager to correct the generated citations? In your experience, is the risk of that happening so low that we don't need to consider how a depositor or curator could have the guess corrected? You mentioned that this won't happen in QDR, but might it for other Dataverse repositories? |
I think for personal authors this is a real issue and it comes up with poor metadata in Zotero a fair amount, but: The only scenario for which we'd therefore need new functionality is an organization author that gets misclassified by the algo, either because they have a comma in their name or because they have a first word that looks like a first name. This, I think, is going to be incredibly rare and I'd argue not worth considering. The chances of it happening could be further reduced if the last step of the algorithm is skipped, i.e. we treat all names without a comma as organizations, but I don't think that's a good idea given the trade-offs of missing a large number of 'incorrectly' added persons. |
Great points! Thanks @adam3smith. |
This PR is related, at least: Might fix it? Or go a long way? I'm not sure! 😄 |
It probably does close it but there's a lot of discussion above. Perhaps whatever's left, if anything, can be a new issue? |
Thanks @qqmyers for always helping make sure things don't fall through the cracks! The two things I see in this issue that aren't addressed in PR #9089 are:
|
Author names are missing from the dataset records that Google Dataset Search indexed from Dataverse.
The JSON-LD schema for authors should be updated to "@type": "Person" or other appropriate types to differentiate between organization as authors or individuals as authors.
The text was updated successfully, but these errors were encountered: