Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate paper metadata handling with Wikidata #126

Open
Daniel-Mietchen opened this issue Oct 10, 2016 · 3 comments

Comments

Projects
None yet
3 participants
@Daniel-Mietchen
Copy link

commented Oct 10, 2016

This post is inspired by the BibTeX from Wikidata functionality described in https://larsgw.blogspot.de/2016/09/citationjs-on-command-line.html .

Some thoughts on how to integrate ContentMine's paper metadata handling with Wikidata:

  • if a ContentMine pipeline (or any reference file in BibTeX or similar format, for that matter) touches bibliographic metadata of scholarly articles, check whether Wikidata items for these articles already exist (e.g. via P932, P356, P698).
    • if yes, it might simply trigger an integrity check of these metadata, perhaps identify the main topic (P921) or do nothing for the moment
    • if no, it should start the missing items with at least some basic properties (e.g. P31:Q13442814 and the respective value for a persistent identifier). If this would leave the items incomplete with respect to Wikidata's data model for scholarly articles, the missing pieces could be handled by the mostly existing pipelines around constraint violations.
  • in addition to the existing ContentMine pipelines to search by dictionaries, it might be interesting to have some functionality to search the literature (across all or selected dictionaries) by contributions from particular authors, institutions, journals, dates or some such, with which Wikidata could help
  • What about running ContentMine over Wikipedia dumps to identify facts?
    • if these facts are referenced on Wikipedia to scholarly sources, ContentMine could check whether the indicated sources actually support the statement, and flag cases where that's not clear
    • if the Wikipedia statements lack scholarly references, ContentMine might be able to find some
    • as above, the metadata of the scholarly references would go to Wikidata, from where it might be pulled into the respective Wikipedia article by way of some variant of Module:Cite.
@tarrow

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2016

These are definitely interesting ideas.

The first idea appears to me to be the most readily actionable at the moment. What sort of work flow do you envisage for this? Hope this doesn't seem like a barrage of questions but I just want to check I've understood this all right.

Should we build a bot and seek approval from the community to add these items? In the call we discussed principally interacting via the primary-sources tool although we mostly talked about 'facts' not paper metadata. Perhaps given that this data is already 'curated' by either the NCBI or CrossRef this isn't such a problem.

Which metadata should we consider adding? If we are looking at all the new publications on a given day should these all be added to Wikidata? My impression from wikicite was that this kind of blanket adding where there is neither a structural need or a lack of real notability for a given publication should be avoided (and perhaps go into librarybase instead?).

One of the places I started with librarybase before was only importing works that were referenced on enwiki (but we could choose any/all wikis)?

@Daniel-Mietchen

This comment has been minimized.

Copy link
Author

commented Oct 10, 2016

I think having Wikidata entries for all works cited from enwiki is reasonable, and expansion to all works cited by any Wikimedia project (at least from their content namespaces) should come as soon as possible thereafter.

Blanket addition beyond that may cause problems but still makes sense in the long run, as that would contribute to the goal of turning Wikidata into an open citation graph. So anything cited anywhere from "within scope" would eventually get a Wikidata item, and after some time, I could well imagine encouraging people to upload their BibTeX files to some tool on Wikimedia labs that would then check these files against the Wikidata corpus and add info / flag inconsistencies as needed.

Perhaps we can start by sharing in a standard fashion the publications that CM has read on a given day, perhaps along with things mined from them? We could then go over that feed and hopefully become more specific about the respective workflows for Wikidata/ Librarybase, and demo things with Zika.

@petermr

This comment has been minimized.

Copy link
Member

commented Oct 10, 2016

I think these are all in scope for the WikiFactMine project - it will
depend on details.

On Mon, Oct 10, 2016 at 4:46 AM, Daniel Mietchen notifications@github.com
wrote:

This post is inspired by the BibTeX from Wikidata functionality described
in https://larsgw.blogspot.de/2016/09/citationjs-on-command-line.html .

Yes - Lars has done a super job.

Some thoughts on how to integrate ContentMine's paper metadata handling
with Wikidata:

  • if a ContentMine pipeline (or any reference file in BibTeX or
    similar format, for that matter) touches bibliographic metadata of
    scholarly articles, check whether Wikidata items for these articles already
    exist (e.g. via P932, P356, P698).
    • if yes, it might simply trigger an integrity check of these
      metadata, perhaps identify the main topic (P921) or do nothing for the
      moment
    • if no, it should start the missing items with at least some basic
      properties (e.g. P31:Q13442814 and the respective value for a persistent
      identifier). If this would leave the items incomplete with respect to
      Wikidata's data model for scholarly articles, the missing pieces could be
      handled by the mostly existing pipelines around constraint violations.

I think this is a great place to start learning about Wikidata and
normalized metadata. Because the scholarly literature is not pr

    • in addition to the existing ContentMine pipelines to search by
      dictionaries, it might be interesting to have some functionality to search
      the literature (across all or selected dictionaries) by contributions from
      particular authors, institutions, journals, dates or some such, with which
      Wikidata could help

Yes - bibliography is starting to emerge as critical and I think we can
and should address it. We can't do all of it, but it needs ot be integrated
into the facts.

  • What about running ContentMine over Wikipedia dumps to identify
    facts?
    • if these facts are referenced on Wikipedia to scholarly sources,
      ContentMine could check whether the indicated sources actually support the
      statement, and flag cases where that's not clear
    • if the Wikipedia statements lack scholarly references,
      ContentMine might be able to find some
    • as above, the metadata of the scholarly references would go to
      Wikidata, from where it might be pulled into the respective Wikipedia
      article by way of some variant of Module:Cite.

I'll bounce this around with Magnus


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#126, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAsxSx2nXGs7meBHAVdzPYuUsVAJNgKYks5qybUKgaJpZM4KSNUs
.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.