Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getpapers has many fewer hits than EUPMC interface #140

Open
petermr opened this issue Nov 21, 2016 · 7 comments
Open

getpapers has many fewer hits than EUPMC interface #140

petermr opened this issue Nov 21, 2016 · 7 comments
Assignees

Comments

@petermr
Copy link
Member

petermr commented Nov 21, 2016

from a correspondent:

I installed ContentMine on my Mac laptop. I tried to do content mine to my research topic – “Postdoc career outcome”. I was able to get 78 open access full-text papers. See the logs of “getpapers” output below,

$ getpapers -q "postdoc career outcome" -o PDcareer -x

info: Searching using eupmc API

info: Found 78 open access results

Retrieving results [==============================] 100% (eta 0.0s)

info: Done collecting results

info: Saving result metadata

info: Full EUPMC result metadata written to eupmc_results.json

info: Individual EUPMC result metadata records written

info: Extracting fulltext HTML URL list (may not be available for all articles)

info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt

info: Got XML URLs for 78 out of 78 results

info: Downloading fulltext XML files

Downloading files [=======================] 100% (78/78) [4.2s elapsed, eta 0.0]

info: All downloads succeeded!

I did the same search through “Europe PMC” web interface. I got total 297 results, in which 296 are full-text articles and 172 are open-access articles. See the screenshot below,

My questions are:

  1.   Why “getpapers” extracted far fewer papers than “EUPMC” provides, 78 vs. 172 (or 296)? Is it caused by limited coverage of journal scrapers?
    
  2.   Not all the extracted papers are relevant to my research topic. So manual filtering may be needed. Is it possible to provide “getpapers” a list of PMC IDs for paper extraction?
    
  3.   For my research topic, I really need to get researcher's name, affiliation, contribution, and bibliometrics (citation number, H-index, journal impact factor) from journal papers. This cannot be done through standard content mine, which extract information about sequence, gene, species, and word count. How do I develop my own “ami2” plugins for extracting facts that I’m interested?
    

Thank you so much for developing this great open-source software! I’m looking forward to hearing from you soon.

@grabear
Copy link

grabear commented Apr 12, 2017

I'm having a similar problem. I'm only getting around 10 papers when using EUPMC 'AUTH:' query when I know I should be getting nearly 60 papers.

@blahah
Copy link
Member

blahah commented Apr 12, 2017

We get our results directly from the EUPMC API, so this sounds like an API bug. @tarrow can you follow up with EUPMC?

@sedimentation-fault
Copy link

@blahah , @tarrow ,

as scientists and as developers it is imperative to keep an open mind. Pure logic suggests that only the following can be inferred with certainty:

  • Either the EUPMC API has a bug,
  • or the EUPMC online interface has a bug.

Being in favour of either inference, without checking the facts, is detrimental to both research and development.

I say this because I have the strong suspicion that it is rather the EUPMC online interface that has a bug - or a feature, depending on interpretation. :-)

Let's do a EUPMC search with an undocumented search feature (undocumented in the API, but fully working, both online and in the API): search by SUBJECT.

NOTE: Subject areas, as they are called by PLOS, are documented here: Search PLOS by Subject Areas

Online

Open

http://journals.plos.org/plosone/search

make sure you click on 'Advanced Search' and then click on 'All Fields' and select 'Subject'. Type 'Algebra' as the subject. I have checked that this is a valid 'subarea' (I have looked at the HTML code of http://journals.plos.org/plosone/browse/mathematics?resultView=list - if you are the 'GUI type', you can just click on the unintuitive 'down arrow' (looks like an inverted ^) on the black navigation bar...). You get a page saying

"1,208 results for subject:algebra"

(notice the notation - that's undocumented!), all finely listed in 81 pages of 15 results each:

http://journals.plos.org/plosone/search?q=subject%3AAlgebra&page=1
up to
http://journals.plos.org/plosone/search?q=subject%3AAlgebra&page=81

EUPMC API (undocumented)

Gathering hope from the above, let's try the undocumented

SUBJECT: XXX

method in the API using getpapers:

getpapers --query 'SUBJECT:"algebra" JOURNAL:"PLOS ONE"' --outdir algebra -p -n

info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 197 open access results

197 results for algebra ONLY? This cannot be explained solely on the grounds of the open/closed access dichotomy - so who is right?

Well, I did NOT check ALL 1000+ results after result no. 197 in the online output - but the few I checked convinced me that everything after result 200+ in the online output is unrelated to algebra, even in the most relaxed version of "relation"! It's all biology - and no algebra at all.

The right results, the results that seem to implement what is said in Search PLOS by Subject Areas, are the ones served by the API!

My explanation for this discrepancy is that the online interface adds results from a different source to the ones returned by the exact query, possibly in an attempt to increase what search specialists call recall - losing what they call precision...

So what do we learn from this? Three things:

  1. This issue may not be a bug, after all. Of course, more inquiry is needed.
  2. How to search by subject using the EUPMC API - use SUBJECT:"your subject" inside your query string.
  3. To keep an open mind. :-)

@sedimentation-fault
Copy link

For the sake of completeness: this issue is a duplicate of #95

@sedimentation-fault
Copy link

@petermr ,

I start to realize that, if the method I found to search by (sub)category/(sub) subject area is really new, it opens many new possibilities. This is so, because these categories/subject areas implement a kind of semantic ontology - see Search PLOS by Subject Areas to see how they were created. You might want to experiment with SUBJECT:XXX and blog about it to spread the word. ;-)

@sedimentation-fault
Copy link

Searching for subject 'mathematical economics' gives a different picture that seems to contradict my above observations.

With getpapers:

getpapers --query 'SUBJECT:"mathematical economics" JOURNAL:"PLOS ONE"' --outdir "mathematical-economics" -p -n

info: Searching using eupmc API
info: Running in no-execute mode, so nothing will be downloaded
info: Found 1 open access results

I get exactly one result:

Reliability of MR-Based Volumetric 3-D Analysis of Pelvic Muscles among Subjects with Low Back with Leg Pain...

which is definitely NOT a mathematical economics paper.

On the other side, trying the online interface at

http://journals.plos.org/plosone/browse/mathematical_economics

brings up 31 results which DO look like mathematical economics papers!

So - again - who is right? The API, or the online interface?

At this point, I have only theories to offer - theories that need experimental evidence:

  • Maybe the online interface enriches the API results with results from some other query that happens to be much more effective (read: precise). In this case, the API is correct, but imprecise.
  • Maybe getpapers uses the API correctly to send the wrong query. That is, querying for subject "mathematical economics" (a quoted string that contains blanks and is supposed to search for the exact phrase) through getpapers does not search for what one would expect. For example, instead of the phrase "mathematical economics", it searches for "mathematical AND economics" which, for subject queries, may be subtly different.
  • Maybe getpapers has a bug in its EUPMC API processing after all. In this case it should be easier to reproduce it, since you have a query that produces 1 vs. 31 results.

@sedimentation-fault
Copy link

It seems that Europe PMC (EUPMC) has listened to complaints about sudden API changes and has modified its procedures.

I have just stumbled upon the EUPMC SOAP Web Service Reference Guide. There, in Introduction (p. 6 of the document, p.7 of the PDF file), it says:

From January 2016 a new web service release procedure has been introduced. This allows two versions of the web service to be simultaneously available. This approach to release management will allow users to prepare for a new version, rather than having to immediately respond to a version change. The details of web service releases will be communicated to all known users. A mailing list of users is compiled from those that have supplied an email address in the ‘Email’ parameters of the various methods available.

You can thus

  • use two versions of the API at any given time, the new and the last stable one, and
  • add yourself to a mailing list and keep up with the API changes by passing your email address to one of the API methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants