Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lens reader for an NLM XML by URL #6

Closed
MikeTaylor opened this issue Jun 6, 2013 · 62 comments
Closed

Lens reader for an NLM XML by URL #6

MikeTaylor opened this issue Jun 6, 2013 · 62 comments

Comments

@MikeTaylor
Copy link

It would be great to be able to read ANY valid NLM XML file using Lens just by specifyin the URL. Something like
http://lens.elifesciences.org/xml-url/https://peerj.com/articles/36/

@michael
Copy link
Member

michael commented Jun 6, 2013

Not sure if on the fly conversion performs well enough. I thought about making Lens aware of different content repositories though, which means that content providers could make their repository available by exposing a service conforming to the JSON format Lens can read.

E.g.
http://lens.elifesciences.org#peerj would let you browse within the available articles of that repository.

More work needs to go into the specification of the Lens Document Format as well as documenting the conversion process and providing tools that help with the conversion process.

@MikeTaylor
Copy link
Author

This seems like a good route to take.

Although on-the-fly conversion would surely be a useful tool for testing
and debugging. Maybe you could consider making it available for beta
partners so they can report to you where they find problems in the
conversion or display?

On 6 June 2013 23:44, Michael Aufreiter notifications@github.com wrote:

Not sure if on the fly conversion performs well enough. I thought about
making Lens aware of different content repositories though, which means
that content providers could make their repository available by exposing a
service conforming to the JSON format Lens can read.

E.g.
http://lens.elifesciences.org#peerj would let you browse within the
available articles of that repository.

More work needs to go into the specification of the Lens Document Format
as well as documenting the conversion process and providing tools that help
with the conversion process.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-19079196
.

@michael
Copy link
Member

michael commented Jun 6, 2013

Our converter will be open sourced soon, we just haven't had the time to review the codebase and get the documentation right. It would be great to have support from other Open Access publishers with regards to accessing fresh content. Lens could also be a useful tool outside of the science community, e.g. for viewing software documentation etc... But then again... one step at a time. :)

@ivangrub
Copy link
Contributor

The converter is currently built to support the XML standard of: http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html . There might still be some edge-case errors that will come up, but should be debugged very quickly.

I will look into building a more robust API for the converter so that it is easier to plug into different workflows. The main issue with on-the-fly client side conversions is that each publisher would have to provide definitions for their figure URLs that would either have to be included in the converter, or added as a post processing step in that publisher's workflow. I think a static repository supported by each publisher would probably be the ideal way to move forward as well.

@MikeTaylor
Copy link
Author

Why does the NLM->JSON converter need to know the individual publishers' figure URL conventions? (I'm not saying it doesn't just curious as to what could invoke such a requirement.)

@ivangrub
Copy link
Contributor

Unfortunately, the XML tags do not provide a src attribute. Through a little hacking it is possible to figure out how to stitch together the URL, but that is a tedious and error-prone process.

The image and video nodes in the JSON contains a url property which needs to point to the image or video to be displayed. If that property is empty, then the article will render fine, except you will be missing all of the media bits.

@MikeTaylor
Copy link
Author

Wait ... NLM, the universally used canonical format for representing scholarly articles ... HAS NO WAY TO LINK TO THE FIGURES? Did I understand you right?

@ivangrub
Copy link
Contributor

Yes. The figure tags contain a graphic-id attribute that gives the figure's name that can be stitched together with a image styled extension. This depends on having local storage of all the article's media (images, video, supplementary material, source code, etc.) in the same path as the article's XML though. Universal access to the figures is not available at the moment. Or at least I have yet to see it. If you know of a way to do this, I would be more than happy to hear about it and quickly implement it into the converter.

@MikeTaylor
Copy link
Author

Holy poo. Well, I am astonished at such designed-in brain-damage. Sorry I can't help -- I am no NLM expert, and certainly don't know a better way. So, yes, your converter script will need, or need access to, a set of per-publisher or per-journal recipes for turning graphic-IDs into URLs.

@rdmpage
Copy link

rdmpage commented Jun 10, 2013

Mike, as an example, here's a fragment from ZooKeys:

<graphic xlink:href="ZooKeys-285-089-g001.jpg" position="float" orientation="portrait" xlink:type="simple"></graphic>

To render this you need to figure out what the full path is to the image. You don't get a URL :(

@rdmpage
Copy link

rdmpage commented Jun 10, 2013

Of course, if you have a DOI for the figure you may have better luck...

@ivangrub
Copy link
Contributor

lt is quite unfortunate. I think part of the reason for this is that the URL path for each publisher is subject to change over time so instead of having to update the XML each time with the new URL, PubMed decided to just push for having local storage of all of the figures and associated files. That is why I think a local repository of the converted JSONs, per publisher, would be the best. We can work with the interested parties to add the converter to their workflows and then link these repositories with Lens.

Even having the figure DOI is not ideal though. In that case you would have to do a http.request to get the html of the DOI page, and then scrape it for the image src URL. In async workflows like node.js, you could do this, but it would take much longer than investing a little time up front to have an organized workflow that publishes all of the associated files of an article's XML to Amazon or another cloud service where the URLs will not change.

@MikeTaylor
Copy link
Author

I see from the ZooKeys example that the attribute in question is "href" which is at least suggestive that it's the name of an HTTP-addressable resource relative to the address of the document that contains it. Doesn't it follow that if you download a ZooKeys XML from from http://www.pensoft.net/J_FILES/1/articles/5334/5334-G-2-layout.xml , then refers to
http://www.pensoft.net/J_FILES/1/articles/5334/ZooKeys-307-001-g001.jpg ?

Answer 1 (from the W3C's XLink spec at http://www.w3.org/TR/xlink/#link-locators ) is that, yes, the href " must be a URI reference as defined in [IETF RFC 2396]". Answer 2 (from simple testing) is that, no, this URL doesn't work. Darn.

@rdmpage
Copy link

rdmpage commented Jun 10, 2013

@rdmpage
Copy link

rdmpage commented Jun 10, 2013

So, for each journal you need to figure out how they serve images, and everyone does it differently.

@MikeTaylor
Copy link
Author

Surely in this case the ZooKeys document is flatly invalid.

@ivangrub
Copy link
Contributor

@MikeTaylor, ideally that is exactly how this would end up working. Realistically, aside from having to hack each publisher with tweaks like what @rdmpage just noticed by adding 'export.php_files' to the URL path, we would have publishers get on board with reorganizing their URLs in a manner that actually makes sense.

Anyone want to start leaning on the publishers to have a standard URL structure?

@MikeTaylor
Copy link
Author

Let's try some other publishers ... PeerJ next, as they're my favourites ...

@MikeTaylor
Copy link
Author

My PeerJ article is https://peerj.com/articles/36.xml and has no "xml:base" element.

The first figure is expressed as:

which means that the URL of the actual figure should be
https://peerj.com/articles/fig-1.png

To my enormous disappointment, the figure is not there. Distressingly, the full-size version seems to be at https://dfzljdn9uc3pi.cloudfront.net/2013/36/1/fig-1-full.png

I expected better from PeerJ. We should probably get in touch with them.

@ivangrub
Copy link
Contributor

Even if the url path were something like this: https://peerj.com/articles/ARTICLE_ID/ that would be fine. Your figure would then be https://peerj.com/articles/36/fig-1-full-png and the XML would be at https://peerj.com/articles/36/36.xml.

If you manage to get in touch with PeerJ, you can point them to this issue thread and we can figure out how to best fix this problem.

@gnott
Copy link
Member

gnott commented Jun 10, 2013

Tagging Figures, Graphics
http://dtd.nlm.nih.gov/publishing/tag-library/n-wq32.html

External Link
http://dtd.nlm.nih.gov/publishing/tag-library/n-hya0.html

In eLife XML it looks like the DOI URL is included in a <ext-link> tag, though following that URL will not necessarily bring you to the image file itself.

Another wrinkle is providing the same graphic in multiple sizes / resolutions, and multiple file formats.

@IanMulvany
Copy link
Contributor

It's all rather unfortunate, it would be interesting to cross post the issue to the JATS list, in the interm we now have content negotiation on eLife: http://www.elifesciences.org/elife-now-supports-content-negotiation/, but to mirror you comment Mike - poo.

@hubgit
Copy link

hubgit commented Jun 10, 2013

In JATS XML, I think the image paths are designed to be relative to the XML file when they're bundled together into a ZIP file for archiving. I'm not sure if this is how it's implemented by all publishers, though.

For PeerJ, an article at https://peerj.com/articles/36/ has an XML file at https://peerj.com/articles/36.xml and figures at https://peerj.com/articles/36/fig-1.png (for example).

@MikeTaylor
Copy link
Author

Excellent. Then it seems that just adding the relevant xml:base attribute to PeerJ's XML will fix this.

@hubgit
Copy link

hubgit commented Jun 10, 2013

In theory, yes, but it looks like articles with an xml:base attribute don't validate against the JATS DTD.

I've only dealt with images in cross-publisher articles when served from PMC, where they're all at predictable URLs.

@ivangrub
Copy link
Contributor

The PMC URLs are easier to predict because they use the xlink:href in the graphics tag to point to the figure. The only issue there is getting the PMC ID for the article which is not in each publisher's XML. The best way to do it in that case is to probably do on-the-fly conversions by requesting the XML from PMC and pulling the PMC ID from their version of the XML.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3539393/bin/elife00170f001.jpg

which would translate to http://www.ncbi.nlm.nih.gov/pmc/articles/PMC_ID/bin/fig_xlink:href.jpg

@Daniel-Mietchen
Copy link

Two pointers to related projects:

Apart from that, there is a number of problems in the XML delivered to PMC with regard to the signaling of licensing and MIME types (cf. http://chrismaloney.org/notes/OAMI%20JatsCon%20Submission,%202013 and http://outreach.wikimedia.org/wiki/GLAM/Newsletter/November_2012/Contents/Open_Access_report ), and these problems will also be mentioned in a breakout session on the reuse of OA materials in Wikimedia contexts at OAI8 (cf.
https://indico.cern.ch/contributionDisplay.py?sessionId=10&contribId=67&confId=211600 ).

@Klortho
Copy link

Klortho commented Jun 11, 2013

@MikeTaylor wrote, "Surely in this case the ZooKeys document is flatly invalid." No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link.

I think the disconnect comes from two facts: JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want; and that the XML, if it is served at all, is usually served independent of the figures and other media (and, as Alf pointed out, it doesn't even allow an xml:base attribute). It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website.

@kaveh1000
Copy link

JATS is full of surprising tags, e.g. , allowing the creation of valid documents with zero structure in the references, e.g. here.

A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue...

@MikeTaylor
Copy link
Author

@Klortho wrote:

"'Surely in this case the ZooKeys document is flatly invalid.' No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link."

Yes, that is a much more precise way to articulate the problem. I shouldn't have used the word "invalid" which of course in the XML world means "not conforming to the schema".

"JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want."

Really? Seems like flagrant misuse of the XLink attributes to me. "xlink:href" has a specific meaning, detailed at http://www.w3.org/TR/xlink/#link-locators and dependent on the "xml:base" attribute for its interpretation. As the spec. says, "If the URI reference is relative, its absolute version MUST be computed by the method of [XML Base] before use".

"It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website."

... which is precisely why no-one does that.

@kaveh1000 writes:

"A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue..."

Except, bizarrely, the "xml:base" attribute that's needed to make this stuff work in a sane way. As this conversation progresses, it looks increasingly as though that is the underlying bug here, no?

At present, we have a worldwide standard representation of academic articles which does not contain the information of how to obtain figures. That is just crazy.

@ivangrub
Copy link
Contributor

@hubgit I agree that it is not difficult to hack together a single publisher's URLs. Getting the figures via PMC is only easy though if your starting XML files are PMC too though (the URL depends on the PMC ID). In my opinion, there is really no need to have a unique identifier for each article other than the DOI.

Leveraging the 600,000+ open access articles on PubMed would be great, but it is difficult due to their term's of use and I do not see much purpose of reinventing the wheel a few times over to provide the exact same service. We will be open sourcing the converter soon. At that point every publisher that is interested in using Lens can make their appropriate magic soup to pull out the figure URLs.

I spent a little time playing around with the XML to figure hacks for PLOS and others, but it is not as simple as @hubgit wrote:

"If an xml:base attribute is not present, the base URL of an XML document is the URL from which it's served, and links are built relative to that."

@hubgit
Copy link

hubgit commented Jun 11, 2013

@ivangrub I agree with you entirely - I was just pointing out how xml:base works :-)

@Klortho
Copy link

Klortho commented Jun 13, 2013

I added this comment to the NISO JATS spec comments list. Sadly, it's not a forum -- there's no discussion or any kind of back-and-forth. If I got anything wrong, or somebody wants to add your own point of view, the thing to do would be to add your own comment.

@hubgit
Copy link

hubgit commented Jun 13, 2013

I've fixed PeerJ's article XML files (e.g. http://peerj.com/articles/36.xml) so they now use absolute URLs, rather than relative URLs, for the xlink:href attributes (i.e. they link to the actual image URLs).

These are the full-size PNG files though, which can be rather large; still looking for best-practices for marking up alternate formats.

@MikeTaylor
Copy link
Author

Nice, thanks!

I have nearly learned better than to ask this kind of question, but not
quite, so here goes: SURELY the NLM/JATS schema has a way to express this
obvious and important concept?

@hubgit
Copy link

hubgit commented Jun 13, 2013

Yes, wrapping the different formats in <alternatives> already allows alternative formats for the same element (e.g. MathML or graphic versions of a formula).

What I'm not sure about is if the attributes on the <graphic> element are expressive enough to allow a client to know which one to choose: the mimetype and mime-subtype attributes will allow a client to distinguish between formats, but there isn't a width, height or file size attribute, as far as I can tell.

@MikeTaylor
Copy link
Author

So you're saying that only MIME type serves to distinguish the
variants? Hmm. Can you hack the MIME types, so you use
image/png+full
image/png+medium
image/png+full
?

Better still, has someone already standardised such extended MIME types?

@hubgit
Copy link

hubgit commented Jun 13, 2013

I don't think that particular standard exists, as "full" is ambiguous (a client still doesn't know what size that is; if it knew the sizes it could assume that the largest is the "full" image).

There's a proposal for a HTML <picture> element, but that assumes that the client has control over the HTML and knows what sizes they want to display the image at, rather than defining the actual sizes of the source files.

@MikeTaylor
Copy link
Author

So we need TWO extensions to the NLM schema: xml:base attribute, and
height/width attributes on image links.

@Klortho
Copy link

Klortho commented Jun 26, 2013

This NISO announcement might be of interest to people on this thread; http://www.niso.org/news/pr/view?item_key=095ead17653aacf2db53445611417084f1d052dc

@MikeTaylor
Copy link
Author

It's of interest, yes; but not necessarily in a good way. I would much
rather they just fixed the obvious bugs in NLM than throw it all out and
start again. Poor NISO -- I suppose they feel they have to have SOMETHING
to do. See also
http://svpow.com/2013/06/22/why-a-niso-effort-to-standardise-altmetrics/

@Klortho
Copy link

Klortho commented Jun 26, 2013

Packaging was never in the scope of JATS, and for better or worse, I think, never will be. I can't see how this effort is throwing anything out. And, unlike altmetrics, there's a lot of prior work regarding packaging that could be drawn upon. I'm not the biggest fan of NISO, but maybe this effort will help.

@ivangrub
Copy link
Contributor

Hey everyone,

We have open sourced refract (the NLM XML to Lens JSON converter).

https://github.com/elifesciences/refract

Please have a look and start playing around with it. To make it easiest to help out with issues, make development branches for each publisher type and I can help with the necessary tweaks.

Thanks!

@IanMulvany
Copy link
Contributor

This is not a mailing list, this is a feature request, if peoeple with to keep discussing I would reccomend that we port over to a mailing list.

@michael
Copy link
Member

michael commented Sep 25, 2013

You can now drag+drop any NLM file into Lens. You can self-host a Lens article by checking out

https://github.com/elifesciences/lens/tree/0.2.x/dist

and adjusting the index.html to your needs.

@michael michael closed this as completed Sep 25, 2013
@MikeTaylor
Copy link
Author

"You can now drag+drop any NLM file into Lens."

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?

@michael
Copy link
Member

michael commented Sep 25, 2013

Use http://lens.substance.io for now.

On Wed, Sep 25, 2013 at 9:14 PM, Mike Taylor notifications@github.com
wrote:

"You can now drag+drop any NLM file into Lens."

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?

Reply to this email directly or view it on GitHub:
#6 (comment)

@MikeTaylor
Copy link
Author

How? I can't see a space on that site to drop XML onto, or an upload area. Sorry if I'm being dense. I am keen to blog the heck out of this once I get it going.

@ivangrub
Copy link
Contributor

Once you load lens.substance.io it will open to the About page. Just drag
and drop the XML anywhere on the page. I just tested PeerJ article 172 and
it worked fine for me.

If you are dragging the XML from the Chrome download bar, it might not
work. Best to take it from the folder that is saved on your hard drive and
drop it into the browser.

On Wed, Sep 25, 2013 at 12:44 PM, Mike Taylor notifications@github.comwrote:

How? I can't see a space on that site to drop XML onto, or an upload area.
Sorry if I'm being dense. I am keen to blog the heck out of this once I get
it going.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-25118548
.

@michael
Copy link
Member

michael commented Sep 25, 2013

Try downloading the XML so you have it locally and then drag the XML file somewhere on the page (it doesn't matter where you drop it) We haven't had time to add some visual clue (like a drop area). It's more of a secret feature for now as we haven't optimised and tested against journals other than eLife/LandesBioscience/PLOS. Just tested with the file you mentioned. It looks good, please excuse the cat images, they're used as a default fallback when there's no image resolver in place.

@MikeTaylor
Copy link
Author

Working for me now, when I draw the XML from a local file rather than an online link. Many thanks.

BTW., does NOT work for me in Firefox (v14.01.1) -- just displays the XML. Works in Chrome.

@MikeTaylor
Copy link
Author

Should I hold off from telling the world about this?

@michael
Copy link
Member

michael commented Sep 25, 2013

Will look into it when I get some time. Btw, we've added a configuration for PeerJ, so the images should show up now.

@MikeTaylor
Copy link
Author

They do! Great work, folks.

On 25 September 2013 22:12, Michael Aufreiter notifications@github.comwrote:

Will look into it when I get some time. Btw, we've added a configuration
for PeerJ, so the images should show up now.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-25125181
.

@michael
Copy link
Member

michael commented Sep 25, 2013

\o/ Credits(tm) to @ivangrub.

And go tell everybody. Just add a "work-in-progress" disclaimer of sorts. Thanks! :)

@Klortho
Copy link

Klortho commented Nov 27, 2013

Update: In case anyone is interested, it looks like xml:base will be allowed for every element in the JATS tag suite (in future revisions): http://www.niso.org/apps/group_public/view_comment.php?comment_id=275

@MikeTaylor
Copy link
Author

Excellent!

On 27 November 2013 19:58, Chris Maloney notifications@github.com wrote:

Update: In case anyone is interested, it looks like xml:base will be
allowed for every element in the JATS tag suite (in future revisions):
http://www.niso.org/apps/group_public/view_comment.php?comment_id=275


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-29415572
.

@michael
Copy link
Member

michael commented Dec 10, 2013

Thanks @Klortho for suggesting that change. It makes a lot of sense. I've added support for xml:base on the root <article> element. Now we have a deterministic way of resolving image urls (given that the relative url's also contain the file extension).

elifesciences/lens-converter@e381eb8

@michael
Copy link
Member

michael commented Dec 10, 2013

@MikeTaylor We do support direct url input now (in the current dev branch). However it requires the server that has the XML needs to support CORS, or you link to a proxy that handles CORS for you.

http://quasipartikel.at/lens/?url=https://peerj.com/articles/36.xml#figures/all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants