Lens reader for an NLM XML by URL #6

MikeTaylor · 2013-06-06T22:02:55Z

It would be great to be able to read ANY valid NLM XML file using Lens just by specifyin the URL. Something like
http://lens.elifesciences.org/xml-url/https://peerj.com/articles/36/

michael · 2013-06-06T22:44:58Z

Not sure if on the fly conversion performs well enough. I thought about making Lens aware of different content repositories though, which means that content providers could make their repository available by exposing a service conforming to the JSON format Lens can read.

E.g.
http://lens.elifesciences.org#peerj would let you browse within the available articles of that repository.

More work needs to go into the specification of the Lens Document Format as well as documenting the conversion process and providing tools that help with the conversion process.

MikeTaylor · 2013-06-06T23:12:36Z

This seems like a good route to take.

Although on-the-fly conversion would surely be a useful tool for testing
and debugging. Maybe you could consider making it available for beta
partners so they can report to you where they find problems in the
conversion or display?

On 6 June 2013 23:44, Michael Aufreiter notifications@github.com wrote:

Not sure if on the fly conversion performs well enough. I thought about
making Lens aware of different content repositories though, which means
that content providers could make their repository available by exposing a
service conforming to the JSON format Lens can read.

E.g.
http://lens.elifesciences.org#peerj would let you browse within the
available articles of that repository.

More work needs to go into the specification of the Lens Document Format
as well as documenting the conversion process and providing tools that help
with the conversion process.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-19079196
.

michael · 2013-06-06T23:23:13Z

Our converter will be open sourced soon, we just haven't had the time to review the codebase and get the documentation right. It would be great to have support from other Open Access publishers with regards to accessing fresh content. Lens could also be a useful tool outside of the science community, e.g. for viewing software documentation etc... But then again... one step at a time. :)

ivangrub · 2013-06-10T16:40:31Z

The converter is currently built to support the XML standard of: http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html . There might still be some edge-case errors that will come up, but should be debugged very quickly.

I will look into building a more robust API for the converter so that it is easier to plug into different workflows. The main issue with on-the-fly client side conversions is that each publisher would have to provide definitions for their figure URLs that would either have to be included in the converter, or added as a post processing step in that publisher's workflow. I think a static repository supported by each publisher would probably be the ideal way to move forward as well.

MikeTaylor · 2013-06-10T16:43:10Z

Why does the NLM->JSON converter need to know the individual publishers' figure URL conventions? (I'm not saying it doesn't just curious as to what could invoke such a requirement.)

ivangrub · 2013-06-10T16:48:41Z

Unfortunately, the XML tags do not provide a src attribute. Through a little hacking it is possible to figure out how to stitch together the URL, but that is a tedious and error-prone process.

The image and video nodes in the JSON contains a url property which needs to point to the image or video to be displayed. If that property is empty, then the article will render fine, except you will be missing all of the media bits.

MikeTaylor · 2013-06-10T16:50:38Z

Wait ... NLM, the universally used canonical format for representing scholarly articles ... HAS NO WAY TO LINK TO THE FIGURES? Did I understand you right?

ivangrub · 2013-06-10T16:54:30Z

Yes. The figure tags contain a graphic-id attribute that gives the figure's name that can be stitched together with a image styled extension. This depends on having local storage of all the article's media (images, video, supplementary material, source code, etc.) in the same path as the article's XML though. Universal access to the figures is not available at the moment. Or at least I have yet to see it. If you know of a way to do this, I would be more than happy to hear about it and quickly implement it into the converter.

MikeTaylor · 2013-06-10T16:56:40Z

Holy poo. Well, I am astonished at such designed-in brain-damage. Sorry I can't help -- I am no NLM expert, and certainly don't know a better way. So, yes, your converter script will need, or need access to, a set of per-publisher or per-journal recipes for turning graphic-IDs into URLs.

rdmpage · 2013-06-10T16:56:46Z

Mike, as an example, here's a fragment from ZooKeys:

To render this you need to figure out what the full path is to the image. You don't get a URL :(

rdmpage · 2013-06-10T16:57:40Z

Of course, if you have a DOI for the figure you may have better luck...

ivangrub · 2013-06-10T17:04:47Z

lt is quite unfortunate. I think part of the reason for this is that the URL path for each publisher is subject to change over time so instead of having to update the XML each time with the new URL, PubMed decided to just push for having local storage of all of the figures and associated files. That is why I think a local repository of the converted JSONs, per publisher, would be the best. We can work with the interested parties to add the converter to their workflows and then link these repositories with Lens.

Even having the figure DOI is not ideal though. In that case you would have to do a http.request to get the html of the DOI page, and then scrape it for the image src URL. In async workflows like node.js, you could do this, but it would take much longer than investing a little time up front to have an organized workflow that publishes all of the associated files of an article's XML to Amazon or another cloud service where the URLs will not change.

MikeTaylor · 2013-06-10T17:21:34Z

I see from the ZooKeys example that the attribute in question is "href" which is at least suggestive that it's the name of an HTTP-addressable resource relative to the address of the document that contains it. Doesn't it follow that if you download a ZooKeys XML from from http://www.pensoft.net/J_FILES/1/articles/5334/5334-G-2-layout.xml , then refers to
http://www.pensoft.net/J_FILES/1/articles/5334/ZooKeys-307-001-g001.jpg ?

Answer 1 (from the W3C's XLink spec at http://www.w3.org/TR/xlink/#link-locators ) is that, yes, the href " must be a URI reference as defined in [IETF RFC 2396]". Answer 2 (from simple testing) is that, no, this URL doesn't work. Darn.

rdmpage · 2013-06-10T17:24:08Z

Almost, http://www.pensoft.net/J_FILES/1/articles/5334/export.php_files/ZooKeys-307-001-g001.jpg

rdmpage · 2013-06-10T17:26:00Z

So, for each journal you need to figure out how they serve images, and everyone does it differently.

MikeTaylor · 2013-06-10T17:26:46Z

Surely in this case the ZooKeys document is flatly invalid.

ivangrub · 2013-06-10T17:26:49Z

@MikeTaylor, ideally that is exactly how this would end up working. Realistically, aside from having to hack each publisher with tweaks like what @rdmpage just noticed by adding 'export.php_files' to the URL path, we would have publishers get on board with reorganizing their URLs in a manner that actually makes sense.

Anyone want to start leaning on the publishers to have a standard URL structure?

MikeTaylor · 2013-06-10T17:27:04Z

Let's try some other publishers ... PeerJ next, as they're my favourites ...

MikeTaylor · 2013-06-10T17:31:33Z

My PeerJ article is https://peerj.com/articles/36.xml and has no "xml:base" element.

The first figure is expressed as:

which means that the URL of the actual figure should be
https://peerj.com/articles/fig-1.png

To my enormous disappointment, the figure is not there. Distressingly, the full-size version seems to be at https://dfzljdn9uc3pi.cloudfront.net/2013/36/1/fig-1-full.png

I expected better from PeerJ. We should probably get in touch with them.

ivangrub · 2013-06-10T17:38:19Z

Even if the url path were something like this: https://peerj.com/articles/ARTICLE_ID/ that would be fine. Your figure would then be https://peerj.com/articles/36/fig-1-full-png and the XML would be at https://peerj.com/articles/36/36.xml.

If you manage to get in touch with PeerJ, you can point them to this issue thread and we can figure out how to best fix this problem.

gnott · 2013-06-10T18:10:28Z

Tagging Figures, Graphics
http://dtd.nlm.nih.gov/publishing/tag-library/n-wq32.html

External Link
http://dtd.nlm.nih.gov/publishing/tag-library/n-hya0.html

In eLife XML it looks like the DOI URL is included in a <ext-link> tag, though following that URL will not necessarily bring you to the image file itself.

Another wrinkle is providing the same graphic in multiple sizes / resolutions, and multiple file formats.

IanMulvany · 2013-06-10T19:32:28Z

It's all rather unfortunate, it would be interesting to cross post the issue to the JATS list, in the interm we now have content negotiation on eLife: http://www.elifesciences.org/elife-now-supports-content-negotiation/, but to mirror you comment Mike - poo.

hubgit · 2013-06-10T20:27:08Z

In JATS XML, I think the image paths are designed to be relative to the XML file when they're bundled together into a ZIP file for archiving. I'm not sure if this is how it's implemented by all publishers, though.

For PeerJ, an article at https://peerj.com/articles/36/ has an XML file at https://peerj.com/articles/36.xml and figures at https://peerj.com/articles/36/fig-1.png (for example).

MikeTaylor · 2013-06-10T20:30:54Z

Excellent. Then it seems that just adding the relevant xml:base attribute to PeerJ's XML will fix this.

hubgit · 2013-06-10T20:57:00Z

In theory, yes, but it looks like articles with an xml:base attribute don't validate against the JATS DTD.

I've only dealt with images in cross-publisher articles when served from PMC, where they're all at predictable URLs.

ivangrub · 2013-06-10T21:09:57Z

The PMC URLs are easier to predict because they use the xlink:href in the graphics tag to point to the figure. The only issue there is getting the PMC ID for the article which is not in each publisher's XML. The best way to do it in that case is to probably do on-the-fly conversions by requesting the XML from PMC and pulling the PMC ID from their version of the XML.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3539393/bin/elife00170f001.jpg

which would translate to http://www.ncbi.nlm.nih.gov/pmc/articles/PMC_ID/bin/fig_xlink:href.jpg

Daniel-Mietchen · 2013-06-11T03:54:14Z

Two pointers to related projects:

https://github.com/konrad/JATS-to-Mediawiki
converts JATS from PMC into MediaWiki markup. We are planning to deploy it on Wikisource this summer.
https://github.com/erlehmann/open-access-media-importer
uploads audio and video files from PMC to Wikimedia Commons. We have thought of taking the publishers' XML directly (cf. Version of bot for import from publisher if article not indexed (yet) by PMC wpoa/open-access-media-importer#45 ) rather than through PMC (in order to bridge the gap of weeks to months between the XML being delivered to and publicly exposed at PMC) but not pursued this further, precisely because it is not obvious where to locate the media files (in PLOS XML, for instance, they do not even have file names).

Apart from that, there is a number of problems in the XML delivered to PMC with regard to the signaling of licensing and MIME types (cf. http://chrismaloney.org/notes/OAMI%20JatsCon%20Submission,%202013 and http://outreach.wikimedia.org/wiki/GLAM/Newsletter/November_2012/Contents/Open_Access_report ), and these problems will also be mentioned in a breakout session on the reuse of OA materials in Wikimedia contexts at OAI8 (cf.
https://indico.cern.ch/contributionDisplay.py?sessionId=10&contribId=67&confId=211600 ).

Klortho · 2013-06-11T04:15:47Z

@MikeTaylor wrote, "Surely in this case the ZooKeys document is flatly invalid." No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link.

I think the disconnect comes from two facts: JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want; and that the XML, if it is served at all, is usually served independent of the figures and other media (and, as Alf pointed out, it doesn't even allow an xml:base attribute). It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website.

kaveh1000 · 2013-06-11T05:12:07Z

JATS is full of surprising tags, e.g. , allowing the creation of valid documents with zero structure in the references, e.g. here.

A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue...

MikeTaylor · 2013-06-11T06:58:02Z

@Klortho wrote:

"'Surely in this case the ZooKeys document is flatly invalid.' No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link."

Yes, that is a much more precise way to articulate the problem. I shouldn't have used the word "invalid" which of course in the XML world means "not conforming to the schema".

"JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want."

Really? Seems like flagrant misuse of the XLink attributes to me. "xlink:href" has a specific meaning, detailed at http://www.w3.org/TR/xlink/#link-locators and dependent on the "xml:base" attribute for its interpretation. As the spec. says, "If the URI reference is relative, its absolute version MUST be computed by the method of [XML Base] before use".

"It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website."

... which is precisely why no-one does that.

@kaveh1000 writes:

"A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue..."

Except, bizarrely, the "xml:base" attribute that's needed to make this stuff work in a sane way. As this conversation progresses, it looks increasingly as though that is the underlying bug here, no?

At present, we have a worldwide standard representation of academic articles which does not contain the information of how to obtain figures. That is just crazy.

ivangrub · 2013-06-11T18:16:00Z

@hubgit I agree that it is not difficult to hack together a single publisher's URLs. Getting the figures via PMC is only easy though if your starting XML files are PMC too though (the URL depends on the PMC ID). In my opinion, there is really no need to have a unique identifier for each article other than the DOI.

Leveraging the 600,000+ open access articles on PubMed would be great, but it is difficult due to their term's of use and I do not see much purpose of reinventing the wheel a few times over to provide the exact same service. We will be open sourcing the converter soon. At that point every publisher that is interested in using Lens can make their appropriate magic soup to pull out the figure URLs.

I spent a little time playing around with the XML to figure hacks for PLOS and others, but it is not as simple as @hubgit wrote:

"If an xml:base attribute is not present, the base URL of an XML document is the URL from which it's served, and links are built relative to that."

hubgit · 2013-06-11T18:18:30Z

@ivangrub I agree with you entirely - I was just pointing out how xml:base works :-)

Klortho · 2013-06-13T15:15:16Z

I added this comment to the NISO JATS spec comments list. Sadly, it's not a forum -- there's no discussion or any kind of back-and-forth. If I got anything wrong, or somebody wants to add your own point of view, the thing to do would be to add your own comment.

hubgit · 2013-06-13T17:02:28Z

I've fixed PeerJ's article XML files (e.g. http://peerj.com/articles/36.xml) so they now use absolute URLs, rather than relative URLs, for the xlink:href attributes (i.e. they link to the actual image URLs).

These are the full-size PNG files though, which can be rather large; still looking for best-practices for marking up alternate formats.

MikeTaylor · 2013-06-13T17:39:48Z

Nice, thanks!

I have nearly learned better than to ask this kind of question, but not
quite, so here goes: SURELY the NLM/JATS schema has a way to express this
obvious and important concept?

hubgit · 2013-06-13T17:44:03Z

Yes, wrapping the different formats in <alternatives> already allows alternative formats for the same element (e.g. MathML or graphic versions of a formula).

What I'm not sure about is if the attributes on the <graphic> element are expressive enough to allow a client to know which one to choose: the mimetype and mime-subtype attributes will allow a client to distinguish between formats, but there isn't a width, height or file size attribute, as far as I can tell.

MikeTaylor · 2013-06-13T17:55:06Z

So you're saying that only MIME type serves to distinguish the
variants? Hmm. Can you hack the MIME types, so you use
image/png+full
image/png+medium
image/png+full
?

Better still, has someone already standardised such extended MIME types?

hubgit · 2013-06-13T18:01:30Z

I don't think that particular standard exists, as "full" is ambiguous (a client still doesn't know what size that is; if it knew the sizes it could assume that the largest is the "full" image).

There's a proposal for a HTML <picture> element, but that assumes that the client has control over the HTML and knows what sizes they want to display the image at, rather than defining the actual sizes of the source files.

MikeTaylor · 2013-06-13T18:07:42Z

So we need TWO extensions to the NLM schema: xml:base attribute, and
height/width attributes on image links.

Klortho · 2013-06-26T15:20:37Z

This NISO announcement might be of interest to people on this thread; http://www.niso.org/news/pr/view?item_key=095ead17653aacf2db53445611417084f1d052dc

MikeTaylor · 2013-06-26T15:27:57Z

It's of interest, yes; but not necessarily in a good way. I would much
rather they just fixed the obvious bugs in NLM than throw it all out and
start again. Poor NISO -- I suppose they feel they have to have SOMETHING
to do. See also
http://svpow.com/2013/06/22/why-a-niso-effort-to-standardise-altmetrics/

Klortho · 2013-06-26T15:44:26Z

Packaging was never in the scope of JATS, and for better or worse, I think, never will be. I can't see how this effort is throwing anything out. And, unlike altmetrics, there's a lot of prior work regarding packaging that could be drawn upon. I'm not the biggest fan of NISO, but maybe this effort will help.

ivangrub · 2013-06-27T18:19:50Z

Hey everyone,

We have open sourced refract (the NLM XML to Lens JSON converter).

https://github.com/elifesciences/refract

Please have a look and start playing around with it. To make it easiest to help out with issues, make development branches for each publisher type and I can help with the necessary tweaks.

Thanks!

IanMulvany · 2013-06-27T18:51:07Z

This is not a mailing list, this is a feature request, if peoeple with to keep discussing I would reccomend that we port over to a mailing list.

michael · 2013-09-25T15:18:36Z

You can now drag+drop any NLM file into Lens. You can self-host a Lens article by checking out

https://github.com/elifesciences/lens/tree/0.2.x/dist

and adjusting the index.html to your needs.

MikeTaylor · 2013-09-25T19:14:01Z

"You can now drag+drop any NLM file into Lens."

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?

michael · 2013-09-25T19:41:42Z

Use http://lens.substance.io for now.

On Wed, Sep 25, 2013 at 9:14 PM, Mike Taylor notifications@github.com
wrote:

"You can now drag+drop any NLM file into Lens."

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?

Reply to this email directly or view it on GitHub:
#6 (comment)

MikeTaylor · 2013-09-25T19:44:50Z

How? I can't see a space on that site to drop XML onto, or an upload area. Sorry if I'm being dense. I am keen to blog the heck out of this once I get it going.

ivangrub · 2013-09-25T19:50:37Z

Once you load lens.substance.io it will open to the About page. Just drag
and drop the XML anywhere on the page. I just tested PeerJ article 172 and
it worked fine for me.

If you are dragging the XML from the Chrome download bar, it might not
work. Best to take it from the folder that is saved on your hard drive and
drop it into the browser.

On Wed, Sep 25, 2013 at 12:44 PM, Mike Taylor notifications@github.comwrote:

How? I can't see a space on that site to drop XML onto, or an upload area.
Sorry if I'm being dense. I am keen to blog the heck out of this once I get
it going.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-25118548
.

michael · 2013-09-25T19:51:06Z

Try downloading the XML so you have it locally and then drag the XML file somewhere on the page (it doesn't matter where you drop it) We haven't had time to add some visual clue (like a drop area). It's more of a secret feature for now as we haven't optimised and tested against journals other than eLife/LandesBioscience/PLOS. Just tested with the file you mentioned. It looks good, please excuse the cat images, they're used as a default fallback when there's no image resolver in place.

MikeTaylor · 2013-09-25T21:10:45Z

Working for me now, when I draw the XML from a local file rather than an online link. Many thanks.

BTW., does NOT work for me in Firefox (v14.01.1) -- just displays the XML. Works in Chrome.

MikeTaylor · 2013-09-25T21:12:04Z

Should I hold off from telling the world about this?

michael · 2013-09-25T21:12:06Z

Will look into it when I get some time. Btw, we've added a configuration for PeerJ, so the images should show up now.

MikeTaylor · 2013-09-25T21:13:18Z

They do! Great work, folks.

On 25 September 2013 22:12, Michael Aufreiter notifications@github.comwrote:

Will look into it when I get some time. Btw, we've added a configuration
for PeerJ, so the images should show up now.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-25125181
.

michael · 2013-09-25T21:14:45Z

\o/ Credits(tm) to @ivangrub.

And go tell everybody. Just add a "work-in-progress" disclaimer of sorts. Thanks! :)

Klortho · 2013-11-27T19:58:36Z

Update: In case anyone is interested, it looks like xml:base will be allowed for every element in the JATS tag suite (in future revisions): http://www.niso.org/apps/group_public/view_comment.php?comment_id=275

MikeTaylor · 2013-11-27T20:41:39Z

Excellent!

On 27 November 2013 19:58, Chris Maloney notifications@github.com wrote:

Update: In case anyone is interested, it looks like xml:base will be
allowed for every element in the JATS tag suite (in future revisions):
http://www.niso.org/apps/group_public/view_comment.php?comment_id=275

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-29415572
.

michael · 2013-12-10T13:05:19Z

Thanks @Klortho for suggesting that change. It makes a lot of sense. I've added support for xml:base on the root <article> element. Now we have a deterministic way of resolving image urls (given that the relative url's also contain the file extension).

elifesciences/lens-converter@e381eb8

michael · 2013-12-10T13:07:52Z

@MikeTaylor We do support direct url input now (in the current dev branch). However it requires the server that has the XML needs to support CORS, or you link to a proxy that handles CORS for you.

http://quasipartikel.at/lens/?url=https://peerj.com/articles/36.xml#figures/all

michael closed this as completed Sep 25, 2013

Lens reader for an NLM XML by URL #6

Lens reader for an NLM XML by URL #6

Comments

MikeTaylor commented Jun 6, 2013

michael commented Jun 6, 2013

MikeTaylor commented Jun 6, 2013

michael commented Jun 6, 2013

ivangrub commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

ivangrub commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

ivangrub commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

rdmpage commented Jun 10, 2013

rdmpage commented Jun 10, 2013

ivangrub commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

rdmpage commented Jun 10, 2013

rdmpage commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

ivangrub commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

ivangrub commented Jun 10, 2013

gnott commented Jun 10, 2013

IanMulvany commented Jun 10, 2013

hubgit commented Jun 10, 2013

MikeTaylor commented Jun 10, 2013

hubgit commented Jun 10, 2013

ivangrub commented Jun 10, 2013

Daniel-Mietchen commented Jun 11, 2013

Klortho commented Jun 11, 2013

kaveh1000 commented Jun 11, 2013

MikeTaylor commented Jun 11, 2013

ivangrub commented Jun 11, 2013

hubgit commented Jun 11, 2013

Klortho commented Jun 13, 2013

hubgit commented Jun 13, 2013

MikeTaylor commented Jun 13, 2013

hubgit commented Jun 13, 2013

MikeTaylor commented Jun 13, 2013

hubgit commented Jun 13, 2013

MikeTaylor commented Jun 13, 2013

Klortho commented Jun 26, 2013

MikeTaylor commented Jun 26, 2013

Klortho commented Jun 26, 2013

ivangrub commented Jun 27, 2013

IanMulvany commented Jun 27, 2013

michael commented Sep 25, 2013

MikeTaylor commented Sep 25, 2013

michael commented Sep 25, 2013

That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work?

MikeTaylor commented Sep 25, 2013

ivangrub commented Sep 25, 2013

michael commented Sep 25, 2013

MikeTaylor commented Sep 25, 2013

MikeTaylor commented Sep 25, 2013

michael commented Sep 25, 2013

MikeTaylor commented Sep 25, 2013

michael commented Sep 25, 2013

Klortho commented Nov 27, 2013

MikeTaylor commented Nov 27, 2013

michael commented Dec 10, 2013

michael commented Dec 10, 2013