New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lens reader for an NLM XML by URL #6
Comments
Not sure if on the fly conversion performs well enough. I thought about making Lens aware of different content repositories though, which means that content providers could make their repository available by exposing a service conforming to the JSON format Lens can read. E.g. More work needs to go into the specification of the Lens Document Format as well as documenting the conversion process and providing tools that help with the conversion process. |
This seems like a good route to take. Although on-the-fly conversion would surely be a useful tool for testing On 6 June 2013 23:44, Michael Aufreiter notifications@github.com wrote:
|
Our converter will be open sourced soon, we just haven't had the time to review the codebase and get the documentation right. It would be great to have support from other Open Access publishers with regards to accessing fresh content. Lens could also be a useful tool outside of the science community, e.g. for viewing software documentation etc... But then again... one step at a time. :) |
The converter is currently built to support the XML standard of: http://www.ncbi.nlm.nih.gov/pmc/pmcdoc/tagging-guidelines/article/style.html . There might still be some edge-case errors that will come up, but should be debugged very quickly. I will look into building a more robust API for the converter so that it is easier to plug into different workflows. The main issue with on-the-fly client side conversions is that each publisher would have to provide definitions for their figure URLs that would either have to be included in the converter, or added as a post processing step in that publisher's workflow. I think a static repository supported by each publisher would probably be the ideal way to move forward as well. |
Why does the NLM->JSON converter need to know the individual publishers' figure URL conventions? (I'm not saying it doesn't just curious as to what could invoke such a requirement.) |
Unfortunately, the XML tags do not provide a src attribute. Through a little hacking it is possible to figure out how to stitch together the URL, but that is a tedious and error-prone process. The image and video nodes in the JSON contains a url property which needs to point to the image or video to be displayed. If that property is empty, then the article will render fine, except you will be missing all of the media bits. |
Wait ... NLM, the universally used canonical format for representing scholarly articles ... HAS NO WAY TO LINK TO THE FIGURES? Did I understand you right? |
Yes. The figure tags contain a graphic-id attribute that gives the figure's name that can be stitched together with a image styled extension. This depends on having local storage of all the article's media (images, video, supplementary material, source code, etc.) in the same path as the article's XML though. Universal access to the figures is not available at the moment. Or at least I have yet to see it. If you know of a way to do this, I would be more than happy to hear about it and quickly implement it into the converter. |
Holy poo. Well, I am astonished at such designed-in brain-damage. Sorry I can't help -- I am no NLM expert, and certainly don't know a better way. So, yes, your converter script will need, or need access to, a set of per-publisher or per-journal recipes for turning graphic-IDs into URLs. |
Mike, as an example, here's a fragment from ZooKeys: <graphic xlink:href="ZooKeys-285-089-g001.jpg" position="float" orientation="portrait" xlink:type="simple"></graphic> To render this you need to figure out what the full path is to the image. You don't get a URL :( |
Of course, if you have a DOI for the figure you may have better luck... |
lt is quite unfortunate. I think part of the reason for this is that the URL path for each publisher is subject to change over time so instead of having to update the XML each time with the new URL, PubMed decided to just push for having local storage of all of the figures and associated files. That is why I think a local repository of the converted JSONs, per publisher, would be the best. We can work with the interested parties to add the converter to their workflows and then link these repositories with Lens. Even having the figure DOI is not ideal though. In that case you would have to do a http.request to get the html of the DOI page, and then scrape it for the image src URL. In async workflows like node.js, you could do this, but it would take much longer than investing a little time up front to have an organized workflow that publishes all of the associated files of an article's XML to Amazon or another cloud service where the URLs will not change. |
I see from the ZooKeys example that the attribute in question is "href" which is at least suggestive that it's the name of an HTTP-addressable resource relative to the address of the document that contains it. Doesn't it follow that if you download a ZooKeys XML from from http://www.pensoft.net/J_FILES/1/articles/5334/5334-G-2-layout.xml , then refers to Answer 1 (from the W3C's XLink spec at http://www.w3.org/TR/xlink/#link-locators ) is that, yes, the href " must be a URI reference as defined in [IETF RFC 2396]". Answer 2 (from simple testing) is that, no, this URL doesn't work. Darn. |
So, for each journal you need to figure out how they serve images, and everyone does it differently. |
Surely in this case the ZooKeys document is flatly invalid. |
@MikeTaylor, ideally that is exactly how this would end up working. Realistically, aside from having to hack each publisher with tweaks like what @rdmpage just noticed by adding 'export.php_files' to the URL path, we would have publishers get on board with reorganizing their URLs in a manner that actually makes sense. Anyone want to start leaning on the publishers to have a standard URL structure? |
Let's try some other publishers ... PeerJ next, as they're my favourites ... |
My PeerJ article is https://peerj.com/articles/36.xml and has no "xml:base" element. The first figure is expressed as: To my enormous disappointment, the figure is not there. Distressingly, the full-size version seems to be at https://dfzljdn9uc3pi.cloudfront.net/2013/36/1/fig-1-full.png I expected better from PeerJ. We should probably get in touch with them. |
Even if the url path were something like this: https://peerj.com/articles/ARTICLE_ID/ that would be fine. Your figure would then be https://peerj.com/articles/36/fig-1-full-png and the XML would be at https://peerj.com/articles/36/36.xml. If you manage to get in touch with PeerJ, you can point them to this issue thread and we can figure out how to best fix this problem. |
Tagging Figures, Graphics External Link In eLife XML it looks like the DOI URL is included in a <ext-link> tag, though following that URL will not necessarily bring you to the image file itself. Another wrinkle is providing the same graphic in multiple sizes / resolutions, and multiple file formats. |
It's all rather unfortunate, it would be interesting to cross post the issue to the JATS list, in the interm we now have content negotiation on eLife: http://www.elifesciences.org/elife-now-supports-content-negotiation/, but to mirror you comment Mike - poo. |
In JATS XML, I think the image paths are designed to be relative to the XML file when they're bundled together into a ZIP file for archiving. I'm not sure if this is how it's implemented by all publishers, though. For PeerJ, an article at https://peerj.com/articles/36/ has an XML file at https://peerj.com/articles/36.xml and figures at https://peerj.com/articles/36/fig-1.png (for example). |
Excellent. Then it seems that just adding the relevant xml:base attribute to PeerJ's XML will fix this. |
In theory, yes, but it looks like articles with an xml:base attribute don't validate against the JATS DTD. I've only dealt with images in cross-publisher articles when served from PMC, where they're all at predictable URLs. |
The PMC URLs are easier to predict because they use the xlink:href in the graphics tag to point to the figure. The only issue there is getting the PMC ID for the article which is not in each publisher's XML. The best way to do it in that case is to probably do on-the-fly conversions by requesting the XML from PMC and pulling the PMC ID from their version of the XML. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3539393/bin/elife00170f001.jpg which would translate to http://www.ncbi.nlm.nih.gov/pmc/articles/PMC_ID/bin/fig_xlink:href.jpg |
Two pointers to related projects:
Apart from that, there is a number of problems in the XML delivered to PMC with regard to the signaling of licensing and MIME types (cf. http://chrismaloney.org/notes/OAMI%20JatsCon%20Submission,%202013 and http://outreach.wikimedia.org/wiki/GLAM/Newsletter/November_2012/Contents/Open_Access_report ), and these problems will also be mentioned in a breakout session on the reuse of OA materials in Wikimedia contexts at OAI8 (cf. |
@MikeTaylor wrote, "Surely in this case the ZooKeys document is flatly invalid." No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link. I think the disconnect comes from two facts: JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want; and that the XML, if it is served at all, is usually served independent of the figures and other media (and, as Alf pointed out, it doesn't even allow an xml:base attribute). It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website. |
JATS is full of surprising tags, e.g. , allowing the creation of valid documents with zero structure in the references, e.g. here. A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue... |
@Klortho wrote: "'Surely in this case the ZooKeys document is flatly invalid.' No, the value in the xlink:href attribute is a valid (relative) URL, but it doesn't point to anything. It's the equivalent of a broken link." Yes, that is a much more precise way to articulate the problem. I shouldn't have used the word "invalid" which of course in the XML world means "not conforming to the schema". "JATS doesn't include a standard packaging format for articles, so publishers are free to define how to resolve these relative URLs however they want." Really? Seems like flagrant misuse of the XLink attributes to me. "xlink:href" has a specific meaning, detailed at http://www.w3.org/TR/xlink/#link-locators and dependent on the "xml:base" attribute for its interpretation. As the spec. says, "If the URI reference is relative, its absolute version MUST be computed by the method of [XML Base] before use". "It's the same thing that would happen if you emailed somebody a raw HTML file (and only the HTML file) that you pulled off of a random website." ... which is precisely why no-one does that. @kaveh1000 writes: "A central policy of NIH seems to be to give publishers any tag that they want or need. This allows all the bad practices in publishing to continue..." Except, bizarrely, the "xml:base" attribute that's needed to make this stuff work in a sane way. As this conversation progresses, it looks increasingly as though that is the underlying bug here, no? At present, we have a worldwide standard representation of academic articles which does not contain the information of how to obtain figures. That is just crazy. |
@hubgit I agree that it is not difficult to hack together a single publisher's URLs. Getting the figures via PMC is only easy though if your starting XML files are PMC too though (the URL depends on the PMC ID). In my opinion, there is really no need to have a unique identifier for each article other than the DOI. Leveraging the 600,000+ open access articles on PubMed would be great, but it is difficult due to their term's of use and I do not see much purpose of reinventing the wheel a few times over to provide the exact same service. We will be open sourcing the converter soon. At that point every publisher that is interested in using Lens can make their appropriate magic soup to pull out the figure URLs. I spent a little time playing around with the XML to figure hacks for PLOS and others, but it is not as simple as @hubgit wrote: "If an xml:base attribute is not present, the base URL of an XML document is the URL from which it's served, and links are built relative to that." |
@ivangrub I agree with you entirely - I was just pointing out how xml:base works :-) |
I added this comment to the NISO JATS spec comments list. Sadly, it's not a forum -- there's no discussion or any kind of back-and-forth. If I got anything wrong, or somebody wants to add your own point of view, the thing to do would be to add your own comment. |
I've fixed PeerJ's article XML files (e.g. http://peerj.com/articles/36.xml) so they now use absolute URLs, rather than relative URLs, for the xlink:href attributes (i.e. they link to the actual image URLs). These are the full-size PNG files though, which can be rather large; still looking for best-practices for marking up alternate formats. |
Nice, thanks! I have nearly learned better than to ask this kind of question, but not |
Yes, wrapping the different formats in What I'm not sure about is if the attributes on the |
So you're saying that only MIME type serves to distinguish the Better still, has someone already standardised such extended MIME types? |
I don't think that particular standard exists, as "full" is ambiguous (a client still doesn't know what size that is; if it knew the sizes it could assume that the largest is the "full" image). There's a proposal for a HTML |
So we need TWO extensions to the NLM schema: xml:base attribute, and |
This NISO announcement might be of interest to people on this thread; http://www.niso.org/news/pr/view?item_key=095ead17653aacf2db53445611417084f1d052dc |
It's of interest, yes; but not necessarily in a good way. I would much |
Packaging was never in the scope of JATS, and for better or worse, I think, never will be. I can't see how this effort is throwing anything out. And, unlike altmetrics, there's a lot of prior work regarding packaging that could be drawn upon. I'm not the biggest fan of NISO, but maybe this effort will help. |
Hey everyone, We have open sourced refract (the NLM XML to Lens JSON converter). https://github.com/elifesciences/refract Please have a look and start playing around with it. To make it easiest to help out with issues, make development branches for each publisher type and I can help with the necessary tweaks. Thanks! |
This is not a mailing list, this is a feature request, if peoeple with to keep discussing I would reccomend that we port over to a mailing list. |
You can now drag+drop any NLM file into Lens. You can self-host a Lens article by checking out https://github.com/elifesciences/lens/tree/0.2.x/dist and adjusting the index.html to your needs. |
"You can now drag+drop any NLM file into Lens." That sounds awesome. But I can't figure out how to do it. I went to http://lens.elifesciences.org/ and dragged a link to https://peerj.com/articles/36.xml from another window into the Lens one, but it just loaded the XML itself in place of Lens. How do I make this work? |
Use http://lens.substance.io for now. On Wed, Sep 25, 2013 at 9:14 PM, Mike Taylor notifications@github.com
|
How? I can't see a space on that site to drop XML onto, or an upload area. Sorry if I'm being dense. I am keen to blog the heck out of this once I get it going. |
Once you load lens.substance.io it will open to the About page. Just drag If you are dragging the XML from the Chrome download bar, it might not On Wed, Sep 25, 2013 at 12:44 PM, Mike Taylor notifications@github.comwrote:
|
Try downloading the XML so you have it locally and then drag the XML file somewhere on the page (it doesn't matter where you drop it) We haven't had time to add some visual clue (like a drop area). It's more of a secret feature for now as we haven't optimised and tested against journals other than eLife/LandesBioscience/PLOS. Just tested with the file you mentioned. It looks good, please excuse the cat images, they're used as a default fallback when there's no image resolver in place. |
Working for me now, when I draw the XML from a local file rather than an online link. Many thanks. BTW., does NOT work for me in Firefox (v14.01.1) -- just displays the XML. Works in Chrome. |
Should I hold off from telling the world about this? |
Will look into it when I get some time. Btw, we've added a configuration for PeerJ, so the images should show up now. |
They do! Great work, folks. On 25 September 2013 22:12, Michael Aufreiter notifications@github.comwrote:
|
\o/ Credits(tm) to @ivangrub. And go tell everybody. Just add a "work-in-progress" disclaimer of sorts. Thanks! :) |
Update: In case anyone is interested, it looks like xml:base will be allowed for every element in the JATS tag suite (in future revisions): http://www.niso.org/apps/group_public/view_comment.php?comment_id=275 |
Excellent! On 27 November 2013 19:58, Chris Maloney notifications@github.com wrote:
|
Thanks @Klortho for suggesting that change. It makes a lot of sense. I've added support for xml:base on the root |
@MikeTaylor We do support direct url input now (in the current dev branch). However it requires the server that has the XML needs to support CORS, or you link to a proxy that handles CORS for you. http://quasipartikel.at/lens/?url=https://peerj.com/articles/36.xml#figures/all |
It would be great to be able to read ANY valid NLM XML file using Lens just by specifyin the URL. Something like
http://lens.elifesciences.org/xml-url/https://peerj.com/articles/36/
The text was updated successfully, but these errors were encountered: