Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BibTeXML vs. bibteXMP #938

Closed
koppor opened this issue Mar 11, 2016 · 20 comments
Closed

BibTeXML vs. bibteXMP #938

koppor opened this issue Mar 11, 2016 · 20 comments

Comments

@koppor
Copy link
Member

koppor commented Mar 11, 2016

JabRef 3.2

It seems that JabRef offers a second kind of XML serialization in BibTeX:

xmlns:bibtex='http://jabref.sourceforge.net/bibteXMP/'

IMHO, it is not worth to keep two different XML Schemas for an XML serialization of BibTeX. AFAIK, there isn't even one for JabRef's XML. Therefore, I propose that we should use BibTeXML only and migrate old XMP meta data to the BibTeXML format.

XMP examples can be found at

return "<rdf:Description rdf:about='' xmlns:bibtex='http://jabref.sourceforge.net/bibteXMP/' "
.

@koppor
Copy link
Member Author

koppor commented Mar 11, 2016

Refs #898

@oscargus
Copy link
Contributor

I'm not really following the argumentation. One may argue of different export formats, but how is it relevant that they are both XML? Isn't it more of an issue if it is a relevant format in itself?

@koppor
Copy link
Member Author

koppor commented Mar 11, 2016

Context: The format is used for storing BibTeX data in XML files using the XMP functionality (follow net.sf.jabref.logic.xmp.XMPSchemaBibtex). This PDF meta data is used by other people to exchange PDFs with the correct bibliographic data without being forced to send the bib entry along with the PDF in two files.

I am arguing that JabRef uses a proprietary format which is not used elsewhere. Thus, our XMP data cannot be processed by other software. I see the point, that the last commit at the current BibTeXML repository is from 2011. Nevertheless, I vote for joining forces. These formats are too similar to go into different directions.

I see following alternatives:

  1. Replace JabRef's bibteXMP by the canonical bibtex representation
  2. Completely use RDF. There seem multiple BibTeX2RDF converters available: https://www.w3.org/wiki/ConverterToRdf#BibTex
  3. Maybe, OWL is also an option: http://zeitkunst.org/bibtex/0.1/
  4. Move to BiBTeXML (as outlined in the original issue)
  5. Use MODS
  6. Keep everything as is

Somehow, the current code seems to use "Dublin Core", which reads good. Maybe, that code can just be used and the other serialization using {http://jabref.sourceforge.net/bibteXMP/}bibtex can be removed completely. Needs to be investigated further.

In case everything is replaced by Dublin Core, one can update PDFBox - see #1096.

@oscargus
Copy link
Contributor

oscargus commented Mar 11, 2016 via email

@Siedlerchr
Copy link
Member

The question would be: How many people actually use the XMP feature?
From my point of view I would suggest supporting the BibteXML Format and maybe add the RDF/OWL stuff as an addition.
Interestingly there is also a Paper about BibtexML:
https://www.researchgate.net/publication/2564256_BIBTEXML_An_XML_Representation_of_BIBTEX

From a quick look at the Code you referenced, I saw that it uses rdf-Tags...:confused:

@koppor
Copy link
Member Author

koppor commented Mar 12, 2016

The XMP feature is the central tool to distribute PDFs with bibliographic information. I learned it from Adrian Daerr (possibly @adriandaerr?).

I am also confused by the code and also had the strange feelings about nesting JabRef's bibtexml into rdf tags. Therefore, I proposed to focus on Dublin Core (see above).

@dret
Copy link

dret commented Mar 13, 2016

thanks for inviting me to the discussion! the BibTeXML we developed and implemented (http://dret.net/netdret/publications#wil01e) is a different one than the sourceforge repo. the paper is from 15 years ago, and while we used the language in a later project (http://dret.net/projects/sharef/), the software produced by that project is not really used anywhere, as far as i can tell. i did hand the sources to some people who liked it and wanted to have a bibtex-xml converter, but i don't think anybody ever made their versions public. i think our XML schema was pretty well-desgined, but it's something i haven't looked at in quite a while.

@Lenchik
Copy link

Lenchik commented Apr 4, 2016

Either format you prefer to embed in PDF, would be great if it is compatible with PDF/A compliance checks.
JabRef 2.x embeds caused errors like:

XMP metadata property used, which is not predefined in the XMP specification of January 2004. There is no XMP extension schema present in the PDF defining the use and contents of this property. Some PDF-based ISO standards require that all XMP metadata properties are either predefined or defined in an embedded extension schema.

If it will be format like BibteXML, that can be exported in xml it would be also great to have some minimal example for correct embedding it through latex with xmpincl or hyperxmp packages. Use case: compiling thesis with embeded metadata precomposed with JabRef.

@lenhard
Copy link
Member

lenhard commented Apr 8, 2016

After dealing with this in #1096 I think the most portable solution would be to drop the JabRef bibteXMP and to encode everything into Dublin core (which we already do on top of our custom serialization).

That is, if we do not decide to drop the XMP functionality completely.

@LSinev
Copy link

LSinev commented Apr 22, 2016

Some info about correct storage of xmp inside pdf (to be compatible with pdf/a for example) can be found with samples at http://www.pdflib.com/knowledge-base/xmp-metadata/xmp-in-pdfa/
Here goes free xmp validator: http://www.pdflib.com/knowledge-base/xmp-metadata/free-xmp-validator/
Some java code samples can also be found at that website.

@koppor
Copy link
Member Author

koppor commented Jul 14, 2016

Idea (as discussed with @hummelriegel): Add bibtexs of cited entries to the PDF. This is especially useful for a self-written paper.

@koppor koppor mentioned this issue Dec 12, 2016
7 tasks
@koppor
Copy link
Member Author

koppor commented Dec 13, 2016

Further options include bibtexml and MODS. I think, dublin core is still the way to go as it is standards-based. We should go in this direction.

@koppor
Copy link
Member Author

koppor commented Dec 15, 2016

Refs koppor#6

@DesBw
Copy link

DesBw commented Apr 6, 2017

Hi guys. I am not developer. I am just another user. I really hope that you maintain the XML feature. This one of the most important unique feature of Jabref that keep me come back time and time again (after using great reference manager like Bookends). The XML is useful not just for sharing Pdf files. Embedding the information into the Pdf is very useful for powerful search tools like Deveonthink[Mac], Spotlight[MaC], dtSearch[Windows]. With the embedded data, it is possible to search Pdf files by their author, title and the like data. In addition, re-generating the Jabref library from the pdf files (incase the library is corrupted or deleted) is possible with the embedded data. I had a couple of cases where my pdf files get dissociated from the reference. I drag them back. Voilà, I have the whole reference. This is just so great.

@lenhard
Copy link
Member

lenhard commented Apr 7, 2017

Hi dellu. Thanks for the praise! And no worries, we have no intentions of removing support for this feature. Quite the contrary, we would like to update and improve it. Unfortunately, this has so far failed due to issues in the libraries that we use for this functionality. As a result, I assume that there will be no significant changes here in the near future.

@DesBw
Copy link

DesBw commented Apr 20, 2017

Thank you @lenhard. I am glad you are going to keep the feature.

What do you guys think of this ?
They also write the metadata into the file using ExifTool. They use the standard bibtex tags. The standard Bibtex is nice.

@lenhard
Copy link
Member

lenhard commented Apr 20, 2017

@Dellu

Interesting link, thanks! Unfortunately, it will not be easy to interact with that tool or the ExifTool. The former is written in C++ and the latter in Perl, whereas JabRef is written in Java. There is always a way around the language differences, but in my point of view we should stick to the Java ecosystem and build a JabRef where everything is closely integrated and without language-related friction.

Other developers might have a different opinion, though.

@koppor
Copy link
Member Author

koppor commented Apr 27, 2017

Together with @snisnisniksonah I am investigating whether we can use Dublin Core.

Current steps:

  1. Read/write PDF annotations using Dublin Core using PDFBox 2.x (refs Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096)
  2. Extract command line tool to convert old PDF annotations to the new format (Refs Removed a number of warnings, added copyright etc #266 (comment)) -> XMPUtil will released separately.

Results:

  • JabRef 4.x depending on PDFBox 2.x
  • XMPUtils depending on PDFBox 1.x

@tobiasdiez
Copy link
Member

Nice! I think the XMPUtil is not that important since in most cases you can just write the information again to the PDF using Dublin Core and thus overwriting / "converting" the old XMP data.

@koppor
Copy link
Member Author

koppor commented Feb 7, 2018

Note to self: Do not forget #938 (comment). pdflatex can easily do that: authorarchive. Check the example PDF.

johannes-manner added a commit to johannes-manner/jabref that referenced this issue Feb 7, 2018
@johannes-manner johannes-manner mentioned this issue Feb 7, 2018
5 tasks
koppor pushed a commit that referenced this issue Feb 20, 2018
This fixes #938

- Reading and writing multiple dublinCore entries works: XMPUtilWriter supports mutliple metadata entries in dublinCore and a single entry in the PDDocumentInformation. If you want to test the reading of multiple entries, the PDF file JabRef_multipleMetaEntries.pdf contains three metadata entries in DublinCore for testing locally.
- Removed to much code when refactoring the XMPUtil. Non XMP metadata are also relevent, when retrieving org.apache.pdfbox.pdmodel.PDDocumentInformation
- Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox.  See pull #1096.
- Refactor extraction from DublinCoreSchema
- The tests cover the most important use cases, which include reading and writing metadata from pdf files. Both formats, DublinCore and PDMetadata (which are no XMP metadata) are tested.
- Separated XMPUtils in a reader and a writer utitlity class.
- add meaningful names in DublinCoreExtractor and use StringUtils.isNullOrEmpty
- Log exception in XMPUtilShared
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants