Description
As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.
Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.
Solution Sketch
We have a grobid in place. This should be used. Check Apache Tika, too.
Steps:
- If first PDF page containts
@article
or something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage). - Check if a
.bib
file is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de) - Check if XMP data is available. If yes -> use that. Stop. Else continue.
- Look for DOI in the first page. If present -> use that. Stop. Else continue.
- Use Apache Tika/GROBID to extract PDF. Use that data.
Improvement possibility
Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)
Challenges
- Cover different cases (BibTeX text on the first page,
.bib
embedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...) - Good test cases
- Create test PDFs
Side notes
Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.
In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.
Refs JabRef#7209