Skip to content

Replace PDFContentImporter by another library #169

Closed
JabRef/jabref
#8001
@koppor

Description

@koppor

As user, I want to import a PDF into JabRef. Each PDF contains bibliographic information. Either by embedded XMP data or by just the txt on the title page (author, title, doi, ..., maybe included bibtex: https://www.ctan.org/pkg/coverpage or https://ctan.org/pkg/authorarchive?lang=de). This information should be extracted from the PDF.

Currently, a self-written functionality is employed. This works OK for LNCS and IEEE papers, but not for other publishers.

Solution Sketch

We have a grobid in place. This should be used. Check Apache Tika, too.

Steps:

  1. If first PDF page containts @article or something direct BibTeX data (created by https://www.ctan.org/pkg/coverpage).
  2. Check if a .bib file is embedded in the PDF. If yes, use this one. (created by https://ctan.org/pkg/authorarchive?lang=de)
  3. Check if XMP data is available. If yes -> use that. Stop. Else continue.
  4. Look for DOI in the first page. If present -> use that. Stop. Else continue.
  5. Use Apache Tika/GROBID to extract PDF. Use that data.

Improvement possibility

Offer merge dialog from the different options (e.g., XMP + PDF scraping via GROBID)

Challenges

  • Cover different cases (BibTeX text on the first page, .bib embedded, DOI on first page, different publisher PDFs (LNCS, LNI, IEEE, ACM, ...)
  • Good test cases
    • Create test PDFs

Side notes

Check current drag'n'drop behavior. In 3.8.2, the user was asked whether (s)he wants to create a new entry or link the PDF.

In http://discourse.jabref.org/t/more-control-on-the-duplicate-finder/120/4?u=koppor the tool https://github.com/CrossRef/pdfextract was recommended. At first sight, it can fully replace our PDFContentImporter.

Refs JabRef#7209

Metadata

Metadata

Assignees

No one assigned

    Labels

    component: xmpIssues concerning the XMP PDF metadata

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions