Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more pdf importers #7947

Merged
merged 51 commits into from Aug 18, 2021
Merged

Conversation

btut
Copy link
Contributor

@btut btut commented Jul 30, 2021

This PR aims to implement more pdf importers.
Currently, pdfs can be imported using the PdfContentImporter that is tailored to some IEEE and Springer formats. We want to add:

  • PdfGrobidImporter.java: Query grobid.jabref.org Grobid (Commits cherry-picked from Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929)
  • PdfEmbeddedBibFileImporter.java: Support for pdfs that have a BibTeX file embedded into them (e.g. generated using the authorarchive package)
  • PdfVerbatimBibTextImporter.java: Support for pdfs that have their own BibTeX entry on the first page (e.g. generated using the coverpage package)
  • PdfMergeMetadataImporter.java: An Importer that:
    • Calls a list of other pdf importers
    • Merges the result (importers with higher priority have the highest credibility, fields from lower-priority importers are only used if the higher-priority importers did not deduct that field)
    • In case a DOI or ISBN was found by any of the used importers, call the DOI / ISBN fetcher to improve the entry even more

The PdfMergeMetadataImporter will be used when users import PDFs into JabRef.

    • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
    • Tests created for changes (if applicable)
    • Manually tested changed features in running JabRef (always required)
    • Screenshots added in PR description (for UI changes)
    • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

btut added 7 commits July 30, 2021 13:46
Implemented an Importer that querries Grobid for metadata of a pdf.
The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet
available in Grobid, but we opened a PR that implements it
(kermitt2/grobid#800).
It's no longer necessary to set the POST data by bytes as we use JSoup
for that.
CHANGELOG.md Outdated Show resolved Hide resolved
Users can perform a PDF import on already imported pdf's to improve the
quality of the entry
When importing, try importers that can tell if they are suitable for a
certain file format or not.
Some importers only check if a file is present, not if it in the correct
format (isRecognizedFormat is always true if an existing file is given).
They are used last.

The List of importers now reflects that prioritization. It is not sorted
by importer names anymore.
The getter-methods getImportFormats and getImportFormatList still sort
the List by name for the View.
@btut btut self-assigned this Aug 16, 2021
@Siedlerchr
Copy link
Member

you seem to have an error:

/home/runner/work/jabref/jabref/src/test/java/org/jabref/logic/importer/WebFetchersTest.java:65: error: method getEntryBasedFetchers in class WebFetchers cannot be applied to given types;
Set idFetchers = WebFetchers.getEntryBasedFetchers(importFormatPreferences);
^
required: ImportFormatPreferences,FilePreferences,BibDatabaseContext,Charset
found: ImportFormatPreferences
reason: actual and formal argument lists differ in length

@btut
Copy link
Contributor Author

btut commented Aug 18, 2021

One last thing before we merge: The verbatim importer currently checks for bibtex on the first page. As discussed with @calixtus and @koppor, there are cases where a verbatim bibtex code is present, but on a later page. Shall we increase that limit? What would be a good number to use?

@calixtus
Copy link
Member

Wouldn't make this PR more complex now than it is. We can discuss this probably on JabCon. Let's just merge it and see after that.
Just restarted the fetcher tests, let's see, if Grobid turns green. :-)

@btut
Copy link
Contributor Author

btut commented Aug 18, 2021

Wouldn't make this PR more complex now than it is.

It's just changing a number here. Nothing complex about it. We just need to decide if we want that behaviour or not.

Just restarted the fetcher tests, let's see, if Grobid turns green. :-)

Adapted them to the new Grobid output. Should be fine now.

@Siedlerchr Siedlerchr merged commit 0b02dd4 into JabRef:main Aug 18, 2021
@calixtus calixtus removed status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers status: depends-on-external A bug or issue that depends on an update of an external library labels Aug 18, 2021
@btut btut deleted the improvement/morePdfImporters branch August 19, 2021 07:05
Siedlerchr added a commit that referenced this pull request Aug 20, 2021
* upstream/main: (110 commits)
  Extract PushTo names into model (#8005)
  Refactor processCitation in GrobidService to match processPdf (#8003)
  Improved progress indication for fulltext-index operations (#7981)
  Reordered Pdf-Importer priorities (#8001)
  Implement more pdf importers (#7947)
  Adding icon picker for group dialog issue#6142 (#7776)
  Fix possible NPE in exporter with empty charset (#7979)
  Fix icon color (#7994)
  Bump slf4j-api from 2.0.0-alpha2 to 2.0.0-alpha4 (#7991)
  Bump classgraph from 4.8.112 to 4.8.114 (#7990)
  Bump mariadb-java-client from 2.7.3 to 2.7.4 (#7992)
  Bump jsoup from 1.14.1 to 1.14.2 (#7993)
  New yaml issue template (#7983)
  [Bot] Update CSL styles (#7985)
  Reordered items in main table right-click menu (#7952)
  Fulltext Index: Only index local pdf files (#7980)
  Bump WyriHaximus/github-action-wait-for-status from 1.3 to 1.4 (#7973)
  Bump byte-buddy-parent from 1.11.9 to 1.11.12 (#7974)
  Bump classgraph from 4.8.110 to 4.8.112 (#7975)
  Bump checkstyle from 8.45 to 8.45.1 (#7978)
  ...

# Conflicts:
#	src/main/java/module-info.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants