Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with writing metadata to pdfs #8332

Merged
merged 4 commits into from
Dec 19, 2021
Merged

Fix issues with writing metadata to pdfs #8332

merged 4 commits into from
Dec 19, 2021

Conversation

btut
Copy link
Contributor

@btut btut commented Dec 13, 2021

There are still some issues when writing metadata to pdfs.

  • Writing metadata multiple times (so overwriting already existing metadata) results in an error.
  • Metadata export fails for some files (assumption: files with write protection)
  • Improve wording of error messages

Fixes #8278

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@btut
Copy link
Contributor Author

btut commented Dec 13, 2021

@ThiloteE would you be so kind to try the exports again? I have to admit you provided a lot of information and I feel a little lost - I don't want to miss anything.

What I purposefully left out for now, is the error message when importing XMP. I think the existing error is fine. If a user knows what XMP is and explicitly tries to import XMP, they will be able to handle such a message I think ;)

@btut btut self-assigned this Dec 13, 2021
@ThiloteE
Copy link
Member

I am on it. Just downloaded the pull from https://builds.jabref.org/pull/8332/merge/. Hope this is the correct one

@Siedlerchr
Copy link
Member

@ThiloteE you have to wait until the Deployment workflow runs through

@ThiloteE
Copy link
Member

oh

@ThiloteE
Copy link
Member

First test results are in. Great success!

the errormessage that emerged when repeatedly trying to update the file (ii.) is gone. I can update an entry successfully without an errormessage appearing.

'citationkey'
  OK.

Finished writing metadata for 1 file (0 skipped, 0 errors).

This status message appears when i

  • write new metadata
  • no new metadata is found on the file, because the metadata i try to write is the same that is already attached to the file.

here the metadata extracted via exiftool:

metadata test 3 after second run.txt
metadata test 3 after first run.txt

This is great! Not sure if unintended repercussions will emerge someday in some niche cases, but this is definitely a fix for the problem i encountered! A great leap forward.

While doing this test though, i noticed that the keywords field is not exported. (keywords = {water},) 😥

@ThiloteE
Copy link
Member

second test results are in.

Old behavior:

  • When more than two entries that are linked to the same file are queued to export metadata an error message appeared, but metadata was written

New behavior:

  • When more than two entries that are linked to the same file are queued to export metadata an error message does NOT appear and metadata was written successfully, albeit the second entry overwrites part of the first.

    This is ... uh i guess one step forward and one step backwards. Good is that the user now knows that metadata was written, but it is worse in that the user does not know that both entries link to the same file and the second entry overwrites metadata from the first entry.

I would desire something like this:

  • i."Error message should remain, but rephrased to inform user that "During the export process, metadata from two (or more) separate entries were written to the same file".

Here the test data:

Status message:

1111tel
  OK.
2222tel
  OK.

Finished writing metadata for 2 file (0 skipped, 0 errors).

metadata test before any run (fresh pdf).txt
metadata test after first run and corresponding status message.txt

@ThiloteE
Copy link
Member

Third (and last) test results are in and correspond to iii in #8278 (comment):

Old and current behavior are similar!

How i did the test:

When trying to write metadata to Améry (1973) Wider den Strukturalismus. Das Beispiel des Michel Foucault.pdf (one of the pdf files that fail to be written with metadata) the following error message emerges:

Amery1973WdS
  Error while writing 'E:\server-t-150\FAU\3. Semester\AER Demokratische Transformation in Asien\00 Hausarbeit Buente 3\Améry (1973) Wider den Strukturalismus. Das Beispiel des Michel Foucault.pdf':
    null

Finished writing metadata for 0 file (0 skipped, 1 errors).

Desired new behavior is something like:

  • Error message should remain, but rephrased to something like: "Metadata could not be exported/written"
  • Optional: Include cause for why metadata could not be exported/written.

@btut
Copy link
Contributor Author

btut commented Dec 13, 2021

Not sure if unintended repercussions will emerge someday in some niche cases, but this is definitely a fix for the problem i encountered!

For now, I just implemented it so that old metadata is always overwritten. This is something to discuss about.

When more than two entries that are linked to the same file are queued to export metadata an error message appeared, but metadata was written

That was not intentional behavior for files linked twice, but because after the first write the embedded-bibtex export failed because there was already metadata. So the XMP contained the second metadata (because it always did overwrite any existing metadata) while the embedded bibtex contained the first metadata because the second write failed.

When more than two entries that are linked to the same file are queued to export metadata an error message does NOT appear and metadata was written successfully, albeit the second entry overwrites part of the first.

True, this is unfortunate. I think this would mostly be the case for journals and such, right? I think the correct thing to do here would be to write the XMP information of the collection (like journal-name and year if all entries are from the same journal) and embedd the bibtex of all entries that are linked. Would be lots of work though that I personally don't have the time for right now.

@ThiloteE
Copy link
Member

When more than two entries that are linked to the same file are queued to export metadata an error message does NOT appear and metadata was written successfully, albeit the second entry overwrites part of the first.

True, this is unfortunate. I think this would mostly be the case for journals and such, right? I think the correct thing to do here would be to write the XMP information of the collection (like journal-name and year if all entries are from the same journal) and embedd the bibtex of all entries that are linked. Would be lots of work though that I personally don't have the time for right now.

I think it also is relevant if you have a file that is a whole book, but then you have InBook references (because you only want to cite a few chapters of the book and not the whole book). People will have chapter entries in their library and may want to make use of the 'crossref' field.

As it stands, right now, i think having multiple entries that link to the same file is not advised to have in your library :D I personally will try to avoid this.

  • Workaround 1: duplicate the book file and change the name.
  • Workaround 2: Divide the book into various chapters (This can be done e.g. via pdf24), but this takes a lot of time. Workaround 1 is prefered

@btut
Copy link
Contributor Author

btut commented Dec 13, 2021

I think it also is relevant if you have a file that is a whole book, but then you have InBook references

Good Point! First idea: look for all entries for a pdf, only keep fields that are the same for all entries, write those entries.

@btut
Copy link
Contributor Author

btut commented Dec 13, 2021

Third (and last) test results are in and correspond to iii in #8278 (comment):

I can reproduce, but not explain (yet). Need some more investigation.

@ThiloteE
Copy link
Member

I think it also is relevant if you have a file that is a whole book, but then you have InBook references

Good Point! First idea: look for all entries for a pdf, only keep fields that are the same for all entries, write those entries.

Oh, yes that would be good!

  • Some metadata would be missing, BUT
  • one the plus side, it completely minimizes the risk of creating wrong metadata.

@ThiloteE
Copy link
Member

Unless you find some nitpicks during the codereview, the tests i did would suggest that what btut here did is definitely better than before. It fixes ii. of #8278 (comment), which is currently the thing that bothers the most.

I recommend merging for 5.4.

(Although i still think that i. should be fixed somehow at one point at least! There will be people complaining why the metadata they export is not the metadata they expect, but at least we know why and what the user can do to workaround it. iii. seems a tough nut to crack, so maybe one day ...)

@Siedlerchr Siedlerchr marked this pull request as ready for review December 19, 2021 13:41
@Siedlerchr
Copy link
Member

Okay, wer merge this now and create a follow up issue

@Siedlerchr Siedlerchr merged commit 6f247f4 into main Dec 19, 2021
@Siedlerchr Siedlerchr deleted the fixMetadataExport branch December 19, 2021 19:03
Siedlerchr added a commit that referenced this pull request Dec 20, 2021
* upstream/main: (46 commits)
  New Crowdin updates (#8349)
  Bump pdfbox from 2.0.24 to 2.0.25 (#8345)
  Bump fontbox from 2.0.24 to 2.0.25 (#8348)
  Bump xmlunit-core from 2.8.3 to 2.8.4 (#8347)
  Bump tika-core from 2.1.0 to 2.2.0 (#8346)
  Added missing executable bindings to various commands (#8342)
  Update Gradle Wrapper from 7.3.1 to 7.3.2. (#8343)
  Fix issues with writing metadata to pdfs (#8332)
  add tinylog test (#8339)
  Tinylog (#8226)
  Don't register any database changes to the indexer while dropping a file (#8334)
  Fix ACM fetcher (#8338)
  Squashed 'buildres/csl/csl-styles/' changes from 3bb4b5f..60bf7d5
  Exception shouldn't happen when pasting an entry with a publication date-range of form yyyy-yyyy (#8247)
  Refactor Sidepane logic (#8202)
  New translations JabRef_en.properties (Japanese) (#8331)
  Bump bcprov-jdk15on from 1.69 to 1.70 (#8333)
  Update Controlsfx to 11.1.1 (#8330)
  Update citeproc (#8329)
  Bump classgraph from 4.8.137 to 4.8.138 (#8327)
  ...

# Conflicts:
#	build.gradle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing XMP metadata to PDFs skips my linked pdf file. Improve error messages
3 participants