Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Add encoding to file writing function #405

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

claell
Copy link
Contributor

@claell claell commented Sep 21, 2023

Fixes #403

@claell
Copy link
Contributor Author

claell commented Sep 21, 2023

Right now, this is just adapted from #395. Not tested, just quickly done in GitHub editor.

Copy link
Collaborator

@MiWeiss MiWeiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Same as in #395, we would need some tests, primarily to prevent regressions. After those are added, the PR can be merged.

@MiWeiss MiWeiss changed the title Add encoding to file writing function 🐛 Add encoding to file writing function Sep 21, 2023
@claell
Copy link
Contributor Author

claell commented Oct 17, 2023

Sorry, am rather busy right now, so not sure when (or if) I'll have time for that.

@MiWeiss
Copy link
Collaborator

MiWeiss commented Nov 2, 2023

Sorry, am rather busy right now, so not sure when (or if) I'll have time for that.

Thanks for the comment. I'll mark the PR to be ready for anyone to continue working on this.

If anyone is willing to do so, or if @claell resumes his work: Please comment here to avoid having two people working on the same thing at the same time.

p.s. Continuing to work on the PR can be done as follows: Add the fork of @claell as additional git remote, pull the branch, and then push it to your own fork and open a new PR, stating that that one is supposed to replace this one.

@agriyakhetarpal
Copy link

agriyakhetarpal commented Jan 12, 2024

Hi, we're looking to add python-bibtexparser as a dependency for the citations workflow of our Python package soon – I thought it would be nice to contribute as well, perhaps by unblocking a few PRs; such as this one and suchlike.

I am happy to take over from @claell and write a few tests, though the issue is that I fail to reproduce the original issue (#403) at this time:

Using the core functionality, i.e., without any customizations to the formatting options for writing, I am able to write an entry:

import bibtexparser
from bibtexparser import *
bib_library = bibtexparser.Library()
fields = []
fields.append(bibtexparser.model.Field("author", "ö"))
entry = bibtexparser.model.Entry("ARTICLE", "test", fields)
bib_library.add(entry)
bibtexparser.write_file("my_new_file.bib", bib_library)

i.e., the same MWE as previously reported in #403 (comment), and parsing my_new_file.bib seems to work without issues.

my_new_file.bib

@ARTICLE{test,
	author = {ö}
}

and reading this programmatically:

import bibtexparser
library = bibtexparser.parse_file("my_new_file.bib")

Therefore, library.entries_dict returns

{'test': Entry(entry_type=`article`, key=`test`, fields=`[Field(key=`author`, value=`ö`, start_line=1)]`, start_line=0)}

which I am able to write to a new .bib file via the bibtexparser.write_file() method without any loss of data or missing umlauts symbols. Has this been fixed, or cannot be reproduced, or there is a different method of looking at the error that I may have missed in oversight?

@MiWeiss
Copy link
Collaborator

MiWeiss commented Jan 14, 2024

Hi @agriyakhetarpal

Great, thanks a lot for taking over!

Strange to hear that the problem cannot be reproduced, I am not aware of any changes we merged recently (although I have not checked ;-) ). As we're talking about encoding, I would not be surprised if the behavior was somewhat system-dependent, which could explain the issue at hand.

In either case, I think we should still merge this PR - to keep the interface consistent and generally applicable. While it's not ideal to do so without reproducing the problem above, I'd suggest implementing tests similar to #395 - this should, at the very least, protect us from some regressions. Do you agree?

I thought it would be nice to contribute as well

That would be amazing. I'll be happy to help you along the way wherever I can (feedback, reviews, ...)

@agriyakhetarpal
Copy link

agriyakhetarpal commented Jan 14, 2024

As we're talking about encoding, I would not be surprised if the behavior was somewhat system-dependent, which could explain the issue at hand.

In either case, I think we should still merge this PR - to keep the interface consistent and generally applicable. While it's not ideal to do so without reproducing the problem above, I'd suggest implementing tests similar to #395 - this should, at the very least, protect us from some regressions. Do you agree?

I'll be happy to help you along the way wherever I can (feedback, reviews, ...)

I could not agree more. I am aware systems like Windows choose CP1252 by default for writing to files if UTF-8 isn't specified – maybe the reason I'm not seeing this error is because I'm on macOS?

Is it fine if we can discuss things here itself before I proceed to write a PR? I couldn't wrap my head around the source code for the middlewares and customisers for now—very custom engineering TBF—but I did manage to write a simple test for test_writer.py, as follows.

def test_write_article_with_umlauts():
    entry_block = Entry(
        entry_type="article",
        key="myKey",
        fields=[
            Field(key="title", value='"myTitle"'),
            Field(key="author", value='"Müller, Max"'),
        ],
    )
    library = Library(blocks=[entry_block])
    string = writer.write(library)
    assert string == '@article{myKey,\n\ttitle = "myTitle",\n\tauthor = "Müller, Max"\n}\n'

I would appreciate feedback on this, afterwards we should be able to pytest-parameterize this with a few other characters (ö, ä, ë, and other diacritics you can think of!)

Edit: removed some redundant print statements, was just debugging something to stdout

@MiWeiss
Copy link
Collaborator

MiWeiss commented Jan 18, 2024

Hi @agriyakhetarpal

Your suggested test, as I read it, would not actually test the method targeted in this PR.

Instead, you would have to test the method bibtexparser.write_file and included the newly added encoding parameter. This will write a new bibtex file. You could then check if this file contains the expected content in the expected encoding.

@agriyakhetarpal
Copy link

Instead, you would have to test the method bibtexparser.write_file and included the newly added encoding parameter. This will write a new bibtex file. You could then check if this file contains the expected content in the expected encoding.

Makes sense. I improved the test such that it parses a given BibTeX string with umlauts symbols, writes to a temporary file, and reads from it; as follows:

def test_write_file_with_umlauts():
    bibtex_str = """@article{umlauts,
    author = {Müller, Hans},
    title = {A title},
    year = {2014},
    journal = {A Journal}
    }"""
    library = parse_string(bibtex_str)
    with tempfile.NamedTemporaryFile(mode="w", encoding="utf-8") as f:
        write_file(f, library, encoding="utf-8")
        f.seek(0)
        library = parse_file(f.name, encoding="utf-8")
    assert library.entries[0]["author"] == "Müller, Hans"
    assert library.entries[0]["title"] == "A title"
    assert library.entries[0]["year"] == "2014"
    assert library.entries[0]["journal"] == "A Journal"

I opted to use temporary files so as to not create any clutter. Does this look fair enough? I can then write a few more tests or parameterize this as needed, perhaps with a few more characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing doesn't work for umlauts (likely UTF-8 formatting problem)
3 participants