-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require metadata CSV to be UTF-8 encoded #485
Conversation
@msmolens Can you rebase this onto master, so tests can pass? It makes sense that without However, are you aware of any instances of CSVs with Unicode characters which were not encoded in UTF-8, or is the |
bb4c124
to
56c1e30
Compare
Codecov Report
@@ Coverage Diff @@
## master #485 +/- ##
==========================================
- Coverage 62.78% 62.73% -0.06%
==========================================
Files 32 32
Lines 2913 2914 +1
==========================================
- Hits 1829 1828 -1
- Misses 1084 1086 +2
Continue to review full report at Codecov.
|
Not exactly. The first reference link in the commit message explains PyMongo's behavior. In particular:
The original bug occurred when the regular strings we input to PyMongo are validated and those strings aren't valid UTF-8.
Yes, the CSVs on the live site that demonstrated the bug are not valid UTF-8. While we could use heuristics to guess the encoding, it seems far better to require a single standard encoding. Some education or additional documentation might be helpful for dataset contributors. I tested re-saving the problem CSVs using Google Sheets, which saves CSV files in UTF-8, and the updated CSVs work, following this change. The combination of |
Ah, that makes sense. I forgot that PyMongo was able to take a So:
Is this correct? |
requirements.txt
Outdated
@@ -1,3 +1,4 @@ | |||
backports.csv==1.0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can we make this a more abstract version specifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to >=
with the same version number.
Require metadata CSV files to be UTF-8 encoded. Validation of CSV files that aren't UTF-8 encoded will indicate an error that provides details of the first encoding problem encountered in the file. Because MongoDB stores data in BSON format and BSON strings are UTF-8 encoded, strings stored to the database must contain valid UTF-8 data [1]. Validating that the CSV file is UTF-8 encoded avoids errors when saving strings to the database later in the workflow. For example: InvalidStringData: strings in documents must be valid UTF-8 The implementation replaces the csv module with a backport of Python 3's csv module [2]. This ensures that csv.DictReader properly reads UTF-8 CSV files and returns Unicode strings. Additionally, using this backport instead of managing UTF-8 conversions manually should make a future transition to Python 3 easier. Note that the implementation currently isn't compatible with Python 3 because of the use of the unicode() function. [1] http://api.mongodb.com/python/current/tutorial.html#a-note-on-unicode-strings [2] https://github.com/ryanhiebert/backports.csv
Yes, the two bullet points above describe the behavior as I understand it. With the old behavior, a |
56c1e30
to
84b3733
Compare
Require metadata CSV files to be UTF-8 encoded. Validation of CSV files
that aren't UTF-8 encoded will indicate an error that provides details
of the first encoding problem encountered in the file.
Because MongoDB stores data in BSON format and BSON strings are UTF-8
encoded, strings stored to the database must contain valid UTF-8 data
[1]. Validating that the CSV file is UTF-8 encoded avoids errors when
saving strings to the database later in the workflow. For example:
The implementation replaces the csv module with a backport of Python 3's
csv module [2]. This ensures that csv.DictReader properly reads UTF-8
CSV files and returns Unicode strings. Additionally, using this backport
instead of managing UTF-8 conversions manually should make a future
transition to Python 3 easier.
Note that the implementation currently isn't compatible with Python 3
because of the use of the unicode() function.
[1] http://api.mongodb.com/python/current/tutorial.html#a-note-on-unicode-strings
[2] https://github.com/ryanhiebert/backports.csv
Fixes #473