Require metadata CSV to be UTF-8 encoded #485

msmolens · 2017-11-01T19:44:08Z

Require metadata CSV files to be UTF-8 encoded. Validation of CSV files
that aren't UTF-8 encoded will indicate an error that provides details
of the first encoding problem encountered in the file.

Because MongoDB stores data in BSON format and BSON strings are UTF-8
encoded, strings stored to the database must contain valid UTF-8 data
[1]. Validating that the CSV file is UTF-8 encoded avoids errors when
saving strings to the database later in the workflow. For example:

InvalidStringData: strings in documents must be valid UTF-8

The implementation replaces the csv module with a backport of Python 3's
csv module [2]. This ensures that csv.DictReader properly reads UTF-8
CSV files and returns Unicode strings. Additionally, using this backport
instead of managing UTF-8 conversions manually should make a future
transition to Python 3 easier.

Note that the implementation currently isn't compatible with Python 3
because of the use of the unicode() function.

[1] http://api.mongodb.com/python/current/tutorial.html#a-note-on-unicode-strings
[2] https://github.com/ryanhiebert/backports.csv

Fixes #473

brianhelba · 2017-11-02T01:16:14Z

@msmolens Can you rebase this onto master, so tests can pass?

It makes sense that without backports.csv, a UTF-8-encoded CSV would not yield unicode strings (explaining the original bug).

However, are you aware of any instances of CSVs with Unicode characters which were not encoded in UTF-8, or is the except UnicodeDecodeError as e: just there as a proactive guard?

codecov · 2017-11-02T01:57:10Z

Codecov Report

Merging #485 into master will decrease coverage by 0.05%.
The diff coverage is 60%.

@@            Coverage Diff             @@
##           master     #485      +/-   ##
==========================================
- Coverage   62.78%   62.73%   -0.06%     
==========================================
  Files          32       32              
  Lines        2913     2914       +1     
==========================================
- Hits         1829     1828       -1     
- Misses       1084     1086       +2

Impacted Files	Coverage Δ
server/utility/__init__.py	`100% <100%> (ø)`	⬆️
server/models/dataset.py	`80% <33.33%> (-0.89%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4997be...84b3733. Read the comment docs.

msmolens · 2017-11-02T02:16:24Z

without backports.csv, a UTF-8-encoded CSV would not yield unicode strings (explaining the original bug).

Not exactly. The first reference link in the commit message explains PyMongo's behavior. In particular:

PyMongo must ensure that any strings it stores contain only valid UTF-8 data. Regular strings (<type ‘str’>) are validated and stored unaltered. Unicode strings (<type ‘unicode’>) are encoded UTF-8 first

The original bug occurred when the regular strings we input to PyMongo are validated and those strings aren't valid UTF-8.

However, are you aware of any instances of CSVs with Unicode characters which were not encoded in UTF-8, or is the except UnicodeDecodeError as e: just there as a proactive guard?

Yes, the CSVs on the live site that demonstrated the bug are not valid UTF-8. While we could use heuristics to guess the encoding, it seems far better to require a single standard encoding. Some education or additional documentation might be helpful for dataset contributors. I tested re-saving the problem CSVs using Google Sheets, which saves CSV files in UTF-8, and the updated CSVs work, following this change.

The combination of backports.csv and the except UnicodeDecodeError effectively moves the UTF-8 validation earlier in the workflow, while processing of the CSV stream. Additionally, using backports.csv enables reading UTF-8 CSV files that start with a BOM. With the Python 2.7 csv.DictReader, the BOM is erroneously read as part of the first field name.

brianhelba · 2017-11-02T02:33:10Z

Ah, that makes sense. I forgot that PyMongo was able to take a str object that already contained UTF-8 encoded characters, in addition to str objects with plain Latin-1 characters (the ordinary case).

So:

with the old behavior, a UTF-8 CSV would just go directly into a str, and through to PyMongo and the DB
with the new behavior, a UTF-8 CSV will be deserialized into unicode (validating it in the process), then be serialized back into a UTF-8 str by PyMongo

Is this correct?

brianhelba · 2017-11-02T02:34:05Z

requirements.txt

@@ -1,3 +1,4 @@
+backports.csv==1.0.5


Actually, can we make this a more abstract version specifier?

Changed to >= with the same version number.

Require metadata CSV files to be UTF-8 encoded. Validation of CSV files that aren't UTF-8 encoded will indicate an error that provides details of the first encoding problem encountered in the file. Because MongoDB stores data in BSON format and BSON strings are UTF-8 encoded, strings stored to the database must contain valid UTF-8 data [1]. Validating that the CSV file is UTF-8 encoded avoids errors when saving strings to the database later in the workflow. For example: InvalidStringData: strings in documents must be valid UTF-8 The implementation replaces the csv module with a backport of Python 3's csv module [2]. This ensures that csv.DictReader properly reads UTF-8 CSV files and returns Unicode strings. Additionally, using this backport instead of managing UTF-8 conversions manually should make a future transition to Python 3 easier. Note that the implementation currently isn't compatible with Python 3 because of the use of the unicode() function. [1] http://api.mongodb.com/python/current/tutorial.html#a-note-on-unicode-strings [2] https://github.com/ryanhiebert/backports.csv

msmolens · 2017-11-02T13:25:20Z

Yes, the two bullet points above describe the behavior as I understand it. With the old behavior, a str that's not valid UTF-8 results in the InvalidStringData: strings in documents must be valid UTF-8 error when writing to the database.

msmolens requested a review from brianhelba November 1, 2017 19:44

brianhelba mentioned this pull request Nov 2, 2017

Use Python 3 #488

Open

msmolens force-pushed the 473-require-utf8-metadata branch from bb4c124 to 56c1e30 Compare November 2, 2017 01:49

brianhelba approved these changes Nov 2, 2017

View reviewed changes

brianhelba requested changes Nov 2, 2017

View reviewed changes

msmolens force-pushed the 473-require-utf8-metadata branch from 56c1e30 to 84b3733 Compare November 2, 2017 13:25

brianhelba merged commit 1c44a43 into master Nov 2, 2017

brianhelba deleted the 473-require-utf8-metadata branch November 2, 2017 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require metadata CSV to be UTF-8 encoded #485

Require metadata CSV to be UTF-8 encoded #485

msmolens commented Nov 1, 2017

brianhelba commented Nov 2, 2017

codecov bot commented Nov 2, 2017 •

edited

Loading

msmolens commented Nov 2, 2017

brianhelba commented Nov 2, 2017

brianhelba Nov 2, 2017

msmolens Nov 2, 2017

msmolens commented Nov 2, 2017

Require metadata CSV to be UTF-8 encoded #485

Require metadata CSV to be UTF-8 encoded #485

Conversation

msmolens commented Nov 1, 2017

brianhelba commented Nov 2, 2017

codecov bot commented Nov 2, 2017 • edited Loading

Codecov Report

msmolens commented Nov 2, 2017

brianhelba commented Nov 2, 2017

brianhelba Nov 2, 2017

Choose a reason for hiding this comment

msmolens Nov 2, 2017

Choose a reason for hiding this comment

msmolens commented Nov 2, 2017

codecov bot commented Nov 2, 2017 •

edited

Loading