Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require metadata CSV to be UTF-8 encoded #485

Merged
merged 1 commit into from
Nov 2, 2017

Conversation

msmolens
Copy link
Contributor

@msmolens msmolens commented Nov 1, 2017

Require metadata CSV files to be UTF-8 encoded. Validation of CSV files
that aren't UTF-8 encoded will indicate an error that provides details
of the first encoding problem encountered in the file.

Because MongoDB stores data in BSON format and BSON strings are UTF-8
encoded, strings stored to the database must contain valid UTF-8 data
[1]. Validating that the CSV file is UTF-8 encoded avoids errors when
saving strings to the database later in the workflow. For example:

InvalidStringData: strings in documents must be valid UTF-8

The implementation replaces the csv module with a backport of Python 3's
csv module [2]. This ensures that csv.DictReader properly reads UTF-8
CSV files and returns Unicode strings. Additionally, using this backport
instead of managing UTF-8 conversions manually should make a future
transition to Python 3 easier.

Note that the implementation currently isn't compatible with Python 3
because of the use of the unicode() function.

[1] http://api.mongodb.com/python/current/tutorial.html#a-note-on-unicode-strings
[2] https://github.com/ryanhiebert/backports.csv

Fixes #473

@brianhelba
Copy link
Member

@msmolens Can you rebase this onto master, so tests can pass?

It makes sense that without backports.csv, a UTF-8-encoded CSV would not yield unicode strings (explaining the original bug).

However, are you aware of any instances of CSVs with Unicode characters which were not encoded in UTF-8, or is the except UnicodeDecodeError as e: just there as a proactive guard?

@brianhelba brianhelba mentioned this pull request Nov 2, 2017
@codecov
Copy link

codecov bot commented Nov 2, 2017

Codecov Report

Merging #485 into master will decrease coverage by 0.05%.
The diff coverage is 60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #485      +/-   ##
==========================================
- Coverage   62.78%   62.73%   -0.06%     
==========================================
  Files          32       32              
  Lines        2913     2914       +1     
==========================================
- Hits         1829     1828       -1     
- Misses       1084     1086       +2
Impacted Files Coverage Δ
server/utility/__init__.py 100% <100%> (ø) ⬆️
server/models/dataset.py 80% <33.33%> (-0.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4997be...84b3733. Read the comment docs.

@msmolens
Copy link
Contributor Author

msmolens commented Nov 2, 2017

without backports.csv, a UTF-8-encoded CSV would not yield unicode strings (explaining the original bug).

Not exactly. The first reference link in the commit message explains PyMongo's behavior. In particular:

PyMongo must ensure that any strings it stores contain only valid UTF-8 data. Regular strings (<type ‘str’>) are validated and stored unaltered. Unicode strings (<type ‘unicode’>) are encoded UTF-8 first

The original bug occurred when the regular strings we input to PyMongo are validated and those strings aren't valid UTF-8.

However, are you aware of any instances of CSVs with Unicode characters which were not encoded in UTF-8, or is the except UnicodeDecodeError as e: just there as a proactive guard?

Yes, the CSVs on the live site that demonstrated the bug are not valid UTF-8. While we could use heuristics to guess the encoding, it seems far better to require a single standard encoding. Some education or additional documentation might be helpful for dataset contributors. I tested re-saving the problem CSVs using Google Sheets, which saves CSV files in UTF-8, and the updated CSVs work, following this change.

The combination of backports.csv and the except UnicodeDecodeError effectively moves the UTF-8 validation earlier in the workflow, while processing of the CSV stream. Additionally, using backports.csv enables reading UTF-8 CSV files that start with a BOM. With the Python 2.7 csv.DictReader, the BOM is erroneously read as part of the first field name.

@brianhelba
Copy link
Member

Ah, that makes sense. I forgot that PyMongo was able to take a str object that already contained UTF-8 encoded characters, in addition to str objects with plain Latin-1 characters (the ordinary case).

So:

  • with the old behavior, a UTF-8 CSV would just go directly into a str, and through to PyMongo and the DB
  • with the new behavior, a UTF-8 CSV will be deserialized into unicode (validating it in the process), then be serialized back into a UTF-8 str by PyMongo

Is this correct?

requirements.txt Outdated
@@ -1,3 +1,4 @@
backports.csv==1.0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can we make this a more abstract version specifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to >= with the same version number.

Require metadata CSV files to be UTF-8 encoded. Validation of CSV files
that aren't UTF-8 encoded will indicate an error that provides details
of the first encoding problem encountered in the file.

Because MongoDB stores data in BSON format and BSON strings are UTF-8
encoded, strings stored to the database must contain valid UTF-8 data
[1]. Validating that the CSV file is UTF-8 encoded avoids errors when
saving strings to the database later in the workflow. For example:

    InvalidStringData: strings in documents must be valid UTF-8

The implementation replaces the csv module with a backport of Python 3's
csv module [2]. This ensures that csv.DictReader properly reads UTF-8
CSV files and returns Unicode strings. Additionally, using this backport
instead of managing UTF-8 conversions manually should make a future
transition to Python 3 easier.

Note that the implementation currently isn't compatible with Python 3
because of the use of the unicode() function.

[1] http://api.mongodb.com/python/current/tutorial.html#a-note-on-unicode-strings
[2] https://github.com/ryanhiebert/backports.csv
@msmolens
Copy link
Contributor Author

msmolens commented Nov 2, 2017

Yes, the two bullet points above describe the behavior as I understand it. With the old behavior, a str that's not valid UTF-8 results in the InvalidStringData: strings in documents must be valid UTF-8 error when writing to the database.

@brianhelba brianhelba merged commit 1c44a43 into master Nov 2, 2017
@brianhelba brianhelba deleted the 473-require-utf8-metadata branch November 2, 2017 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants