Skip to content

Conversation

@gravesm
Copy link
Collaborator

@gravesm gravesm commented Feb 2, 2018

This cleans up a few parts of the file processing to ensure we're handling bytes/strings appropriately.

Mike Graves added 2 commits February 2, 2018 15:42
The first 1024 bytes should be enough for magic to determine file type,
at least for plain text and docx.
Iterating through byte chunks may split in the middle of words, or
worse, the middle of multi-byte characters. Django's file object allows
iterating by line which should be safer. Additionally, this explicitly
decodes from UTF-8. This is probably fine for now, but we'll want to add
encoding detection at a later point.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 38.208% when pulling 4b36524 on upload-refactor into 0a6b5df on master.

for chunk in doc.chunks():
bag_of_words.extend(str(chunk).strip().split(' '))
for line in doc:
bag_of_words.extend(line.decode('utf-8').strip().split())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any particular guarantee that this is UTF-8?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since they are uploading whatever, it could be whatever. We'd need to use chardet or something to have any confidence in our encoding assumptions. We should probably a new ticket for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, in the absence of any other information, UTF-8 is as good a guess as any.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new ticket to confirm encoding is 🌈 . I used UnicodeDammit for that in solenoid and it was great.

Copy link
Owner

@thatandromeda thatandromeda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the opening of that ticket, LGTM.

@gravesm gravesm merged commit 21b9cac into master Feb 5, 2018
@gravesm gravesm deleted the upload-refactor branch February 5, 2018 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants