-
Notifications
You must be signed in to change notification settings - Fork 2
Refactor upload #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The first 1024 bytes should be enough for magic to determine file type, at least for plain text and docx.
Iterating through byte chunks may split in the middle of words, or worse, the middle of multi-byte characters. Django's file object allows iterating by line which should be safer. Additionally, this explicitly decodes from UTF-8. This is probably fine for now, but we'll want to add encoding detection at a later point.
| for chunk in doc.chunks(): | ||
| bag_of_words.extend(str(chunk).strip().split(' ')) | ||
| for line in doc: | ||
| bag_of_words.extend(line.decode('utf-8').strip().split()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any particular guarantee that this is UTF-8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since they are uploading whatever, it could be whatever. We'd need to use chardet or something to have any confidence in our encoding assumptions. We should probably a new ticket for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though, in the absence of any other information, UTF-8 is as good a guess as any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new ticket to confirm encoding is 🌈 . I used UnicodeDammit for that in solenoid and it was great.
thatandromeda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the opening of that ticket, LGTM.
This cleans up a few parts of the file processing to ensure we're handling bytes/strings appropriately.