Refactor upload #9

gravesm · 2018-02-02T20:59:29Z

This cleans up a few parts of the file processing to ensure we're handling bytes/strings appropriately.

The first 1024 bytes should be enough for magic to determine file type, at least for plain text and docx.

Iterating through byte chunks may split in the middle of words, or worse, the middle of multi-byte characters. Django's file object allows iterating by line which should be safer. Additionally, this explicitly decodes from UTF-8. This is probably fine for now, but we'll want to add encoding detection at a later point.

coveralls · 2018-02-02T21:01:49Z

Coverage increased (+0.2%) to 38.208% when pulling 4b36524 on upload-refactor into 0a6b5df on master.

thatandromeda · 2018-02-02T21:09:30Z

hamlet/theses/views.py

-        for chunk in doc.chunks():
-            bag_of_words.extend(str(chunk).strip().split(' '))
+        for line in doc:
+            bag_of_words.extend(line.decode('utf-8').strip().split())


Do we have any particular guarantee that this is UTF-8?

No, since they are uploading whatever, it could be whatever. We'd need to use chardet or something to have any confidence in our encoding assumptions. We should probably a new ticket for this.

Though, in the absence of any other information, UTF-8 is as good a guess as any.

A new ticket to confirm encoding is 🌈 . I used UnicodeDammit for that in solenoid and it was great.

thatandromeda

Given the opening of that ticket, LGTM.

Mike Graves added 2 commits February 2, 2018 15:42

Read first 1024 bytes for file type checking

63391e2

The first 1024 bytes should be enough for magic to determine file type, at least for plain text and docx.

gravesm requested a review from thatandromeda February 2, 2018 20:59

gravesm deployed to mitlibraries-hamlet-stagi-pr-9 February 2, 2018 20:59 View deployment

thatandromeda reviewed Feb 2, 2018

View reviewed changes

thatandromeda approved these changes Feb 2, 2018

View reviewed changes

gravesm merged commit 21b9cac into master Feb 5, 2018

gravesm deleted the upload-refactor branch February 5, 2018 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor upload #9

Refactor upload #9

Uh oh!

gravesm commented Feb 2, 2018

Uh oh!

coveralls commented Feb 2, 2018

Uh oh!

thatandromeda Feb 2, 2018

Uh oh!

gravesm Feb 2, 2018

Uh oh!

gravesm Feb 2, 2018

Uh oh!

thatandromeda Feb 2, 2018

Uh oh!

thatandromeda left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Refactor upload #9

Refactor upload #9

Uh oh!

Conversation

gravesm commented Feb 2, 2018

Uh oh!

coveralls commented Feb 2, 2018

Uh oh!

thatandromeda Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

gravesm Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

gravesm Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

thatandromeda Feb 2, 2018

Choose a reason for hiding this comment

Uh oh!

thatandromeda left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants