Skip to content

Conversation

@gravesm
Copy link
Collaborator

@gravesm gravesm commented Feb 5, 2018

This adds support for reading uploaded docx files. A document module is
added with a factory function for creating a document object from a
Django UploadedFile object. The created document object has a words
property which contains a list of words in the document. In addition,
this uses chardet to determine the encoding of plain text files.

@gravesm gravesm requested a review from thatandromeda February 5, 2018 21:21
@gravesm gravesm temporarily deployed to mitlibraries-hamlet-stag-pr-11 February 5, 2018 21:21 Inactive
@coveralls
Copy link

coveralls commented Feb 6, 2018

Coverage Status

Coverage increased (+4.3%) to 92.936% when pulling edf0f00 on docx-uploads into bfcace5 on master.

"""Document object for representing a DOCX document."""
def __init__(self, doc):
self._words = []
self.doc = docx.Document(doc)
Copy link
Owner

@thatandromeda thatandromeda Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it docx handles all encoding issues internally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, by the time we get text out of it, it's already unicode. I will say that if we decide we don't like how the library I'm using (it's the only one I could find) is performing it probably wouldn't be totally onerous to roll our own. Our needs for pulling text out of the document are very minimal.

@@ -0,0 +1,3 @@
Since the dawn of time, humans have been using hatsopoulos microfluids to accomplish some amazing things. Today, we stand upon the precipice of a new age where hatsopoulos microfluids will dictate all aspects of our lives.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 😆

allowed_filetypes = ['text/plain']
allowed_extensions = ['txt', 'docx']
allowed_mimetypes = ['text/plain',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a file generated from an actual Windows product to test this on? I can easily supply you with an OS X Office file, and I can get a Windows one also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably wouldn't hurt, though I'd guess with what we're doing, any differences in implementation of the standard won't have much of an impact. It's no problem to throw a few more files at it during testing, so why not?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sent you a Mac-generated one, and Matt will be sending you a Windows-generated one shortly.

@gravesm gravesm temporarily deployed to mitlibraries-hamlet-stag-pr-11 February 6, 2018 18:03 Inactive
@thatandromeda
Copy link
Owner

File upload wasn't working for me, but it turned out this was due to incompatibilities between gensim versions - I've pinned the version number in the Pipfile and now it works. I'll open a ticket for the version incompatibility issue (but it will be very low priority; rewriting the gensim code is thinky work and there are more important ways to spend our time).

Copy link
Owner

@thatandromeda thatandromeda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but tell me if you want to do anything with respect to the questions above.

@gravesm
Copy link
Collaborator Author

gravesm commented Feb 6, 2018

I think I'd like to at least look into updating the tests to go through the actual view, based on the document you sent me before merging.

@thatandromeda
Copy link
Owner

Integration tests ftw.

@gravesm gravesm temporarily deployed to mitlibraries-hamlet-stag-pr-11 February 6, 2018 22:08 Inactive
@thatandromeda thatandromeda force-pushed the docx-uploads branch 2 times, most recently from 75a2d75 to 66ea0c5 Compare February 15, 2018 18:01
`hamlet.settings.local` defaults to using the test model, since it is checked
into version control. If you have a different model you want to use:
* put it in `hamlet/model/hamlet.model` (along with any other numpy files it needs to work);
* set `DJANGO_USE_LIVE_MODEL=True` in `.env`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we added a new setting called something like HAMLET_MODEL that specified the path to an alternate model? This would have the advantage of being more flexible in where it lives and what it's called, as well as being more explicit about what's getting used. If the setting doesn't exist we can fall back to the testmodel.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumed to be the relative path from the project root? Sure, that works. Will do.

This adds support for reading uploaded docx files. A document module is
added with a factory function for creating a document object from a
Django UploadedFile object. The created document object has a words
property which contains a list of words in the document. In addition,
this uses chardet to determine the encoding of plain text files.
gensim 3.3 has a backwards-incompatible change that broke our existing
file upload handling. I'll open a ticket to make this 3.3-compatible,
but for now this is the easy fix.
This adds tests for the upload recommender view's ability to return
matching documents using both .txt and .docx uploads.
@thatandromeda thatandromeda merged commit 7ebfbf1 into master Feb 15, 2018
@thatandromeda thatandromeda deleted the docx-uploads branch February 15, 2018 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants