-
Notifications
You must be signed in to change notification settings - Fork 2
Add support for .docx files #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| """Document object for representing a DOCX document.""" | ||
| def __init__(self, doc): | ||
| self._words = [] | ||
| self.doc = docx.Document(doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it docx handles all encoding issues internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, by the time we get text out of it, it's already unicode. I will say that if we decide we don't like how the library I'm using (it's the only one I could find) is performing it probably wouldn't be totally onerous to roll our own. Our needs for pulling text out of the document are very minimal.
| @@ -0,0 +1,3 @@ | |||
| Since the dawn of time, humans have been using hatsopoulos microfluids to accomplish some amazing things. Today, we stand upon the precipice of a new age where hatsopoulos microfluids will dictate all aspects of our lives. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 😆
| allowed_filetypes = ['text/plain'] | ||
| allowed_extensions = ['txt', 'docx'] | ||
| allowed_mimetypes = ['text/plain', | ||
| 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a file generated from an actual Windows product to test this on? I can easily supply you with an OS X Office file, and I can get a Windows one also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably wouldn't hurt, though I'd guess with what we're doing, any differences in implementation of the standard won't have much of an impact. It's no problem to throw a few more files at it during testing, so why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've sent you a Mac-generated one, and Matt will be sending you a Windows-generated one shortly.
|
File upload wasn't working for me, but it turned out this was due to incompatibilities between gensim versions - I've pinned the version number in the Pipfile and now it works. I'll open a ticket for the version incompatibility issue (but it will be very low priority; rewriting the gensim code is thinky work and there are more important ways to spend our time). |
thatandromeda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but tell me if you want to do anything with respect to the questions above.
|
I think I'd like to at least look into updating the tests to go through the actual view, based on the document you sent me before merging. |
|
Integration tests ftw. |
75a2d75 to
66ea0c5
Compare
docs/developer.md
Outdated
| `hamlet.settings.local` defaults to using the test model, since it is checked | ||
| into version control. If you have a different model you want to use: | ||
| * put it in `hamlet/model/hamlet.model` (along with any other numpy files it needs to work); | ||
| * set `DJANGO_USE_LIVE_MODEL=True` in `.env`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we added a new setting called something like HAMLET_MODEL that specified the path to an alternate model? This would have the advantage of being more flexible in where it lives and what it's called, as well as being more explicit about what's getting used. If the setting doesn't exist we can fall back to the testmodel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumed to be the relative path from the project root? Sure, that works. Will do.
e7ff6bd to
c7412a7
Compare
This adds support for reading uploaded docx files. A document module is added with a factory function for creating a document object from a Django UploadedFile object. The created document object has a words property which contains a list of words in the document. In addition, this uses chardet to determine the encoding of plain text files.
gensim 3.3 has a backwards-incompatible change that broke our existing file upload handling. I'll open a ticket to make this 3.3-compatible, but for now this is the easy fix.
This adds tests for the upload recommender view's ability to return matching documents using both .txt and .docx uploads.
c7412a7 to
edf0f00
Compare
This adds support for reading uploaded docx files. A document module is
added with a factory function for creating a document object from a
Django UploadedFile object. The created document object has a words
property which contains a list of words in the document. In addition,
this uses chardet to determine the encoding of plain text files.