-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1387: Add TextDirectoryCorpus
that yields one doc per file recursi…
#1388
Conversation
…e recursively read from directory.
…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.
…or `TextDirectoryCorpus`.
@piskvorky I know we had discussed refactoring the TextCorpus variants in general; do you think it's best to merge this and then open up another issue for the more general architectural changes we were discussing, or should I add that sort of stuff to this PR? Thanks! |
…hanges [MRG] Updated IPython notebook for scikit-learn wrappers
Fixed incorrect link in notebook
Fix numpy/scipy version & disable nnz code from nose (temporary option)
* update topic coherence tutorial notebook * update topic coherence movies benchmark notebook to reflect the recent coherence optimizations * a few minor updates to the text of the topic coherence benchmark on the movies dataset * add new notebook demonstrating use of the CoherenceModel for model selection
gensim/corpora/textcorpus.py
Outdated
|
||
""" | ||
def __init__(self, input=None): | ||
def __init__(self, input=None, metadata=False, character_filters=None, tokenizer=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add documentation for a new parameters in google-docstring format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if input is not None: | ||
self.dictionary.add_documents(self.get_texts()) | ||
else: | ||
logger.warning("No input document stream provided; assuming " | ||
"dictionary will be initialized some other way.") | ||
|
||
def __iter__(self): | ||
""" | ||
The function that defines a corpus. | ||
"""The function that defines a corpus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use google-docstring format everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gensim/corpora/textcorpus.py
Outdated
""" | ||
|
||
def __init__(self, input, metadata=False, min_depth=0, max_depth=None, pattern=None, | ||
exclude_pattern=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docstring with parameter description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gensim/corpora/textcorpus.py
Outdated
|
||
class TextDirectoryCorpus(TextCorpus): | ||
"""Read documents recursively from a directory, | ||
where each file is interpreted as a plain text document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe iterate by line (not by file), what do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add an option for that, but the corpora this was designed for (20 newsgroups, Wiki-Movie from coherence papers) are one document per file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a new argument lines_are_documents
along with a test that shows usage
Can you add short example of usage of this feature with non-trivial tree (as "integration" test)
|
* Create local random generator for sample_text & add lenght * Fix typos
…one. Fix piskvorky#1294 (piskvorky#1321) * added any2sparse_clipped() function * changed full2sparse_clipped to any2sparse_clipped in __getitem__ * added missing whitespace * return topn from any2sparse_clipped() * efficient any2sparse_clipped implementation * added unit test for any2sparse_clipped * function call corrected * removed any2sparse_clipped and added scipy2scipy_clipped * added new code path for maintain_sparsity * added unit tests for new function and issue * fixed flake8 errors * fixed matrix_indptr * added requested changes * replaced hasattr with getattr * call abs() once for entire matrix in scipy2scipy_clipped * removed matrix.sort_indices and removed indptr while calling argsort
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
…e recursively read from directory.
…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.
…or `TextDirectoryCorpus`.
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.
…_directory_corpus # Conflicts: # gensim/corpora/textcorpus.py # gensim/test/test_corpora.py
@menshikh-iv I've addressed your PR comments; thank you for the review! Sorry about the big blob of commits since your review -- I had to rebase and resolve several conflicts with gensim upstream. The first commit, d41b2c4, contains my changes in response to your review. |
@menshikh-iv @piskvorky Looks like I botched up this PR with the rebase I did to resolve conflicts recently. I've opened up PR #1459 to replace it. |
…vely read from directory. Resolves #1387.