Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1387: Add TextDirectoryCorpus that yields one doc per file recursi… #1388

Closed
wants to merge 69 commits into from

Conversation

macks22
Copy link
Contributor

@macks22 macks22 commented Jun 2, 2017

…vely read from directory. Resolves #1387.

@macks22
Copy link
Contributor Author

macks22 commented Jun 8, 2017

@piskvorky I know we had discussed refactoring the TextCorpus variants in general; do you think it's best to merge this and then open up another issue for the more general architectural changes we were discussing, or should I add that sort of stuff to this PR? Thanks!

aneesh-joshi and others added 9 commits June 21, 2017 10:49
…hanges

[MRG] Updated IPython notebook for scikit-learn wrappers
Fix numpy/scipy version & disable nnz code from nose (temporary option)
* update topic coherence tutorial notebook

* update topic coherence movies benchmark notebook to reflect the recent coherence optimizations

* a few minor updates to the text of the topic coherence benchmark on the movies dataset

* add new notebook demonstrating use of the CoherenceModel for model selection

"""
def __init__(self, input=None):
def __init__(self, input=None, metadata=False, character_filters=None, tokenizer=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add documentation for a new parameters in google-docstring format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if input is not None:
self.dictionary.add_documents(self.get_texts())
else:
logger.warning("No input document stream provided; assuming "
"dictionary will be initialized some other way.")

def __iter__(self):
"""
The function that defines a corpus.
"""The function that defines a corpus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use google-docstring format everywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""

def __init__(self, input, metadata=False, min_depth=0, max_depth=None, pattern=None,
exclude_pattern=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring with parameter description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


class TextDirectoryCorpus(TextCorpus):
"""Read documents recursively from a directory,
where each file is interpreted as a plain text document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe iterate by line (not by file), what do you think about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add an option for that, but the corpora this was designed for (20 newsgroups, Wiki-Movie from coherence papers) are one document per file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new argument lines_are_documents along with a test that shows usage

@menshikh-iv
Copy link
Contributor

Can you add short example of usage of this feature with non-trivial tree (as "integration" test)
Example of tree:

.
├── a_folder/
│        ├── 0.txt
├── b_folder/
│       ├── 1.txt
│       ├── 2.txt
│       ├── c_folder/
│               ├── 3.txt

fsonntag and others added 12 commits June 22, 2017 12:24
* Create local random generator for sample_text & add lenght

* Fix typos
…one. Fix piskvorky#1294 (piskvorky#1321)

* added any2sparse_clipped() function

* changed full2sparse_clipped to any2sparse_clipped in __getitem__

* added missing whitespace

* return topn from any2sparse_clipped()

* efficient any2sparse_clipped implementation

* added unit test for any2sparse_clipped

* function call corrected

* removed any2sparse_clipped and added scipy2scipy_clipped

* added new code path for maintain_sparsity

* added unit tests for new function and issue

* fixed flake8 errors

* fixed matrix_indptr

* added requested changes

* replaced hasattr with getattr

* call abs() once for entire matrix in scipy2scipy_clipped

* removed matrix.sort_indices and removed indptr while calling argsort
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.
…_directory_corpus

# Conflicts:
#	gensim/corpora/textcorpus.py
#	gensim/test/test_corpora.py
@macks22
Copy link
Contributor Author

macks22 commented Jun 23, 2017

@menshikh-iv I've addressed your PR comments; thank you for the review!

Sorry about the big blob of commits since your review -- I had to rebase and resolve several conflicts with gensim upstream. The first commit, d41b2c4, contains my changes in response to your review.

@macks22
Copy link
Contributor Author

macks22 commented Jul 3, 2017

@menshikh-iv @piskvorky Looks like I botched up this PR with the rebase I did to resolve conflicts recently. I've opened up PR #1459 to replace it.

@macks22 macks22 closed this Jul 3, 2017
@macks22 macks22 deleted the text_directory_corpus branch July 3, 2017 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants