#1387: Add `TextDirectoryCorpus` that yields one doc per file recursi… #1388

macks22 · 2017-06-02T16:06:54Z

…vely read from directory. Resolves #1387.

…e recursively read from directory.

…rpus` change.

…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.

…or `TextDirectoryCorpus`.

macks22 · 2017-06-08T15:05:17Z

@piskvorky I know we had discussed refactoring the TextCorpus variants in general; do you think it's best to merge this and then open up another issue for the more general architectural changes we were discussing, or should I add that sort of stuff to this PR? Thanks!

…hanges [MRG] Updated IPython notebook for scikit-learn wrappers

Fixed incorrect link in notebook

Fix numpy/scipy version & disable nnz code from nose (temporary option)

* update topic coherence tutorial notebook * update topic coherence movies benchmark notebook to reflect the recent coherence optimizations * a few minor updates to the text of the topic coherence benchmark on the movies dataset * add new notebook demonstrating use of the CoherenceModel for model selection

menshikh-iv · 2017-06-22T05:21:03Z

gensim/corpora/textcorpus.py


    """
-    def __init__(self, input=None):
+    def __init__(self, input=None, metadata=False, character_filters=None, tokenizer=None,


Please, add documentation for a new parameters in google-docstring format

menshikh-iv · 2017-06-22T05:22:42Z

gensim/corpora/textcorpus.py

        if input is not None:
            self.dictionary.add_documents(self.get_texts())
        else:
            logger.warning("No input document stream provided; assuming "
                           "dictionary will be initialized some other way.")

    def __iter__(self):
-        """
-        The function that defines a corpus.
+        """The function that defines a corpus.


Please use google-docstring format everywhere

menshikh-iv · 2017-06-22T05:24:38Z

gensim/corpora/textcorpus.py

+    """
+
+    def __init__(self, input, metadata=False, min_depth=0, max_depth=None, pattern=None,
+                 exclude_pattern=None, **kwargs):


Add docstring with parameter description

menshikh-iv · 2017-06-22T05:26:04Z

gensim/corpora/textcorpus.py

+
+class TextDirectoryCorpus(TextCorpus):
+    """Read documents recursively from a directory,
+    where each file is interpreted as a plain text document.


Maybe iterate by line (not by file), what do you think about it?

I think we could add an option for that, but the corpora this was designed for (20 newsgroups, Wiki-Movie from coherence papers) are one document per file.

added a new argument lines_are_documents along with a test that shows usage

menshikh-iv · 2017-06-22T05:35:01Z

Can you add short example of usage of this feature with non-trivial tree (as "integration" test)
Example of tree:

.
├── a_folder/
│        ├── 0.txt
├── b_folder/
│       ├── 1.txt
│       ├── 2.txt
│       ├── c_folder/
│               ├── 3.txt

* Create local random generator for sample_text & add lenght * Fix typos

…one. Fix piskvorky#1294 (piskvorky#1321) * added any2sparse_clipped() function * changed full2sparse_clipped to any2sparse_clipped in __getitem__ * added missing whitespace * return topn from any2sparse_clipped() * efficient any2sparse_clipped implementation * added unit test for any2sparse_clipped * function call corrected * removed any2sparse_clipped and added scipy2scipy_clipped * added new code path for maintain_sparsity * added unit tests for new function and issue * fixed flake8 errors * fixed matrix_indptr * added requested changes * replaced hasattr with getattr * call abs() once for entire matrix in scipy2scipy_clipped * removed matrix.sort_indices and removed indptr while calling argsort

…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.

…e recursively read from directory.

…rpus` change.

…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.

…or `TextDirectoryCorpus`.

…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.

…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.

…_directory_corpus # Conflicts: # gensim/corpora/textcorpus.py # gensim/test/test_corpora.py

macks22 · 2017-06-23T21:46:09Z

@menshikh-iv I've addressed your PR comments; thank you for the review!

Sorry about the big blob of commits since your review -- I had to rebase and resolve several conflicts with gensim upstream. The first commit, d41b2c4, contains my changes in response to your review.

…ings.

macks22 · 2017-07-03T11:56:52Z

@menshikh-iv @piskvorky Looks like I botched up this PR with the rebase I did to resolve conflicts recently. I've opened up PR #1459 to replace it.

Sweeney, Mack and others added 14 commits June 2, 2017 12:06

piskvorky#1387: Add TextDirectoryCorpus that yields one doc per fil…

47c0a46

…e recursively read from directory.

piskvorky#1387: Fix test failures in test_miislita based on `TextCo…

53a9623

…rpus` change.

piskvorky#1387: switch from basic preprocess_text method to a prepr…

c3381c5

…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.

created new file for rpmodel_sklearn_wrapper

0c5bcb0

updated get_params, set_params functions

0810428

correction in calling init function

d67f047

added fit, transform, partial_fit function

a9ce401

added tests for Rp model's sklearn wrapper

05ad743

minor correction in docstring in LDA and LSI models

f1b9c4a

added newline before class definition (PEP8)

8696e54

removed 'corpus' from 'init' and set 'corpus' in 'fit'

fe2f947

updated docstring for 'fit' function

7317173

piskvorky#1387: Update docs for TextCorpus and fix length caching f…

c71f670

…or `TextDirectoryCorpus`.

piskvorky#1387: Remove whitespace in lines in TextCorpus docstring.

910cbf9

piskvorky assigned menshikh-iv Jun 9, 2017

chinmayapancholi13 added 14 commits June 13, 2017 02:15

refactored code to use 'self.model'

692be88

code style changes

a2ec746

refactored wrapper and tests

954715e

removed 'self.corpus' attribute and refactored slightly

6c3b819

updated 'self.__model' to 'self.gensim_model'

aee04ff

updated test data

a73dacc

updated 'fit' and 'transform' methods

da602d9

updated 'testTransform' test

c1087ac

PEP8 change

00f5336

updated 'testTransform' test

376959d

added 'NotFittedError' in 'transform' function

9c888d6

added 'testPersistence' and 'testModelNotFitted' tests

373c36c

added input 'docs' description in 'transform' function

f3c3601

added 'testPipeline' test

ab90b68

aneesh-joshi and others added 9 commits June 21, 2017 10:49

removed .ipynb checkpoint

16998b1

Merge pull request piskvorky#1428 from chinmayapancholi13/skl_ipynb_c…

b4da23c

…hanges [MRG] Updated IPython notebook for scikit-learn wrappers

Merge pull request piskvorky#1426 from aneesh-joshi/develop

a2b6b32

Fixed incorrect link in notebook

Partial fix windows issue (piskvorky#1438)

c14b138

Fix numpy/scipy version & disable nnz code from nose (temporary option)

update changelog to 2.2.0

50b3f2b

bump version to 2.2.0

9c23419

Merge branch 'release-2.2.0' into develop

a3965e9

Add juju.com to the list of adopters (piskvorky#1436)

aa620db

menshikh-iv suggested changes Jun 22, 2017

View reviewed changes

fsonntag and others added 12 commits June 22, 2017 12:24

Add word ngram parameter to fasttext (piskvorky#1432)

dfb66f1

Add seed and lenght for sample_text (piskvorky#1422)

0d47a6f

* Create local random generator for sample_text & add lenght * Fix typos

piskvorky#1387: Add TextDirectoryCorpus that yields one doc per fil…

68d08e2

…e recursively read from directory.

piskvorky#1387: Fix test failures in test_miislita based on `TextCo…

0425bef

…rpus` change.

piskvorky#1387: switch from basic preprocess_text method to a prepr…

6f9a4f9

…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.

piskvorky#1387: Update docs for TextCorpus and fix length caching f…

d881fba

…or `TextDirectoryCorpus`.

piskvorky#1387: Remove whitespace in lines in TextCorpus docstring.

1c002f7

piskvorky#1387: Resolve conflicts with rebasing on upstream develop b…

72b6706

…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.

Merge remote-tracking branch 'origin/text_directory_corpus' into text…

4312a63

…_directory_corpus # Conflicts: # gensim/corpora/textcorpus.py # gensim/test/test_corpora.py

Sweeney, Mack added 2 commits June 24, 2017 13:43

piskvorky#1387: fix flake8 formatting issue and add a few more docstr…

1e919c7

…ings.

piskvorky#1387: Fix corpora deaccent tests for Python 3.

66b535a

macks22 mentioned this pull request Jul 3, 2017

#1387: Add TextDirectoryCorpus and refactor TextCorpus #1459

Merged

macks22 closed this Jul 3, 2017

macks22 deleted the text_directory_corpus branch July 3, 2017 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1387: Add `TextDirectoryCorpus` that yields one doc per file recursi… #1388

#1387: Add `TextDirectoryCorpus` that yields one doc per file recursi… #1388

macks22 commented Jun 2, 2017

macks22 commented Jun 8, 2017

menshikh-iv Jun 22, 2017

macks22 Jun 23, 2017

menshikh-iv Jun 22, 2017

macks22 Jun 23, 2017

menshikh-iv Jun 22, 2017

macks22 Jun 23, 2017

menshikh-iv Jun 22, 2017

macks22 Jun 22, 2017

macks22 Jun 23, 2017

menshikh-iv commented Jun 22, 2017

macks22 commented Jun 23, 2017

macks22 commented Jul 3, 2017

#1387: Add TextDirectoryCorpus that yields one doc per file recursi… #1388

#1387: Add TextDirectoryCorpus that yields one doc per file recursi… #1388

Conversation

macks22 commented Jun 2, 2017

macks22 commented Jun 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Jun 22, 2017

macks22 commented Jun 23, 2017

macks22 commented Jul 3, 2017

#1387: Add `TextDirectoryCorpus` that yields one doc per file recursi… #1388

#1387: Add `TextDirectoryCorpus` that yields one doc per file recursi… #1388