Add functionality in TextCorpus to convert document text to index vectors #1720

roopalgarg · 2017-11-16T08:29:40Z

TextCorpus doesn't provide a way to convert document text to index vector as needed for say DL NLP models.
Adding a 'doc2idx' to Dictionary object and creating modes in TextCorpus to leverage this functionality.
Referencing issue #1634

…ictionary

…dding test case for the changes

roopalgarg · 2017-11-17T00:21:48Z

@menshikh-iv how does this look?

menshikh-iv

I understand this change for Dictionary, it's OK, but I didn't understand, why this changes needed for TextCorpus ( the main question is why only for him)

menshikh-iv · 2017-11-17T06:42:27Z

gensim/corpora/dictionary.py

@@ -173,6 +173,37 @@ def doc2bow(self, document, allow_update=False, return_missing=False):
        else:
            return result

+    def doc2idx(self, document, unk_wrd_idx=0):
+        """
+        Convert `document` (a list of words) into a list of indexes = list


Please use numpy-style docstrings.

menshikh-iv · 2017-11-17T06:54:43Z

gensim/corpora/dictionary.py

+        if isinstance(document, string_types):
+            raise TypeError("doc2idx expects an array of unicode tokens on input, not a single string")
+
+        token2id = self.token2id


you can use self.token2id directly

was just following convention from 'doc2bow' but I will make the change as you pointed out.

menshikh-iv · 2017-11-17T06:56:31Z

gensim/corpora/dictionary.py

@@ -173,6 +173,37 @@ def doc2bow(self, document, allow_update=False, return_missing=False):
        else:
            return result

+    def doc2idx(self, document, unk_wrd_idx=0):


Dictionary always start numbering from 0, for this reason, index 0 always busy with some word, -1 is significantly better as the default value.

Also please rename unk_wrd_idx to unknown_word_index (here and everywhere)

menshikh-iv · 2017-11-17T06:59:07Z

gensim/corpora/dictionary.py

+
+        token2id = self.token2id
+
+        list_word_idx = list()


document = [word if isinstance(word, unicode) else unicode(word, 'utf-8') for word in document] return [self.token2id.get(word, unknown_word_index) for word in document]

menshikh-iv · 2017-11-17T07:01:59Z

gensim/corpora/textcorpus.py

@@ -112,7 +114,10 @@ class TextCorpus(interfaces.CorpusABC):
    6.  remove stopwords; see `gensim.parsing.preprocessing` for the list of stopwords

    """
-    def __init__(self, input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None):
+    def __init__(


Please use vertical indent (only for method/function definition, in all other cases - hanging indent).

roopalgarg · 2017-11-17T08:46:31Z

@menshikh-iv the idea was that since we are adding the functionality to the __iter__ of TextCorpus, we can process a corpus to get vocabulary indexes of the corpus. Also since WikiCorpus inherits TextCorpus and doesnt override the __iter__ the functionality gets added to WikiCorpus as well.
Though I agree that the constructor for WikiCorpus doesnt explicitly take parameters for this new functionality, but we can set the parameters after creating an instance of the class by explicitly setting the new parameters.

So couple of things here then:

Should WikiCorpus have the parameters explicitly as well as part of its constructor definition?
Where else do you think this functionality should be added since you were asking as to why only TextCorpus?

menshikh-iv · 2017-11-17T09:07:43Z

@roopalgarg current problem is more "global", let me describe:

We have many different classes for corpuses, we need to add it for all corpuses (but I don't think that this is a good idea, but have same interfaces will be nice).
How to choose the concrete class for support this Dictionary feature (maybe we no need to add it to any corpuses, only new method for Dictionary?)
Passing arguments (if we want to add it in many places)

I think corpus classes needs global refactoring (bring everything to the same interfaces and simplify, i.e. a minimum of functionality), @roopalgarg it isn't your problem, sorry, but you reminded me about the old and important problem.

@roopalgarg I'm ready to merge only new method for Dictionary right now.

@piskvorky wdyt? corpuses in common are very chaotic, have any idea how to rework it?

roopalgarg · 2017-11-17T09:13:48Z

@menshikh-iv I see your point. For now just adding doc2idx to the Dictionary class sounds good to me.
Do I need to revert my changes to the other 2 files (or there is a way you can discard the changes to them)?

menshikh-iv · 2017-11-17T09:28:50Z

gensim/corpora/dictionary.py

+
+        Notes
+        -----
+            This function is `const`, aka read-only


No needed indentation here

menshikh-iv · 2017-11-17T09:29:07Z

gensim/corpora/dictionary.py

+
+        Parameters
+        ----------
+        document : list


list of str

menshikh-iv · 2017-11-17T09:29:23Z

gensim/corpora/dictionary.py

+        Parameters
+        ----------
+        document : list
+            List of words tokenized, normalized and preprocessed.


No need to mention type twice

menshikh-iv · 2017-11-17T09:29:42Z

gensim/corpora/dictionary.py

+
+        Returns
+        -------
+        list


list of int

menshikh-iv · 2017-11-17T09:30:59Z

gensim/corpora/dictionary.py

+        Returns
+        -------
+        list
+            List of indexes in the dictionary for words in the `document`


No need to mention type twice + add preserves order.

menshikh-iv · 2017-11-17T09:31:24Z

gensim/corpora/dictionary.py

+        -------
+        list
+            List of indexes in the dictionary for words in the `document`
+


Please add example section (simple example that works how to apply this method)

menshikh-iv · 2017-11-17T09:32:09Z

@roopalgarg yeah, please revert 2 files, fix docstring and that's all 👍

reverting changes to TextCorpus as discussed

roopalgarg · 2017-11-17T10:07:50Z

@menshikh-iv a little new to numpy style docstrings so not fully aware of best practices. learnt something new today :)

roopalgarg · 2017-11-18T05:04:33Z

@menshikh-iv good to merge ?

menshikh-iv · 2017-11-20T04:43:44Z

@roopalgarg yeah, thanks for your contribution:+1:

roopalgarg · 2017-11-20T09:02:49Z

@menshikh-iv awesome! thanks

…iskvorky#1720) * define doc2idx to convert a document to a vector of indexes per the dictionary * update documentation * changes to textcorpus to add a mode for index vector format output. adding test case for the changes * fixing doc string * fix doc string * fix doc string * removing trailing white spaces * removing trailing white spaces * changes as per review * change as per review. reverting changes to TextCorpus as discussed

roopalgarg added 8 commits November 15, 2017 23:14

define doc2idx to convert a document to a vector of indexes per the d…

91ab157

…ictionary

update documentation

ddf9e04

changes to textcorpus to add a mode for index vector format output. a…

4ba3714

…dding test case for the changes

fixing doc string

2a08ce9

fix doc string

7e7fef7

fix doc string

0e6f793

removing trailing white spaces

8e10138

removing trailing white spaces

655cd04

menshikh-iv suggested changes Nov 17, 2017

View reviewed changes

changes as per review

4e6957f

menshikh-iv suggested changes Nov 17, 2017

View reviewed changes

change as per review.

dec80e8

reverting changes to TextCorpus as discussed

menshikh-iv merged commit db3b881 into piskvorky:develop Nov 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality in TextCorpus to convert document text to index vectors #1720

Add functionality in TextCorpus to convert document text to index vectors #1720

roopalgarg commented Nov 16, 2017

roopalgarg commented Nov 17, 2017

menshikh-iv left a comment •

edited

Loading

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

roopalgarg Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

roopalgarg commented Nov 17, 2017

menshikh-iv commented Nov 17, 2017

roopalgarg commented Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv Nov 17, 2017

menshikh-iv commented Nov 17, 2017

roopalgarg commented Nov 17, 2017

roopalgarg commented Nov 18, 2017

menshikh-iv commented Nov 20, 2017

roopalgarg commented Nov 20, 2017

Add functionality in TextCorpus to convert document text to index vectors #1720

Add functionality in TextCorpus to convert document text to index vectors #1720

Conversation

roopalgarg commented Nov 16, 2017

roopalgarg commented Nov 17, 2017

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roopalgarg commented Nov 17, 2017

menshikh-iv commented Nov 17, 2017

roopalgarg commented Nov 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Nov 17, 2017

roopalgarg commented Nov 17, 2017

roopalgarg commented Nov 18, 2017

menshikh-iv commented Nov 20, 2017

roopalgarg commented Nov 20, 2017

menshikh-iv left a comment •

edited

Loading