New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse2Corpus: update __getitem__ to work on slices, lists and ellipsis #3247
Conversation
Fancy slicing & indexing is an interesting idea, thanks. Can you post some supporting motivation? What made you look at this functionality? What's your use-case? I'm definitely -1 on breaking existing primary interfaces / API contracts of Gensim though, such as the expectation that a transformed document is a list / plain sequence of |
In https://github.com/biolab/orange3-text we have a corpus of documents with gets Sparse2Corpus object attached by Bag of Words. At some point, the user can decide to do slicing (take part of documents) on the corpus and we also need to slice Sparse2Corpus accordingly. Currently, we must slice the underlying CSC matrix and create new Sparse2Corpus:
It is not such a big problem for us, but I taught it would be a nice enhancement to Sparse2Corpus to support that natively.
You are right. We should not break this expectation/current behaviour. My current implementation would break it since it always yields a Sparse2Corpus, even for one document. I am thinking of another solution:
|
What is the current behaviour of Gensim in this case? What happens now if you pass a slice/list/ellipsis? (concrete result / error) |
It results in errors since the current implementation expects index (not slice, ...). |
Then your last suggestion should be 100% backward compatible, right? If that's the case then it's fine to implement it, thanks. |
Yes. It should be backward compatible. Ok, I will make this PR ready then soon. |
71729a2
to
2279428
Compare
2279428
to
7d17352
Compare
Codecov Report
@@ Coverage Diff @@
## develop #3247 +/- ##
===========================================
- Coverage 79.32% 78.91% -0.41%
===========================================
Files 68 68
Lines 11777 11776 -1
===========================================
- Hits 9342 9293 -49
- Misses 2435 2483 +48
Continue to review full report at Codecov.
|
7d17352
to
bc3f082
Compare
It is ready for review now. |
c691ad8
to
23cf0c4
Compare
23cf0c4
to
7e20dab
Compare
Merging. Thank you for your effort and your patience! |
No worries. Thank you for considering my PR. |
Currently, Sparse2Corpus's
__getitems__
only work on integer indexes and return a sparse representation for only one document.Since the underlying structure is a CSC matrix which allows more options in
__getitems__
I suggest upgrading the current implementation with options supported for the CSC matrix. I suggest that__getitems__
returns the Sparse2Corpus object with a selected subset of data. Supported subsets are slice, list, ellipsis, int.My change includes:
If we agree we should also discuss how to deprecate old behaviour and introduce the new one.
This PR is not final. When we agree that the solution is OK I will polish the code and add some tests.