Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse2Corpus: update __getitem__ to work on slices, lists and ellipsis #3247

Merged
merged 4 commits into from Dec 4, 2021

Conversation

PrimozGodec
Copy link
Contributor

Currently, Sparse2Corpus's __getitems__ only work on integer indexes and return a sparse representation for only one document.

Since the underlying structure is a CSC matrix which allows more options in __getitems__ I suggest upgrading the current implementation with options supported for the CSC matrix. I suggest that __getitems__ returns the Sparse2Corpus object with a selected subset of data. Supported subsets are slice, list, ellipsis, int.

My change includes:

  1. More options for selecting part of the data
  2. Returning Sparse2Corpus object instead of a manmade list. I can support that decision with the fact that the underlying object (CSC matrix) have the same behaviour (returns the same type). The same holds for NumPy for example.

If we agree we should also discuss how to deprecate old behaviour and introduce the new one.

This PR is not final. When we agree that the solution is OK I will polish the code and add some tests.

@piskvorky
Copy link
Owner

I suggest that getitems returns the Sparse2Corpus object with a selected subset of data. Supported subsets are slice, list, ellipsis, int.

Fancy slicing & indexing is an interesting idea, thanks. Can you post some supporting motivation? What made you look at this functionality? What's your use-case?

I'm definitely -1 on breaking existing primary interfaces / API contracts of Gensim though, such as the expectation that a transformed document is a list / plain sequence of (feature_id, weight) pairs. Or is your change backward compatible?

@PrimozGodec
Copy link
Contributor Author

Fancy slicing & indexing is an interesting idea, thanks. Can you post some supporting motivation? What made you look at this functionality? What's your use case?

In https://github.com/biolab/orange3-text we have a corpus of documents with gets Sparse2Corpus object attached by Bag of Words. At some point, the user can decide to do slicing (take part of documents) on the corpus and we also need to slice Sparse2Corpus accordingly. Currently, we must slice the underlying CSC matrix and create new Sparse2Corpus:

sparse = s2c.sparse.__getitem__((slice(None, None, None), key))
Sparse2Corpus(sparse)

It is not such a big problem for us, but I taught it would be a nice enhancement to Sparse2Corpus to support that natively.

I'm definitely -1 on breaking existing primary interfaces / API contracts of Gensim though, such as the expectation that a transformed document is a list / plain sequence of (feature_id, weight) pairs. Or is your change backward compatible?

You are right. We should not break this expectation/current behaviour. My current implementation would break it since it always yields a Sparse2Corpus, even for one document. I am thinking of another solution:

  1. returns a list / plain sequence of (feature_id, weight) pairs when only one document requested: c2s[i]
  2. returns Sparse2Corpus object when multiple documents are requested (with slicing, list, ellipsis).

@piskvorky
Copy link
Owner

piskvorky commented Oct 13, 2021

returns Sparse2Corpus object when multiple documents are requested (with slicing, list, ellipsis).

What is the current behaviour of Gensim in this case? What happens now if you pass a slice/list/ellipsis? (concrete result / error)

@PrimozGodec
Copy link
Contributor Author

What is the current behavior of Gensim in this case? What happens now if you pass a slice/list/ellipsis? (concrete result / error)

It results in errors since the current implementation expects index (not slice, ...).

@piskvorky
Copy link
Owner

Then your last suggestion should be 100% backward compatible, right? If that's the case then it's fine to implement it, thanks.

@PrimozGodec
Copy link
Contributor Author

Then your last suggestion should be 100% backward compatible, right? If that's the case then it's fine to implement it, thanks.

Yes. It should be backward compatible. Ok, I will make this PR ready then soon.

@mpenkov mpenkov marked this pull request as draft October 24, 2021 03:27
@PrimozGodec PrimozGodec marked this pull request as ready for review October 27, 2021 14:29
@codecov
Copy link

codecov bot commented Oct 27, 2021

Codecov Report

Merging #3247 (fa5e392) into develop (fa2d1b1) will decrease coverage by 0.40%.
The diff coverage is 100.00%.

❗ Current head fa5e392 differs from pull request most recent head c46241b. Consider uploading reports for the commit c46241b to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3247      +/-   ##
===========================================
- Coverage    79.32%   78.91%   -0.41%     
===========================================
  Files           68       68              
  Lines        11777    11776       -1     
===========================================
- Hits          9342     9293      -49     
- Misses        2435     2483      +48     
Impacted Files Coverage Δ
gensim/matutils.py 77.23% <100.00%> (+0.83%) ⬆️
gensim/scripts/segment_wiki.py 71.18% <0.00%> (-21.19%) ⬇️
gensim/corpora/wikicorpus.py 82.21% <0.00%> (-11.54%) ⬇️
gensim/utils.py 71.05% <0.00%> (-0.49%) ⬇️
gensim/models/word2vec.py 89.77% <0.00%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa2d1b1...c46241b. Read the comment docs.

@PrimozGodec
Copy link
Contributor Author

It is ready for review now.

gensim/matutils.py Outdated Show resolved Hide resolved
gensim/matutils.py Show resolved Hide resolved
gensim/matutils.py Outdated Show resolved Hide resolved
@mpenkov mpenkov merged commit 2f182d7 into piskvorky:develop Dec 4, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Dec 4, 2021

Merging. Thank you for your effort and your patience!

@PrimozGodec PrimozGodec deleted the sparse2corpus-getitem branch December 5, 2021 11:25
@PrimozGodec
Copy link
Contributor Author

No worries. Thank you for considering my PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants