Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to SplitMatrix #91

Closed
wants to merge 11 commits into from

Conversation

MarcAntoineSchmidtQC
Copy link
Member

  • Allow SplitMatrix to be constructed from another SplitMatrix.
  • Allow inputs of SplitMatrix to be 1-d
  • Implement __getitem__ for column subset
  • Also had to implement column subsetting for CategoricalMatrix
  • __repr__ uses the __repr__ method of components instead of str()

ToDo:

  • FIX BUG WITH _split_col_subsets (first confirm that it's a bug)
  • Add testing for new features

Checklist

  • Added a CHANGELOG.rst entry

- Allow SplitMatrix to be constructed from another SplitMatrix.
- Allow inputs of SplitMatrix to be 1-d
- Implement __getitem__ for column subset
- Also had to implement column subsetting for CategoricalMatrix
- __repr__ uses the __repr__ function of components instead of str()
Copy link
Member

@lbittarello lbittarello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

src/quantcore/matrix/split_matrix.py Outdated Show resolved Hide resolved
colmap[idx] = [i, j]
return colmap

def _split_col_subsets_unordered(self, cols):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems occasionally to return empty lists when one indexes columns with a list, which causes is_sorted to throw an error later.

MRE:

import pandas as pd
import quantcore.matrix as mx

df = pd.DataFrame({"u": ["a", "b"], "v": ["a", "b"]})
X = mx.from_pandas(df, cat_threshold=False, object_as_cat=True)
X[:, [0, 1]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't replicate this on macOS. I assume you tried this on Windows. Is that correct? Can you give me some info on your setup?

Also, you said "occasionally". Does it always throw an error when you try your MRE or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it always throw an error when you try your MRE or not?

Yes. But it doesn't throw an error if we reduce the number of rows from two to one, for example.

I assume you tried this on Windows. Is that correct?

Yes. I'm not embarrassed.

Can you give me some info on your setup?

python           : 3.8.3.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD
byteorder        : little
pandas           : 1.3.0
numpy            : 1.20.3

Is this helpful? Do you need something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps add unit tests for indexing? The Windows CI could come in handy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unit testing will be very valuable. I'll temporarily add the Windows CI to this PR for all the pushes. And yes, this is helpful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on indexing unit tests.

Copy link
Collaborator

@tbenthompson tbenthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge!! Thanks Marc for making a really big improvement to SplitMatrix. Let me know if there are any pieces that you would like to hand off. It feels like you started what seemed like a small project and it's creeped outwards a couple times now. So, don't feel like you're committed and stuck finishing this if you have other important stuff going on.

return CategoricalMatrix(self.cat[row])
else:
# return a SparseMatrix if we subset columns
return SparseMatrix(self.tocsr()[row, col], dtype=self.dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite inefficient because we construct the full sparse matrix with self.tocsr() before subsetting it. I'm fine leaving it like this for now, but I think it'd be good to at least leave a TODO comment or add an issue mentioning this performance bug. Fixing this for the single element case is quite easy. For the [:, cols] case, I guess we need to construct a sparse matrix element by element and I'm guessing that it'll be easiest to do that in Cython.

src/quantcore/matrix/split_matrix.py Show resolved Hide resolved
colmap[idx] = [i, j]
return colmap

def _split_col_subsets_unordered(self, cols):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on indexing unit tests.

MarcAntoineSchmidtQC and others added 7 commits July 26, 2021 12:09
added docstring

Co-authored-by: Ben Thompson <t.ben.thompson@gmail.com>
This is a big commit with many changes:
- partial support of integer indexing
- removed advanced indexing for densematrix and sparsematrix
- ensure that indexing of splitmatrix generates basematrix type
- partial fix of standardizedmatrix indexing
 - added indexing tests (currently fails)
@waclawkusnierczykqc
Copy link

Quick thought: it looks like several loosely related updates within one PR.
Perhaps worth considering pushing them in a few more focused PRs?

@MarcAntoineSchmidtQC
Copy link
Member Author

Quick thought: it looks like several loosely related updates within one PR.
Perhaps worth considering pushing them in a few more focused PRs?

Great idea. There are finished features that I would like to merge soon so I'll create more focused PRs with them.

@waclawkusnierczykqc
Copy link

Quick thought: it looks like several loosely related updates within one PR.
Perhaps worth considering pushing them in a few more focused PRs?

Great idea. There are finished features that I would like to merge soon so I'll create more focused PRs with them.

Yes, being able to push some of the changes independently of others is one advantage.
Among the others, there is clear focus, easier reviewing, and easier reverting if need be.

@MarcAntoineSchmidtQC
Copy link
Member Author

This PR has been separated into chunks. See PR #109, PR #110, and PR #111.

What remains is the big mess that is column indexing with splitmatrix.

@MarcAntoineSchmidtQC
Copy link
Member Author

closing. Most changes were implemented in other PRs and we will clearly take another approach to dealing with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants