-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implement BlockManager
backend
#104
Conversation
lgtm! |
Codecov Report
@@ Coverage Diff @@
## dev #104 +/- ##
==========================================
+ Coverage 65.08% 73.74% +8.65%
==========================================
Files 58 51 -7
Lines 3340 3135 -205
Branches 590 532 -58
==========================================
+ Hits 2174 2312 +138
+ Misses 1012 671 -341
+ Partials 154 152 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
closes #98 |
Closes #109 The BlockManager (see #104) introduces a need for more robust DataPanel testing that tests DataPanels with a diverse set of columns. As we add more columns, we don't want to have to update the DataPanel tests for each new column. Instead, we should specify a TestBed for each column that plugs in to the DataPanel tests. Started this for NumpyArrayColumn with #108. Co-authored-by: Priya <priyamis@cse.iitk.ac.in>
Overhaul the internals of the Meerkat `DataPanel`. The changes seek to enable: 1. Vectorized row-wise operations (*e.g.* slicing, reduction) 2. Simplified I/O and improved latency 3. Clarified view vs. copy behavior - We introduce a [new spec](https://www.notion.so/meerkat-working-doc-40d70d094ac0495684d3fd8ddc809343#2b9460b744a04cbca912c8d017e5887c) detailing when users should expect to get views vs. copies (similar to [this resource](https://numpy.org/doc/stable/reference/arrays.indexing.html) for NumPy) – I'm working on enforcing this spec throughout the codebase. The new internals are based primarily off the `BlockManager` class, a dict-like object meant to replace the dictionary we were storing the DataPanel's columns in before. The `BlockManager` manages links between a DataPanel's columns and data blocks (`AbstractBlock`, `NumpyBlock`) where the data is actually stored. It implements `consolidate`, which takes columns of similar type in a DataPanel and stores their data together in a block, and `apply` which applies row-wise operations (e.g. __getitem__) to the blocks in a vectorized fashion. Other important classes: - `BlockRef` objects link a block with the `BlockManager`. These are critical to the functioning of the BlockManager and are the primary type of object passed between the blocks and the block manager. They consists of two things: 1. A reference to the block (`self.block`) 2. A set of columns in the `BlockManager` whose data live in the `Block` - `BlockableMixin` - a mixin used with `AbstractColumn` that holds references to a column's block and the columns index in the block - `BlockView` - a simple DataClass holding a block and an index into the block. It is typical for new columns to be created from `BlockView` Note: I marked this is a WIP because there are still a few more things to be done on this front. 1. Make `concat` BlockManager aware Other major changes: - Removed `visible_rows` from `AbstractColumn`, - Removed `_cloneable_kwargs` in favor of a unified `_clone`, `_copy`, and `_view` module (cloneable.py)
Closes #109 The BlockManager (see #104) introduces a need for more robust DataPanel testing that tests DataPanels with a diverse set of columns. As we add more columns, we don't want to have to update the DataPanel tests for each new column. Instead, we should specify a TestBed for each column that plugs in to the DataPanel tests. Started this for NumpyArrayColumn with #108. Co-authored-by: Priya <priyamis@cse.iitk.ac.in>
Overhaul the internals of the Meerkat
DataPanel
. The changes seek to enable:The new internals are based primarily off the
BlockManager
class, a dict-like object meant to replace the dictionary we were storing the DataPanel's columns in before. TheBlockManager
manages links between a DataPanel's columns and data blocks (AbstractBlock
,NumpyBlock
) where the data is actually stored. It implementsconsolidate
, which takes columns of similar type in a DataPanel and stores their data together in a block, andapply
which applies row-wise operations (e.g. getitem) to the blocks in a vectorized fashion. Other important classes:BlockRef
objects link a block with theBlockManager
. These are critical to the functioning of the BlockManager and are the primary type of object passed between the blocks and the block manager. They consists of two things:self.block
)BlockManager
whose data live in theBlock
BlockableMixin
- a mixin used withAbstractColumn
that holds references to a column's block and the columns index in the blockBlockView
- a simple DataClass holding a block and an index into the block. It is typical for new columns to be created fromBlockView
Note: I marked this is a WIP because there are still a few more things to be done on this front.
concat
BlockManager awareOther major changes:
visible_rows
fromAbstractColumn
,_cloneable_kwargs
in favor of a unified_clone
,_copy
, and_view
module (cloneable.py)