Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement BlockManager backend #104

Merged
merged 20 commits into from
Jul 30, 2021
Merged

[WIP] Implement BlockManager backend #104

merged 20 commits into from
Jul 30, 2021

Conversation

seyuboglu
Copy link
Collaborator

@seyuboglu seyuboglu commented Jul 28, 2021

Overhaul the internals of the Meerkat DataPanel. The changes seek to enable:

  1. Vectorized row-wise operations (e.g. slicing, reduction)
  2. Simplified I/O and improved latency
  3. Clarified view vs. copy behavior
    • We introduce a new spec detailing when users should expect to get views vs. copies (similar to this resource for NumPy) – I'm working on enforcing this spec throughout the codebase.

The new internals are based primarily off the BlockManager class, a dict-like object meant to replace the dictionary we were storing the DataPanel's columns in before. The BlockManager manages links between a DataPanel's columns and data blocks (AbstractBlock, NumpyBlock) where the data is actually stored. It implements consolidate, which takes columns of similar type in a DataPanel and stores their data together in a block, and apply which applies row-wise operations (e.g. getitem) to the blocks in a vectorized fashion. Other important classes:

  • BlockRef objects link a block with the BlockManager. These are critical to the functioning of the BlockManager and are the primary type of object passed between the blocks and the block manager. They consists of two things:
    1. A reference to the block (self.block)
    2. A set of columns in the BlockManager whose data live in the Block
  • BlockableMixin - a mixin used with AbstractColumn that holds references to a column's block and the columns index in the block
  • BlockView - a simple DataClass holding a block and an index into the block. It is typical for new columns to be created from BlockView

Note: I marked this is a WIP because there are still a few more things to be done on this front.

  1. Make concat BlockManager aware

Other major changes:

  • Removed visible_rows from AbstractColumn,
  • Removed _cloneable_kwargs in favor of a unified _clone, _copy, and _view module (cloneable.py)

@seyuboglu seyuboglu requested review from krandiash and ad12 July 28, 2021 05:06
@krandiash
Copy link
Contributor

lgtm!

@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2021

Codecov Report

Merging #104 (ed79348) into dev (efebaf6) will increase coverage by 8.65%.
The diff coverage is 83.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #104      +/-   ##
==========================================
+ Coverage   65.08%   73.74%   +8.65%     
==========================================
  Files          58       51       -7     
  Lines        3340     3135     -205     
  Branches      590      532      -58     
==========================================
+ Hits         2174     2312     +138     
+ Misses       1012      671     -341     
+ Partials      154      152       -2     
Flag Coverage Δ
unittests 73.74% <83.84%> (+8.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
meerkat/columns/list_column.py 70.00% <28.57%> (-14.62%) ⬇️
meerkat/columns/lambda_column.py 65.88% <50.00%> (-17.17%) ⬇️
meerkat/writers/concat_writer.py 83.33% <50.00%> (+1.51%) ⬆️
meerkat/datapanel.py 72.86% <69.56%> (-0.23%) ⬇️
meerkat/columns/numpy_column.py 77.66% <74.28%> (-2.16%) ⬇️
meerkat/columns/image_column.py 92.30% <75.00%> (-0.72%) ⬇️
meerkat/columns/tensor_column.py 81.90% <75.00%> (-2.65%) ⬇️
meerkat/block/ref.py 76.66% <76.66%> (ø)
meerkat/mixins/io.py 77.58% <77.58%> (ø)
meerkat/columns/prediction_column.py 76.41% <82.35%> (+0.34%) ⬆️
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efebaf6...ed79348. Read the comment docs.

Copy link
Collaborator

@ad12 ad12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@seyuboglu seyuboglu merged commit 383dcd5 into dev Jul 30, 2021
@seyuboglu seyuboglu linked an issue Jul 30, 2021 that may be closed by this pull request
@seyuboglu
Copy link
Collaborator Author

closes #98

seyuboglu added a commit that referenced this pull request Aug 9, 2021
Closes #109
The BlockManager (see #104) introduces a need for more robust DataPanel testing that tests DataPanels with a diverse set of columns. As we add more columns, we don't want to have to update the DataPanel tests for each new column. Instead, we should specify a TestBed for each column that plugs in to the DataPanel tests.

Started this for NumpyArrayColumn with #108.

Co-authored-by: Priya <priyamis@cse.iitk.ac.in>
krandiash pushed a commit that referenced this pull request Aug 14, 2021
Overhaul the internals of the Meerkat `DataPanel`. The changes seek to enable:
1. Vectorized row-wise operations (*e.g.* slicing, reduction)
2. Simplified I/O and improved latency
3. Clarified view vs. copy behavior
    - We introduce a [new spec](https://www.notion.so/meerkat-working-doc-40d70d094ac0495684d3fd8ddc809343#2b9460b744a04cbca912c8d017e5887c) detailing when users should expect to get views vs. copies (similar to [this resource](https://numpy.org/doc/stable/reference/arrays.indexing.html) for NumPy) – I'm working on enforcing this spec throughout the codebase. 

The new internals are based primarily off the `BlockManager` class, a dict-like object meant to replace the dictionary we were storing the DataPanel's columns in before. The  `BlockManager` manages links between a DataPanel's columns and data blocks (`AbstractBlock`, `NumpyBlock`) where the data is actually stored. It implements `consolidate`, which takes columns of similar type in a DataPanel and stores their data together in a block, and `apply` which applies row-wise operations (e.g. __getitem__) to the blocks in a vectorized fashion. Other important classes:  
- `BlockRef` objects link a block with the  `BlockManager`. These are critical to the functioning of the BlockManager and are the primary type of object passed between the blocks and the block manager. They consists of two things:
  1. A reference to the block (`self.block`)
  2. A set of columns in the `BlockManager` whose data live in the `Block`
- `BlockableMixin` - a mixin used with `AbstractColumn` that holds references to a column's block and the columns index in the block
- `BlockView` - a simple DataClass holding a block and an index into the block. It is typical for new columns to be created from `BlockView`

Note: I marked this is a WIP because there are still a few more things to be done on this front. 
1. Make `concat` BlockManager aware

Other major changes:
- Removed `visible_rows` from `AbstractColumn`, 
- Removed `_cloneable_kwargs` in favor of a unified `_clone`, `_copy`, and `_view` module (cloneable.py)
krandiash pushed a commit that referenced this pull request Aug 14, 2021
Closes #109
The BlockManager (see #104) introduces a need for more robust DataPanel testing that tests DataPanels with a diverse set of columns. As we add more columns, we don't want to have to update the DataPanel tests for each new column. Instead, we should specify a TestBed for each column that plugs in to the DataPanel tests.

Started this for NumpyArrayColumn with #108.

Co-authored-by: Priya <priyamis@cse.iitk.ac.in>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Improve I/O space and time efficiency
4 participants