Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] [Tensor] Add support for Tensor and FixedShapeTensor types. #1073

Merged
merged 8 commits into from
Jul 13, 2023

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Jun 21, 2023

This PR adds basic support for Tensor and FixedShapeTensor types, allowing us to represent arbitrary n-dimensional arrays in dataframe columns.

Heterogeneous/ragged tensor type

    data = [
        np.arange(8,).reshape((2, 2, 2)),
        np.arange(8, 32).reshape((2, 2, 3, 2)),
        None,
    ]
    s = Series.from_pylist(data, pyobj="force")
    s = s.cast(DataType.tensor(DateType::Int64))

Homogeneous/fixed-shape tensor type

    shape = (3, 2, 2)
    data = [
        np.arange(12, dtype=np_dtype).reshape(shape),
        np.arange(12, 24, dtype=np_dtype).reshape(shape),
        None,
    ]
    s = Series.from_pylist(data, pyobj="force")
    s = s.cast(DataType.tensor(DateType::Int64, shape))

TODOs

  • DataFrame-level test coverage.
  • Add repr for tensor types.
  • Convert NumPy ndarrays to Tensor type on Python ingress (Series.from_pylist()).
  • Convert Tensor type to NumPy ndarrays on Python egress (Series.to_pylist()).
  • Casting between tensor types.
  • Casting to/from image types.
  • Convert pyarrow fixed_shape_tensor extension type to FixedShapeTensor type on Arrow ingress.
  • Convert FixedShapeTensor type to pyarrow fixed_shape_tensor extension type on Arrow egress.
  • Casting to/from embedding type.
  • Add some basic kernels that delegate to the ndarray crate.

@github-actions github-actions bot added the enhancement New feature or request label Jun 21, 2023
@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Merging #1073 (765bc53) into main (03a6646) will increase coverage by 0.03%.
The diff coverage is 92.72%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1073      +/-   ##
==========================================
+ Coverage   88.39%   88.42%   +0.03%     
==========================================
  Files          54       54              
  Lines        5488     5555      +67     
==========================================
+ Hits         4851     4912      +61     
- Misses        637      643       +6     
Impacted Files Coverage Δ
daft/datatype.py 92.01% <86.66%> (-0.71%) ⬇️
daft/series.py 92.09% <89.74%> (-0.44%) ⬇️
daft/dataframe/dataframe.py 88.84% <100.00%> (ø)
daft/runners/partitioning.py 82.35% <100.00%> (ø)
daft/runners/ray_runner.py 91.66% <100.00%> (ø)
daft/table/table.py 88.46% <100.00%> (+1.01%) ⬆️
daft/udf.py 94.11% <100.00%> (ø)
daft/utils.py 86.04% <100.00%> (+1.83%) ⬆️

@clarkzinzow clarkzinzow force-pushed the clark/tensor-type branch 2 times, most recently from 53d67b6 to e6b8388 Compare July 10, 2023 20:16
@clarkzinzow clarkzinzow marked this pull request as ready for review July 10, 2023 20:16
Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'm looking forward to factoring our Monolithic casting logic when we get a chance

daft/datatype.py Outdated Show resolved Hide resolved
daft/series.py Show resolved Hide resolved
src/daft-core/src/python/datatype.rs Outdated Show resolved Hide resolved
src/daft-core/src/python/series.rs Outdated Show resolved Hide resolved
type ArrayPayload<Tgt> = (Vec<Tgt>, Option<Vec<i64>>, Option<Vec<Vec<u64>>>);
type ArrayPayload<Tgt> = (
Vec<Tgt>,
Option<Vec<i64>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what do you mean by "match", and to what are you referring to by "these"? The first two fields are the data array and the corresponding offsets array, where the former has a native templated type and the latter should always have an i64 type.

The last two fields are the shape array and the corresponding offsets array.

@clarkzinzow clarkzinzow merged commit 9345e7f into main Jul 13, 2023
18 checks passed
@clarkzinzow clarkzinzow deleted the clark/tensor-type branch July 13, 2023 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants