Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue: migrate Arrow/Parquet to the official implementation. #555

Closed
40 of 52 tasks
waynexia opened this issue Nov 17, 2022 · 13 comments
Closed
40 of 52 tasks
Assignees
Labels
C-enhancement Category Enhancements tracking-issue A tracking issue for a feature.
Milestone

Comments

@waynexia
Copy link
Member

waynexia commented Nov 17, 2022

What type of enhancement is this?

Tech debt reduction

What does the enhancement do?

As discussed in #388, we decide to migrate the arrow/parquet implementation to the official version. This issue tracks the progress of this work.

Implementation challenges

Major API changes

Works before migration

Migrating

Working on branch replace-arrow2

Target versions:

  • DataFusion: 14.0.0
  • Arrow: 26.0.0
  • Parquet: 26.0.0

Migrating datatypes

Branch datatypes2 is working on migrating our vectors to arrow

TODO

Migrating functions

TODO

@waynexia waynexia added C-enhancement Category Enhancements tracking-issue A tracking issue for a feature. labels Nov 17, 2022
@evenyag
Copy link
Contributor

evenyag commented Nov 17, 2022

What about Field and Schema, I remember that arrow2 and arrow have different APIs.

@v0y4g3r v0y4g3r pinned this issue Nov 18, 2022
@evenyag
Copy link
Contributor

evenyag commented Nov 18, 2022

add proxy method like get_validity for ArrayRef

Only a few places use get_validity(), I think we could skip this step.

BTW, we also use methods in arrow::compute

Ok(arrow::compute::cast::cast(
arithmetics::div(tmp.as_ref(), arr).as_ref(),
&DataType::Int64,
cast::CastOptions {
wrapped: false,
partial: true,
},
)

They might have different semantic

@waynexia
Copy link
Member Author

They might have different semantic

* [arrow2 CastOptions](https://docs.rs/arrow2/latest/arrow2/compute/cast/struct.CastOptions.html)

* [arrow CastOptions](https://docs.rs/arrow/latest/arrow/compute/struct.CastOptions.html)

Yes, so many API changes. For this particular case, I think we can set safe to true in the new option.

Gladly (or maybe sadly 🥲) it looks like we don't have this kind of boundary-case tests. I almost see the future that we've paid lots of care to review the change (I hope so 🤪) but still ignore some critical changes that will take us a day to trace it down.

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Nov 21, 2022

We may also need to refactor Vector to separate vector's logical type from it's internal representation like discussed in #203

@waynexia
Copy link
Member Author

We may also need to refactor Vector to separate vector's logical type from it's internal representation like discussed in #203

Looks #203 is suggesting a non-trivial change. We can dive into it after this migration.

@killme2008 killme2008 unpinned this issue Nov 22, 2022
@killme2008 killme2008 pinned this issue Nov 22, 2022
@evenyag
Copy link
Contributor

evenyag commented Nov 23, 2022

I'm working on the new vectors based on the official arrow in branch datatypes2. Now I implement the boolean and binary vector. The next step is porting the primitive vector. @waynexia @v0y4g3r

@evenyag
Copy link
Contributor

evenyag commented Dec 9, 2022

  • test test_clip_fn_* may fail with SIGSEVG

I reproduced this BUG in another repo and find out that it is highly related to the usage of the with_match_primitive_type_id macro.

The AddressSanitizer also says that the stack overflowed when running the tests.

=================================================================
==71205==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x00016d433f60 at pc 0x000102ca8210 bp 0x00016d3caf90 sp 0x00016d3caf88
WRITE of size 8 at 0x00016d433f60 thread T2
    #0 0x102ca820c in debug_datatypes::macros::tests::eval_i64::h9032765d623f5419 macros.rs:178

The define_eval expands to about 11 * 11 * 11 = 1331 (there are 10 primitive types and an unreachable branch) match arms. The overflow should be caused by rust-lang/rust#34283. Thanks for providing this helpful hint @discord9.

@evenyag
Copy link
Contributor

evenyag commented Dec 14, 2022

arrow only provides a generic version of arithmetic operation, but arrow2 supports passing scalar dynamically

arrow:

pub fn add_scalar_dyn<T>(array: &dyn Array, scalar: T::Native) -> Result<ArrayRef>
where
    T: ArrowNumericType,
    T::Native: ArrowNativeTypeOp,
{}

arrow2:

pub fn add_scalar(lhs: &dyn Array, rhs: &dyn Scalar) -> Box<dyn Array> {}

@DiamondMofeng
Copy link
Contributor

Is this error relavate to this issue?
This function is not necessary, but it is useful in time serise query.

图片

@DiamondMofeng
Copy link
Contributor

Sorry, I didn't see there is an available now() function, which is an alternative.
图片

@evenyag
Copy link
Contributor

evenyag commented Feb 3, 2023

This function is not necessary, but it is useful in time serise query

Unfortunately, we don't support Time64 data type now. Note that the Timestamp type is different from the Time type.

@v0y4g3r v0y4g3r modified the milestones: v0.3, v0.1 Feb 15, 2023
@killme2008
Copy link
Contributor

What's the status of this tracking issue right now? Can we close it or keep it? @evenyag @v0y4g3r

@v0y4g3r
Copy link
Contributor

v0y4g3r commented Jun 20, 2023

What's the status of this tracking issue right now? Can we close it or keep it? @evenyag @v0y4g3r

I think it's sufficient to close this tracking issue. The unresolved todos mainly relate to data type refactoring, it's better to open another issue for that topic.

@v0y4g3r v0y4g3r closed this as completed Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements tracking-issue A tracking issue for a feature.
Projects
None yet
Development

No branches or pull requests

5 participants