Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generics example #70

Open
aldanor opened this issue Oct 7, 2022 · 15 comments
Open

Generics example #70

aldanor opened this issue Oct 7, 2022 · 15 comments

Comments

@aldanor
Copy link

aldanor commented Oct 7, 2022

@ncpenke Hey - I'm in the process of adding generics support (to structs, as a first step), which is a pretty painful process to say the least 🤣

As a matter of fact, most of it compiles, except a few weird quirks. It would really help if you could provide a hand-written example of how it's supposed to work so I wouldn't be guessing blindly.

First, deserialization. There's this bound in the ArrowDeserialize trait that seems to be causing problems:

pub trait ArrowDeserialize: ArrowField + Sized
where
    Self::ArrayType: ArrowArray,
    for<'a> &'a Self::ArrayType: IntoIterator, // <----
{ ... }

Basically, if for some struct Foo<A, B> for all implementations we require

impl OneOfTheTraits for OneOfTheStructs<A, B>
where 
    A: ArrowDeserialize, 
    B: ArrowDeserialize,
{ ... }

this leads to

error[E0277]: `&'a <A as ArrowDeserialize>::ArrayType` is not an iterator
   = help: the trait `for<'a> Iterator` is not implemented for `&'a <A as ArrowDeserialize>::ArrayType`
   = note: required because of the requirements on the impl of `for<'a> IntoIterator` for `&'a <A as ArrowDeserialize>::ArrayType`

However, if we require

impl OneOfTheTraits for OneOfTheStructs<A, B>
where
    A: ArrowDeserialize,
    B: ArrowDeserialize,
    for<'a> &'a <A as ArrowDeserialize>::ArrayType: IntoIterator,
    for<'a> &'a <B as ArrowDeserialize>::ArrayType: IntoIterator,
{ ... }

this results in

overflow evaluating the requirement `for<'_a> &'_a FooArray<_, _>: IntoIterator`

Wonder if you'd have any ideas on this, or maybe provide a trivial working example? 🤔

(I have a feeling that those trait bounds together and all the for<'_> are messing things up big time; perhaps this could be rewritten once GATs are stabilized in less than a month from now in 1.65?... idk)

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

The funniest part is that the requirement overflow happens in the impl that's marked as unimplemented!() and doesn't matter anyway...

(so maybe those IntoIterator bounds don't really belong to the trait itself, but rather to some impls that actually use them?)

@ncpenke
Copy link
Collaborator

ncpenke commented Oct 8, 2022

@aldanor This is exciting, thanks for working on this!

I ran into a lot of those overflow errors when developing this crate, and had to be strategic about the trait bounds and you're absolutely right GATs in 1.65 will result in a massive rewrite and simplification of this crate, and I'm personally looking forward to it.

Having said that, I'm surprised it's not working in its current from. Please give me a few days and I'll come up with a hand written example (or at least an explanation of where I get stuck), in case you want to keep working on this (though might be strategic to wait until GATs come in and see how it impacts this crate).

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

I've posted two examples here, maybe it will save you some time: https://gist.github.com/aldanor/de7accf7bb40e83bec6e1e51f08f1190 (one without for<'_> bounds which complains about IntoIterator and one with, which leads to requirement overflows)

Yea, with GATs things should probably get rewritten, but all the generics boilerplate in proc-macros will stay largely the same regardless, so it needs to be done... (there are also various minor bugs which I've fixed along the way - e.g., if you have a field called "data_type" or "validity" in your struct, compilation will fail, etc).

So maybe we can manage getting generics (at least struct generics) in first anyway. And with GATs, we don't have to wait till 1.65 - it should be fine on the beta toolchain, so it could land soon after 1.65 is released because why not :)

// I'm still a bit confused re: the unimplemented!() part for struct array implementation, that feels a bit wrong... but that's a separate story, I guess. Maybe something to think about if redesigning things for GATs?

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

Just to mention it, there's another problem I have (which shows up in one of the gists above, the one with bounds enabled) - in try_push() and return_next() implementations there's a bit of confusion between Self and <Self as ArrowField>::Type, seems I've mixed those up but I'm not sure I fully understand what's what 😄

I have a suspicion that this wasn't properly working in the non-generic version as well, but I may be wrong. Would appreciate if you could clarify that as well.

E.g.:

115 | impl<__T: std::borrow::Borrow<Foo<A, B>>, A, B> arrow2::array::TryPush<Option<__T>>
    |                                           - this type parameter
...
132 |                 <A as arrow2_convert::serialize::ArrowSerialize>::arrow_serialize(
    |                 ----------------------------------------------------------------- arguments to this function are incorrect
133 |                     i.a.borrow(),
    |                     ^^^^^^^^^^^^ expected associated type, found type parameter `A`
    |
    = note: expected reference `&<A as ArrowField>::Type`
               found reference `&A`

@ncpenke
Copy link
Collaborator

ncpenke commented Oct 8, 2022

Thanks for the thoughts and questions @aldanor.

there are also various minor bugs which I've fixed along the way - e.g., if you have a field called "data_type" or "validity" in your struct, compilation will fail, etc

If you want to check these in as part of another issue/PR please feel free to create one and we can get it reviewed/merged.

I've posted two examples here, maybe it will save you some time: https://gist.github.com/aldanor/de7accf7bb40e83bec6e1e51f08f1190 (one without for<'_> bounds which complains about IntoIterator and one with, which leads to requirement overflows)

I'm working on an updated example for you @aldanor but essentially A and B (or any generic types for that matter) don't need to implement the arrow2_convert traits. For example consider the following:

struct StructWithCustomAllocators<A, B, CustomAlloc> 
where CustomAlloc: Allocator
{
    a: Vec<A, CustomAlloc>,
    b: Vec<B, CustomAlloc>,
}

In this case CustomAlloc should not implement any of the traits (and as an aside this reminds me in order to fix this we need a supporting change to support allocators as well).

With regards to the errors + readability, as we discussed above GATs will clean up the code. Part of the reason the code got a bit convoluted is to supported nested type conversions, specifically Vec<T>. This requires both serialization, and deserialization sources and targets to be iterators rather than single elements. Once we have GATs, I think we should be able to do something like Type = Iterator<Item = String>, instead of having to indirectly achieve the same thing now.

I'll post an example shortly. If that doesn't help clarify then happy to discuss this further and offer as much clarity as needed.

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

I'm working on an updated example for you @aldanor but essentially A and B (or any generic types for that matter) don't need to implement the arrow2_convert traits. For example consider the following:

struct StructWithCustomAllocators<A, B, CustomAlloc> 
where CustomAlloc: Allocator
{
    a: Vec<A, CustomAlloc>,
    b: Vec<B, CustomAlloc>,
}

This is hard to achieve automatically - note, for example, that even standard libraries traits like Debug, when derived, will require impl Debug for all of the types even if it's not technically needed (like in your example above).

One solution to this is to add an attr (container attr), marking this type as "foreign" / "ignored". Serde also has custom bounds for scenarios like this (because it's impossible to figure it out automatically in some cases) - https://serde.rs/container-attrs.html#bound.

E.g. (this can be done differently, this is just one possible way):

#[derive(ArrowField)]
#[arrow_field(ignore_bound = "CustomAlloc")]
struct StructWithCustomAllocators<A, B, CustomAlloc> 
    where CustomAlloc: Allocator
{
    a: Vec<A, CustomAlloc>,
    b: Vec<B, CustomAlloc>,
}

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

Re: Debug, here's what #[derive(Debug)] expands to:

#[automatically_derived]
impl<A: ::core::fmt::Debug, B: ::core::fmt::Debug,
    CustomAlloc: ::core::fmt::Debug> ::core::fmt::Debug for
    StructWithCustomAllocators<A, B, CustomAlloc> where CustomAlloc: Allocator
    {
    fn fmt(&self, f: &mut ::core::fmt::Formatter) -> ::core::fmt::Result {
        ::core::fmt::Formatter::debug_struct_field2_finish(f,
            "StructWithCustomAllocators", "a", &&self.a, "b", &&self.b)
    }
}

Does it make sense? Not really... but that's how it is (in fact, as you can see in the gist I shared, I had to implement Debug manually for a few types exactly because of this problem).

@ncpenke
Copy link
Collaborator

ncpenke commented Oct 8, 2022

@aldanor I'm able to reproduce the overflow error from your example, I'll need some time to think on it. The IntoIter for the generated struct arrays is unimplemented since we don't use it for the deserialize path. It was more of a corner cutting, in theory it should be possible to properly implement it as well. I don't fully follow the circular dependency that's causing the overflow. I just have a few personal things to take care of urgently and can then focus and will definitely able to help progress this along, early next week.

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

@ncpenke Sounds good.

I've tried to simplify it further, here's a playground that fails: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3577126d5818fd7678a9343bd29f6569

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

And here's an example of the same thing (I think?) that compiles on 1.65 beta: https://play.rust-lang.org/?version=beta&mode=debug&edition=2021&gist=a3fb2e3e7e8248b7a6c9c721a57fc351

Wonder if that's something like what you had in mind with GATs? :) (note that we no longer have to carry around those stupid for<'a> bounds every time we mention ArrowDeserialize, however there several new Self: 'a bounds in a few places but they are not too invasive)

@ncpenke
Copy link
Collaborator

ncpenke commented Oct 8, 2022

Wonder if that's something like what you had in mind with GATs? :)

I just skimmed through it, and it's along the lines of how I was hoping it would work. Thanks for taking the time to post it.

Honestly, it was pretty painful getting the crate this far without GATs. There are various points at which the overflow error creeps up (sometimes for legitimate use-cases where there are no circular dependencies).

My proposal to you is we'll create a feature branch for the next release, which will merge to main after rust 1.65 is stable. I'll move all development except hot fixes to this feature branch.

You can merge your generics changes there. If you also want to take on the refactoring of the crate to use GATs please let me know, otherwise I should be able to get to it next week.

I think that will be far more practical/time-efficient than trying to get this working without GATs. I'm sure we could do it if we really tried, but with the release of 1.65 at most a few weeks away it doesn't seem worthwhile?

@aldanor
Copy link
Author

aldanor commented Oct 8, 2022

Yea, I agree. A few weeks before GATs land spending much time on making the code work that we'll throw out anyway doesn't seem too reasonable.

I could certainly try to take on it (I'll have to stash my current generics branch then, try and do the GATs refactor, and then get back to generics again) - this is why I've posted the IterRef sketch to see what you think about it. Wonder if there's any other places in particular where GATs could be used for refactoring, or is it the only weak spot? (i.e. ArrowDeserialize/IntoIterator)

@ncpenke
Copy link
Collaborator

ncpenke commented Oct 8, 2022

Awesome, I just created https://github.com/DataEngineeringLabs/arrow2-convert/tree/feature/v0.4.0. The IterRef sketch looked good to me. I remember when I was thinking through this that the ArrowSerialize, and ArrowField could be cleaned up as well with GATs

@aldanor
Copy link
Author

aldanor commented Feb 9, 2023

@ncpenke Hey, apologies for falling out for a bit, was way too busy with life and work, but finally got some time to join back if it suits :)

Wonder what's the latest story, and is it still the plan to continue on with generics + GATs? (will take me a little while to read through all of my own notes and sketch code to get back into context though)

@ncpenke
Copy link
Collaborator

ncpenke commented Feb 9, 2023

@aldanor Great to hear from you. No problem at all, I've been busy as well. Yes the plan is still to continue with generics + GATs. The project had a few contributions come in, so it was worth doing a release. Once I get some breathing room I'll rebase and wrap up my generic changes, so that you can make the GAT changes. I'm sorry this has stretched so long but let's get it done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants