Arrow Support #78

Lundez · 2021-12-21T14:31:41Z

Hi, I can't find that dataframe supports Arrow as internal serialization / backend.

Is this something which you're working on?

The text was updated successfully, but these errors were encountered:

nikitinas · 2021-12-22T22:46:05Z

Hi, Lundez!

Currently DataFrame doesn't use Arrow as backend, but it's on the roadmap.

Until now we were mostly focused on frontend part: typesafe Kotlin API, code generation, schema inference and other tricks that provide great experience when you work with data in Kotlin. But now API and overall model are getting stable, so it's time to do more performance tuning and scalability, including Arrow support as a backend.

Currently the project has only two active contributors, so any help will be very much appreciated!

Lundez · 2021-12-25T01:54:42Z

Hi, do you have any headers on how to start?

Do you think the java arrow API can work with your "typing" (or whatever to call the typing is used in data frames)? 😊

I think adding arrow would give this project a big boost.
Also adding a query optimizer would follow up as a huge bonus, like pola.rs / spark. To optimize columns and other this when using arrow makes a lot of sense! 😄

jimexist · 2022-03-06T05:07:43Z

I have some experience with arrow (as an arrow committer) so let me try to set this up.

Current plan is to split into two parts:

Arrow schema reading
Arrow file / data loading and off-heap memory management

Subsequent features can come into more tangible forms when reading is done. Eg arrow file writing, streaming, predicate push down, etc.

Lundez · 2022-03-06T10:28:48Z

@jimexist incredibly excited to hear this!

Kopilov · 2022-05-20T07:48:11Z

Currently the project has only two active contributors, so any help will be very much appreciated!

Hello @nikitinas, what do you think about my last PR-s?

Also I have made some code writing to Arrow but it does not cover all DataFrame-supported column types (was made for Krangl originally)

Kopilov · 2022-07-06T07:49:27Z

Hello again.
I am working with more complex unit test for Arrow reading. Will make PR a little later.
Just now, you can look at data example and code it was generated with here

Kopilov · 2022-07-11T11:00:31Z

@koperagen, @nikitinas, I want your opinion about the next detail.

In Arrow schema we have nullable flag but it's value does not depend on column content. And we may get a column that is marked as not nullable but actually contains null values. Here is an example.

So, we can:

Ignore nullable flag in the file, read all data and set nullable flag in DataFrame schema if and only if there are null values in the column;
Look at nullable flag and always copy it to DataFrame schema; thus reading data like above will produce an error;
Look at nullable flag, copy it to DataFrame schema by default and then change not nullable to nullable if there are null values.

What behavior is the best and should we support different of them, in your point of view?

Lundez · 2022-07-11T13:35:48Z

@koperagen, @nikitinas, I want your opinion about the next detail.

In Arrow schema we have nullable flag but it's value does not depend on column content. And we may get a column that is marked as not nullable but actually contains null values. Here is an example.

So, we can:

Ignore nullable flag in the file, read all data and set nullable flag in DataFrame schema if and only if there are null values in the column;

Look at nullable flag and always copy it to DataFrame schema; thus reading data like above will produce an error;

Look at nullable flag, copy it to DataFrame schema by default and then change not nullable to nullable if there are null values.

What behavior is the best and should we support different of them, in your point of view?

Could we support different read-modes? Defaulting to first or third makes sense, but a strict-mode would be great (second) through a flag/read-mode IMO

koperagen · 2022-07-11T18:52:01Z

@koperagen, @nikitinas, I want your opinion about the next detail.

In Arrow schema we have nullable flag but it's value does not depend on column content. And we may get a column that is marked as not nullable but actually contains null values. Here is an example.

So, we can:

Ignore nullable flag in the file, read all data and set nullable flag in DataFrame schema if and only if there are null values in the column;

Look at nullable flag and always copy it to DataFrame schema; thus reading data like above will produce an error;

Look at nullable flag, copy it to DataFrame schema by default and then change not nullable to nullable if there are null values.

What behavior is the best and should we support different of them, in your point of view?

Hm, i would prefer 1 as a default, because in REPL it can help avoid unnecessary null handling when there are no nulls. But we also need 3 for Gradle plugin which generates schema declaration from data sample.

Do i understand the second option right? Something like this would be possible?

    val df = DataFrame.readArrow()

    df.notNullableColumn.map { it  / 2 } // null pointer exception

I think we shouldn't have this mode unless there is very strong evidence that it is very useful for someone :)

Or do you mean this?

    val df = DataFrame.readArrow() // Exception: notNullableColumn marked not nullable in schema, but has nulls

All that reminds me of "Infer" that is used as a flat for some operations.

Kopilov · 2022-07-12T07:50:37Z

Thank you for highlighting Infer enum. It can probably be used as parameter.

Hm, i would prefer 1 as a default

OK, thanks for sharing.
About 2, I expected something like

val df = DataFrame.readArrow() // Exception: notNullableColumn marked not nullable in schema, but has nulls

when callnig

DataColumn.createValueColumn(field.name, listWithNulls, typeNotNullable, Infer.None)

but actually we have

val df = DataFrame.readArrow()
df.notNullableColumn.map { it  / 2 } // null pointer exception

now. I will fix that.

Where can I read more about the Gradle plugin? How do you use it?

Kopilov · 2022-07-12T09:17:48Z

I suggest next mapping if use Infer as a parameter:

Infer.Nulls — set nullable flag in DataFrame schema if and only if there are null values in the column, make default;
Infer.None — copy Arrow schema to DataFrame, throw Exception like "notNullableColumn marked not nullable in schema, but has nulls";
Infer.Type — copy Arrow schema to DataFrame, change not nullable to nullable if there are null values. Or it actually would be the same as Infer.Nulls (single type is already guaranteed by Arrow).

koperagen · 2022-07-12T11:44:44Z

Where can I read more about the Gradle plugin? How do you use it?

https://kotlin.github.io/dataframe/gradle.html

I suggest next mapping if use Infer as a parameter:

I'm not sure about it anymore. Because Infer.Type does a different thing in other operations. Infer.Nulls is
"actual data nullability" == "schema nullability", and in our case
"set nullable flag in DataFrame schema if and only if there are null values in the column" is "narrow nullability if possible", and a third option is "widen nullability if needed"
What do you think about a new enum, let's say something like SchemaVerification? It describes variants of this operation:
actual nullability (from data) + schema nullability (from file) -> nullability | error
Maybe some other name, idk.

edit. Colleagues suggested NullabilityOptions, NullabilityTransformOptions, NullabilityOperatorOptions, NullabilityCompositionOptions
As for enum variants, could be WIDENING, NARROWING, CHECKING.

Kopilov · 2022-07-15T10:38:41Z

Implemented in #129
Narrowing was renamed to Keeping because on schema ignoring we can get no nulls in nullable as well as some nulls in not nullable.

nikitinas mentioned this issue Dec 22, 2021

Add primitive arrays column wrappers #30

Open

nikitinas added the enhancement New feature or request label Dec 22, 2021

jimexist mentioned this issue Mar 6, 2022

Support reading Arrow .feather file using Apache Arrow #93

Merged

zaleslaw added performance Something related to how fast the library can handle data research This requires a deeper dive to gather a better understanding and removed enhancement New feature or request labels Apr 25, 2023

zaleslaw added this to the Backlog milestone Apr 25, 2023

koperagen mentioned this issue Aug 18, 2023

Expected hasNulls behavior for #428 #429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow Support #78

Arrow Support #78

Lundez commented Dec 21, 2021

nikitinas commented Dec 22, 2021 •

edited

Loading

Lundez commented Dec 25, 2021

jimexist commented Mar 6, 2022 •

edited

Loading

Lundez commented Mar 6, 2022

Kopilov commented May 20, 2022 •

edited

Loading

Kopilov commented Jul 6, 2022

Kopilov commented Jul 11, 2022

Lundez commented Jul 11, 2022

koperagen commented Jul 11, 2022

Kopilov commented Jul 12, 2022

Kopilov commented Jul 12, 2022

koperagen commented Jul 12, 2022 •

edited

Loading

Kopilov commented Jul 15, 2022 •

edited

Loading

Arrow Support #78

Arrow Support #78

Comments

Lundez commented Dec 21, 2021

nikitinas commented Dec 22, 2021 • edited Loading

Lundez commented Dec 25, 2021

jimexist commented Mar 6, 2022 • edited Loading

Lundez commented Mar 6, 2022

Kopilov commented May 20, 2022 • edited Loading

Kopilov commented Jul 6, 2022

Kopilov commented Jul 11, 2022

Lundez commented Jul 11, 2022

koperagen commented Jul 11, 2022

Kopilov commented Jul 12, 2022

Kopilov commented Jul 12, 2022

koperagen commented Jul 12, 2022 • edited Loading

Kopilov commented Jul 15, 2022 • edited Loading

nikitinas commented Dec 22, 2021 •

edited

Loading

jimexist commented Mar 6, 2022 •

edited

Loading

Kopilov commented May 20, 2022 •

edited

Loading

koperagen commented Jul 12, 2022 •

edited

Loading

Kopilov commented Jul 15, 2022 •

edited

Loading