Skip to content

Commit

Permalink
Towards #11 Add support for enums (#37)
Browse files Browse the repository at this point in the history
- enum slices don't work - need scalar support
  • Loading branch information
ncpenke committed Jun 28, 2022
1 parent 7d9e132 commit f4c3e76
Show file tree
Hide file tree
Showing 13 changed files with 1,270 additions and 253 deletions.
134 changes: 71 additions & 63 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,37 +2,81 @@

Provides an API on top of [`arrow2`](https://github.com/jorgecarleitao/arrow2) to convert between rust types and Arrow.

The Arrow ecosystem provides many ways to convert between Arrow and other popular formats across several languages. This project aims to serve the need for rust-centric data pipelines to easily convert to/from Arrow via an auto-generated compile-time schema.
The Arrow ecosystem provides many ways to convert between Arrow and other popular formats across several languages. This project aims to serve the need for rust-centric data pipelines to easily convert to/from Arrow with strong typing and arbitrary nesting.

## Example

The example below performs a round trip conversion of a struct with a single field.

Please see the [complex_example.rs](./arrow2_convert/tests/complex_example.rs) for usage of the full functionality.

```rust
/// Simple example

use arrow2::array::Array;
use arrow2_convert::{deserialize::TryIntoCollection, serialize::TryIntoArrow, ArrowField};

#[derive(Debug, Clone, PartialEq, ArrowField)]
pub struct Foo {
name: String,
}

#[test]
fn test_simple_roundtrip() {
// an item
let original_array = [
Foo { name: "hello".to_string() },
Foo { name: "one more".to_string() },
Foo { name: "good bye".to_string() },
];

// serialize to an arrow array. try_into_arrow() is enabled by the TryIntoArrow trait
let arrow_array: Box<dyn Array> = original_array.try_into_arrow().unwrap();

// which can be cast to an Arrow StructArray and be used for all kinds of IPC, FFI, etc.
// supported by `arrow2`
let struct_array= arrow_array.as_any().downcast_ref::<arrow2::array::StructArray>().unwrap();
assert_eq!(struct_array.len(), 3);

// deserialize back to our original vector via TryIntoCollection trait.
let round_trip_array: Vec<Foo> = arrow_array.try_into_collection().unwrap();
assert_eq!(round_trip_array, original_array);
}
```

## API

Types that implement the `ArrowField`, `ArrowSerialize` and `ArrowDeserialize` traits can be converted to/from Arrow. The `ArrowField` implementation for a type defines the Arrow schema. The `ArrowSerialize` and `ArrowDeserialize` implementations provide the conversion logic via arrow2's data structures.
Types that implement the `ArrowField`, `ArrowSerialize` and `ArrowDeserialize` traits can be converted to/from Arrow via the `try_into_arrow` and the `try_into_collection` methods.

The `ArrowField` derive macro can be used to generate implementations of these traits for structs and enums. Custom implementations can also be defined for any type that needs to convert to/from Arrow by manually implementing the traits.

For serializing to arrow, `TryIntoArrow::try_into_arrow` can be used to serialize any iterable into an `arrow2::Array` or a `arrow2::Chunk`. `arrow2::Array` represents the in-memory Arrow layout. `arrow2::Chunk` represents a column group and can be used with `arrow2` API for other functionality such converting to parquet and arrow flight RPC.

For deserializing from arrow, the `TryIntoCollection::try_into_collection` can be used to deserialize from an `arrow2::Array` representation into any container that implements `FromIterator`.

## Features

- A derive macro, `ArrowField`, can generate implementations of the above traits for structures. Support for enums is in progress.
- Implementations are provided for Arrow primitives
- Numeric types
- [`u8`], [`u16`], [`u32`], [`u64`], [`i8`], [`i16`], [`i32`], [`i64`], [`f32`], [`f64`]
- [`i128`] is supported via the `override` attribute. Please see the [i128 section](#i128) for more details.
- Other types:
- [`bool`], [`String`], [`Binary`]
- Temporal types:
- [`chrono::NaiveDate`], [`chrono::NaiveDateTime`]
- Blanket implementations are provided for types that implement the above traits:
- Option<T>
- Vec<T>
- Large Arrow types [`LargeBinary`], [`LargeString`], [`LargeList`] are supported via the `override` attribute. Please see the [complex_example.rs](./arrow2_convert/tests/complex_example.rs) for usage.
### Default implementations

Default implementations of the above traits are provided for the following:

- Numeric types
- [`u8`], [`u16`], [`u32`], [`u64`], [`i8`], [`i16`], [`i32`], [`i64`], [`f32`], [`f64`]
- [`i128`] is supported via the `type` attribute. Please see the [i128 section](#i128) for more details.
- Other types:
- [`bool`], [`String`], [`Binary`]
- Temporal types:
- [`chrono::NaiveDate`], [`chrono::NaiveDateTime`]
- Option<T> if T implements `ArrowField`
- Vec<T> if T implements `ArrowField`
- Large Arrow types [`LargeBinary`], [`LargeString`], [`LargeList`] are supported via the `type` attribute. Please see the [complex_example.rs](./arrow2_convert/tests/complex_example.rs) for usage.
- Fixed size types [`FixedSizeBinary`], [`FixedSizeList`] are supported via the `FixedSizeVec` type override.
- Note: nesting of [`FixedSizeList`] is not supported.
- Scalars and enums are in progress
- Support for generics, slices and reference is currently missing.

This is not an exhaustive list. Please open an issue if you need a feature.
### Enums

Enums are still an experimental feature and need to be integrated tested. Rust enum arrays are converted to a `Arrow::UnionArray`. Some additional notes on enums:

- Rust unit variants are represented using as the `bool` data type.
- Enum slices currently [don't deserialize correctly](https://github.com/DataEngineeringLabs/arrow2-convert/issues/53).

### i128

Expand All @@ -46,7 +90,7 @@ use arrow2_convert::ArrowField;

#[derive(Debug, ArrowField)]
struct S {
#[arrow_field(override = "I128<32, 32>")]
#[arrow_field(type = "I128<32, 32>")]
field: i128,
}
```
Expand All @@ -72,51 +116,16 @@ fn convert_i128() {
```
### Nested Option Types

Since the Arrow format only supports one level of validity, nested option types such as `Option<Option<T>>` after serialization to Arrow will lose intermediate nesting of None values. For example, `Some(None)` will be serialized to `None`,

## Memory

Pass-thru conversions perform a single memory copy. Deserialization performs a copy from arrow2 to the destination. Serialization performs a copy from the source to arrow2. In-place deserialization is theoretically possible but currently not supported.

## Example

Below is a bare-bones example that does a round trip conversion of a struct with a single field.

Please see the [complex_example.rs](./arrow2_convert/tests/complex_example.rs) for usage of the full functionality.
Since the Arrow format only supports one level of validity, nested option types such as `Option<Option<T>>`, after serialization to Arrow, will lose any intermediate nesting of None values. For example, `Some(None)` will be serialized to `None`,

```rust
/// Simple example
### Missing Features

use arrow2::array::Array;
use arrow2_convert::{deserialize::TryIntoCollection, serialize::TryIntoArrow, ArrowField};

#[derive(Debug, Clone, PartialEq, ArrowField)]
pub struct Foo {
name: String,
}

#[test]
fn test_simple_roundtrip() {
// an item
let original_array = [
Foo { name: "hello".to_string() },
Foo { name: "one more".to_string() },
Foo { name: "good bye".to_string() },
];

// serialize to an arrow array. try_into_arrow() is enabled by the TryIntoArrow trait
let arrow_array: Box<dyn Array> = original_array.try_into_arrow().unwrap();
- Support for generics, slices and reference is currently missing.

// which can be cast to an Arrow StructArray and be used for all kinds of IPC, FFI, etc.
// supported by `arrow2`
let struct_array= arrow_array.as_any().downcast_ref::<arrow2::array::StructArray>().unwrap();
assert_eq!(struct_array.len(), 3);
This is not an exhaustive list. Please open an issue if you need a feature.
## Memory

// deserialize back to our original vector via TryIntoCollection trait.
let round_trip_array: Vec<Foo> = arrow_array.try_into_collection().unwrap();
assert_eq!(round_trip_array, original_array);
}
```
Pass-thru conversions perform a single memory copy. Deserialization performs a copy from arrow2 to the destination. Serialization performs a copy from the source to arrow2. In-place deserialization is theoretically possible but currently not supported.

## Internals

Expand All @@ -128,7 +137,6 @@ However unlike serde's traits provide an exhaustive and flexible mapping to the

Specifically, the `ArrowSerialize` trait provides the logic to serialize a type to the corresponding `arrow2::array::MutableArray`. The `ArrowDeserialize` trait deserializes a type from the corresponding `arrow2::array::ArrowArray`.


### Workarounds

Features such as partial implementation specialization and generic associated types (currently only available in nightly builds) can greatly simplify the underlying implementation.
Expand Down
11 changes: 5 additions & 6 deletions arrow2_convert/tests/complex_example.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
use arrow2::array::*;
use arrow2_convert::deserialize::{arrow_array_deserialize_iterator, TryIntoCollection};
use arrow2_convert::field::{FixedSizeBinary, FixedSizeVec, LargeBinary, LargeString, LargeVec};
use arrow2_convert::serialize::TryIntoArrow;
/// Complex example that uses the following features:
///
Expand Down Expand Up @@ -40,19 +39,19 @@ pub struct Root {
// int 32 array
int32_array: Vec<i32>,
// large binary
#[arrow_field(override = "LargeBinary")]
#[arrow_field(type = "arrow2_convert::field::LargeBinary")]
large_binary: Vec<u8>,
// fixed size binary
#[arrow_field(override = "FixedSizeBinary<3>")]
#[arrow_field(type = "arrow2_convert::field::FixedSizeBinary<3>")]
fixed_size_binary: Vec<u8>,
// large string
#[arrow_field(override = "LargeString")]
#[arrow_field(type = "arrow2_convert::field::LargeString")]
large_string: String,
// large vec
#[arrow_field(override = "LargeVec<i64>")]
#[arrow_field(type = "arrow2_convert::field::LargeVec<i64>")]
large_vec: Vec<i64>,
// fixed size vec
#[arrow_field(override = "FixedSizeVec<i64, 3>")]
#[arrow_field(type = "arrow2_convert::field::FixedSizeVec<i64, 3>")]
fixed_size_vec: Vec<i64>,
}

Expand Down
3 changes: 1 addition & 2 deletions arrow2_convert/tests/test_deserialize.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
use arrow2::array::*;
use arrow2::error::Result;
use arrow2_convert::deserialize::*;
use arrow2_convert::field::LargeString;
use arrow2_convert::serialize::*;
use arrow2_convert::ArrowField;

Expand Down Expand Up @@ -63,7 +62,7 @@ fn test_deserialize_large_types_schema_mismatch_error() {
}
#[derive(Debug, Clone, PartialEq, ArrowField)]
struct S2 {
#[arrow_field(override = "LargeString")]
#[arrow_field(type = "arrow2_convert::field::LargeString")]
a: String,
}

Expand Down
130 changes: 130 additions & 0 deletions arrow2_convert/tests/test_enum.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
use arrow2::array::*;
use arrow2_convert::{deserialize::TryIntoCollection, serialize::TryIntoArrow, ArrowField};

#[test]
fn test_dense_enum_unit_variant() {
#[derive(Debug, PartialEq, ArrowField)]
#[arrow_field(type = "dense")]
enum TestEnum {
VAL1,
VAL2,
VAL3,
VAL4,
}

let enums = vec![
TestEnum::VAL1,
TestEnum::VAL2,
TestEnum::VAL3,
TestEnum::VAL4,
];
let b: Box<dyn Array> = enums.try_into_arrow().unwrap();
let round_trip: Vec<TestEnum> = b.try_into_collection().unwrap();
assert_eq!(round_trip, enums);
}

#[test]
fn test_sparse_enum_unit_variant() {
#[derive(Debug, PartialEq, ArrowField)]
#[arrow_field(type = "sparse")]
enum TestEnum {
VAL1,
VAL2,
VAL3,
VAL4,
}

let enums = vec![
TestEnum::VAL1,
TestEnum::VAL2,
TestEnum::VAL3,
TestEnum::VAL4,
];
let b: Box<dyn Array> = enums.try_into_arrow().unwrap();
let round_trip: Vec<TestEnum> = b.try_into_collection().unwrap();
assert_eq!(round_trip, enums);
}

#[test]
fn test_nested_unit_variant() {
#[derive(Debug, PartialEq, ArrowField)]
struct TestStruct {
a1: i64,
}

#[derive(Debug, PartialEq, ArrowField)]
#[arrow_field(type = "dense")]
enum TestEnum {
VAL1,
VAL2(i32),
VAL3(f64),
VAL4(TestStruct),
VAL5(ChildEnum),
}

#[derive(Debug, PartialEq, ArrowField)]
#[arrow_field(type = "sparse")]
enum ChildEnum {
VAL1,
VAL2(i32),
VAL3(f64),
VAL4(TestStruct),
}

let enums = vec![
TestEnum::VAL1,
TestEnum::VAL2(2),
TestEnum::VAL3(1.2),
TestEnum::VAL4(TestStruct { a1: 10 }),
];

let b: Box<dyn Array> = enums.try_into_arrow().unwrap();
let round_trip: Vec<TestEnum> = b.try_into_collection().unwrap();
assert_eq!(round_trip, enums);
}

// TODO: reenable this test once slices for enums is fixed.
//#[test]
#[allow(unused)]
fn test_slice() {
#[derive(Debug, PartialEq, ArrowField)]
struct TestStruct {
a1: i64,
}

#[derive(Debug, PartialEq, ArrowField)]
#[arrow_field(type = "dense")]
enum TestEnum {
VAL1,
VAL2(i32),
VAL3(f64),
VAL4(TestStruct),
VAL5(ChildEnum),
}

#[derive(Debug, PartialEq, ArrowField)]
#[arrow_field(type = "sparse")]
enum ChildEnum {
VAL1,
VAL2(i32),
VAL3(f64),
VAL4(TestStruct),
}

let enums = vec![
TestEnum::VAL4(TestStruct { a1: 11 }),
TestEnum::VAL1,
TestEnum::VAL2(2),
TestEnum::VAL3(1.2),
TestEnum::VAL4(TestStruct { a1: 10 }),
];

let b: Box<dyn Array> = enums.try_into_arrow().unwrap();

for i in 0..enums.len() {
let arrow_slice = b.slice(i, enums.len() - i);
let original_slice = &enums[i..enums.len()];
let round_trip: Vec<TestEnum> = arrow_slice.try_into_collection().unwrap();
assert_eq!(round_trip, original_slice);
}
}
15 changes: 6 additions & 9 deletions arrow2_convert/tests/test_schema.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
use arrow2::datatypes::*;
use arrow2_convert::field::{
FixedSizeBinary, FixedSizeVec, LargeBinary, LargeString, LargeVec, I128,
};
use arrow2_convert::ArrowField;

#[test]
Expand All @@ -21,7 +18,7 @@ fn test_schema_types() {
// timestamp(ns, None)
a6: Option<chrono::NaiveDateTime>,
// i128(precision, scale)
#[arrow_field(override = "I128<32, 32>")]
#[arrow_field(type = "arrow2_convert::field::I128<32, 32>")]
a7: i128,
// array of date times
date_time_list: Vec<chrono::NaiveDateTime>,
Expand All @@ -40,19 +37,19 @@ fn test_schema_types() {
// int 32 array
int32_array: Vec<i32>,
// large binary
#[arrow_field(override = "LargeBinary")]
#[arrow_field(type = "arrow2_convert::field::LargeBinary")]
large_binary: Vec<u8>,
// fixed size binary
#[arrow_field(override = "FixedSizeBinary<3>")]
#[arrow_field(type = "arrow2_convert::field::FixedSizeBinary<3>")]
fixed_size_binary: Vec<u8>,
// large string
#[arrow_field(override = "LargeString")]
#[arrow_field(type = "arrow2_convert::field::LargeString")]
large_string: String,
// large vec
#[arrow_field(override = "LargeVec<i64>")]
#[arrow_field(type = "arrow2_convert::field::LargeVec<i64>")]
large_vec: Vec<i64>,
// fixed size vec
#[arrow_field(override = "FixedSizeVec<i64, 3>")]
#[arrow_field(type = "arrow2_convert::field::FixedSizeVec<i64, 3>")]
fixed_size_vec: Vec<i64>,
}

Expand Down
Loading

0 comments on commit f4c3e76

Please sign in to comment.