- Data Types and In-Memory Data Model: https://arrow.apache.org/docs/python/data.html


In [30]:
import pyarrow as pa
import pandas as pd

## Type Metadata

Apache Arrow defines language agnostic column-oriented data structures for array data. These include:

- Fixed-length primitive types: ``numbers``, ``booleans``, ``date and times``, ``fixed size binary``, ``decimals``, and other values that fit into a given number
- Variable-length primitive types: ``binary``, ``string``
- Nested types: ``list``, ``map``, ``struct``, and ``union``
- Dictionary type: An encoded categorical type (more on this later)


In [31]:
t1 = pa.int32()
t1

DataType(int32)

In [32]:
t2 = pa.string()
t2

DataType(string)

In [33]:
t3 = pa.binary()
t3

DataType(binary)

In [34]:
t4 = pa.binary(10)
t4

FixedSizeBinaryType(fixed_size_binary[10])

In [35]:
# ref: https://arrow.apache.org/docs/python/generated/pyarrow.timestamp.html
t5 = pa.timestamp("ms") # precision, one of ‘s’ [second], ‘ms’ [millisecond], ‘us’ [microsecond], or ‘ns’ [nanosecond]
t5

TimestampType(timestamp[ms])

In [36]:
# The Field type is a type plus a name and optional user-defined metadata:
f0 = pa.field('int32_field', t1)
f0

pyarrow.Field<int32_field: int32>

In [37]:
f0.name

'int32_field'

In [38]:
f0.type

DataType(int32)

In [39]:
# Arrow supports nested value types like list, map, struct, and union. 
# When creating these, you must pass types or fields to indicate 
# the data types of the types’ children. 
# For example, we can define a list of int32 values with:
t6 = pa.list_(t1)
t6

ListType(list<item: int32>)

In [40]:
fields = [
    pa.field("s0", t1),
    pa.field("s1", t2),
    pa.field("s2", t4),
    pa.field("s3", t6),
]
t7 = pa.struct(fields)
t7

StructType(struct<s0: int32, s1: string, s2: fixed_size_binary[10], s3: list<item: int32>>)

In [41]:
# For convenience, you can pass (name, type) tuples directly instead of Field instances:
t8 = pa.struct([('s0', t1), ('s1', t2), ('s2', t4), ('s3', t6)])
t8

StructType(struct<s0: int32, s1: string, s2: fixed_size_binary[10], s3: list<item: int32>>)

In [42]:
# You can compare type
t8 == t7

True

## Schemas

The ``Schema`` type is similar to the ``struct`` array type; it defines the column names and types in a record batch or table data structure. The ``pyarrow.schema()`` factory function makes new Schema objects in Python:

In [43]:
my_schema = pa.schema([
    ("field0", t1),
    ("field1", t2),
    ("field2", t4),
    ("field3", t6),
])
my_schema

field0: int32
field1: string
field2: fixed_size_binary[10]
field3: list<item: int32>
  child 0, item: int32

## Arrays

[EN]

For each data type, there is an accompanying array data structure for holding memory buffers that define a single contiguous chunk of columnar array data. When you are using PyArrow, this data may come from IPC tools, though it can also be created from various types of Python sequences (lists, NumPy arrays, pandas data).

[CN]

pyarrow 的 Array

In [44]:
# A simple way to create arrays is with pyarrow.array, 
# which is similar to the numpy.array function. 
# By default PyArrow will infer the data type for you:
arr = pa.array([1, 2, None, 3])
arr

<pyarrow.lib.Int64Array object at 0x7fdc5e7dec48>
[
  1,
  2,
  null,
  3
]

In [45]:
# But you may also pass a specific data type to override type inference:

In [46]:
pa.array([1, 2], type=pa.uint16())

<pyarrow.lib.UInt16Array object at 0x7fdc5e7de648>
[
  1,
  2
]

In [47]:
# The array’s type attribute is the corresponding piece of type metadata:
arr.type

DataType(int64)

In [48]:
# Each in-memory array has a known length and null count (which will be 0 if there are no null values):
len(arr)

4

In [49]:
arr.null_count

1

In [50]:
# Scalar values can be selected with normal indexing. 
# pyarrow.array converts None values to Arrow nulls; 
# we return the special pyarrow.NA value for nulls:
arr[0]

<pyarrow.Int64Scalar: 1>

In [51]:
arr[2]

<pyarrow.Int64Scalar: None>

### List Array

List Array 就是 Array 中的元素还是 Array

In [52]:
nested_arr = pa.array([[], None, [1, 2], [None, 1]])
nested_arr.type

ListType(list<item: int64>)

### Struct Array

Struct Array 就是 Array 中的元素是 Struct

In [53]:
pa.array([{"x": 1, "y": True}, {"z": 3.4, "x": 4}])

<pyarrow.lib.StructArray object at 0x7fdc5e7dea68>
-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    4
  ]
-- child 1 type: bool
  [
    true,
    null
  ]
-- child 2 type: double
  [
    null,
    3.4
  ]

In [54]:
ty = pa.struct([
    ("x", pa.int8()),
    ("y", pa.bool_()),
])
pa.array([{"x": 1, "y": True}, {"x": 2, "y": False}], type=ty)

<pyarrow.lib.StructArray object at 0x7fdc5e7deac8>
-- is_valid: all not null
-- child 0 type: int8
  [
    1,
    2
  ]
-- child 1 type: bool
  [
    true,
    false
  ]

In [56]:
# 如果定义了类型, 那么初始化的时候传入值就可以了, 无需传入 key
pa.array([(3, True), (4, False)], type=ty)

<pyarrow.lib.StructArray object at 0x7fdc5e7dee88>
-- is_valid: all not null
-- child 0 type: int8
  [
    3,
    4
  ]
-- child 1 type: bool
  [
    true,
    false
  ]

In [57]:
# When initializing a struct array, nulls are allowed both at the struct level and at the individual field level.
# If initializing from a sequence of Python dicts, a missing dict key is handled as a null value:
pa.array([{"x": 1}, None, {"y": None}], type=ty)

<pyarrow.lib.StructArray object at 0x7fdc5e7def48>
-- is_valid:
  [
    true,
    false,
    true
  ]
-- child 0 type: int8
  [
    1,
    0,
    null
  ]
-- child 1 type: bool
  [
    null,
    false,
    null
  ]

In [59]:
# You can also construct a struct array from existing arrays for each of the struct’s components. 
# In this case, data storage will be shared with the individual arrays, and no copy is involved:
xs = pa.array([5, 6, 7], type=pa.int16())
ys = pa.array([False, True, True])
arr = pa.StructArray.from_arrays((xs, ys), names=("x", "y"))
arr.type

StructType(struct<x: int16, y: bool>)

In [60]:
arr

<pyarrow.lib.StructArray object at 0x7fdc5e7deca8>
-- is_valid: all not null
-- child 0 type: int16
  [
    5,
    6,
    7
  ]
-- child 1 type: bool
  [
    false,
    true,
    true
  ]

## Record Batches

[EN]

A Record Batch in Apache Arrow is a collection of equal-length array instances. Let’s consider a collection of arrays:

[CN]

Record Batch 本质上就是一个迷你型的 DataFrame, 包含多个等长 Array, 每个 Array 视为一个 Column.

Ref: https://arrow.apache.org/docs/python/data.html#record-batches

In [67]:
# 初始化方法 1: 不给 headers, 直接给列表数据, 列表中的每个元素是一个 array, 代表一个 column
data = [
    pa.array([1, 2, 3, 4]),
    pa.array(['foo', 'bar', 'baz', None]),
    pa.array([True, None, False, True])
]
data

[<pyarrow.lib.Int64Array object at 0x7fdc5e7de6a8>
 [
   1,
   2,
   3,
   4
 ],
 <pyarrow.lib.StringArray object at 0x7fdc5e7de408>
 [
   "foo",
   "bar",
   "baz",
   null
 ],
 <pyarrow.lib.BooleanArray object at 0x7fdc5e7de3a8>
 [
   true,
   null,
   false,
   true
 ]]

In [69]:
# 初始化方法 2: 给 headers 也给列表数据
# A record batch can be created from this list of arrays using ``RecordBatch.from_arrays``:
batch = pa.RecordBatch.from_arrays(data, ["f0", "f1", "f2"])
batch

pyarrow.RecordBatch
f0: int64
f1: string
f2: bool

In [70]:
batch.num_columns

3

In [63]:
batch.num_rows

4

In [64]:
batch.schema

f0: int64
f1: string
f2: bool

In [65]:
batch[1]

<pyarrow.lib.StringArray object at 0x7fdc5e7de7c8>
[
  "foo",
  "bar",
  "baz",
  null
]

## Tables

[EN]

The PyArrow Table type is not part of the Apache Arrow specification, but is rather a tool to help with wrangling multiple record batches and array pieces as a single logical dataset. As a relevant example, we may receive multiple small record batches in a socket stream, then need to concatenate them into contiguous memory for use in NumPy or pandas. The Table object makes this efficient without requiring additional memory copying.

Considering the record batch we created above, we can create a Table containing one or more copies of the batch using Table.from_batches:

[CN]

``pyarrow.Table`` 不是 Apache Arrow 的标准之一, 这是 pyarrow 为适合 Python 编程提供的抽象, 和 ``pandas.DataFrame`` 类似, 代表一个二维表



In [71]:
batches = [batch] * 5
table = pa.Table.from_batches(batches)
table

pyarrow.Table
f0: int64
f1: string
f2: bool
----
f0: [[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]
f1: [["foo","bar","baz",null],["foo","bar","baz",null],["foo","bar","baz",null],["foo","bar","baz",null],["foo","bar","baz",null]]
f2: [[true,null,false,true],[true,null,false,true],[true,null,false,true],[true,null,false,true],[true,null,false,true]]