# Compute Functions

对一列大量数据类型相同的值进行向量化操作要比在 python 中用 for loop 一个个操作要效率的多. pyarrow 提供了许多高阶的计算函数, 包括各种聚合 (aggregation), 数值 (arithmetic), 字符串 (string) 计算操作. 这些操作调用的是底层的 C 实现, 要比在 Python 中使用 Python 函数或是 UDF (user defined function 用户自定义函数) 效率要高的多. 如果可以, 推荐尽量使用 ``pyarrow.compute`` 库中的计算函数. 如果实在需要用 UDF, ``pyarrow`` 不提供将 Python 编译成 C 的选项, 你只能将 ``pyarrow.Array`` 转化成 ``numpy.array``, apply UDF, 然后再转换回来.

- Compute Functions: https://arrow.apache.org/docs/python/compute.html
- Compute Functions API Reference: https://arrow.apache.org/docs/python/api/compute.html

In [1]:
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc

In [2]:
a = pa.array([1, 1, 2, 3])
b = pa.array([4, 1, 2, 8])
pc.equal(a, b)

<pyarrow.lib.BooleanArray object at 0x7fc65aa69888>
[
  false,
  true,
  true,
  false
]

In [3]:
x, y = pa.scalar(7.8), pa.scalar(9.3)
pc.multiply(x, y)

<pyarrow.DoubleScalar: 72.54>

In [4]:
t = pa.table({"x":[1,2,3],"y":[3,2,1]})
i = pc.sort_indices(t, sort_keys=[("y", "ascending")])
i

<pyarrow.lib.UInt64Array object at 0x7fc65aa69c48>
[
  2,
  1,
  0
]

## 

## Associative Transforms

Ref:

- https://arrow.apache.org/docs/python/api/compute.html#associative-transforms

In [5]:
arr = pa.array(list("abbacdcdaacacab"))

In [6]:
pc.unique(arr)

<pyarrow.lib.StringArray object at 0x7fc65aa69f48>
[
  "a",
  "b",
  "c",
  "d"
]

In [7]:
pc.value_counts(arr)

<pyarrow.lib.StructArray object at 0x7fc6599df168>
-- is_valid: all not null
-- child 0 type: string
  [
    "a",
    "b",
    "c",
    "d"
  ]
-- child 1 type: int64
  [
    6,
    3,
    4,
    2
  ]

In [8]:
pc.dictionary_encode(arr)

<pyarrow.lib.DictionaryArray object at 0x7fc6fe172f98>

-- dictionary:
  [
    "a",
    "b",
    "c",
    "d"
  ]
-- indices:
  [
    0,
    1,
    1,
    0,
    2,
    3,
    2,
    3,
    0,
    0,
    2,
    0,
    2,
    0,
    1
  ]