# PyArrow Tutorial

[CN]

在大数据领域, 列式存储格式是用来存储大量数据并提供高性能查询的行业标准. 其中有两种数据格式非常流行 [Apache ORC](https://orc.apache.org/) [Apache Parquet](https://parquet.apache.org/). 其中 Parquet 要更流行一些. [Apache Arrow](https://arrow.apache.org/docs/index.html) 则是一个 in-memory analytics 的数据分析平台, 能把对这些流行的数据格式的 IO, transform 等操作整合起来的一个项目.

在 Python 社区主流的用于数据分析的库是 [Pandas](https://pandas.pydata.org/). **PyArrow** 则是用 Python 操作 Apache Arrow 的一套 API, 同时可以用这套 API 操作 Parquet / ORC 数据格式. 并且提供了一套和 pandas 交互的接口.

在小数据领域 pandas 基本已经够用了, 而在大数据领域, 学习 pyarrow 则是非常有必要的.

---

[EN]

In [26]:
%pip install rich
%pip install smart_open
%pip install s3pathlib
%pip install pathlib_mate
%pip install pandas
%pip install pyarrow

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [100]:
import random
import string
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from rich import print as rprint
from s3pathlib import S3Path
from pathlib_mate import Path

dir_here = Path.cwd()

In [101]:
def rnd_str(length=8):
    """
    Generate random length string.
    """
    return "".join([random.choice(string.ascii_lowercase) for _ in range(length)])

rnd_str(8)

'prwjekzy'

In [102]:
# 创建一个 Pandas DataFarme
n_rows = 1000
df = pd.DataFrame()
df["id"] = range(1, n_rows+1)
df["name"] = [rnd_str(8) for _ in range(n_rows)]
df.head(3)

Unnamed: 0,id,name
0,1,awxfmamm
1,2,wwozwymi
2,3,jniehyby


In [103]:
# 将 pandas.DataFrame 转化为 pyarrow.Table
# Ref: https://arrow.apache.org/docs/python/pandas.html
table = pa.Table.from_pandas(df)

In [104]:
# method1. use Table.schema
print("--- method 1 ---")
print(table.schema)

# method2. read schema from pandas DataFrame
print("--- method 2 ---")
schema = pa.Schema.from_pandas(df)
print(schema)

--- method 1 ---
id: int64
name: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 470
--- method 2 ---
id: int64
name: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 470


In [105]:
# 将 pyarrow.Table 写入磁盘
# Ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
pq.write_table(table, Path(dir_here, "users.parquet").abspath)

In [106]:
# 从 磁盘读取 pyarrow.Table 数据
# Ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
table = pq.read_table(Path(dir_here, "users.parquet").abspath)
df = table.to_pandas()
df.head(3)

Unnamed: 0,id,name
0,1,awxfmamm
1,2,wwozwymi
2,3,jniehyby


In [107]:
# 将 pyarrow.Table 写入磁盘, 不过用的是 file object API
# Ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
p = Path(dir_here, "users.parquet")
with p.open("wb") as f:
    pq.write_table(table, f)

In [108]:
# 从 磁盘读取 pyarrow.Table 数据, 不过用的是 file object API
# Ref: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
with p.open("rb") as f:
    table = pq.read_table(f)
df = table.to_pandas()
df.head(3)

Unnamed: 0,id,name
0,1,awxfmamm
1,2,wwozwymi
2,3,jniehyby


In [109]:
n_rows = 1000000
event_type_list = ["event_type_1", "event_type_2", "event_type_3"]
df = pd.DataFrame()
df["event_id"] = range(1, n_rows+1)
df["event_type"] = np.random.choice(event_type_list, n_rows)
df["time"] = pd.date_range("2000-01-01 00:00:00", "2000-01-31 23:59:59", periods=n_rows)
df["time"] = df["time"].astype("datetime64[ms]")
df["value"] = np.random.randint(1, 100, n_rows)
df.head(3)

Unnamed: 0,event_id,event_type,time,value
0,1,event_type_3,2000-01-01 00:00:00.000,86
1,2,event_type_1,2000-01-01 00:00:02.678,18
2,3,event_type_1,2000-01-01 00:00:05.356,63


## Dictionary Encoding 字典编码

``字典编码`` 就是将所有可能的值的总数不多的列用数字编码. 比如不超过 256 个可能值就可以用一个 8 位整数对其编码, 原始值很可能是字符串, 这样能大大节约存储空间.

``use_dictionary`` 参数能控制是否自动使用字典编码, 还是只对指定的列使用. ``pyarrow`` 默认使用 dict encode, 如果不使用, 最后的文件会大一些. 特别注意的是, 被 dict encode 后的列只能用于 等于 和 不等于 的比较, 而无法利用该列的最大最小信息进行 range query 的优化.

Ref:

- [pyarrow.parquet.write_table](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html)

In [110]:
p = Path(dir_here, "events1.parquet")
with p.open("wb") as f:
    pq.write_table(
        pa.Table.from_pandas(df), 
        f,
        use_dictionary=True, # default is True 
    )
print(f"{p.basename} file size = {p.size_in_text}")

p = Path(dir_here, "events2.parquet")
with p.open("wb") as f:
    pq.write_table(
        pa.Table.from_pandas(df), 
        f,
        use_dictionary=False,
    )
print(f"{p.basename} file size = {p.size_in_text}")

events1.parquet file size = 11.31 MB
events2.parquet file size = 13.63 MB


## Row Group 行组

``row_group_size``

### Row Group 越多, 占用的磁盘就越多, 但是扫描起来也就越快

In [119]:
print(f"一共有 {df.shape[0]} 行")
# row group 越多, 占用磁盘越多, 但是扫描起来越有可能跳过不必要的数据块. row group 的数量最终需要取比较平衡的值.
p = Path(dir_here, "events1.parquet")
row_group_size = 1000
n_row_groups = int(df.shape[0] / row_group_size)
with p.open("wb") as f:
    pq.write_table(
        pa.Table.from_pandas(df), 
        f,
        row_group_size=row_group_size,
    )
print(f"一共有 {n_row_groups} 个 row group, {p.basename} file size = {p.size_in_text}")

p = Path(dir_here, "events2.parquet")
row_group_size = 1000000
n_row_groups = int(df.shape[0] / row_group_size)
with p.open("wb") as f:
    pq.write_table(
        pa.Table.from_pandas(df), 
        f,
        row_group_size=row_group_size,
    )
print(f"一共有 {n_row_groups} 个 row group, {p.basename} file size = {p.size_in_text}")

一共有 1000000 行
一共有 1000 个 row group, events1.parquet file size = 14.75 MB
一共有 1 个 row group, events2.parquet file size = 11.31 MB


In [121]:
import os
print(os.cpu_count())

2


In [112]:
p = Path(dir_here, "events.parquet")
with p.open("wb") as f:
    pq.write_table(
        pa.Table.from_pandas(df), 
        f,
        row_group_size=1000,
    )

# columns: ["event_id", "event_type", "time", "value"]
pq_file = pq.ParquetFile(p.abspath)
for ith_row_group in range(pq_file.metadata.num_row_groups):
    rg_meta = pq_file.metadata.row_group(ith_row_group)
    print(rg_meta.column(2).statistics)
    rg = pq_file.read_row_group(ith_row_group)
    print(rg)
    print(rg_meta.column(0))
    print(rg_meta.column(0).statistics.min)
    break

<pyarrow._parquet.Statistics object at 0x7f842a4685e8>
  has_min_max: True
  min: 2000-01-01 00:00:00
  max: 2000-01-01 00:44:35.723000
  null_count: 0
  distinct_count: 0
  num_values: 1000
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE
pyarrow.Table
event_id: int64
event_type: string
time: timestamp[us]
value: int64
----
event_id: [[1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000]]
event_type: [["event_type_3","event_type_1","event_type_1","event_type_2","event_type_1","event_type_2","event_type_1","event_type_2","event_type_2","event_type_2",...,"event_type_3","event_type_3","event_type_1","event_type_2","event_type_3","event_type_2","event_type_2","event_type_3","event_type_1","event_type_3"]]
time: [[2000-01-01 00:00:00.000000,2000-01-01 00:00:02.678000,2000-01-01 00:00:05.356000,2000-01-01 00:00:08.035000,2000-01-01 00:00: