# ``pandas`` Integration

Ref: 
    
- https://arrow.apache.org/docs/python/pandas.html

In [8]:
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

## DataFrames

[EN]

The equivalent to a pandas DataFrame in Arrow is a Table. Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible.

Conversion from a Table to a DataFrame is done by calling ``pyarrow.Table.to_pandas()``. The inverse is then achieved by using ``pyarrow.Table.from_pandas()``.

By default ``pyarrow`` tries to preserve and restore the ``.index`` data as accurately as possible. See the section below for more about this, and how to disable this logic.

[CN]

In [2]:
df = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "name": ["Alice", "Bob", "Cathy"],
    }
)
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)

# Convert back to pandas
df_new = table.to_pandas()

Unnamed: 0,id,name
0,1,Alice
1,2,Bob
2,3,Cathy


In [3]:
table

pyarrow.Table
id: int64
name: string
----
id: [[1,2,3]]
name: [["Alice","Bob","Cathy"]]

In [4]:
df_new

Unnamed: 0,id,name
0,1,Alice
1,2,Bob
2,3,Cathy


In [5]:
# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)
schema

id: int64
name: string
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 467

## Handling pandas Indexes

[EN]

Methods like ``pyarrow.Table.from_pandas()`` have a ``preserve_index`` option which defines how to preserve (store) or not to preserve (to not store) the data in the ``index`` member of the corresponding pandas object. This data is tracked using schema-level metadata in the internal ``arrow::Schema`` object.

The default of ``preserve_index`` is ``None``, which behaves as follows:

- ``RangeIndex`` is stored as metadata-only, not requiring any extra storage.
- Other index types are stored as one or more physical data columns in the resulting ``Table``

To not store the index at all pass ``preserve_index=False``. Since storing a ``RangeIndex`` can cause issues in some limited scenarios (such as storing multiple DataFrame objects in a Parquet file), to force all index data to be serialized in the resulting table, pass ``preserve_index=True``.

[CN]
pandas.DataFrame 是有行 index 的. 但是 pyarrow 主要是列式存储, 大部分情况下不需要 index. 

在默认情况下从 ``Table`` 转化为 ``DataFrame`` 的时候是 ``preserve_index=None``, 具体效果如下:

- 如果 ``DataFrame`` 用的是 ``RangeIndex`` (就是自动生成的从 0 到 n-1, n 为总行数的 index). ``Table`` 就会在 metadata 中标注使用 ``RangeIndex`` 即可, 由于只需要知道总行数即可, 所以无需在内存中真正维护 index 数据.
- 而其他的 pandas index 类型, 例如时间序列索引, 那么就会保存为真正的 ``pyarrow.Column`` 同时拷贝数据.

什么时候用 ``preserve_index=True``, 什么时候用 ``preserve_index=False``?

- 当你的 ``DataFrame`` 都用的是自动生成的 ``RangeIndex``, 而你在数据处理时又不在乎 index 的时后, 建议用 ``False``. 有些极端情况下例如你想将多个 ``pandas.DataFrame`` 写入 parquet 的文件的时候会出错误.
- 如果你希望将 ``DataFrame`` 序列化为 parquet 并保存, 并且希望读取回来的时候依旧能恢复 index, 建议用 ``True``

In [14]:
df = pd.DataFrame(
    {"value": np.arange(3)},
    index=pd.date_range("2000-01-01", periods=3, freq="1d")
)
df

Unnamed: 0,value
2000-01-01,0
2000-01-02,1
2000-01-03,2


In [21]:
table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, "df_preserve_index_false.parquet")
df_loaded = pq.read_table("df_preserve_index_false.parquet").to_pandas()
df_loaded

Unnamed: 0,value
0,0
1,1
2,2


In [22]:
table = pa.Table.from_pandas(df, preserve_index=True)
pq.write_table(table, "df_preserve_index_true.parquet")
df_loaded = pq.read_table("df_preserve_index_true.parquet").to_pandas()
df_loaded

Unnamed: 0,value
2000-01-01,0
2000-01-02,1
2000-01-03,2


In [23]:
table = pa.Table.from_pandas(df, preserve_index=None)
pq.write_table(table, "df_preserve_index_none.parquet")
df_loaded = pq.read_table("df_preserve_index_none.parquet").to_pandas()
df_loaded

Unnamed: 0,value
2000-01-01,0
2000-01-02,1
2000-01-03,2


**Zero Copy Series Conversions**

[EN]

Zero copy conversions from Array or ChunkedArray to NumPy arrays or pandas Series are possible in certain narrow cases:

- The Arrow data is stored in an integer (signed or unsigned ``int8`` through ``int64``) or floating point type (``float16`` through (``float64``). This includes many numeric types as well as timestamps.
- The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas).
- For ``ChunkedArray``, the data consists of a single chunk, i.e. ``arr.num_chunks == 1``. Multiple chunks will always require a copy because of pandas’s contiguousness requirement.

In these scenarios, to_pandas or to_numpy will be zero copy. In all other scenarios, a copy will be required.

[CN]

以下的几种情况下, 从 ``.to_pandas()`` 方法是 zero copy.

- Arrow data 是整数, 有符号 / 无符号, int8 / 16 / 32 / 64, float 16 / 32 / 64. 这包含了大部分的数字类型, 包括 timestamps (因为 timestamps 本质上也是用整数储存的)
- Arrow data 没有 null value, 因为这些 null value 在 Arrow 中使用的是 bit map 数据结构存储的, 而 pandas 并不支持这种存储方式.
- 对于 ``ChunkedArray``, 在只有 1 个 chunk 的情况下 (也就是 Table 只有一个 RecordBatch) 是 Zero Copy. 否则一定需要 Copy.

如果有许多 column, 有的可以 zero copy, 有的不能. 那么转化为 Table 之后只有那些不能 zero copy 的 column 的数据被复制了. 例如 string.

**一句话总结: 想要避免 Copy Data, 请确保没有 null value,  可以 fillin 一些 default value**

小知识:

- 2 ** 32 = 4,294,967,296
- 2 ** 64 = 18,446,744,073,709,551,616
- 从 1970-01-01 00:00:00 开始, 如果精度是 1 秒, 32 位整数可以表示到 2106-02-07 06:28:16
- 从 1970-01-01 00:00:00 开始, 如果精度是 1 纳秒 (10e9 ns = 1 s), 64 位整数可以表示到 2554 年 12 月, 足够我们用了.

In [31]:
table = pa.table(
    {
        "id": pa.array([1, 2, 3], type=pa.int16()),
        "name": pa.array(["alice", "bob", "cathy"]),
    }
)
table.to_pandas()

Unnamed: 0,id,name
0,1,alice
1,2,bob
2,3,cathy
