numbarrow

Numba adapters for PyArrow and PySpark.

numbarrow lets you work with Apache Arrow arrays directly inside Numba @njit compiled functions. It converts PyArrow arrays into NumPy views (zero-copy where possible) and extracts validity bitmaps for null handling — bridging PySpark's Arrow-based batch processing with high-performance JIT-compiled code.

Installation

pip install numbarrow

Optional dependencies for PySpark and pandas support:

pip install numbarrow[test]       # adds pyspark
pip install numbarrow[mapinarrow] # adds pandas

Quick Start

import pyarrow as pa
from numba import njit
from numbarrow.core.adapters import arrow_array_adapter
from numbarrow.core.is_null import is_null

# Convert a PyArrow array to NumPy for use in @njit
arrow_array = pa.array([10, None, 30, 40], type=pa.int32())
bitmap, data = arrow_array_adapter(arrow_array)

@njit
def sum_non_null(data, bitmap):
    total = 0
    for i in range(len(data)):
        if bitmap is None or not is_null(i, bitmap):
            total += data[i]
    return total

result = sum_non_null(data, bitmap)  # 80

Supported Types

PyArrow Type	NumPy Result	Copy?
`Int32Array`, `Int64Array`, `DoubleArray`	Matching dtype	No (view)
`BooleanArray`	`bool_`	Yes (bit-unpacking)
`Date32Array`	`datetime64[D]`	Yes (int32 → int64)
`Date64Array`	`datetime64[ms]`	No (view)
`TimestampArray`	`datetime64[unit]`	No (view)
`StringArray`	Fixed-width Unicode (bitmap not returned)	Yes (repacking)
`StructArray`	Tuple of two dicts: (bitmaps, data) per field	Per-field
`ListArray` (of structs)	Tuple of two dicts: (bitmaps, data) per field	Per-field

PySpark Integration

Use make_mapinarrow_func to create functions compatible with PySpark's mapInArrow:

from numbarrow.core.mapinarrow_factory import make_mapinarrow_func

def compute(data_dict, bitmap_dict, broadcasts):
    # data_dict: {col_name: np.ndarray}
    # bitmap_dict: {col_name: uint8 bitmap array}
    result = data_dict["value"] * broadcasts["scale"]
    return {"output": result}

udf = make_mapinarrow_func(compute, broadcasts={"scale": 2.0})
df_out = df_in.mapInArrow(udf, output_schema)

See test/demo_map_in_arrow.py for a complete runnable example.

Compatibility

Dependency	Versions
Python	3.10 – 3.12
numba	0.60 – 0.63
pyarrow	14 – 18
pyspark	3.3 – 3.x (optional)
pandas	1.5+ (optional)

Documentation

Full API documentation: numbarrow docs

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
docs		docs
numbarrow		numbarrow
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

numbarrow

Installation

Quick Start

Supported Types

PySpark Integration

Compatibility

Documentation

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

numbarrow

Installation

Quick Start

Supported Types

PySpark Integration

Compatibility

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages