Pyarrow ops

Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, implemented in numpy & Cython. For convenience, function naming and behavior tries to replicates that of the Pandas API. The Join / Groupy performance is slightly slower than that of pandas, especially on multi column joins.

Current use cases:

Data operations like joins, groupby (aggregations), filters & drop_duplicates
(Very fast) reusable pre-processing for ML applications

Installation

Use the package manager pip to install pyarrow_ops.

pip install pyarrow_ops

Usage

See test_*.py for runnable test examples

Data operations:

import pyarrow as pa 
from pyarrow_ops import join, filters, groupby, head, drop_duplicates

# Create pyarrow.Table
t = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., 24., 26., 24.]
})
head(t) # Use head to print, like df.head()

# Drop duplicates based on column values
d = drop_duplicates(t, on=['Animal'], keep='first')

# Groupby iterable
for key, value in groupby(t, ['Animal']):
    print(key)
    head(value)

# Group by aggregate functions
g = groupby(t, ['Animal']).sum()
g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})

# Use filter predicates using list of tuples (column, operation, value)
f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])

# Join operations (currently performs inner join)
t2 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Parrot'],
    'Age': [10, 20]
})
j = join(t, t2, on=['Animal'])

ML Preprocessing (note: personal tests showed ~5x speed up compared to pandas on large datasets)

import pyarrow as pa 
from pyarrow_ops import head, TableCleaner

# Training data
t1 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., None, 26., 24.],
    'Value': [2000, 1500, 10, 30, 20],
})

# Create TableCleaner & register columns to be processed
cleaner = TableCleaner()
cleaner.register_numeric('Max Speed', impute='min', clip=True)
cleaner.register_label('Animal', categories=['Goose', 'Falcon'])
cleaner.register_one_hot('Animal')

# Clean table and split into train/test
X, y = cleaner.clean_table(t1, label='Value')
X_train, X_test, y_train, y_test = cleaner.split(X, y)

# Train a model + Save cleaner settings
cleaner_dict = cleaner.to_dict()

# Prediction data
t2 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Goose', 'Parrot', 'Parrot'],
    'Max Speed': [380., 10., None, 26.]
})
new_cleaner = TableCleaner().from_dict(cleaner_dict)
X_pred = new_cleaner.clean_table(t2)

To Do's

Improve groupby speed by not create copys of table
Add ML cleaning class
Improve speed of groupby by avoiding for loops
Improve join speed by moving code to C
Add unit tests using pytest
Add window functions on groupby
Add more join options (left, right, outer, full, cross)
Allow for functions to be classmethods of pa.Table* (t.groupby(...))

*One of the main difficulties is that the pyarrow classes are written in C and do not have a dict method, this hinders inheritance and adding classmethods.

Relation to pyarrow

In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.

Contributing

Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code. If you would like to seriously improve this work, please let me know!

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
pyarrow_ops		pyarrow_ops
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compare.py		compare.py
pyproject.toml		pyproject.toml
setup.py		setup.py
test_func.py		test_func.py
test_ml.py		test_ml.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow_ops

pyarrow_ops

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

compare.py

compare.py

pyproject.toml

pyproject.toml

setup.py

setup.py

test_func.py

test_func.py

test_ml.py

test_ml.py

Repository files navigation

Pyarrow ops

Installation

Usage

To Do's

Relation to pyarrow

Contributing

About

Releases

Packages

Contributors 2

Languages

License

TomScheffers/pyarrow_ops

Folders and files

Latest commit

History

Repository files navigation

Pyarrow ops

Installation

Usage

To Do's

Relation to pyarrow

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages