dataframe-benchmarking

Benchmarking a few dataframe Python frameworks.

This repo was mostly inspired by the claim that FireDucks would be a faster drop-in replacement of Pandas, I've made some not too complex test cases based on that claim. For more complex or long-running tasks, I usually resort to Polars, which requires meaningfully different syntax and usage, but is known for being generally faster.

The goal was to make some tests that would allow a comparison between those frameworks that is realistic while employing a not-so-heavy use case in terms of numbers of manipulations (something that could favor lazy-evaluated frameworks) but using a realistic data amount. I think a developer should always build something like this in house before proposing a tool migration. You are looking for the best tool for your problem, not the best tool for a problem you've never faced yourself.

This readme does not mention which tool performed the best, as the repo is designed to run on the latest version of the libraries on the latest Python version each time, thus conclusions are expected to change over time.

Usage

With uv installed, run the command bash run.sh.

A GitHub Action executes the benchmarks and provides artifacts for analysis, including pyinstrument reports for detailed analysis.

Other notes:

I suspected __pycache__ or some Just-in-Time (JIT) compilation artifact could have an impact on performance. Removing the __pycache__ folder could help with the former, but other than repeating operations, I don't see a proper way of testing the latter.

Exact same results (sometimes not even allowing for numerical discrepancies) is something that in the past was important for me. Hence I wanted to test what is needed to migrate while not having changes in behavior, this introduced a few quirks in the code, but showcases that for now, a few different behaviors can surface between FireDucks and Pandas, those are minor from what I noted.

Due to the comparison with polars not being so straightforward, conversion to pandas is sometimes used to avoid leaving gaps in the comparisons. While this makes the comparison unfair for polars on a naive interpretation, the current (unmentioned results) makes this fact relatively indifferent for any practical purposes.

Contributing

Pull requests and issues are welcome. Note that scripts are designed to crash on error to ensure problems are fixed, preventing skewed statistical comparisons.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
00_tools.py		00_tools.py
01_prep_data.py		01_prep_data.py
02_benchmark.py		02_benchmark.py
03_polars.py		03_polars.py
04_compare_parquets.py		04_compare_parquets.py
05_comparison.py		05_comparison.py
LICENSE		LICENSE
README.md		README.md
requirements.in		requirements.in
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataframe-benchmarking

Usage

Other notes:

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dataframe-benchmarking

Usage

Other notes:

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages