Skip to content

RianKoja/dataframe-benchmarking

Repository files navigation

dataframe-benchmarking

Benchmarking a few dataframe Python frameworks.

This repo was mostly inspired by the claim that FireDucks would be a faster drop-in replacement of Pandas, I've made some not too complex test cases based on that claim. For more complex or long-running tasks, I usually resort to Polars, which requires meaningfully different syntax and usage, but is known for being generally faster.

The goal was to make some tests that would allow a comparison between those frameworks that is realistic while employing a not-so-heavy use case in terms of numbers of manipulations (something that could favor lazy-evaluated frameworks) but using a realistic data amount. I think a developer should always build something like this in house before proposing a tool migration. You are looking for the best tool for your problem, not the best tool for a problem you've never faced yourself.

This readme does not mention which tool performed the best, as the repo is designed to run on the latest version of the libraries on the latest Python version each time, thus conclusions are expected to change over time.

Usage

With uv installed, run the command bash run.sh.

A GitHub Action executes the benchmarks and provides artifacts for analysis, including pyinstrument reports for detailed analysis.

Other notes:

I suspected __pycache__ or some Just-in-Time (JIT) compilation artifact could have an impact on performance. Removing the __pycache__ folder could help with the former, but other than repeating operations, I don't see a proper way of testing the latter.

Exact same results (sometimes not even allowing for numerical discrepancies) is something that in the past was important for me. Hence I wanted to test what is needed to migrate while not having changes in behavior, this introduced a few quirks in the code, but showcases that for now, a few different behaviors can surface between FireDucks and Pandas, those are minor from what I noted.

Due to the comparison with polars not being so straightforward, conversion to pandas is sometimes used to avoid leaving gaps in the comparisons. While this makes the comparison unfair for polars on a naive interpretation, the current (unmentioned results) makes this fact relatively indifferent for any practical purposes.

Contributing

Pull requests and issues are welcome. Note that scripts are designed to crash on error to ensure problems are fixed, preventing skewed statistical comparisons.

About

Benchmarking a few Dataframe python frameworks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors