Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions data.generated.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 8 additions & 5 deletions datafusion-partitioned/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# DataFusion
# Apache DataFusion

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>
[Apache DataFusion] is an extensible query execution framework, written in Rust, that uses [Apache Arrow] as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>

[Apache DataFusion]: https://arrow.apache.org/datafusion/
[Apache Arrow]: https://arrow.apache.org/

We use parquet file here and create an external table for it; and then execute the queries.

Expand All @@ -10,7 +13,7 @@ The benchmark should be completed in under an hour. On-demand pricing is $0.6 pe

1. manually start a AWS EC2 instance
- `c6a.4xlarge`
- Ubuntu 22.04 or later
- Ubuntu 24.04 or later
- Root 500GB gp2 SSD
- no EBS optimized
- no instance store
Expand All @@ -20,16 +23,16 @@ The benchmark should be completed in under an hour. On-demand pricing is $0.6 pe
1. `vi benchmark.sh` and modify following line to target Datafusion version

```bash
git checkout 46.0.0
git checkout 51.0.0
```

1. `bash benchmark.sh`
1. `./save-result.sh c6a.4xlarge`

### Know Issues

1. importing parquet by `datafusion-cli` doesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp via `to_timestamp_seconds`)
2. importing parquet by `datafusion-cli` make column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g. `EventTime` -> `"EventTime"`)
3. `comparing binary with utf-8` and `group by binary` don't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/arrow-datafusion#3050

## Generate full human readable results (for debugging)

Expand Down
6 changes: 3 additions & 3 deletions datafusion-partitioned/benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ sudo apt-get update -y
sudo apt-get install -y gcc

echo "Install DataFusion main branch"
git clone https://github.com/apache/arrow-datafusion.git
cd arrow-datafusion/
git checkout 47.0.0
git clone https://github.com/apache/datafusion.git
cd datafusion/
git checkout 51.0.0
CARGO_PROFILE_RELEASE_LTO=true RUSTFLAGS="-C codegen-units=1" cargo build --release --package datafusion-cli --bin datafusion-cli
export PATH="`pwd`/target/release:$PATH"
cd ..
Expand Down
56 changes: 56 additions & 0 deletions datafusion-partitioned/results/47-c6a.4xlarge.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"system": "DataFusion 47 (Parquet, partitioned)",
"date": "2025-11-24",
"machine": "47-c6a.4xlarge",
"cluster_size": 1,
"proprietary": "no",
"tuned": "no",
"tags": ["Rust","column-oriented","embedded","stateless"],
"load_time": 0,
"data_size": 14737666736,
"result": [
[0.061,0.019,0.017],
[0.120,0.036,0.035],
[0.214,0.085,0.085],
[0.443,0.090,0.086],
[1.017,0.817,0.837],
[0.961,0.780,0.782],
[0.090,0.025,0.026],
[0.128,0.041,0.038],
[1.050,0.888,0.905],
[1.367,1.007,1.019],
[0.552,0.243,0.234],
[0.697,0.276,0.264],
[1.083,0.828,0.876],
[2.654,1.369,1.430],
[1.130,0.824,0.825],
[1.080,0.951,0.946],
[2.634,1.680,1.691],
[2.591,1.624,1.620],
[5.272,3.377,3.387],
[0.522,0.081,0.074],
[9.761,1.073,1.052],
[11.401,1.293,1.302],
[22.146,2.584,2.588],
[55.505,10.246,10.275],
[2.836,0.431,0.450],
[0.854,0.340,0.343],
[2.847,0.513,0.513],
[9.739,1.521,1.549],
[9.775,9.431,9.480],
[0.535,0.415,0.421],
[2.451,0.766,0.763],
[6.158,0.915,0.913],
[4.622,3.361,3.383],
[10.150,3.631,3.656],
[10.174,3.659,3.687],
[1.294,1.180,1.183],
[0.294,0.114,0.123],
[0.173,0.050,0.052],
[0.280,0.118,0.114],
[0.423,0.163,0.172],
[0.166,0.041,0.041],
[0.165,0.041,0.043],
[0.150,0.036,0.039]
]
}
56 changes: 56 additions & 0 deletions datafusion-partitioned/results/48-c6a.4xlarge.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"system": "DataFusion 48.0.0 (Parquet, partitioned)",
"date": "2025-11-24",
"machine": "48-c6a.4xlarge",
"cluster_size": 1,
"proprietary": "no",
"tuned": "no",
"tags": ["Rust","column-oriented","embedded","stateless"],
"load_time": 0,
"data_size": 14737666736,
"result": [
[0.070,0.016,0.018],
[0.124,0.029,0.030],
[0.199,0.069,0.070],
[0.453,0.088,0.083],
[1.168,0.725,0.739],
[0.977,0.777,0.776],
[0.090,0.022,0.021],
[0.123,0.030,0.031],
[1.023,0.905,0.901],
[1.388,0.999,0.988],
[0.560,0.240,0.233],
[0.680,0.263,0.274],
[1.084,0.861,0.877],
[2.688,1.217,1.339],
[1.142,0.834,0.822],
[0.995,0.858,0.857],
[2.688,1.675,1.700],
[2.586,1.613,1.624],
[5.197,3.328,3.352],
[0.360,0.079,0.078],
[9.973,1.075,1.025],
[11.396,1.302,1.279],
[22.070,2.500,2.535],
[55.536,10.283,10.124],
[2.835,0.447,0.435],
[0.865,0.353,0.331],
[2.847,0.517,0.518],
[9.706,1.472,1.535],
[9.666,9.526,9.477],
[0.574,0.426,0.432],
[2.491,0.759,0.723],
[6.162,0.924,0.907],
[4.649,3.361,3.393],
[10.168,3.640,3.652],
[10.098,3.657,3.672],
[1.360,1.160,1.193],
[0.295,0.109,0.105],
[0.172,0.049,0.049],
[0.286,0.096,0.113],
[0.430,0.159,0.162],
[0.182,0.045,0.040],
[0.171,0.038,0.043],
[0.154,0.034,0.037]
]
}
56 changes: 56 additions & 0 deletions datafusion-partitioned/results/49-c6a.4xlarge.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"system": "DataFusion 49.0.0 (Parquet, partitioned)",
"date": "2025-11-24",
"machine": "49-c6a.4xlarge",
"cluster_size": 1,
"proprietary": "no",
"tuned": "no",
"tags": ["Rust","column-oriented","embedded","stateless"],
"load_time": 0,
"data_size": 14737666736,
"result": [
[0.116,0.050,0.050],
[0.173,0.076,0.079],
[0.259,0.118,0.121],
[0.499,0.137,0.132],
[0.915,0.794,0.786],
[1.061,0.863,0.867],
[0.109,0.049,0.051],
[0.178,0.077,0.078],
[1.132,0.983,0.937],
[1.442,1.040,1.052],
[0.614,0.286,0.288],
[0.720,0.300,0.295],
[1.171,0.909,0.896],
[2.659,1.415,1.362],
[1.161,0.871,0.866],
[1.017,0.892,0.879],
[2.708,1.685,1.690],
[2.654,1.681,1.670],
[5.280,3.286,3.282],
[0.394,0.126,0.126],
[9.853,1.139,1.135],
[11.475,1.335,1.363],
[22.124,2.602,2.585],
[55.427,9.969,9.878],
[2.894,0.478,0.482],
[0.896,0.314,0.318],
[2.887,0.468,0.456],
[9.817,1.576,1.539],
[9.624,8.898,8.896],
[0.588,0.475,0.471],
[2.515,0.793,0.794],
[6.177,0.961,0.986],
[4.612,3.313,3.315],
[10.257,3.641,3.658],
[10.212,3.661,3.642],
[1.407,1.194,1.215],
[0.369,0.155,0.138],
[0.223,0.095,0.091],
[0.356,0.138,0.139],
[0.520,0.206,0.200],
[0.229,0.081,0.080],
[0.217,0.079,0.081],
[0.198,0.075,0.077]
]
}
56 changes: 56 additions & 0 deletions datafusion-partitioned/results/50-c6a.4xlarge.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"system": "DataFusion 50.0.0 (Parquet, partitioned)",
"date": "2025-11-24",
"machine": "50-c6a.4xlarge",
"cluster_size": 1,
"proprietary": "no",
"tuned": "no",
"tags": ["Rust","column-oriented","embedded","stateless"],
"load_time": 0,
"data_size": 14737666736,
"result": [
[0.106,0.027,0.027],
[0.167,0.048,0.049],
[0.275,0.089,0.088],
[0.493,0.107,0.103],
[0.972,0.755,0.778],
[1.039,0.850,0.816],
[0.099,0.027,0.027],
[0.166,0.051,0.050],
[1.094,0.944,0.880],
[1.406,1.006,0.946],
[0.599,0.230,0.232],
[0.700,0.258,0.246],
[1.115,0.843,0.847],
[2.840,1.365,1.349],
[1.170,0.817,0.840],
[1.010,0.873,0.850],
[2.713,1.633,1.648],
[2.626,1.632,1.651],
[5.101,3.230,3.270],
[0.370,0.106,0.098],
[9.907,1.099,1.101],
[11.448,1.386,1.380],
[22.140,2.564,2.553],
[52.780,9.008,8.951],
[0.395,0.151,0.155],
[0.936,0.276,0.275],
[0.399,0.152,0.158],
[9.806,1.541,1.559],
[9.669,8.939,9.169],
[0.556,0.414,0.419],
[2.517,0.760,0.758],
[6.189,0.942,0.920],
[4.573,3.275,3.246],
[10.247,3.616,3.662],
[10.224,3.584,3.625],
[1.348,1.244,1.186],
[0.335,0.115,0.129],
[0.222,0.068,0.071],
[0.325,0.135,0.132],
[0.498,0.193,0.172],
[0.209,0.057,0.057],
[0.205,0.059,0.056],
[0.183,0.052,0.052]
]
}
56 changes: 56 additions & 0 deletions datafusion-partitioned/results/51-c6a.4xlarge.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
{
"system": "DataFusion 51 (Parquet, partitioned)",
"date": "2025-11-24",
"machine": "51-c6a.4xlarge",
"cluster_size": 1,
"proprietary": "no",
"tuned": "no",
"tags": ["Rust","column-oriented","embedded","stateless"],
"load_time": 0,
"data_size": 14737666736,
"result": [
[0.098,0.032,0.031],
[0.144,0.054,0.053],
[0.260,0.095,0.096],
[0.644,0.111,0.111],
[1.174,0.803,0.790],
[1.139,0.835,0.828],
[0.108,0.031,0.032],
[0.164,0.056,0.055],
[1.093,0.909,0.951],
[1.756,1.004,1.020],
[0.673,0.236,0.236],
[0.840,0.256,0.251],
[1.300,0.842,0.843],
[2.702,1.346,1.356],
[1.213,0.809,0.824],
[1.063,0.882,0.872],
[2.765,1.684,1.687],
[2.743,1.675,1.671],
[5.557,3.341,3.359],
[0.353,0.103,0.100],
[10.171,1.099,1.123],
[11.557,1.368,1.334],
[22.327,2.612,2.599],
[52.202,9.208,9.074],
[0.374,0.157,0.156],
[1.120,0.251,0.252],
[0.768,0.160,0.161],
[10.076,1.479,1.506],
[9.603,8.859,9.045],
[0.569,0.444,0.424],
[3.228,0.800,0.772],
[6.972,0.984,0.960],
[5.114,3.492,3.501],
[10.275,3.631,3.636],
[10.212,3.625,3.617],
[1.397,1.227,1.196],
[0.339,0.139,0.133],
[0.212,0.086,0.074],
[0.353,0.139,0.137],
[0.512,0.208,0.213],
[0.199,0.069,0.070],
[0.189,0.066,0.065],
[0.173,0.057,0.058]
]
}
Loading