diff --git a/README.md b/README.md index 31adc2d2..6ef12dca 100644 --- a/README.md +++ b/README.md @@ -96,6 +96,9 @@ This part of the readme is also covered by autotests, so these code are always u In any unclear situation, look into it first. +
+ CLICK HERE to see the most complete description of ALL features! + ```yml # It's a complete example of the CSV schema file in YAML format. @@ -649,6 +652,7 @@ columns: ``` +
### Extra checks @@ -907,14 +911,13 @@ Optional format `text` with highlited keywords: Of course, you'll want to know how fast it works. The thing is, it depends very-very-very much on the following factors: * **The file size** - Width and height of the CSV file. The larger the dataset, the longer it will take to go through - it. - The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD). + it. The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD). * **Number of rules used** - Obviously, the more of them there are for one column, the more iterations you will have to - make. - Also remember that they do not depend on each other. + make. Also remember that they do not depend on each other. I.e. execution of one rule will not optimize or slow down + another rule in any way. In fact, it will be just summing up time and memory resources. * Some validation rules are very time or memory intensive. For the most part you won't notice this, but there are some that are dramatically slow. For example, `interquartile_mean` processes about 4k lines per second, while the rest of - the rules are about 0.3-1 million lines per second. + the rules are about 30+ millions lines per second. However, to get a rough picture, you can check out the table below. @@ -927,7 +930,7 @@ However, to get a rough picture, you can check out the table below. * Software: Latest Ubuntu + Docker. Also [see detail about GA hardware](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-private-repositories). * The main metric is the number of lines per second. Please note that the table is thousands of lines per second - (`100 K` = `100,000 lines per second`). + (`100K` = `100,000 lines per second`). * An additional metric is the peak RAM consumption over the entire time of the test case. Since usage profiles can vary, I've prepared a few profiles to cover most cases. @@ -937,17 +940,20 @@ Since usage profiles can vary, I've prepared a few profiles to cover most cases. * **[Minimum](tests/Benchmarks/bench_1_mini_combo.yml)** - Normal rules with average performance, but 2 of each. * **[Realistic](tests/Benchmarks/bench_2_realistic_combo.yml)** - A mix of rules that are most likely to be used in real life. -* **[All aggregations at once](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the +* **[All aggregations](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the worst-case scenario. Also, there is an additional division into -* `Cell rules` - only rules applicable for each row/cell, 1000 lines per second. -* `Agg rules` - only rules applicable for the whole column, 1000 lines per second. -* `Cell + Agg` - a simultaneous combination of the previous two, 1000 lines per second. -* `Peak Memory` - the maximum memory consumption during the test case, megabytes. **Important note:** This value is - only for the aggregation case. Since if you don't have aggregations, the peak memory usage will always be - no more than a couple megabytes. +* `Cell rules` - only rules applicable for each row/cell. +* `Agg rules` - only rules applicable for the whole column. +* `Cell + Agg` - a simultaneous combination of the previous two. +* `Peak Memory` - the maximum memory consumption during the test case. + +**Important note:** `Peak Memory` value is only for the aggregation case. Since if you don't have aggregations, +the peak memory usage will always be no more than 2-4 megabytes. No memory leaks! +It doesn't depend on the number of rules or the size of CSV file. + @@ -960,91 +966,91 @@ Also, there is an additional division into - + - + - + - +
All aggregations
Columns: 1
Size: 8.48 MB


Columns: 1
Size: ~8 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
-586K, 3.4 sec
-802K, 2.5 sec
-474K, 4.2 sec
+586K,  3.4 sec
+802K,  2.5 sec
+474K,  4.2 sec
52 MB
-320K, 6.3 sec
-755K, 2.6 sec
-274K, 7.3 sec
+320K,  6.3 sec
+755K,  2.6 sec
+274K,  7.3 sec
68 MB
171K, 11.7 sec
-532K, 3.8 sec
+532K,  3.8 sec
142K, 14.1 sec
208 MB
-794K, 2.5 sec
+794K,  2.5 sec
142K, 14.1 sec
121K, 16.5 sec
272 MB
Columns: 5
Size: 64.04 MB


Columns: 5
Size: 64 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
-443K, 4.5 sec
-559K, 3.6 sec
-375K, 5.3 sec
+443K,  4.5 sec
+559K,  3.6 sec
+375K,  5.3 sec
52 MB
-274K, 7.3 sec
-526K, 3.8 sec
-239K, 8.4 sec
+274K,  7.3 sec
+526K,  3.8 sec
+239K,  8.4 sec
68 MB
156K, 12.8 sec
-406K, 4.9 sec
+406K,  4.9 sec
131K, 15.3 sec
208 MB
-553K, 3.6 sec
+553K,  3.6 sec
139K, 14.4 sec
-111K, 18 sec
+111K, 18.0 sec
272 MB
Columns: 10
Size: 220.02 MB


Columns: 10
Size: 220 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
-276K, 7.2 sec
-314K, 6.4 sec
-247K, 8.1 sec
+276K,  7.2 sec
+314K,  6.4 sec
+247K,  8.1 sec
52 MB
197K, 10.2 sec
-308K, 6.5 sec
+308K,  6.5 sec
178K, 11.2 sec
68 MB
129K, 15.5 sec
-262K, 7.6 sec
-111K, 18 sec
+262K,  7.6 sec
+111K, 18.0 sec
208 MB
-311K, 6.4 sec
+311K,  6.4 sec
142K, 14.1 sec
97K, 20.6 sec
272 MB
Columns: 20
Size: 1.18 GB


Columns: 20
Size: 1.2 GB


Cell rules
Agg rules
Cell + Agg
Peak Memory
102K, 19.6 sec
@@ -1065,7 +1071,7 @@ Also, there is an additional division into 208 MB
-105K, 19 sec
+105K, 19.0 sec
144K, 13.9 sec
61K, 32.8 sec
272 MB @@ -1074,28 +1080,39 @@ Also, there is an additional division into
+Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel +2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best) +hardware at the regular engineer. + ### Brief conclusions * Cell rules are very CPU demanding, but use almost no RAM (always about 1-2 MB at peak). The more of them there are, the longer it will take to validate a column, as they are additional actions per(!) value. * Aggregation rules - work lightning fast (from 10 millions to billions of rows per second), but require a lot of RAM. - On the other hand, if you add 20 different aggregation rules, the amount of memory consumed will not increase. + On the other hand, if you add 100+ different aggregation rules, the amount of memory consumed will not increase too + much. + +* Unfortunately, not all PHP array functions can work by reference (`&$var`). + This is a very individual thing that depends on the algorithm. + So if a dataset in a column is 20 MB sometimes it is copied and the peak value becomes 40 (this is just an example). + That's why link optimization doesn't work most of the time. * In fact, if you are willing to wait 30-60 seconds for a 1 GB file, and you have 200-500 MB of RAM, I don't see the point in thinking about it at all. * No memory leaks have been detected. -Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel -2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best) -hardware at the regular engineer. ### Examples of CSV files Below you will find examples of CSV files that were used for the benchmarks. They were created with [PHP Faker](tests/Benchmarks/Commands/CreateCsv.php) (the first 2000 lines) and then -copied [1000 times into themselves](tests/Benchmarks/create-csv.sh). +copied [1000 times into themselves](tests/Benchmarks/create-csv.sh). So we can create a really huge random files in +seconds. + +The basic principle is that the more columns there are, the longer the values in them. I.e. something like exponential +growth.
Columns: 1, Size: 8.48 MB @@ -1110,7 +1127,7 @@ id
- Columns: 5, Size: 64.04 MB + Columns: 5, Size: 64 MB ```csv id,bool_int,bool_str,number,float @@ -1122,7 +1139,7 @@ id,bool_int,bool_str,number,float
- Columns: 10, Size: 220.02 MB + Columns: 10, Size: 220 MB ```csv id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4 @@ -1134,7 +1151,7 @@ id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4
- Columns: 20, Size: 1.18 GB + Columns: 20, Size: 12 GB ```csv id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,postcode,latitude,longitude,ip6,sentence_tiny,sentence_small,sentence_medium,sentence_huge @@ -1144,7 +1161,7 @@ id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,po
-### Run the benchmark locally +### Run benchmark locally Make sure you have PHP 8.1+ and Dooker installed. @@ -1274,7 +1291,7 @@ I'm not sure if I will implement all of them. But I will try to do my best. ## Contributing If you have any ideas or suggestions, feel free to open an issue or create a pull request. -```sh +```shell # Fork the repo and build project git clone git@github.com:jbzoo/csv-blueprint.git ./jbzoo-csv-blueprint cd ./jbzoo-csv-blueprint diff --git a/tests/ReadmeTest.php b/tests/ReadmeTest.php index ee37c300..1c21f497 100644 --- a/tests/ReadmeTest.php +++ b/tests/ReadmeTest.php @@ -139,6 +139,13 @@ public function testAdditionalValidationRules(): void public function testBenchmarkTable(): void { $nbsp = static fn (string $text): string => \str_replace(' ', ' ', $text); + $timeFormat = static fn (float $time): string => \str_pad( + \number_format($time, 1) . ' sec', + 8, + ' ', + \STR_PAD_LEFT, + ); + $numberOfLines = 2_000_000; $columns = [ @@ -149,25 +156,25 @@ public function testBenchmarkTable(): void ]; $table = [ - 'Columns: 1
Size: 8.48 MB' => [ + 'Columns: 1
Size: ~8 MB' => [ [586, 802, 474, 52], [320, 755, 274, 68], [171, 532, 142, 208], [794, 142, 121, 272], ], - 'Columns: 5
Size: 64.04 MB' => [ + 'Columns: 5
Size: 64 MB' => [ [443, 559, 375, 52], [274, 526, 239, 68], [156, 406, 131, 208], [553, 139, 111, 272], ], - 'Columns: 10
Size: 220.02 MB' => [ + 'Columns: 10
Size: 220 MB' => [ [276, 314, 247, 52], [197, 308, 178, 68], [129, 262, 111, 208], [311, 142, 97, 272], ], - 'Columns: 20
Size: 1.18 GB' => [ + 'Columns: 20
Size: 1.2 GB' => [ [102, 106, 95, 52], [88, 103, 83, 68], [70, 97, 65, 208], @@ -199,8 +206,8 @@ public function testBenchmarkTable(): void if ($key === 3) { $testRes = $value . ' MB'; } else { - $execTime = \round($numberOfLines / ($value * 1000), 1); - $testRes = $nbsp("{$value}K, {$execTime} sec
"); + $execTime = $timeFormat($numberOfLines / ($value * 1000)); + $testRes = $nbsp("{$value}K, {$execTime}
"); } $output[] = $testRes;