diff --git a/README.md b/README.md
index 31adc2d2..6ef12dca 100644
--- a/README.md
+++ b/README.md
@@ -96,6 +96,9 @@ This part of the readme is also covered by autotests, so these code are always u
In any unclear situation, look into it first.
+
+ CLICK HERE to see the most complete description of ALL features!
+
```yml
# It's a complete example of the CSV schema file in YAML format.
@@ -649,6 +652,7 @@ columns:
```
+
### Extra checks
@@ -907,14 +911,13 @@ Optional format `text` with highlited keywords:
Of course, you'll want to know how fast it works. The thing is, it depends very-very-very much on the following factors:
* **The file size** - Width and height of the CSV file. The larger the dataset, the longer it will take to go through
- it.
- The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD).
+ it. The dependence is linear and strongly depends on the speed of your hardware (CPU, SSD).
* **Number of rules used** - Obviously, the more of them there are for one column, the more iterations you will have to
- make.
- Also remember that they do not depend on each other.
+ make. Also remember that they do not depend on each other. I.e. execution of one rule will not optimize or slow down
+ another rule in any way. In fact, it will be just summing up time and memory resources.
* Some validation rules are very time or memory intensive. For the most part you won't notice this, but there are some
that are dramatically slow. For example, `interquartile_mean` processes about 4k lines per second, while the rest of
- the rules are about 0.3-1 million lines per second.
+ the rules are about 30+ millions lines per second.
However, to get a rough picture, you can check out the table below.
@@ -927,7 +930,7 @@ However, to get a rough picture, you can check out the table below.
* Software: Latest Ubuntu + Docker.
Also [see detail about GA hardware](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-private-repositories).
* The main metric is the number of lines per second. Please note that the table is thousands of lines per second
- (`100 K` = `100,000 lines per second`).
+ (`100K` = `100,000 lines per second`).
* An additional metric is the peak RAM consumption over the entire time of the test case.
Since usage profiles can vary, I've prepared a few profiles to cover most cases.
@@ -937,17 +940,20 @@ Since usage profiles can vary, I've prepared a few profiles to cover most cases.
* **[Minimum](tests/Benchmarks/bench_1_mini_combo.yml)** - Normal rules with average performance, but 2 of each.
* **[Realistic](tests/Benchmarks/bench_2_realistic_combo.yml)** - A mix of rules that are most likely to be used in real
life.
-* **[All aggregations at once](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the
+* **[All aggregations](tests/Benchmarks/bench_3_all_agg.yml)** - All aggregation rules at once. This is the
worst-case scenario.
Also, there is an additional division into
-* `Cell rules` - only rules applicable for each row/cell, 1000 lines per second.
-* `Agg rules` - only rules applicable for the whole column, 1000 lines per second.
-* `Cell + Agg` - a simultaneous combination of the previous two, 1000 lines per second.
-* `Peak Memory` - the maximum memory consumption during the test case, megabytes. **Important note:** This value is
- only for the aggregation case. Since if you don't have aggregations, the peak memory usage will always be
- no more than a couple megabytes.
+* `Cell rules` - only rules applicable for each row/cell.
+* `Agg rules` - only rules applicable for the whole column.
+* `Cell + Agg` - a simultaneous combination of the previous two.
+* `Peak Memory` - the maximum memory consumption during the test case.
+
+**Important note:** `Peak Memory` value is only for the aggregation case. Since if you don't have aggregations,
+the peak memory usage will always be no more than 2-4 megabytes. No memory leaks!
+It doesn't depend on the number of rules or the size of CSV file.
+
@@ -960,91 +966,91 @@ Also, there is an additional division into
All aggregations |
- Columns: 1 Size: 8.48 MB
|
+ Columns: 1 Size: ~8 MB
|
Cell rules Agg rules Cell + Agg Peak Memory |
-586K, 3.4 sec
-802K, 2.5 sec
-474K, 4.2 sec
+586K,  3.4 sec
+802K,  2.5 sec
+474K,  4.2 sec
52 MB
|
-320K, 6.3 sec
-755K, 2.6 sec
-274K, 7.3 sec
+320K,  6.3 sec
+755K,  2.6 sec
+274K,  7.3 sec
68 MB
|
171K, 11.7 sec
-532K, 3.8 sec
+532K,  3.8 sec
142K, 14.1 sec
208 MB
|
-794K, 2.5 sec
+794K,  2.5 sec
142K, 14.1 sec
121K, 16.5 sec
272 MB
|
- Columns: 5 Size: 64.04 MB
|
+ Columns: 5 Size: 64 MB
|
Cell rules Agg rules Cell + Agg Peak Memory |
-443K, 4.5 sec
-559K, 3.6 sec
-375K, 5.3 sec
+443K,  4.5 sec
+559K,  3.6 sec
+375K,  5.3 sec
52 MB
|
-274K, 7.3 sec
-526K, 3.8 sec
-239K, 8.4 sec
+274K,  7.3 sec
+526K,  3.8 sec
+239K,  8.4 sec
68 MB
|
156K, 12.8 sec
-406K, 4.9 sec
+406K,  4.9 sec
131K, 15.3 sec
208 MB
|
-553K, 3.6 sec
+553K,  3.6 sec
139K, 14.4 sec
-111K, 18 sec
+111K, 18.0 sec
272 MB
|
- Columns: 10 Size: 220.02 MB
|
+ Columns: 10 Size: 220 MB
|
Cell rules Agg rules Cell + Agg Peak Memory |
-276K, 7.2 sec
-314K, 6.4 sec
-247K, 8.1 sec
+276K,  7.2 sec
+314K,  6.4 sec
+247K,  8.1 sec
52 MB
|
197K, 10.2 sec
-308K, 6.5 sec
+308K,  6.5 sec
178K, 11.2 sec
68 MB
|
129K, 15.5 sec
-262K, 7.6 sec
-111K, 18 sec
+262K,  7.6 sec
+111K, 18.0 sec
208 MB
|
-311K, 6.4 sec
+311K,  6.4 sec
142K, 14.1 sec
97K, 20.6 sec
272 MB
|
- Columns: 20 Size: 1.18 GB
|
+ Columns: 20 Size: 1.2 GB
|
Cell rules Agg rules Cell + Agg Peak Memory |
102K, 19.6 sec
@@ -1065,7 +1071,7 @@ Also, there is an additional division into
208 MB
|
-105K, 19 sec
+105K, 19.0 sec
144K, 13.9 sec
61K, 32.8 sec
272 MB
@@ -1074,28 +1080,39 @@ Also, there is an additional division into
|
+Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel
+2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best)
+hardware at the regular engineer.
+
### Brief conclusions
* Cell rules are very CPU demanding, but use almost no RAM (always about 1-2 MB at peak).
The more of them there are, the longer it will take to validate a column, as they are additional actions per(!) value.
* Aggregation rules - work lightning fast (from 10 millions to billions of rows per second), but require a lot of RAM.
- On the other hand, if you add 20 different aggregation rules, the amount of memory consumed will not increase.
+ On the other hand, if you add 100+ different aggregation rules, the amount of memory consumed will not increase too
+ much.
+
+* Unfortunately, not all PHP array functions can work by reference (`&$var`).
+ This is a very individual thing that depends on the algorithm.
+ So if a dataset in a column is 20 MB sometimes it is copied and the peak value becomes 40 (this is just an example).
+ That's why link optimization doesn't work most of the time.
* In fact, if you are willing to wait 30-60 seconds for a 1 GB file, and you have 200-500 MB of RAM,
I don't see the point in thinking about it at all.
* No memory leaks have been detected.
-Btw, if you run the same tests on a MacBook 14" M2 Max 2023, the results are ~2 times better. On MacBook 2019 Intel
-2.4Gz about the same as on GitHub Actions. So I think the table can be considered an average (but too far from the best)
-hardware at the regular engineer.
### Examples of CSV files
Below you will find examples of CSV files that were used for the benchmarks. They were created
with [PHP Faker](tests/Benchmarks/Commands/CreateCsv.php) (the first 2000 lines) and then
-copied [1000 times into themselves](tests/Benchmarks/create-csv.sh).
+copied [1000 times into themselves](tests/Benchmarks/create-csv.sh). So we can create a really huge random files in
+seconds.
+
+The basic principle is that the more columns there are, the longer the values in them. I.e. something like exponential
+growth.