From 9f2883fbbd118987b1b4de61193605026ba9aaeb Mon Sep 17 00:00:00 2001 From: SmetDenis Date: Fri, 12 Apr 2024 00:48:07 +0400 Subject: [PATCH] Add parallel processing for CSV file validation A new parallel processing feature has been added that enables the user to validate multiple CSV files concurrently. Introduced via the `--parallel` flag, this functionality enhances performance by effectively utilizing CPU resources. Accompanying documentation for this experimental feature, including usage and considerations, has also been updated in the README. Be aware that to use this new feature, the 'ext-parallel' PHP extension is required. --- README.md | 34 ++++++++++++++++++++++++++++++---- action.yml | 9 +++++---- 2 files changed, 35 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 8d2d2da8..9fa5964e 100644 --- a/README.md +++ b/README.md @@ -29,6 +29,7 @@ specifications, making it invaluable in scenarios where data quality and consist - [Usage](#usage) - [Schema definition](#schema-definition) - [Presets and reusable schemas](#presets-and-reusable-schemas) +- [Parallel processing](#parallel-processing) - [Complete CLI help message](#complete-cli-help-message) - [Report examples](#report-examples) - [Benchmarks](#benchmarks) @@ -160,11 +161,12 @@ You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv # Extra options for the CSV Blueprint. Only for debbuging and profiling. # Available options: - # ANSI output. You can disable ANSI colors if you want with `--no-ansi`. - # Verbosity level: Available options: `-v`, `-vv`, `-vvv`. - # Add flag `--profile` if you want to see profiling info. Add details with `-vvv`. - # Add flag `--debug` if you want to see more really deep details. + # Add flag `--parallel` if you want to validate CSV files in parallel. # Add flag `--dump-schema` if you want to see the final schema after all includes and inheritance. + # Add flag `--debug` if you want to see more really deep details. + # Add flag `--profile` if you want to see profiling info. Add details with `-vvv`. + # Verbosity level: Available options: `-v`, `-vv`, `-vvv` + # ANSI output. You can disable ANSI colors if you want with `--no-ansi`. # Default value: 'options: --ansi' # You can skip it. extra: 'options: --ansi' @@ -1412,6 +1414,30 @@ columns: These are intended solely for demonstration and to illustrate potential configurations and features. +## Parallel processing + +The `--parallel` option is available for speeding up the validation of CSV files by utilizing more CPU resources +effectively. + +### Key Points + +- **Experimental Feature:** This feature is currently experimental and requires further debugging and testing. Although + it performs well in synthetic autotests and benchmarks. More practical use cases are needed to validate its stability. +- **Use Case:** This option is beneficial if you are processing dozens of CSV files, with each file taking 1 second or + more to process. +- **Default Behavior:** If you use `--parallel` without specifying a value, it defaults to using the maximum number of + available CPU cores. +- **Thread Pool Size:** You can set a specific number of threads for the pool. For example, `--parallel=10` will set the + thread pool size to 10. It doesn't make much sense to specify more than the number of logical cores in your CPU. +- **Disabling Parallelism:** Using `--parallel=1` disables parallel processing, which is the default setting if the + option is not specified. +- **Implementation:** The feature relies on the `ext-parallel` PHP extension, which enables the creation of lightweight + threads rather than processes. This extension is already included in our Docker image. Ensure that you have + the `ext-parallel` extension installed if you are not using our Docker image. This extension is crucial for the + operation of the parallel processing feature. The application always runs in single-threaded mode if the extension is + not installed. + + ## Complete CLI help message This section outlines all available options and commands provided by the tool, leveraging the JBZoo/Cli package for its diff --git a/action.yml b/action.yml index 8dfe230a..9f971197 100644 --- a/action.yml +++ b/action.yml @@ -58,11 +58,12 @@ inputs: description: | Extra options for the CSV Blueprint. Only for debbuging and profiling. Available options: - ANSI output. You can disable ANSI colors if you want with `--no-ansi`. - Verbosity level: Available options: `-v`, `-vv`, `-vvv`. - Add flag `--profile` if you want to see profiling info. Add details with `-vvv`. - Add flag `--debug` if you want to see more really deep details. + Add flag `--parallel` if you want to validate CSV files in parallel. Add flag `--dump-schema` if you want to see the final schema after all includes and inheritance. + Add flag `--debug` if you want to see more really deep details. + Add flag `--profile` if you want to see profiling info. Add details with `-vvv`. + Verbosity level: Available options: `-v`, `-vv`, `-vvv` + ANSI output. You can disable ANSI colors if you want with `--no-ansi`. default: 'options: --ansi' runs: