Skip to content

Commit

Permalink
Add parallel processing for CSV file validation
Browse files Browse the repository at this point in the history
A new parallel processing feature has been added that enables the user to validate multiple CSV files concurrently. Introduced via the `--parallel` flag, this functionality enhances performance by effectively utilizing CPU resources. Accompanying documentation for this experimental feature, including usage and considerations, has also been updated in the README. Be aware that to use this new feature, the 'ext-parallel' PHP extension is required.
  • Loading branch information
SmetDenis committed Apr 11, 2024
1 parent c45df91 commit 9f2883f
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 8 deletions.
34 changes: 30 additions & 4 deletions README.md
Expand Up @@ -29,6 +29,7 @@ specifications, making it invaluable in scenarios where data quality and consist
- [Usage](#usage)
- [Schema definition](#schema-definition)
- [Presets and reusable schemas](#presets-and-reusable-schemas)
- [Parallel processing](#parallel-processing)
- [Complete CLI help message](#complete-cli-help-message)
- [Report examples](#report-examples)
- [Benchmarks](#benchmarks)
Expand Down Expand Up @@ -160,11 +161,12 @@ You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv

# Extra options for the CSV Blueprint. Only for debbuging and profiling.
# Available options:
# ANSI output. You can disable ANSI colors if you want with `--no-ansi`.
# Verbosity level: Available options: `-v`, `-vv`, `-vvv`.
# Add flag `--profile` if you want to see profiling info. Add details with `-vvv`.
# Add flag `--debug` if you want to see more really deep details.
# Add flag `--parallel` if you want to validate CSV files in parallel.
# Add flag `--dump-schema` if you want to see the final schema after all includes and inheritance.
# Add flag `--debug` if you want to see more really deep details.
# Add flag `--profile` if you want to see profiling info. Add details with `-vvv`.
# Verbosity level: Available options: `-v`, `-vv`, `-vvv`
# ANSI output. You can disable ANSI colors if you want with `--no-ansi`.
# Default value: 'options: --ansi'
# You can skip it.
extra: 'options: --ansi'
Expand Down Expand Up @@ -1412,6 +1414,30 @@ columns:
These are intended solely for demonstration and to illustrate potential configurations and features.


## Parallel processing

The `--parallel` option is available for speeding up the validation of CSV files by utilizing more CPU resources
effectively.

### Key Points

- **Experimental Feature:** This feature is currently experimental and requires further debugging and testing. Although
it performs well in synthetic autotests and benchmarks. More practical use cases are needed to validate its stability.
- **Use Case:** This option is beneficial if you are processing dozens of CSV files, with each file taking 1 second or
more to process.
- **Default Behavior:** If you use `--parallel` without specifying a value, it defaults to using the maximum number of
available CPU cores.
- **Thread Pool Size:** You can set a specific number of threads for the pool. For example, `--parallel=10` will set the
thread pool size to 10. It doesn't make much sense to specify more than the number of logical cores in your CPU.
- **Disabling Parallelism:** Using `--parallel=1` disables parallel processing, which is the default setting if the
option is not specified.
- **Implementation:** The feature relies on the `ext-parallel` PHP extension, which enables the creation of lightweight
threads rather than processes. This extension is already included in our Docker image. Ensure that you have
the `ext-parallel` extension installed if you are not using our Docker image. This extension is crucial for the
operation of the parallel processing feature. The application always runs in single-threaded mode if the extension is
not installed.


## Complete CLI help message

This section outlines all available options and commands provided by the tool, leveraging the JBZoo/Cli package for its
Expand Down
9 changes: 5 additions & 4 deletions action.yml
Expand Up @@ -58,11 +58,12 @@ inputs:
description: |
Extra options for the CSV Blueprint. Only for debbuging and profiling.
Available options:
ANSI output. You can disable ANSI colors if you want with `--no-ansi`.
Verbosity level: Available options: `-v`, `-vv`, `-vvv`.
Add flag `--profile` if you want to see profiling info. Add details with `-vvv`.
Add flag `--debug` if you want to see more really deep details.
Add flag `--parallel` if you want to validate CSV files in parallel.
Add flag `--dump-schema` if you want to see the final schema after all includes and inheritance.
Add flag `--debug` if you want to see more really deep details.
Add flag `--profile` if you want to see profiling info. Add details with `-vvv`.
Verbosity level: Available options: `-v`, `-vv`, `-vvv`
ANSI output. You can disable ANSI colors if you want with `--no-ansi`.
default: 'options: --ansi'

runs:
Expand Down

0 comments on commit 9f2883f

Please sign in to comment.