Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel processing functionality based on ext-parallel #121

Merged
merged 57 commits into from
Apr 11, 2024

Conversation

SmetDenis
Copy link
Member

@SmetDenis SmetDenis commented Apr 1, 2024

In this commit, the parallel processing capability was added to improve performance when validating CSV files. The changes include updates to the workflow file and Dockerfile, adding support for a thread-safe version of PHP and the parallel extension.

In this commit, the parallel processing capability was added to improve performance when validating CSV files. The changes include updates to the workflow file and Dockerfile, adding support for a thread-safe version of PHP and the parallel extension. Also, a new dependency `hds-solutions/parallel-sdk` was included in the "composer.json" file, along with relevant changes in other files.
@coveralls
Copy link

coveralls commented Apr 1, 2024

Pull Request Test Coverage Report for Build 8653613758

Details

  • 140 of 206 (67.96%) changed or added relevant lines in 9 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.8%) to 95.973%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Commands/ValidateCsv.php 44 45 97.78%
src/Workers/Tasks/ValidationSchemaTask.php 7 9 77.78%
src/Workers/Worker.php 7 11 63.64%
src/Commands/AbstractValidate.php 27 34 79.41%
src/Utils.php 6 27 22.22%
src/Workers/WorkerPool.php 16 47 34.04%
Totals Coverage Status
Change from base Build 8636375558: -0.8%
Covered Lines: 3265
Relevant Lines: 3402

💛 - Coveralls

…formation commands from the Github workflow's main.yml file, and also updates the 'phpts' environment variable from 'ts' to 'zts'. The changes aim at optimizing the workflow runs and reducing the noise in workflow logs.
The code changes include adding the requirement for the parallel PHP extension in the composer.json and composer.lock files. The updates also include bootstrap for the parallel extension in autoload.php and csv-blueprint.php. These modifications ensure better multi-threading
This commit corrects the bootstrap function call under the parallel extension in both "tests/autoload.php" and "csv-blueprint.php" files to ensure proper functioning in PHP applications. Additionally, it improves the README file's readability by fixing the indentations in a section explaining the usage of "csv.header" in two different cases.
This commit corrects the bootstrap function call under the parallel extension in both "tests/autoload.php" and "csv-blueprint.php" files to ensure proper functioning in PHP applications. Additionally, it improves the README file's readability by fixing the indentations in a section explaining the usage of "csv.header" in two different cases.
This commit corrects the bootstrap function call under the parallel extension in both "tests/autoload.php" and "csv-blueprint.php" files to ensure proper functioning in PHP applications. Additionally, it improves the README file's readability by fixing the indentations in a section explaining the usage of "csv.header" in two different cases.
This commit corrects the bootstrap function call under the parallel extension in both "tests/autoload.php" and "csv-blueprint.php" files to ensure proper functioning in PHP applications. Additionally, it improves the README file's readability by fixing the indentations in a section explaining the usage of "csv.header" in two different cases.
This commit corrects the bootstrap function call under the parallel extension in both "tests/autoload.php" and "csv-blueprint.php" files to ensure proper functioning in PHP applications. Additionally, it improves the README file's readability by fixing the indentations in a section explaining the usage of "csv.header" in two different cases.
The version of the "amphp/parallel" library has been updated from "^1.4" to "^1.4.3" in the composer.json file. Moreover, the changes also include some formatting adjustments in the indentation of the composer.json and composer.lock files to improve readability.
The version of the "amphp/parallel" library has been updated from "^1.4" to "^1.4.3" in the composer.json file. Moreover, the changes also include some formatting adjustments in the indentation of the composer.json and composer.lock files to improve readability.
@SmetDenis SmetDenis marked this pull request as draft April 1, 2024 19:12
@SmetDenis SmetDenis added the WIP Work in progress label Apr 2, 2024
# Conflicts:
#	.phan.php
#	composer.json
#	tests/ReadmeTest.php
This update adds the "amphp/parallel" library as a new dependency in composer.json and updates the composer.lock file accordingly. The addition of "amphp/parallel" helps in achieving improved performance by enabling better management of concurrent PHP processes.
This commit introduces an experimental feature for parallel schema validation improving performance on multi-core CPUs. It adds a new command-line option 'parallel' and implements the parallel validation logic in 'ValidateSchema.php'. Additionally, it introduces a 'SchemaValidationTask' class executing individual validation tasks in worker threads.
# Conflicts:
#	composer.json
#	composer.lock
The ValidateSchema command has been refactored by replacing the previously used parallel task execution method with a Scheduler from the HDSSolutions\Console\Parallel package. This change includes replacing Amp\Parallel\Worker and Amp\Future use statements with Scheduler. It also changes the mechanism of handling tasks, introducing more structured control of task processing.
Integration of step to setup PHP version 8.3 in the benchmark workflow process has been added. This setup step ensures correct PHP version is used before running benchmarks. This modification has been implemented in both active and commented sections of the GitHub workflow.
This commit enhances the parallelization feature, altering the "parallel" option to allow specifying the number of threads. Moreover, it improves task handling by introducing a TaskRunner system, separating task logic into distinct Task classes. The system now also uses more efficient parallel execution when available. Additionally, the command now prints diagnostic messages about the number of threads in use during parallel executions. Lastly, the SchemaValidationTask class was moved and the AbstractTask class was introduced to improve the organization of the Task structure.
This commit improves the experimental parallelization feature by allowing a specific number of threads to be specified. It improves error handling in the task pool and updates the code in the WorkerPool class to streamline task processing. The commit also increases the pool maintenance delay to prevent CPU overload. PHP.ini, Dockerfile and README.md files have been adjusted accordingly.
WorkerPool has been refactored to simplify its code, particularly in the task dispatching section. Parallelization testing in TaskRunnerTest has been improved with better validity checks, and the overall error handling has been strengthened. Some extraneous code elements have also been removed for tidiness.
Unnecessary white space was removed in Schema.php for cleaner code. The 'schema' path in the ValidateSchemaTest was modified as well, ensuring tests access the correct schema files.
This commit consists of code refactoring, particularly the removal of unnecessary space in Schema.php and the enhancement of Schema validation tests. The schema path in ValidateSchemaTest was updated to ensure tests are accessing correct schema files. Other notable changes include the amendment of exception and validation handling, and adjustments to worker-related codes for CSV validation. The composer.lock was updated correspondingly due to these changes.
# Conflicts:
#	Dockerfile
# Conflicts:
#	csv-blueprint.php
The PHP_SAPI checks have been reorganized to avoid redundancy. By consolidating the code, the script's instructions have been streamlined and readability has been improved.
# Conflicts:
#	csv-blueprint.php
Additional security settings have been added in php.ini to prohibit inclusion of URLs in scripts. Runtime exception handling across the codebase has also been simplified. For improving maintainability, redundant code has been eliminated in csv-blueprint.php, particularly where PHP_SAPI is checked.
This commit updates multiple classes throughout the project to be declared as final thereby restricting them from further inheritance. This adjustment enhances the security and stability of the code, and helps prevent potential design issues in future developments.
This commit overhauls the CSV validation script to enhance performance and streamline procedures for Docker. The parallel file checking has been updated for improved efficiency and a script for generating random CSV data has been added. Environmental adjustments have been made in the Dockerfile, including git removal, cache clearing, and script permissions management. Additionally, the Docker PHP INI configurations have been updated for better memory consumption and error reporting.
Optimizes the CSV validation process through both sequential and parallel validation improvements. Also, introduces changes to the Docker configuration, majorly concerning PHP INI settings and preparation procedures, which includes cache warmups and script permissions modifications. Enhances the Dockerfile to generate random CSV data for testing and perform environmental cleanup operations like removing git files and clearing composer cache.
This addition allows the benchmarking process to measure performance in both single and multi-thread modes. Each mode is run on Docker with dedicated title identifiers, supporting systematic evaluation of CSV validations in different settings.
The code has been refactored to move the handling of the debug mode to a new utility class. This change allows us to set and get the debug mode via static methods, improving code clarity and encapsulation. The debug mode state is now part of the Utils class helping to maintain a cleaner global namespace.
The debug method in the Utils class has been refactored to only accept string messages. Additionally, a try-catch block has been added to output the message with strip_tags function in the event of a Throwable exception, improving error-handling.
The file opening error check has been removed from the random-csv.php script in the Docker directory. This change simplifies the code, relying on the inbuilt error handling to manage the case where the file cannot be successfully opened.
The worker pool maintenance delay has been reduced from 10,000 to 1,000 units in WorkerPool.php for smoother operation. In Utils.php, the check for debug mode has been changed from a defined constant to a variable, enhancing maintainability and effectiveness of debug mode functionality.
The updated workflow file now includes steps to run validation on both valid and invalid CSV files in parallel. This change is intended to increase overall efficiency of the validation process by speeding it up through simultaneous execution.
Added a var_dump within Utils.php to help debug issues with regex matching in the code. Also simplified the formatting in phpunit.xml.dist for readability, compacting multi-line report details into single lines.
Removed a debug var_dump in Utils.php which was previously added for regex matching debug purposes. Additionally, the filename pattern in demo_invalid.yml is updated to be more concise, eliminating unnecessary enumeration.
Updated code construction and optimized scripts in random-csv.php and csv-blueprint.php. Removed redundant try-catch in Utils.php and established a new utility function that sets up the basic environment for script execution. Updated Dockerfile for better cache warmup, and README.md to reflect changes in command options.
The SonarCloud scan step in GitHub workflow is now allowed to continue on error. Updated error message in Utils.php for clarity and precision. Also, modified the Docker image source in action.yml to use the local Dockerfile. This change provides more flexibility and control over the docker environment for the actions.
The order of the GitHub workflow has been adjusted to prioritize code quality check. Furthermore, error handling in Utils.php has been improved to provide more precise error severity descriptions. This enhances troubleshooting by providing more accurate error context.
@SmetDenis SmetDenis changed the title Implement parallel processing functionality for CSV file validation Implement parallel processing functionality based on ext-parallel Apr 11, 2024
The GitHub workflow and Makefile have been updated for a better benchmark setup. A separate step for multi-threaded benchmarking has been added and the method of generating random integers in the CSV file has been altered. The adjustments provide a clearer understanding of the process and better use of resources.
The GitHub workflow and Makefile have been updated for a better benchmark setup. A separate step for multi-threaded benchmarking has been added and the method of generating random integers in the CSV file has been altered. The adjustments provide a clearer understanding of the process and better use of resources.
The WorkerPool bootstrap has been simplified by removing conditional file_exists check and directly setting the 'vendor/autoload.php' as the bootstrap. This will make the script setup less convoluted and easier to understand.
The WorkerPool bootstrap setup has been modified to accommodate a Docker environment. Now, the script uses 'docker/preload.php' as the bootstrap in Docker environment, and falls back to 'vendor/autoload.php' if the Docker file does not exist. This adjustment helps to maintain a flexible setup that can suit different runtime contexts.
A new parallel processing feature has been added that enables the user to validate multiple CSV files concurrently. Introduced via the `--parallel` flag, this functionality enhances performance by effectively utilizing CPU resources. Accompanying documentation for this experimental feature, including usage and considerations, has also been updated in the README. Be aware that to use this new feature, the 'ext-parallel' PHP extension is required.
Enhanced error handling in filename pattern validation to identify regex related exceptions. Updated PHP settings for opcache to ensure thread-safety and prevent segmentation faults during parallel execution. Included modifications to basic configurations and addition of experimental elements in the php.ini file.
Updated documentation on the use and impacts of multithreading in the context of parallel validation of CSV by columns. Additionally, explained the potential downside of allocating more threads than available CPU cores on performance due to system overhead.
Copy link

sonarcloud bot commented Apr 11, 2024

@SmetDenis SmetDenis marked this pull request as ready for review April 11, 2024 21:45
@SmetDenis SmetDenis merged commit 821fc85 into master Apr 11, 2024
12 checks passed
@SmetDenis SmetDenis deleted the parallel-exec branch April 11, 2024 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress
Development

Successfully merging this pull request may close these issues.

2 participants