Skip to content

Commit

Permalink
Enhance README badges (#59)
Browse files Browse the repository at this point in the history
  • Loading branch information
SmetDenis committed Mar 19, 2024
1 parent 9958cd7 commit 51d9341
Show file tree
Hide file tree
Showing 5 changed files with 119 additions and 82 deletions.
132 changes: 65 additions & 67 deletions README.md
Expand Up @@ -4,13 +4,13 @@
[![Stable Version](https://poser.pugx.org/jbzoo/csv-blueprint/version)](https://packagist.org/packages/jbzoo/csv-blueprint/) [![Total Downloads](https://poser.pugx.org/jbzoo/csv-blueprint/downloads)](https://packagist.org/packages/jbzoo/csv-blueprint/stats) [![Docker Pulls](https://img.shields.io/docker/pulls/jbzoo/csv-blueprint.svg)](https://hub.docker.com/r/jbzoo/csv-blueprint) [![Dependents](https://poser.pugx.org/jbzoo/csv-blueprint/dependents)](https://packagist.org/packages/jbzoo/csv-blueprint/dependents?order_by=downloads) [![GitHub License](https://img.shields.io/github/license/jbzoo/csv-blueprint)](https://github.com/JBZoo/Csv-Blueprint/blob/master/LICENSE)

<!-- rules-counter -->
![Static Badge](https://img.shields.io/badge/Rules-100-green?label=Total%20Number%20of%20Rules&labelColor=blue&color=gray) ![Static Badge](https://img.shields.io/badge/Rules-55-green?label=Cell%20Rules&labelColor=blue&color=gray) ![Static Badge](https://img.shields.io/badge/Rules-45-green?label=Aggregate%20Rules&labelColor=blue&color=gray)
[![Static Badge](https://img.shields.io/badge/Rules-102-green?label=Total%20Number%20of%20Rules&labelColor=blue&color=gray)](schema-examples/full.yml) [![Static Badge](https://img.shields.io/badge/Rules-55-green?label=Cell%20Rules&labelColor=blue&color=gray)](src/Rules/Cell) [![Static Badge](https://img.shields.io/badge/Rules-45-green?label=Aggregate%20Rules&labelColor=blue&color=gray)](src/Rules/Aggregate) [![Static Badge](https://img.shields.io/badge/Rules-2-green?label=Extra%20Checks&labelColor=blue&color=gray)](schema-examples/full.yml)
<!-- /rules-counter -->

## Introduction

The CSV Blueprint tool is a powerful and flexible utility designed for validating CSV files against
a predefined schema specified in YAML format. With the capability to run both locally and in Docker environments,
a pre-defined schema specified in YAML format. With the capability to run both locally and in Docker environments,
CSV Blueprint is an ideal choice for integrating into CI/CD pipelines, such as GitHub Actions,
to ensure the integrity of CSV data in your projects.

Expand All @@ -35,7 +35,7 @@ Integrating CSV validation into CI processes promotes higher data integrity, rel
* **Comprehensive Rule Set**: Includes a broad set of validation rules, such as non-empty fields, exact values, regular expressions, numeric constraints, date formats, and more, catering to various data validation needs.
* **Docker Support**: Easily integrate into any workflow with Docker, providing a seamless experience for development, testing, and production environments.
* **GitHub Actions Integration**: Automate CSV validation in your CI/CD pipeline, enhancing the quality control of your data in pull requests and deployments.
* **Various ways to report:** issues that can be easily integrated with GithHub, Gitlab, TeamCity, etc. The default output is a human-readable table. [See Live Demo](https://github.com/JBZoo/Csv-Blueprint-Demo).
* **Various ways to report:** issues that can be easily integrated with GitHub, Gitlab, TeamCity, etc. The default output is a human-readable table. [See Live Demo](https://github.com/JBZoo/Csv-Blueprint-Demo).


## Live Demo
Expand All @@ -51,8 +51,8 @@ Integrating CSV validation into CI processes promotes higher data integrity, rel
* [demo.csv](tests/fixtures/demo.csv)


### Schema Definition
Define your CSV validation schema in a [YAML](schema-examples/full.yml). Other formats are also available: , [JSON](schema-examples/full.json), [PHP](schema-examples/full.php).
### Schema definition
Define your CSV validation schema in a [YAML](schema-examples/full.yml). Other formats are also available: [JSON](schema-examples/full.json), [PHP](schema-examples/full.php).

This example defines a simple schema for a CSV file with a header row, specifying that the `id` column must not be empty and must contain integer values.
Also, it checks that the `name` column has a minimum length of 3 characters.
Expand All @@ -74,7 +74,7 @@ columns:
```


### Full description of the scheme
### Full description of the schema

In the [example Yml file](schema-examples/full.yml) you can find a detailed description of all features.
It's also covered by tests, so it's always up-to-date.
Expand Down Expand Up @@ -351,7 +351,7 @@ columns:
You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/demo.yml).


### As GitHub Action
### GitHub Action

<!-- github-actions-yml -->
```yml
Expand Down Expand Up @@ -383,7 +383,7 @@ You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv
```
<!-- /github-actions-yml -->

**Note**. Report format for GitHub Actions is `github` by default. See [GitHub Actions friendly](https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-a-warning-message) and [PR as a live demo](https://github.com/JBZoo/Csv-Blueprint-Demo/pull/1/files).
**Note**. GitHub Actions report format is `github` by default. See [GitHub Actions friendly](https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-a-warning-message) and [PR as a live demo](https://github.com/JBZoo/Csv-Blueprint-Demo/pull/1/files).

This allows you to see bugs in the GitHub interface at the PR level.
That is, the error will be shown in a specific place in the CSV file right in diff of your Pull Requests! [See example](https://github.com/JBZoo/Csv-Blueprint-Demo/pull/1/files).
Expand All @@ -398,7 +398,7 @@ That is, the error will be shown in a specific place in the CSV file right in di
</details>


### As Docker container
### Docker container
Ensure you have Docker installed on your machine.

```sh
Expand All @@ -408,15 +408,15 @@ docker pull jbzoo/csv-blueprint
# Run the tool inside Docker
docker run --rm \
--workdir=/parent-host \
-v .:/parent-host \
-v $(pwd):/parent-host \
jbzoo/csv-blueprint \
validate:csv \
--csv=./tests/fixtures/demo.csv \
--schema=./tests/schemas/demo_invalid.yml
```


### As PHP binary
### PHP binary
Ensure you have PHP installed on your machine.

**Status: WIP**. It's not released yet. But you can build it from source. See manual above and `./build/csv-blueprint.phar` file.
Expand All @@ -430,7 +430,7 @@ chmod +x ./csv-blueprint.phar
```


### As PHP project
### PHP project
Ensure you have PHP installed on your machine.
Then, you can use the following commands to build from source and run the tool.

Expand All @@ -444,7 +444,7 @@ make build
```


### CLI Help Message
### Complete CLI Help Message

Here you can see all available options and commands. Tool uses [JBZoo/Cli](https://github.com/JBZoo/Cli) package for the CLI interface.
So there are options here for all occasions.
Expand Down Expand Up @@ -559,55 +559,54 @@ Optional format `text` with highlited keywords:

It's random ideas and plans. No orderings and deadlines. <u>But batch processing is the priority #1</u>.

**Batch processing**
* If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems.
* Discovering CSV files by `filename_pattern` in the schema file. In case you have a lot of schemas and a lot of CSV files and want to automate the process as one command.
* Flag to ignore file name pattern. It's useful when you have a lot of files, and you don't want to validate the file name.

**Validation**
* [More aggregate rules](https://github.com/markrogoyski/math-php#statistics---descriptive).
* [More cell rules](https://github.com/Respect/Validation).
* `required` flag for the column.
* Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc. Overridable on the column level.
* Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
* Using multiple schemas for one csv file.
* Inheritance of schemas, rules and columns. Define parent schema and override some rules in the child schemas. Make it DRY and easy to maintain.
* Validate syntax and options in the schema file. It's important to know if the schema file is valid and can be used for validation.
* If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?").
* Complex rules (like "if field `A` is not empty, then field `B` should be not empty too").
* Extending with custom rules and custom report formats. Plugins?
* Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8/16/32 is the best choice for now.

**Release workflow**
* Build and release Docker image [via GitHub Actions, tags and labels](https://docs.docker.com/build/ci/github-actions/manage-tags-labels/). Review it.
* Build phar file and release via GitHub Actions.
* Auto insert tool version into the Docker image and phar file. It's important to know the version of the tool you are using.
* Show version as part of output.

**Performance and optimization**
* Benchmarks as part of the CI(?) and Readme. It's important to know how much time the validation process takes.
* Optimazation on `php.ini` level to start it faster. JIT, opcache, preloading, etc.
* Parallel validation of really-really large files (1GB+ ?). I know you have them and not so much memory.
* Parallel validation of multiple files at once.

**Mock data generation**
* Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules").
* Use [Faker](https://github.com/FakerPHP/Faker) for random data generation.

**Reporting**
* More report formats (like JSON, XML, etc). Any ideas?
* Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea.
* Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports.

**Misc**
* Use it as PHP SDK. Examples in Readme.
* Warnings about deprecated options and features.
* Warnings about invalid schema files.
* Move const:HELP to PHP annotations. Canonic way to describe the command.
* S3 Storage support. Validate files in the S3 bucket?
* More examples and documentation.
* **Batch processing**
* If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems.
* Discovering CSV files by `filename_pattern` in the schema file. In case you have a lot of schemas and a lot of CSV files and want to automate the process as one command.
* Flag to ignore file name pattern. It's useful when you have a lot of files, and you don't want to validate the file name.

* **Validation**
* [More aggregate rules](https://github.com/markrogoyski/math-php#statistics---descriptive).
* [More cell rules](https://github.com/Respect/Validation).
* `required` flag for the column.
* Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
* Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc. Overridable on the column level.
* Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
* Using multiple schemas for one csv file.
* Inheritance of schemas, rules and columns. Define parent schema and override some rules in the child schemas. Make it DRY and easy to maintain.
* If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?").
* Complex rules (like "if field `A` is not empty, then field `B` should be not empty too").
* Extending with custom rules and custom report formats. Plugins?
* Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8/16/32 is the best choice for now.

* **Release workflow**
* Build and release Docker image [via GitHub Actions, tags and labels](https://docs.docker.com/build/ci/github-actions/manage-tags-labels/). Review it.
* Build phar file and release via GitHub Actions.
* Auto insert tool version into the Docker image and phar file. It's important to know the version of the tool you are using.
* Show version as part of output.

* **Performance and optimization**
* Benchmarks as part of the CI(?) and Readme. It's important to know how much time the validation process takes.
* Optimization on `php.ini` level to start it faster. JIT, opcache, preloading, etc.
* Parallel validation of really-really large files (1GB+ ?). I know you have them and not so much memory.
* Parallel validation of multiple files at once.

* **Mock data generation**
* Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules").
* Use [Faker](https://github.com/FakerPHP/Faker) for random data generation.

* **Reporting**
* More report formats (like JSON, XML, etc). Any ideas?
* Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea.
* Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports.

* **Misc**
* Use it as PHP SDK. Examples in Readme.
* Warnings about deprecated options and features.
* Warnings about invalid schema files.
* Move const:HELP to PHP annotations. Canonic way to describe the command.
* S3 Storage support. Validate files in the S3 bucket?
* More examples and documentation.


PS. [There is a file](tests/schemas/todo.yml) with my ideas and imagination. It's not valid schema file, just a draft.
Expand All @@ -616,12 +615,12 @@ I'm not sure if I will implement all of them. But I will try to do my best.

## Disadvantages?

There is a perception that PHP is a slow language. I don't agree with that. You just need to know how to prepare it.
It is perceived that PHP is a slow language. I don't agree with that. You just need to know how to prepare it.
See [Processing One Billion CSV rows in PHP!](https://dev.to/realflowcontrol/processing-one-billion-rows-in-php-3eg0).
That is, if you do everything right, you can read, aggregate and calculate data from CSV at **~15 million lines per second**!

* Yeah-yeah. I know it's not the fastest tool in the world. But it's not the slowest either. See link above.
* Yeah-yeah. I know it's PHP (not Python, Go, Pyspark...). PHP is not the best language for such tasks.
* Yeah-yeah. I know it's PHP (not Python, Go, PySpark...). PHP is not the best language for such tasks.
* Yeah-yeah. It looks like a standalone binary. Right. Just use it, don't think about how it works.
* Yeah-yeah. I know you can't use as Python SDK as part of a pipeline.

Expand All @@ -637,10 +636,9 @@ So... as strictly as possible in today's PHP world. I think it works as expected

## Interesting fact

I think I've set a personal record.
The first version was written from scratch in about 3 days (with really frequent breaks to take care of 4 month baby).
I've set a personal record. The first version was written from scratch in about 3 days (with really frequent breaks to take care of 4 month baby).
I'm looking at the first commit and the very first git tag. I'd say over the weekend, in my spare time on my personal laptop.
Well... AI I only used for this Readme file because I'm not very good at English. 🤔
Well... AI was only used for this Readme file because I'm not very good at English. 🤔

I seem to be typing fast and I had really great inspiration. I hope my wife doesn't divorce me. 😅

Expand Down
21 changes: 14 additions & 7 deletions tests/ReadmeTest.php
Expand Up @@ -64,19 +64,26 @@ public function testBadgeOfRules(): void
{
$cellRules = \count(yml(Tools::SCHEMA_FULL_YML)->findArray('columns.0.rules'));
$aggRules = \count(yml(Tools::SCHEMA_FULL_YML)->findArray('columns.0.aggregate_rules'));
$totalRules = $cellRules + $aggRules;
$extraRules = 1 + 1; // filename_pattern, schema validation
$totalRules = $cellRules + $aggRules + $extraRules;

$badge = static function (string $label, int $count): string {
$badge = static function (string $label, int $count, string $url = ''): string {
$label = \str_replace(' ', '%20', $label);

return "![Static Badge](https://img.shields.io/badge/Rules-{$count}-green" .
$badge = "![Static Badge](https://img.shields.io/badge/Rules-{$count}-green" .
"?label={$label}&labelColor=blue&color=gray)";

if ($url) {
return "[{$badge}]({$url})";
}

return $badge;
};

$text = \implode(' ', [
$badge('Total Number of Rules', $totalRules),
$badge('Cell Rules', $cellRules),
$badge('Aggregate Rules', $aggRules),
$badge('Total Number of Rules', $totalRules, 'schema-examples/full.yml'),
$badge('Cell Rules', $cellRules, 'src/Rules/Cell'),
$badge('Aggregate Rules', $aggRules, 'src/Rules/Aggregate'),
$badge('Extra Checks', $extraRules, 'schema-examples/full.yml'),
]);

Tools::insertInReadme('rules-counter', $text);
Expand Down
11 changes: 11 additions & 0 deletions tests/SchemaTest.php
Expand Up @@ -325,4 +325,15 @@ public function testMatchTypes(): void

isSame([], $invalidPairs);
}

public function testTodoList(): void
{
isSame(
[],
Tools::findKeysToRemove(
yml(Tools::SCHEMA_FULL_YML)->getArrayCopy(),
yml(Tools::SCHEMA_TODO)->getArrayCopy(),
),
);
}
}
24 changes: 24 additions & 0 deletions tests/Tools.php
Expand Up @@ -41,6 +41,8 @@ final class Tools
public const SCHEMA_FULL_PHP = './schema-examples/full.php';
public const SCHEMA_INVALID = './tests/schemas/invalid_schema.yml';

public const SCHEMA_TODO = './tests/schemas/todo.yml';

public const DEMO_YML_VALID = './tests/schemas/demo_valid.yml';
public const DEMO_YML_INVALID = './tests/schemas/demo_invalid.yml';
public const DEMO_CSV = './tests/fixtures/demo.csv';
Expand Down Expand Up @@ -117,4 +119,26 @@ public static function insertInReadme(string $code, string $content): void

isFileContains($result, self::README);
}

public static function findKeysToRemove(array $current, array $todo, $path = ''): array
{
$keysToRemove = [];

foreach ($todo as $key => $value) {
$currentPath = $path === '' ? $key : $path . '.' . $key;

if (\array_key_exists($key, $current)) {
if (\is_array($value) && \is_array($current[$key])) {
$keysToRemove = \array_merge(
$keysToRemove,
self::findKeysToRemove($current[$key], $value, $currentPath),
);
} else {
$keysToRemove[] = $currentPath;
}
}
}

return $keysToRemove;
}
}

0 comments on commit 51d9341

Please sign in to comment.