Skip to content

Commit

Permalink
Improve CSV schema validation with filename patterns (#22)
Browse files Browse the repository at this point in the history
This commit introduces filename pattern validation to the CSV schema,
which allows increased data consistency checks. It also extends the
capability to include additional columns to the schema, providing a more
flexible structure. Enhancements were also made to error handling,
introducing a quick-stop feature to expedite error discovery.
  • Loading branch information
SmetDenis committed Mar 13, 2024
1 parent ad046c0 commit 7f71221
Show file tree
Hide file tree
Showing 16 changed files with 165 additions and 52 deletions.
9 changes: 3 additions & 6 deletions .github/workflows/demo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -105,13 +105,12 @@ jobs:
- name: 👎 Invalid CSV file
run: |
docker run \
! docker run \
-v `pwd`:/parent-host \
--rm jbzoo/csv-blueprint \
validate:csv \
--csv=/parent-host/tests/fixtures/batch/*.csv \
--schema=/parent-host/tests/schemas/demo_invalid.yml
continue-on-error: true
phar:
Expand All @@ -138,11 +137,10 @@ jobs:
- name: 👎 Invalid CSV file
run: |
./build/csv-blueprint.phar \
! ./build/csv-blueprint.phar \
validate:csv \
--csv=./tests/fixtures/batch/*.csv \
--schema=./tests/schemas/demo_invalid.yml
continue-on-error: true
php:
Expand All @@ -169,8 +167,7 @@ jobs:
- name: 👎 Invalid CSV file
run: |
./csv-blueprint \
! ./csv-blueprint \
validate:csv \
--csv=./tests/fixtures/batch/*.csv \
--schema=./tests/schemas/demo_invalid.yml
continue-on-error: true
34 changes: 28 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,9 +254,16 @@ Found CSV files: 3
| 7 | 0:Name | min_length | Value "Lois" (length: 4) is too short. Min length is 5 |
+------+------------+------------+----- demo-2.csv ---------------------------------------+
(3/3) OK: ./tests/fixtures/batch/sub/demo-3.csv
(3/3) Invalid file: ./tests/fixtures/batch/sub/demo-3.csv
+------+-----------+------------------+---- demo-3.csv -------------------------------------------+
| Line | id:Column | Rule | Message |
+------+-----------+------------------+-----------------------------------------------------------+
| 0 | | filename_pattern | Filename "./tests/fixtures/batch/sub/demo-3.csv" does not |
| | | | match pattern: "/demo-[12].csv$/i" |
+------+-----------+------------------+---- demo-3.csv -------------------------------------------+
Found 7 issues in 2 out of 3 CSV files.
Found 8 issues in 3 out of 3 CSV files.
```

Expand Down Expand Up @@ -307,6 +314,11 @@ This gives you great flexibility when validating CSV files.
```yml
# It's a full example of the CSV schema file in YAML format.

# Regular expression to match the file name. If not set, then no pattern check
# This way you can validate the file name before the validation process.
# Feel free to check parent directories as well.
filename_pattern: /demo(-\d+)?\.csv$/i

csv: # Here are default values. You can skip this section if you don't need to override the default values
header: true # If the first row is a header. If true, name of each column is required
delimiter: , # Delimiter character in CSV file
Expand Down Expand Up @@ -362,6 +374,8 @@ columns:
cardinal_direction: true # Valid cardinal direction. Examples: "N", "S", "NE", "SE", "none", ""
usa_market_name: true # Check if the value is a valid USA market name. Example: "New York, NY"

- name: "another_column"

```


Expand All @@ -370,15 +384,16 @@ columns:

```json
{
"csv" : {
"filename_pattern" : "/demo(-\\d+)?\\.csv$/i",
"csv" : {
"header" : true,
"delimiter" : ",",
"quote_char" : "\\",
"enclosure" : "\"",
"encoding" : "utf-8",
"bom" : false
},
"columns" : [
"columns" : [
{
"name" : "csv_header_name",
"description" : "Lorem ipsum",
Expand Down Expand Up @@ -412,7 +427,8 @@ columns:
"cardinal_direction" : true,
"usa_market_name" : true
}
}
},
{"name" : "another_column"}
]
}

Expand All @@ -422,6 +438,7 @@ columns:




<details>
<summary>Click to see: PHP Format</summary>

Expand All @@ -430,6 +447,8 @@ columns:
declare(strict_types=1);

return [
'filename_pattern' => '/demo(-\\d+)?\\.csv$/i',

'csv' => [
'header' => true,
'delimiter' => ',',
Expand All @@ -438,6 +457,7 @@ return [
'encoding' => 'utf-8',
'bom' => false,
],

'columns' => [
[
'name' => 'csv_header_name',
Expand Down Expand Up @@ -473,6 +493,7 @@ return [
'usa_market_name' => true,
],
],
['name' => 'another_column'],
],
];

Expand All @@ -481,6 +502,7 @@ return [
</details>



## Coming soon

It's random ideas and plans. No orderings and deadlines. <u>But batch processing is the priority #1</u>.
Expand All @@ -494,7 +516,7 @@ Batch processing
* [ ] Discovering CSV files by `filename_pattern` in the schema file. In case you have a lot of schemas and a lot of CSV files and want to automate the process as one command.

Validation
* [ ] `filename_pattern` validation with regex (like "all files in the folder should be in the format `/^[\d]{4}-[\d]{2}-[\d]{2}\.csv$/`").
* [x] ~~`filename_pattern` validation with regex (like "all files in the folder should be in the format `/^[\d]{4}-[\d]{2}-[\d]{2}\.csv$/`").~~
* [ ] Agregate rules (like "at least one of the fields should be not empty" or "all values must be unique").
* [ ] Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
* [ ] Using multiple schemas for one csv file.
Expand Down
8 changes: 5 additions & 3 deletions schema-examples/full.json
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
{
"csv" : {
"filename_pattern" : "/demo(-\\d+)?\\.csv$/i",
"csv" : {
"header" : true,
"delimiter" : ",",
"quote_char" : "\\",
"enclosure" : "\"",
"encoding" : "utf-8",
"bom" : false
},
"columns" : [
"columns" : [
{
"name" : "csv_header_name",
"description" : "Lorem ipsum",
Expand Down Expand Up @@ -41,6 +42,7 @@
"cardinal_direction" : true,
"usa_market_name" : true
}
}
},
{"name" : "another_column"}
]
}
4 changes: 4 additions & 0 deletions schema-examples/full.php
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
declare(strict_types=1);

return [
'filename_pattern' => '/demo(-\\d+)?\\.csv$/i',

'csv' => [
'header' => true,
'delimiter' => ',',
Expand All @@ -23,6 +25,7 @@
'encoding' => 'utf-8',
'bom' => false,
],

'columns' => [
[
'name' => 'csv_header_name',
Expand Down Expand Up @@ -58,5 +61,6 @@
'usa_market_name' => true,
],
],
['name' => 'another_column'],
],
];
7 changes: 7 additions & 0 deletions schema-examples/full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@

# It's a full example of the CSV schema file in YAML format.

# Regular expression to match the file name. If not set, then no pattern check
# This way you can validate the file name before the validation process.
# Feel free to check parent directories as well.
filename_pattern: /demo(-\d+)?\.csv$/i

csv: # Here are default values. You can skip this section if you don't need to override the default values
header: true # If the first row is a header. If true, name of each column is required
delimiter: , # Delimiter character in CSV file
Expand Down Expand Up @@ -66,3 +71,5 @@ columns:
is_longitude: true # Can be integer or float. Example: -89.123456
cardinal_direction: true # Valid cardinal direction. Examples: "N", "S", "NE", "SE", "none", ""
usa_market_name: true # Check if the value is a valid USA market name. Example: "New York, NY"

- name: "another_column"
39 changes: 37 additions & 2 deletions src/Csv/CsvFile.php
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
namespace JBZoo\CsvBlueprint\Csv;

use JBZoo\CsvBlueprint\Schema;
use JBZoo\CsvBlueprint\Utils;
use JBZoo\CsvBlueprint\Validators\Error;
use JBZoo\CsvBlueprint\Validators\ErrorSuite;
use League\Csv\Reader as LeagueReader;
Expand Down Expand Up @@ -82,7 +83,9 @@ public function validate(bool $quickStop = false): ErrorSuite
{
$errors = new ErrorSuite($this->getCsvFilename());

$errors->addErrorSuit($this->validateHeader())
$errors
->addErrorSuit($this->validateFile($quickStop))
->addErrorSuit($this->validateHeader($quickStop))
->addErrorSuit($this->validateEachCell($quickStop))
->addErrorSuit(self::validateAggregateRules($quickStop));

Expand All @@ -106,7 +109,7 @@ private function prepareReader(): LeagueReader
return $reader;
}

private function validateHeader(): ErrorSuite
private function validateHeader(bool $quickStop = false): ErrorSuite
{
$errors = new ErrorSuite();

Expand All @@ -125,6 +128,10 @@ private function validateHeader(): ErrorSuite

$errors->addError($error);
}

if ($quickStop && $errors->count() > 0) {
return $errors;
}
}

return $errors;
Expand Down Expand Up @@ -152,6 +159,34 @@ private function validateEachCell(bool $quickStop = false): ErrorSuite
return $errors;
}

private function validateFile(bool $quickStop = false): ErrorSuite
{
$errors = new ErrorSuite();

$filenamePattern = $this->schema->getFilenamePattern();
if (
$filenamePattern !== null
&& $filenamePattern !== ''
&& \preg_match($filenamePattern, $this->csvFilename) === 0
) {
$error = new Error(
'filename_pattern',
'Filename "<c>' . Utils::cutPath($this->csvFilename) .
"</c>\" does not match pattern: \"<c>{$filenamePattern}</c>\"",
'',
0,
);

$errors->addError($error);

if ($quickStop && $errors->count() > 0) {
return $errors;
}
}

return $errors;
}

private static function validateAggregateRules(bool $quickStop = false): ErrorSuite
{
$errors = new ErrorSuite();
Expand Down
4 changes: 2 additions & 2 deletions src/Schema.php
Original file line number Diff line number Diff line change
Expand Up @@ -114,9 +114,9 @@ public function getColumn(int|string $columNameOrId): ?Column
return $column;
}

public function getFinenamePattern(): ?string
public function getFilenamePattern(): ?string
{
return $this->data->getStringNull('finename_pattern');
return Utils::prepareRegex($this->data->getStringNull('filename_pattern'));
}

public function getIncludes(): array
Expand Down
2 changes: 1 addition & 1 deletion src/Utils.php
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ public static function prepareRegex(?string $pattern, string $addDelimiter = '/'
}
}

return $addDelimiter . $pattern . $addDelimiter . 'u';
return $addDelimiter . $pattern . $addDelimiter;
}

/**
Expand Down
2 changes: 1 addition & 1 deletion tests/Blueprint/MiscTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ public function testPrepareRegex(): void
{
isSame(null, Utils::prepareRegex(null));
isSame(null, Utils::prepareRegex(''));
isSame('/.*/u', Utils::prepareRegex('.*'));
isSame('/.*/', Utils::prepareRegex('.*'));
isSame('#.*#u', Utils::prepareRegex('#.*#u'));
isSame('/.*/', Utils::prepareRegex('/.*/'));
isSame('/.*/ius', Utils::prepareRegex('/.*/ius'));
Expand Down
2 changes: 1 addition & 1 deletion tests/Blueprint/RulesTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -591,7 +591,7 @@ public function testRegex(): void
isSame(null, $rule->validate('aaa'));
isSame(null, $rule->validate('a'));
isSame(
'"regex" at line 0, column "prop". Value "1bc" does not match the pattern "/^a/u".',
'"regex" at line 0, column "prop". Value "1bc" does not match the pattern "/^a/".',
\strip_tags((string)$rule->validate('1bc')),
);
}
Expand Down
4 changes: 2 additions & 2 deletions tests/Blueprint/SchemaTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,10 @@ public function testFilename(): void
public function testGetFinenamePattern(): void
{
$schemaEmpty = new Schema(self::SCHEMA_EXAMPLE_EMPTY);
isSame(null, $schemaEmpty->getFinenamePattern());
isSame(null, $schemaEmpty->getFilenamePattern());

$schemaFull = new Schema(self::SCHEMA_EXAMPLE_FULL);
isSame('^example\.csv$', $schemaFull->getFinenamePattern());
isSame('/^example\.csv$/', $schemaFull->getFilenamePattern());
}

public function testScvStruture(): void
Expand Down

0 comments on commit 7f71221

Please sign in to comment.