Skip to content

Commit

Permalink
Add 'charset' validation (#132)
Browse files Browse the repository at this point in the history
Introduced "charset" property in CSV schema examples for validating if
the string is in a specific charset. Created a new Charset rule class
and corresponding test to handle this validation. The readme and schemas
are updated accordingly to reflect this new feature.
  • Loading branch information
SmetDenis committed Apr 3, 2024
1 parent 9c50b12 commit 014cbd2
Show file tree
Hide file tree
Showing 9 changed files with 153 additions and 9 deletions.
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@
<!-- auto-update:/top-badges -->

<!-- auto-update:rules-counter -->
[![Static Badge](https://img.shields.io/badge/Rules-382-green?label=Total%20number%20of%20rules&labelColor=darkgreen&color=gray)](schema-examples/full.yml)
[![Static Badge](https://img.shields.io/badge/Rules-168-green?label=Cell%20rules&labelColor=blue&color=gray)](src/Rules/Cell)
[![Static Badge](https://img.shields.io/badge/Rules-383-green?label=Total%20number%20of%20rules&labelColor=darkgreen&color=gray)](schema-examples/full.yml)
[![Static Badge](https://img.shields.io/badge/Rules-169-green?label=Cell%20rules&labelColor=blue&color=gray)](src/Rules/Cell)
[![Static Badge](https://img.shields.io/badge/Rules-206-green?label=Aggregate%20rules&labelColor=blue&color=gray)](src/Rules/Aggregate)
[![Static Badge](https://img.shields.io/badge/Rules-8-green?label=Extra%20checks&labelColor=blue&color=gray)](#extra-checks)
[![Static Badge](https://img.shields.io/badge/Rules-27/54/9-green?label=Plan%20to%20add&labelColor=gray&color=gray)](tests/schemas/todo.yml)
[![Static Badge](https://img.shields.io/badge/Rules-26/54/9-green?label=Plan%20to%20add&labelColor=gray&color=gray)](tests/schemas/todo.yml)
<!-- auto-update:/rules-counter -->

A console utility designed for validating CSV files against a strictly defined schema and validation rules outlined
Expand Down Expand Up @@ -504,6 +504,16 @@ columns:
# - haval128,4, haval160,4, haval192,4, haval224,4, haval256,4, haval128,5, haval160,5, haval192,5, haval224,5, haval256,5
hash: set_algo # Example: "1234567890abcdef".

# Check if a string is in a specific charset. Available charsets:
# - 7bit, 8bit, ASCII, ArmSCII-8, BASE64, BIG-5, CP850, CP866, CP932, CP936
# - CP950, CP50220, CP50221, CP50222, CP51932, EUC-CN, EUC-JP, EUC-JP-2004, EUC-KR, EUC-TW
# - GB18030, HTML-ENTITIES, HZ, ISO-2022-JP, ISO-2022-JP-2004, ISO-2022-JP-MOBILE#KDDI, ISO-2022-JP-MS, ISO-2022-KR, ISO-8859-1, ISO-8859-2
# - ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14
# - ISO-8859-15, ISO-8859-16, JIS, KOI8-R, KOI8-U, Quoted-Printable, SJIS, SJIS-2004, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI
# - SJIS-Mobile#SOFTBANK, SJIS-mac, SJIS-win, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UHC
# - UTF-7, UTF-8, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A, UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, UTF-16, UTF-16BE, UTF-16LE, UTF-32
# - UTF-32BE, UTF-32LE, UTF7-IMAP, UUENCODE, Windows-1251, Windows-1252, Windows-1254, eucJP-win
charset: charset_code # Validates if a string is in a specific charset. Example: "UTF-8".

####################################################################################################################
# Data validation for the entire(!) column using different data aggregation methods.
Expand Down
3 changes: 2 additions & 1 deletion schema-examples/full.json
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,8 @@
"is_consonant" : true,
"is_alnum" : true,
"is_alpha" : true,
"hash" : "set_algo"
"hash" : "set_algo",
"charset" : "charset_code"
},
"aggregate_rules" : {
"is_unique" : true,
Expand Down
3 changes: 2 additions & 1 deletion schema-examples/full.php
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,8 @@
'is_alnum' => true,
'is_alpha' => true,

'hash' => 'set_algo',
'hash' => 'set_algo',
'charset' => 'charset_code',
],

'aggregate_rules' => [
Expand Down
10 changes: 10 additions & 0 deletions schema-examples/full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,16 @@ columns:
# - haval128,4, haval160,4, haval192,4, haval224,4, haval256,4, haval128,5, haval160,5, haval192,5, haval224,5, haval256,5
hash: set_algo # Example: "1234567890abcdef".

# Check if a string is in a specific charset. Available charsets:
# - 7bit, 8bit, ASCII, ArmSCII-8, BASE64, BIG-5, CP850, CP866, CP932, CP936
# - CP950, CP50220, CP50221, CP50222, CP51932, EUC-CN, EUC-JP, EUC-JP-2004, EUC-KR, EUC-TW
# - GB18030, HTML-ENTITIES, HZ, ISO-2022-JP, ISO-2022-JP-2004, ISO-2022-JP-MOBILE#KDDI, ISO-2022-JP-MS, ISO-2022-KR, ISO-8859-1, ISO-8859-2
# - ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14
# - ISO-8859-15, ISO-8859-16, JIS, KOI8-R, KOI8-U, Quoted-Printable, SJIS, SJIS-2004, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI
# - SJIS-Mobile#SOFTBANK, SJIS-mac, SJIS-win, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UHC
# - UTF-7, UTF-8, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A, UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, UTF-16, UTF-16BE, UTF-16LE, UTF-32
# - UTF-32BE, UTF-32LE, UTF7-IMAP, UUENCODE, Windows-1251, Windows-1252, Windows-1254, eucJP-win
charset: charset_code # Validates if a string is in a specific charset. Example: "UTF-8".

####################################################################################################################
# Data validation for the entire(!) column using different data aggregation methods.
Expand Down
2 changes: 2 additions & 0 deletions schema-examples/full_clean.yml
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,8 @@ columns:

hash: set_algo

charset: charset_code

aggregate_rules:
is_unique: true
sorted: [ asc, natural ]
Expand Down
68 changes: 68 additions & 0 deletions src/Rules/Cell/Charset.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
<?php

/**
* JBZoo Toolbox - Csv-Blueprint.
*
* This file is part of the JBZoo Toolbox project.
* For the full copyright and license information, please view the LICENSE
* file that was distributed with this source code.
*
* @license MIT
* @copyright Copyright (C) JBZoo.com, All rights reserved.
* @see https://github.com/JBZoo/Csv-Blueprint
*/

declare(strict_types=1);

namespace JBZoo\CsvBlueprint\Rules\Cell;

use Respect\Validation\Validator;

final class Charset extends AbstractCellRule
{
public function getHelpMeta(): array
{
return [
self::getHelpTitle(),
[
self::DEFAULT => [
'charset_code',
'Validates if a string is in a specific charset. Example: "UTF-8".',
],
],
];
}

public function validateRule(string $cellValue): ?string
{
if ($cellValue === '') {
return null;
}

$charset = $this->getOptionAsString();
if ($charset === '') {
return 'The charset is not specified.';
}

if (!Validator::charset($charset)->validate($cellValue)) {
return "The value \"<c>{$cellValue}</c>\" is not in the charset \"{$charset}\".";
}

return null;
}

private static function getHelpTitle(): array
{
$maxOnLine = 10;
$list = \mb_list_encodings();
\sort($list, \SORT_NATURAL);
$lines = \array_chunk($list, $maxOnLine);

$result = ['Check if a string is in a specific charset. Available charsets:'];
foreach ($lines as $line) {
$result[] = ' - ' . \implode(', ', $line);
}

return $result;
}
}
53 changes: 53 additions & 0 deletions tests/Rules/Cell/CharsetTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
<?php

/**
* JBZoo Toolbox - Csv-Blueprint.
*
* This file is part of the JBZoo Toolbox project.
* For the full copyright and license information, please view the LICENSE
* file that was distributed with this source code.
*
* @license MIT
* @copyright Copyright (C) JBZoo.com, All rights reserved.
* @see https://github.com/JBZoo/Csv-Blueprint
*/

declare(strict_types=1);

namespace JBZoo\PHPUnit\Rules\Cell;

use JBZoo\CsvBlueprint\Rules\Cell\Charset;
use JBZoo\PHPUnit\Rules\TestAbstractCellRule;

use function JBZoo\PHPUnit\isSame;

final class CharsetTest extends TestAbstractCellRule
{
protected string $ruleClass = Charset::class;

public function testPositive(): void
{
$rule = $this->create('UTF-8');
isSame('', $rule->test(''));
isSame('', $rule->test('USD'));
isSame('', $rule->test('EUR'));

$rule = $this->create('ASCII');
isSame('', $rule->test(\mb_convert_encoding('strawberry', 'ASCII')));
}

public function testNegative(): void
{
$rule = $this->create('ASCII');
isSame(
'The value "日本国" is not in the charset "ASCII".',
$rule->test('日本国'),
);

$rule = $this->create('');
isSame(
'The charset is not specified.',
$rule->test('日本国'),
);
}
}
6 changes: 3 additions & 3 deletions tests/Tools.php
Original file line number Diff line number Diff line change
Expand Up @@ -127,12 +127,12 @@ public static function insertInReadme(string $code, string $content): void
\file_get_contents(self::README),
);

$sizeBefore = \filesize(self::README);
$hashBefore = \hash_file('md5', self::README);
\clearstatcache(true, self::README);
isTrue(\file_put_contents(self::README, $result) > 0);
$sizeAfter = \filesize(self::README);
$hashAfter = \hash_file('md5', self::README);

isSame($sizeAfter, $sizeBefore, "README.md was not updated. Code: {$code}");
isSame($hashAfter, $hashBefore, "README.md was not updated. Code: {$code}");
isFileContains($result, self::README);
}

Expand Down
1 change: 0 additions & 1 deletion tests/schemas/todo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@ columns:
is_card_number: true

# Strings
is_charset: true
is_hex_rgb_color: true
no_whitespace: true

Expand Down

0 comments on commit 014cbd2

Please sign in to comment.