GitHub - 2shortplanks/Test-utf8: Handy utf-8 tests for Perl

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.circleci		.circleci
inc/Module		inc/Module
lib/Test		lib/Test
t		t
xt		xt
CHANGES		CHANGES
MANIFEST		MANIFEST
META.yml		META.yml
Makefile.PL		Makefile.PL
README		README

Repository files navigation

NAME
    Test::utf8 - handy utf8 tests

SYNOPSIS
      # check the string is good
      is_valid_string($string);   # check the string is valid
      is_sane_utf8($string);      # check not double encoded

      # check the string has certain attributes
      is_flagged_utf8($string1);   # has utf8 flag set
      is_within_ascii($string2);   # only has ascii chars in it
      isnt_within_ascii($string3); # has chars outside the ascii range
      is_within_latin_1($string4); # only has latin-1 chars in it
      isnt_within_ascii($string5); # has chars outside the latin-1 range

DESCRIPTION
    This module is a collection of tests useful for dealing with utf8
    strings in Perl.

    This module has two types of tests: The validity tests check if a string
    is valid and not corrupt, whereas the characteristics tests will check
    that string has a given set of characteristics.

  Validity Tests
    is_valid_string($string, $testname)
        Checks if the string is "valid", i.e. this passes and returns true
        unless the internal utf8 flag hasn't been set on scalar that isn't
        made up of a valid utf-8 byte sequence.

        This should *never* happen and, in theory, this test should always
        pass. Unless you (or a module you use) goes monkeying around inside
        a scalar using Encode's private functions or XS code you shouldn't
        ever end up in a situation where you've got a corrupt scalar. But if
        you do, and you do, then this function should help you detect the
        problem.

        To be clear, here's an example of the error case this can detect:

          my $mark = "Mark";
          my $leon = "L\x{e9}on";
          is_valid_string($mark);  # passes, not utf-8
          is_valid_string($leon);  # passes, not utf-8

          my $iloveny = "I \x{2665} NY";
          is_valid_string($iloveny);      # passes, proper utf-8

          my $acme = "L\x{c3}\x{a9}on";
          Encode::_utf8_on($acme);      # (please don't do things like this)
          is_valid_string($acme);       # passes, proper utf-8 byte sequence upgraded

          Encode::_utf8_on($leon);      # (this is why you don't do things like this)
          is_valid_string($leon);       # fails! the byte \x{e9} isn't valid utf-8

    is_sane_utf8($string, $name)
        This test fails if the string contains something that looks like it
        might be dodgy utf8, i.e. containing something that looks like the
        multi-byte sequence for a latin-1 character but perl hasn't been
        instructed to treat as such. Strings that are not utf8 always
        automatically pass.

        Some examples may help:

          # This will pass as it's a normal latin-1 string
          is_sane_utf8("Hello L\x{e9}eon");

          # this will fail because the \x{c3}\x{a9} looks like the
          # utf8 byte sequence for e-acute
          my $string = "Hello L\x{c3}\x{a9}on";
          is_sane_utf8($string);

          # this will pass because the utf8 is correctly interpreted as utf8
          Encode::_utf8_on($string)
          is_sane_utf8($string);

        Obviously this isn't a hundred percent reliable. The edge case where
        this will fail is where you have "\x{c2}" (which is "LATIN CAPITAL
        LETTER WITH CIRCUMFLEX") or "\x{c3}" (which is "LATIN CAPITAL LETTER
        WITH TILDE") followed by one of the latin-1 punctuation symbols.

          # a capital letter A with tilde surrounded by smart quotes
          # this will fail because it'll see the "\x{c2}\x{94}" and think
          # it's actually the utf8 sequence for the end smart quote
          is_sane_utf8("\x{93}\x{c2}\x{94}");

        However, since this hardly comes up this test is reasonably reliable
        in most cases. Still, care should be applied in cases where dynamic
        data is placed next to latin-1 punctuation to avoid false negatives.

        There exists two situations to cause this test to fail; The string
        contains utf8 byte sequences and the string hasn't been flagged as
        utf8 (this normally means that you got it from an external source
        like a C library; When Perl needs to store a string internally as
        utf8 it does it's own encoding and flagging transparently) or a utf8
        flagged string contains byte sequences that when translated to
        characters themselves look like a utf8 byte sequence. The test
        diagnostics tells you which is the case.

  String Characteristic Tests
    These routines allow you to check the range of characters in a string.
    Note that these routines are blind to the actual encoding perl
    internally uses to store the characters, they just check if the string
    contains only characters that can be represented in the named encoding:

    is_within_ascii
        Tests that a string only contains characters that are in the ASCII
        character set.

    is_within_latin_1
        Tests that a string only contains characters that are in latin-1.

    Simply check if a scalar is or isn't flagged as utf8 by perl's
    internals:

    is_flagged_utf8($string, $name)
        Passes if the string is flagged by perl's internals as utf8, fails
        if it's not.

    isnt_flagged_utf8($string,$name)
        The opposite of "is_flagged_utf8", passes if and only if the string
        isn't flagged as utf8 by perl's internals.

        Note: you can refer to this function as "isn't_flagged_utf8" if you
        really want to.

AUTHOR
    Written by Mark Fowler mark@twoshortplanks.com

COPYRIGHT
    Copyright Mark Fowler 2004,2012. All rights reserved.

    This program is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

BUGS
    None known. Please report any to me via the CPAN RT system. See
    http://rt.cpan.org/ for more details.

SEE ALSO
    Test::DoubleEncodedEntities for testing for double encoded HTML
    entities.