Skip to content

doherty/Text-Unidecode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME
    Text::Unidecode - US-ASCII transliterations of Unicode text

VERSION
    version 0.05

SYNOPSIS
        use utf8;
        use Text::Unidecode;
        print unidecode("\x{5317}\x{4EB0}\n"); # Those are the Chinese characters for Beijing
        # Prints 'Bei Jing'

DESCRIPTION
    It often happens that you have non-Roman text data in Unicode, but you
    can't display it -- usually because you're trying to show it to a user
    via an application that doesn't support Unicode, or because the fonts
    you need aren't accessible. You could represent the Unicode characters
    as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the
    user who actually wants to read what the text says.

    What Text::Unidecode provides is a function, "unidecode(...)" that takes
    Unicode data and tries to represent it in US-ASCII characters (i.e., the
    universally displayable characters between 0x00 and 0x7F). The
    representation is almost always an attempt at *transliteration* -- i.e.,
    conveying, in Roman letters, the pronunciation expressed by the text in
    some other writing system. (See the example in the synopsis.)

    Text::Unidecode's ability to transliterate is limited by two factors:

    *   The amount and quality of data in the original

        So if you have Hebrew data that has no vowel points in it, then
        Unidecode cannot guess what vowels should appear in a
        pronounciation. S f y hv n vwls n th npt, y wn't gt ny vwls n th
        tpt. (This is a specific application of the general principle of
        "Garbage In, Garbage Out".)

    *   Basic limitations in the Unidecode design

        Writing a real and clever transliteration algorithm for any single
        language usually requires a lot of time, and at least a passable
        knowledge of the language involved. But Unicode text can convey more
        languages than I could possibly learn (much less create a
        transliterator for) in the entire rest of my lifetime. So I put a
        cap on how intelligent Unidecode could be, by insisting that it
        support only context-*in*sensitive transliteration. That means
        missing the finer details of any given writing system, while still
        hopefully being useful.

    Unidecode, in other words, is quick and dirty. Sometimes the output is
    not so dirty at all: Russian and Greek seem to work passably; and while
    Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing
    system, setting up a mapping from it to Roman letters seems to work
    pretty well. But sometimes the output is *very dirty*: Unidecode does
    quite badly on Japanese and Thai.

    If you want a smarter transliteration for a particular language than
    Unidecode provides, then you should look for (or write) a
    transliteration algorithm specific to that language, and apply it
    instead of (or at least before) applying Unidecode.

    In other words, Unidecode's approach is broad (knowing about dozens of
    writing systems), but shallow (not being meticulous about any of them).

FUNCTIONS
    Text::Unidecode provides one function, "unidecode()", which is exported
    by default.

  unidecode
    "unidecode()" can be used in a variety of calling contexts:

    "$out = unidecode($in);" # scalar context
        This returns a copy of $in, transliterated.

    "$out = unidecode(@in);" # scalar context
        This is the same as "$out = unidecode(join '', @in);"

    "@out = unidecode(@in);" # list context
        This returns a list consisting of copies of @in, each
        transliterated. This is the same as "@out = map
        scalar(unidecode($_)), @in;"

    "unidecode(@items);" # void context
    "unidecode(@bar, $foo, @baz);" # void context
        Each item on input is replaced with its transliteration. This is the
        same as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"

    You should make a minimum of assumptions about the output of
    "unidecode(...)". For example, if you assume an all-alphabetic (Unicode)
    string passed to "unidecode(...)" will return an all-alphabetic string,
    you're wrong -- some alphabetic Unicode characters are transliterated as
    strings containing punctuation (e.g., the Armenian letter at 0x0539
    currently transliterates as "T`".

    However, these are the assumptions you *can* make:

    *   Each character 0x0000 - 0x007F transliterates as itself. That is,
        "unidecode(...)" is 7-bit pure.

    *   The output of "unidecode(...)" always consists entirely of US-ASCII
        characters -- i.e., characters 0x0000 - 0x007F.

    *   All Unicode characters translate to a sequence of (any number of)
        characters that are newline ("\n") or in the range 0x0020-0x007E.
        That is, no Unicode character translates to "\x01", for example.
        (Although if you have a "\x01" in input, you'll get a "\x01" in
        output.)

    *   Yes, some transliterations produce a "\n" -- but just a few, and
        only with good reason. Note that the value of newline ("\n") varies
        from platform to platform -- see perlport.

    *   Some Unicode characters may transliterate to nothing (i.e., empty
        string).

    *   Very many Unicode characters transliterate to multi-character
        sequences. E.g., Han character 0x5317 transliterates as the
        four-character string "Bei ".

    *   Within these constraints, I may change the transliteration of
        characters in future versions. For example, if someone convinces me
        that the Armenian letter at 0x0539, currently transliterated as
        "T`", would be better transliterated as "D", I may well make that
        change.

  remap
    This remaps a transliteration temporarily. Give it one character, and
    the desired transliteration:

        my %remap = (
            'A' => 'omg',
            'B' => 'really?!',
        );
        while (my ($char, $remap) = each %remap) {
            remap($char, $remap);
        }
        print unidecode('ABCDE'); # see your alternate transliterations

  unidecode_to_charset
    This attempts to "unidecode()" to any arbitrary charset, rather than to
    US-ASCII. The function is not exported by default.

    Pass the target charset as the first parameter (default is ascii), and
    the text to transliterate as the second.

DESIGN GOALS AND CONSTRAINTS
    Text::Unidecode is meant to be a transliterator-of-last resort, to be
    used once you've decided that you can't just display the Unicode data as
    is, and once you've decided you don't have a more clever,
    language-specific transliterator available. It transliterates
    context-insensitively -- that is, a given character is replaced with the
    same US-ASCII (7-bit ASCII) character or characters, no matter what the
    surrounding character are.

    The main reason I'm making Text::Unidecode work with only
    context-insensitive substitution is that it's fast, dumb, and
    straightforward enough to be feasable. It doesn't tax my (quite limited)
    knowledge of world languages. It doesn't require me writing a hundred
    lines of code to get the Thai syllabification right (and never knowing
    whether I've gotten it wrong, because I don't know Thai), or spending a
    year trying to get Text::Unidecode to use the ChaSen algorithm for
    Japanese, or trying to write heuristics for telling the difference
    between Japanese, Chinese, or Korean, so it knows how to transliterate
    any given Uni-Han glyph. And moreover, context-insensitive substitution
    is still mostly useful, but still clearly couldn't be mistaken for
    authoritative.

    Text::Unidecode is an example of the 80/20 rule in action -- you get 80%
    of the usefulness using just 20% of a "real" solution.

    A "real" approach to transliteration for any given language can involve
    such increasingly tricky contextual factors as these

    The previous / preceding character(s)
        What a given symbol "X" means, could depend on whether it's followed
        by a consonant, or by vowel, or by some diacritic character.

    Syllables
        A character "X" at end of a syllable could mean something different
        from when it's at the start -- which is especially problematic when
        the language involved doesn't explicitly mark where one syllable
        stops and the next starts.

    Parts of speech
        What "X" sounds like at the end of a word, depends on whether that
        word is a noun, or a verb, or what.

    Meaning
        By semantic context, you can tell that this ideogram "X" means
        "shoe" (pronounced one way) and not "time" (pronounced another), and
        that's how you know to transliterate it one way instead of the
        other.

    Origin of the word
        "X" means one thing in loanwords and/or placenames (and derivatives
        thereof), and another in native words.

    "It's just that way"
        "X" normally makes the /X/ sound, except for this list of seventy
        exceptions (and words based on them, sometimes indirectly). Or: you
        never can tell which of the three ways to pronounce "X" this word
        actually uses; you just have to know which it is, so keep a
        dictionary on hand!

    Language
        The character "X" is actually used in several different languages,
        and you have to figure out which you're looking at before you can
        determine how to transliterate it.

    Out of a desire to avoid being mired in *any* of these kinds of
    contextual factors, I chose to exclude *all of them* and just stick with
    context-insensitive replacement.

MOTTO
    The Text::Unidecode motto is:

      It's better than nothing!

    ...in both meanings: 1) seeing the output of "unidecode(...)" is better
    than just having all font-unavailable Unicode characters replaced with
    "?"'s, or rendered as gibberish; and 2) it's the worst, i.e., there's
    nothing that Text::Unidecode's algorithm is better than.

CAVEATS
    If you get really implausible nonsense out of "unidecode(...)", make
    sure that the input data really is a utf8 string. See perlunicode.

THANKS
    Thanks to Harald Tveit Alvestrand, Abhijit Menon-Sen, and Mark-Jason
    Dominus.

SEE ALSO
    *   Unicode Consortium: <http://www.unicode.org>

    *   Geoffrey Sampson. 1990. *Writing Systems: A Linguistic
        Introduction.* ISBN: 0804717567

    *   Randall K. Barry (editor). 1997. *ALA-LC Romanization Tables:
        Transliteration Schemes for Non-Roman Scripts.* ISBN: 0844409405
        [ALA is the American Library Association; LC is the Library of
        Congress.]

    *   Rupert Snell. 2000. *Beginner's Hindi Script (Teach Yourself
        Books).* ISBN: 0658009109

AVAILABILITY
    The latest version of this module is available from the Comprehensive
    Perl Archive Network (CPAN). Visit <http://www.perl.com/CPAN/> to find a
    CPAN site near you, or see
    <http://search.cpan.org/dist/Text-Unidecode/>.

    The development version lives at
    <http://github.com/doherty/Text-Unidecode> and may be cloned from
    <git://github.com/doherty/Text-Unidecode.git>. Instead of sending
    patches, please fork this project using the standard git and github
    infrastructure.

SOURCE
    The development version is on github at
    <http://github.com/doherty/Text-Unidecode> and may be cloned from
    <git://github.com/doherty/Text-Unidecode.git>

BUGS AND LIMITATIONS
    No bugs have been reported.

    Please report any bugs or feature requests through the web interface at
    <http://github.com/doherty/Text-Unidecode/issues>.

AUTHOR
    Sean M. Burke <sburke@cpan.org>

COPYRIGHT AND LICENSE
    This software is copyright (c) 2001 by Sean M. Burke.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.

About

US-ASCII transliterations of Unicode text (unmaintained fork)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages