Skip to content
Types and Traits for Character Sets, Encodings, and Character Set Encodings
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Update for new StrAPI Oct 21, 2018
test
.codecov.yml First split of CharacterSetEncodings off of Strs May 5, 2018
.gitignore First split of CharacterSetEncodings off of Strs May 5, 2018
.travis.yml Update for Julia v1.1, drop support for v0.x Feb 5, 2019
LICENSE Initial commit May 4, 2018
LICENSE.md First split of CharacterSetEncodings off of Strs May 5, 2018
Project.toml Update scripts, add version to Project.toml Apr 23, 2019
README.md Add Appveyor support, update README.md May 28, 2018
appveyor.yml Update scripts, add version to Project.toml Apr 23, 2019

README.md

CharSetEncodings

Info Windows Linux & MacOS Package Evaluator CodeCov Coveralls

Architecture

This provides the basic types and mode methods for dealing with character sets, encodings, and character set encodings.

Types

Currently, there are the following types:

  • CodeUnitTypes a Union of the 3 codeunit types (UInt8, UInt16, UInt32) for convenience
  • CharSet a struct type, which is parameterized by the name of the character set and the type needed to represent a code point
  • Encoding a struct type, parameterized by the name of the encoding

Built-in Character Sets / Character Set Encodings

  • Binary For storing non-textual data as a sequence of bytes, 0-0xff

  • ASCII ASCII (Unicode subset, 0-0x7f)

  • Latin Latin-1 (ISO-8859-1) (Unicode subset, 0-0xff)

  • UCS2 UCS-2 (Unicode subset, 0-0xd7ff, 0xe000-0xffff, BMP only, no surrogates)

  • UTF32 UTF-32 (Full Unicode, 0-0xd7ff, 0xe000-0x10ffff)

  • UniPlus Unvalidated Unicode (i.e. like String, can contain invalid codepoints)

  • Text1 Unknown 1-byte character set

  • Text2 Unknown 2-byte character set

  • Text4 Unknown 4-byte character set

Built-in Encodings

  • UTF8Encoding
  • Native1Byte
  • Native2Byte
  • Native4Byte
  • NativeUTF16
  • Swapped4Byte
  • Swapped2Byte
  • SwappedUTF16
  • LE2
  • BE2
  • LE4
  • BE4
  • UTF16LE
  • UTF16BE
  • 2Byte
  • 4Byte
  • UTF16

Built-in CSEs

  • BinaryCSE, Text1CSE, ASCIICSE, LatinCSE

  • Text2CSE, UCS2CSE

  • Text4CSE, UTF32CSE

  • UTF8CSE UTF32CharSet, all valid, using UTF8Encoding, conforming to the Unicode Organization's standard, i.e. no long encodings, surrogates, or invalid bytes.

  • RawUTF8CSE UniPlusCharSet, not validated, using UTF8Encoding, may have invalid sequences, long encodings, encode surrogates and characters up to 0x7fffffff

  • UTF16CSE UTF32CharSet, all valid, using UTF16 Encoding (native order), conforming to the Unicode standard, i.e. no out of order or isolated surrogates.

Internal Unicode subset types

  • _LatinCSE Indicates has at least 1 character > 0x7f, all <= 0xff
  • _UCS2CSE Indicates has at least 1 character > 0xff, all <= 0xffff
  • _UTF32CSE Indicates has at least 1 non-BMP character

API

The cse function returns the character set encoding for a string type, string. Returns RawUTF8CSE as a fallback for AbstractString (i.e. same as String) The charset function returns the character set for a string type, string, character type, or character. The encoding function returns the encoding for a type or string. The codeunit function returns the code unit used for a character set encoding The cs"..." string macro creates a CharSet type with that name The enc"..." string macro creates an Encoding type with that name The @cse(cs, enc) macro creates a character set encoding with the given character set and encoding

Also Exports the helpful constant Bool flags BIG_ENDIAN and LITTLE_ENDIAN

You can’t perform that action at this time.