Skip to content
Convert contiguous ranges of Unicode codepoints to UTF-8 byte ranges.
Rust Shell Makefile Vim script
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benches Polished, docs, tests, benchmarks. Oct 17, 2015
ci ci: don't run tests on 1.12.0 Jan 1, 2018
src
.gitignore initial commit Oct 16, 2015
.travis.yml ci: only test master Jan 1, 2018
COPYING initial commit Oct 16, 2015
Cargo.toml 1.0.4 Aug 3, 2019
LICENSE-MIT initial commit Oct 16, 2015
Makefile initial commit Oct 16, 2015
README.md DEPRECATED Aug 3, 2019
UNLICENSE initial commit Oct 16, 2015
ctags.rust initial commit Oct 16, 2015
session.vim initial commit Oct 16, 2015

README.md

DEPRECATED: This crate has been folded into the regex-syntax and is now deprecated.

utf8-ranges

This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte ranges. This is useful when constructing byte based automata from Unicode. Stated differently, this lets one embed UTF-8 decoding as part of one's automaton.

Linux build status

Dual-licensed under MIT or the UNLICENSE.

Documentation

https://docs.rs/utf8-ranges

Example

This shows how to convert a scalar value range (e.g., the basic multilingual plane) to a sequence of byte based character classes.

extern crate utf8_ranges;

use utf8_ranges::Utf8Sequences;

fn main() {
    for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') {
        println!("{:?}", range);
    }
}

The output:

[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]

These ranges can then be used to build an automaton. Namely:

  1. Every arbitrary sequence of bytes matches exactly one of the sequences of ranges or none of them.
  2. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous encodings of surrogate codepoints in UTF-8 cannot match any of the byte ranges above.)
You can’t perform that action at this time.