Pure-Rust, #![no_std] internationalization primitives — a long-term, pure-Rust
analog of ICU (collation, number formatting, normalization, transliteration, …).
See ROADMAP.md for the plan toward ICU feature parity.
The foundational layer, available today, is the unicode module: Unicode
rune analysis driven by the official Unicode Character Database (UCD), with
character properties compiled directly into Rust match dispatch by an offline
code generator — so every lookup is a const fn, allocates nothing, and needs
no runtime initialization.
no_std, noalloc— usable in embedded, kernel, and WASM contexts.- Tables as code — the UCD is converted into a two-level paged
match("switch/case") index, not parsed at runtime. - Feature-selectable ranges — compile only the slice of the codepoint space
you need. Anything outside the compiled range resolves to the neutral default
(
Unassigned/false), so every lookup is total. - Targets Unicode 17.0.0.
[dependencies]
intl = "0.1"use intl::unicode::{general_category, GeneralCategory, CharExt};
assert_eq!(general_category('A'), GeneralCategory::UppercaseLetter);
assert_eq!(general_category('中'), GeneralCategory::OtherLetter);
assert!('A'.is_uppercase());
assert!('٣'.is_numeric()); // Arabic-Indic digit three
assert!(' '.is_whitespace());
assert!(!'\u{0378}'.is_assigned()); // a reserved codepointEvery predicate exists both as a free const fn taking a char
(intl::unicode::is_uppercase('A')) and as a method via the CharExt trait
('A'.is_uppercase()).
Normalization and collation (the latter behind the alloc feature):
use intl::unicode::{nfc, nfd};
assert_eq!(nfc("e\u{0301}".chars()).collect::<String>(), "é");
assert_eq!(nfd("é".chars()).collect::<String>(), "e\u{0301}");
// With the `alloc` feature:
use intl::unicode::collate::compare;
use std::cmp::Ordering;
assert_eq!(compare("café", "cafz"), Ordering::Less); // é (≈ e) sorts before zBeyond the unicode module:
-
intl::locale(alloc) parses and canonicalizes BCP-47 language tags (Locale::parse("zh-hant-hk")→"zh-Hant-HK"), and adds/removes likely subtags (Locale::maximize:en→en-Latn-US;Locale::minimize:zh-Hans-CN→zh), and negotiates a best match between a user's requested locales and what's available (negotiate). -
intl::plural(no_std, no alloc) selects the CLDRPluralCategoryfor a number in a language —plural_category(cardinal) andordinal_category("1st"/"2nd"/"3rd"), rules compiled from CLDR into amatch.plural_category("pl", &PluralOperands::from_int(5))→Many. Validated against the CLDR sample data (cardinal + ordinal). -
intl::number(alloc) formats numbers in a locale's conventions —format_decimal("de", 1234.5)→"1.234,5",format_decimal("hi", 1234567.0)→"12,34,567"(Indian grouping),format_percent("en", 0.5)→"50%",format_currency("en", 1234.5, "USD")→"$1,234.50",format_scientific("1.2345E4"),format_compact("1.5K","2.3M"), andparse_decimalback to anf64(parse_decimal("de", "1.234,5")→1234.5), plus native digit systems (to_numbering_system("2024", "arab")→"٢٠٢٤") and ordinals (format_ordinal("en", 21)→"21st"). -
intl::list(alloc) joins items with locale connectors —format_list("en", &["a","b","c"], ListStyle::And)→"a, b, and c". -
intl::relative(alloc) formats relative times —format_relative("en", -2, RelativeUnit::Hour)→"2 hours ago",format_relative("en", -1, RelativeUnit::Day)→"yesterday"(plural- and number-aware). -
intl::display(no_std, no alloc) gives locale display names —language_name("fr", "de")→Some("allemand"),region_name("en", "JP")→Some("Japan"). -
intl::unit(alloc) formats measurement units —format_unit("en", 5.0, Unit::Kilometer, UnitWidth::Long)→"5 kilometers"(plural- and number-aware, long/short widths) — and durations:format_duration("en", 3661, UnitWidth::Long)→"1 hour 1 minute 1 second". -
intl::message(alloc) is a subset of ICU MessageFormat —{arg}substitution,plural/selectordinal(with=Nand#), andselect, composing the plural rules and number formatting. -
intl::datetime(alloc) formats Gregorian dates/times —format_date("en", &dt, DateStyle::Long)→"June 4, 2026",format_date("de", &dt, DateStyle::Long)→"4. Juni 2026"(CLDR patterns, month/weekday names, am/pm; weekday via Sakamoto's algorithm). Alsoformat_skeleton("en", &dt, "yMMMd")→"Jun 4, 2026"(flexible field-set formatting), and renders Islamic (Hijri) and Persian dates with localized month names (format_islamic_date("en", 1445, 9, 1, DateStyle::Long)→"Ramadan 1, 1445 AH";format_persian_datelikewise). -
intl::spelloutspells integers out in words via the CLDR RBNF rules (locale-driven) —spell_cardinal("en", 1234)→"one thousand two hundred thirty-four",spell_cardinal("fr", 80)→"quatre-vingts". (alloc) -
intl::timezoneparses a POSIXTZstring ("PST8PDT,M3.2.0,M11.1.0/2") and computes the UTC offset / DST state for any date. With theiana-tzfeature it also loads the full IANA tz database (via the embeddedtimezone-datacrate):load_zone("America/New_York")thenoffset_at/abbrev_at/is_dst_at/to_localfor any instant, with historical transitions. (iana-tzraises the MSRV to 1.86; the rest of the crate is 1.70.) -
intl::calendar(no_std, no alloc) converts dates between the Gregorian, civil (tabular) Islamic, Persian (Solar Hijri), Hebrew, and Chinese (lunisolar, 1900–2099 via an embedded lunar table) calendars through the Julian Day Number, gives the Japanese era/year, plus ISO-8601 week dates and day-of-week — pure integer arithmetic.DateTimealso does ISO-8601 timestamp parse/format, date arithmetic (add_seconds/add_days/weekday, leap- and carry-aware), andformat_gmt_offsetrenders a localized UTC offset (GMT+05:30,UTC−08:00). -
intl::translit(alloc) transliterates:latin_ascii("café"→"cafe", "Straße"→"Strasse"),remove_diacritics,cyrillic_to_latin(ISO 9),greek_to_latin(ELOT/ISO 843), andany_asciifor best-effort mixed-script ASCII ("Москва café Αθήνα"→"Moskva cafe Athina").
These build out the CLDR/locale layer toward full ICU-style formatting. The
locale data is compiled by the offline codegen into flat binary blobs committed
under src/cldr/ and embedded with include_bytes!, so the table layer is
no_std (no alloc dependency); only the formatting functions need alloc.
default = ["bmp"]. Range tiers are ascii ⊂ latin1 ⊂ bmp ⊂ full (below). The
alloc feature (still no_std) enables the allocating APIs
(unicode::collate, unicode::spoof, unicode::idna, intl::locale, …); it
implies full.
Cargo features select how much of the codepoint space is compiled in, trading coverage for binary size. The tiers are nested (each implies the smaller ones):
| feature | codepoints compiled |
|---|---|
ascii |
U+0000..=U+007F |
latin1 |
U+0000..=U+00FF |
bmp |
U+0000..=U+FFFF (default) |
full |
U+0000..=U+10FFFF |
# Latin-1 only, no default BMP tables:
intl = { version = "0.1", default-features = false, features = ["latin1"] }
# Everything, including supplementary planes:
intl = { version = "0.1", default-features = false, features = ["full"] }A codepoint outside the compiled tier reports GeneralCategory::Unassigned
(and false for every boolean predicate) — exactly as a genuinely unassigned
codepoint would.
General_Category(the 29 UAX #44 categories) and their majorGroups, viageneral_category/general_category_u32.- Boolean predicates:
is_alphabetic,is_uppercase,is_lowercase,is_whitespace(from the derived Unicode properties), plus the category-derivedis_letter,is_mark,is_numeric,is_decimal_digit,is_punctuation,is_symbol,is_separator,is_control,is_format, andis_assigned; plus the property predicatesis_math,is_dash,is_diacritic,is_hex_digit,is_quotation_mark,is_join_control, andis_default_ignorable. - Segmentation (UAX #29) — extended grapheme cluster, word, and sentence
boundary iteration via
graphemes(&str),words(&str), andsentences(&str)(each yielding&str, allocation-free). Grapheme breaking handles combining marks, Hangul, Indic conjuncts, regional-indicator flags, and emoji ZWJ sequences; word and sentence breaking implement the full WB / SB rule sets. All three validated against the officialGraphemeBreakTest/WordBreakTest/SentenceBreakTestsuites. - Line breaking (UAX #14) —
line_breaks(&str)yielding break opportunities (mandatory vs allowed). ~99.98% conformant againstLineBreakTest(a few CJK quotation/East-Asian-Width edge cases remain). - Collation (UTS #10) — DUCET root collation via
collate::compare/collate::Collator(andsort_key), with non-ignorable or shifted variable handling, strength levels (with_strength: accent-/case-insensitive), numeric ordering (with_numeric:file2 < file10), and locale tailoring (Tailoring::parse("&z < å < ä < ö")/Tailoring::for_locale("sv")for primary reordering). Validated against the full officialCollationTestsuite (both modes). Requires theallocfeature. - Normalization (UAX #15) —
nfd,nfc,nfkd,nfkcas streaming, allocation-free iterator adaptors overIterator<Item = char>; quick-check helpersis_nfc/is_nfd/is_nfkc/is_nfkd(and tri-statequick_check_*→IsNormalized); pluscanonical_combining_class. Validated against the full officialNormalizationTest.txtconformance suite. - Full, unconditional case mapping — per-
charto_uppercase,to_lowercase,to_titlecase,case_fold(each aCaseMapIter, 1–3 chars, e.g.ß→SS), plus whole-stream adaptorsuppercase/lowercase/foldoverIterator<Item = char>(e.g.uppercase("Weiß".chars()); no allocation).foldgives caseless comparison. ScriptandScript_Extensions(UAX #24) viascript/script_u32andscript_extensions/script_extensions_u32(Scriptenum with.long_name();ScriptExtensionswith.contains()/.iter()).East_Asian_Width(UAX #11) viaeast_asian_width/east_asian_width_u32(EastAsianWidthenum, with.is_wide()).- Bidirectional text (UAX #9) —
bidi_class(theBidiClassenum),base_direction(&str)(rules P2–P3), and (withalloc) the full reordering algorithmbidi::process(&str, …) -> BidiInfo(embedding levels + visual order). ~99.996% conformant againstBidiCharacterTest. - Identifiers (UAX #31) —
is_xid_start,is_xid_continue, andis_identifier(&str)for default identifier validation. - Confusables / spoof detection (UTS #39) —
spoof::skeleton,spoof::confusable, andspoof::is_single_script(mixed-script detection). Requiresalloc. - IDNA / Punycode (UTS #46 / RFC 3492) —
idna::to_ascii/idna::to_unicodefor domain names (mapping + NFC + Punycode). The mapping/Punycode core passes every clean-success line of IdnaTestV2; the contextual validity rules (CheckBidi/CheckJoiners) are not yet enforced. Requiresalloc. Numeric_Typeand exactNumeric_Valuevianumeric_typeandnumeric_value/numeric_value_u32(NumericValueis a rationalnumerator / denominator, with.to_i64()/.as_f64()).UNICODE_VERSIONof the embedded tables.
The committed files under src/unicode/generated/ are produced from the
vendored UCD text files in data/ucd/<version>/ by the codegen tool. It is a
packaging-time tool run only when updating the data or the Unicode version —
the published crate never builds or invokes it, and codegen/ is a standalone
package (not a workspace member and not part of intl).
cargo run --manifest-path codegen/Cargo.tomlOutput is deterministic and rustfmt-clean, so regeneration with the same data
yields no diff. To update the Unicode version, drop the new UCD files into
data/ucd/<version>/, bump the version in codegen, and re-run.
MIT — see LICENSE.