Skip to content

KarpelesLab/intlrs

Repository files navigation

intl

crates.io docs.rs CI License: MIT

Pure-Rust, #![no_std] internationalization primitives — a long-term, pure-Rust analog of ICU (collation, number formatting, normalization, transliteration, …). See ROADMAP.md for the plan toward ICU feature parity.

The foundational layer, available today, is the unicode module: Unicode rune analysis driven by the official Unicode Character Database (UCD), with character properties compiled directly into Rust match dispatch by an offline code generator — so every lookup is a const fn, allocates nothing, and needs no runtime initialization.

  • no_std, no alloc — usable in embedded, kernel, and WASM contexts.
  • Tables as code — the UCD is converted into a two-level paged match ("switch/case") index, not parsed at runtime.
  • Feature-selectable ranges — compile only the slice of the codepoint space you need. Anything outside the compiled range resolves to the neutral default (Unassigned / false), so every lookup is total.
  • Targets Unicode 17.0.0.

Usage

[dependencies]
intl = "0.1"
use intl::unicode::{general_category, GeneralCategory, CharExt};

assert_eq!(general_category('A'), GeneralCategory::UppercaseLetter);
assert_eq!(general_category('中'), GeneralCategory::OtherLetter);

assert!('A'.is_uppercase());
assert!('٣'.is_numeric());          // Arabic-Indic digit three
assert!(' '.is_whitespace());
assert!(!'\u{0378}'.is_assigned()); // a reserved codepoint

Every predicate exists both as a free const fn taking a char (intl::unicode::is_uppercase('A')) and as a method via the CharExt trait ('A'.is_uppercase()).

Normalization and collation (the latter behind the alloc feature):

use intl::unicode::{nfc, nfd};
assert_eq!(nfc("e\u{0301}".chars()).collect::<String>(), "é");
assert_eq!(nfd("é".chars()).collect::<String>(), "e\u{0301}");

// With the `alloc` feature:
use intl::unicode::collate::compare;
use std::cmp::Ordering;
assert_eq!(compare("café", "cafz"), Ordering::Less); // é (≈ e) sorts before z

Beyond the unicode module:

  • intl::locale (alloc) parses and canonicalizes BCP-47 language tags (Locale::parse("zh-hant-hk")"zh-Hant-HK"), and adds/removes likely subtags (Locale::maximize: enen-Latn-US; Locale::minimize: zh-Hans-CNzh), and negotiates a best match between a user's requested locales and what's available (negotiate).

  • intl::plural (no_std, no alloc) selects the CLDR PluralCategory for a number in a language — plural_category (cardinal) and ordinal_category ("1st"/"2nd"/"3rd"), rules compiled from CLDR into a match. plural_category("pl", &PluralOperands::from_int(5))Many. Validated against the CLDR sample data (cardinal + ordinal).

  • intl::number (alloc) formats numbers in a locale's conventions — format_decimal("de", 1234.5)"1.234,5", format_decimal("hi", 1234567.0)"12,34,567" (Indian grouping), format_percent("en", 0.5)"50%", format_currency("en", 1234.5, "USD")"$1,234.50", format_scientific ("1.2345E4"), format_compact ("1.5K", "2.3M"), and parse_decimal back to an f64 (parse_decimal("de", "1.234,5")1234.5), plus native digit systems (to_numbering_system("2024", "arab")"٢٠٢٤") and ordinals (format_ordinal("en", 21)"21st").

  • intl::list (alloc) joins items with locale connectors — format_list("en", &["a","b","c"], ListStyle::And)"a, b, and c".

  • intl::relative (alloc) formats relative times — format_relative("en", -2, RelativeUnit::Hour)"2 hours ago", format_relative("en", -1, RelativeUnit::Day)"yesterday" (plural- and number-aware).

  • intl::display (no_std, no alloc) gives locale display names — language_name("fr", "de")Some("allemand"), region_name("en", "JP")Some("Japan").

  • intl::unit (alloc) formats measurement units — format_unit("en", 5.0, Unit::Kilometer, UnitWidth::Long)"5 kilometers" (plural- and number-aware, long/short widths) — and durations: format_duration("en", 3661, UnitWidth::Long)"1 hour 1 minute 1 second".

  • intl::message (alloc) is a subset of ICU MessageFormat — {arg} substitution, plural/selectordinal (with =N and #), and select, composing the plural rules and number formatting.

  • intl::datetime (alloc) formats Gregorian dates/times — format_date("en", &dt, DateStyle::Long)"June 4, 2026", format_date("de", &dt, DateStyle::Long)"4. Juni 2026" (CLDR patterns, month/weekday names, am/pm; weekday via Sakamoto's algorithm). Also format_skeleton("en", &dt, "yMMMd")"Jun 4, 2026" (flexible field-set formatting), and renders Islamic (Hijri) and Persian dates with localized month names (format_islamic_date("en", 1445, 9, 1, DateStyle::Long)"Ramadan 1, 1445 AH"; format_persian_date likewise).

  • intl::spellout spells integers out in words via the CLDR RBNF rules (locale-driven) — spell_cardinal("en", 1234)"one thousand two hundred thirty-four", spell_cardinal("fr", 80)"quatre-vingts". (alloc)

  • intl::timezone parses a POSIX TZ string ("PST8PDT,M3.2.0,M11.1.0/2") and computes the UTC offset / DST state for any date. With the iana-tz feature it also loads the full IANA tz database (via the embedded timezone-data crate): load_zone("America/New_York") then offset_at / abbrev_at / is_dst_at / to_local for any instant, with historical transitions. (iana-tz raises the MSRV to 1.86; the rest of the crate is 1.70.)

  • intl::calendar (no_std, no alloc) converts dates between the Gregorian, civil (tabular) Islamic, Persian (Solar Hijri), Hebrew, and Chinese (lunisolar, 1900–2099 via an embedded lunar table) calendars through the Julian Day Number, gives the Japanese era/year, plus ISO-8601 week dates and day-of-week — pure integer arithmetic. DateTime also does ISO-8601 timestamp parse/format, date arithmetic (add_seconds/add_days/ weekday, leap- and carry-aware), and format_gmt_offset renders a localized UTC offset (GMT+05:30, UTC−08:00).

  • intl::translit (alloc) transliterates: latin_ascii ("café"→"cafe", "Straße"→"Strasse"), remove_diacritics, cyrillic_to_latin (ISO 9), greek_to_latin (ELOT/ISO 843), and any_ascii for best-effort mixed-script ASCII ("Москва café Αθήνα"→"Moskva cafe Athina").

These build out the CLDR/locale layer toward full ICU-style formatting. The locale data is compiled by the offline codegen into flat binary blobs committed under src/cldr/ and embedded with include_bytes!, so the table layer is no_std (no alloc dependency); only the formatting functions need alloc.

Features

default = ["bmp"]. Range tiers are ascii ⊂ latin1 ⊂ bmp ⊂ full (below). The alloc feature (still no_std) enables the allocating APIs (unicode::collate, unicode::spoof, unicode::idna, intl::locale, …); it implies full.

Range tiers

Cargo features select how much of the codepoint space is compiled in, trading coverage for binary size. The tiers are nested (each implies the smaller ones):

feature codepoints compiled
ascii U+0000..=U+007F
latin1 U+0000..=U+00FF
bmp U+0000..=U+FFFF (default)
full U+0000..=U+10FFFF
# Latin-1 only, no default BMP tables:
intl = { version = "0.1", default-features = false, features = ["latin1"] }
# Everything, including supplementary planes:
intl = { version = "0.1", default-features = false, features = ["full"] }

A codepoint outside the compiled tier reports GeneralCategory::Unassigned (and false for every boolean predicate) — exactly as a genuinely unassigned codepoint would.

What the unicode module covers

  • General_Category (the 29 UAX #44 categories) and their major Groups, via general_category / general_category_u32.
  • Boolean predicates: is_alphabetic, is_uppercase, is_lowercase, is_whitespace (from the derived Unicode properties), plus the category-derived is_letter, is_mark, is_numeric, is_decimal_digit, is_punctuation, is_symbol, is_separator, is_control, is_format, and is_assigned; plus the property predicates is_math, is_dash, is_diacritic, is_hex_digit, is_quotation_mark, is_join_control, and is_default_ignorable.
  • Segmentation (UAX #29) — extended grapheme cluster, word, and sentence boundary iteration via graphemes(&str), words(&str), and sentences(&str) (each yielding &str, allocation-free). Grapheme breaking handles combining marks, Hangul, Indic conjuncts, regional-indicator flags, and emoji ZWJ sequences; word and sentence breaking implement the full WB / SB rule sets. All three validated against the official GraphemeBreakTest / WordBreakTest / SentenceBreakTest suites.
  • Line breaking (UAX #14) — line_breaks(&str) yielding break opportunities (mandatory vs allowed). ~99.98% conformant against LineBreakTest (a few CJK quotation/East-Asian-Width edge cases remain).
  • Collation (UTS #10) — DUCET root collation via collate::compare / collate::Collator (and sort_key), with non-ignorable or shifted variable handling, strength levels (with_strength: accent-/case-insensitive), numeric ordering (with_numeric: file2 < file10), and locale tailoring (Tailoring::parse("&z < å < ä < ö") / Tailoring::for_locale("sv") for primary reordering). Validated against the full official CollationTest suite (both modes). Requires the alloc feature.
  • Normalization (UAX #15) — nfd, nfc, nfkd, nfkc as streaming, allocation-free iterator adaptors over Iterator<Item = char>; quick-check helpers is_nfc/is_nfd/is_nfkc/is_nfkd (and tri-state quick_check_*IsNormalized); plus canonical_combining_class. Validated against the full official NormalizationTest.txt conformance suite.
  • Full, unconditional case mapping — per-char to_uppercase, to_lowercase, to_titlecase, case_fold (each a CaseMapIter, 1–3 chars, e.g. ßSS), plus whole-stream adaptors uppercase / lowercase / fold over Iterator<Item = char> (e.g. uppercase("Weiß".chars()); no allocation). fold gives caseless comparison.
  • Script and Script_Extensions (UAX #24) via script / script_u32 and script_extensions / script_extensions_u32 (Script enum with .long_name(); ScriptExtensions with .contains() / .iter()).
  • East_Asian_Width (UAX #11) via east_asian_width / east_asian_width_u32 (EastAsianWidth enum, with .is_wide()).
  • Bidirectional text (UAX #9) — bidi_class (the BidiClass enum), base_direction(&str) (rules P2–P3), and (with alloc) the full reordering algorithm bidi::process(&str, …) -> BidiInfo (embedding levels + visual order). ~99.996% conformant against BidiCharacterTest.
  • Identifiers (UAX #31) — is_xid_start, is_xid_continue, and is_identifier(&str) for default identifier validation.
  • Confusables / spoof detection (UTS #39) — spoof::skeleton, spoof::confusable, and spoof::is_single_script (mixed-script detection). Requires alloc.
  • IDNA / Punycode (UTS #46 / RFC 3492) — idna::to_ascii / idna::to_unicode for domain names (mapping + NFC + Punycode). The mapping/Punycode core passes every clean-success line of IdnaTestV2; the contextual validity rules (CheckBidi/CheckJoiners) are not yet enforced. Requires alloc.
  • Numeric_Type and exact Numeric_Value via numeric_type and numeric_value / numeric_value_u32 (NumericValue is a rational numerator / denominator, with .to_i64() / .as_f64()).
  • UNICODE_VERSION of the embedded tables.

Regenerating the tables

The committed files under src/unicode/generated/ are produced from the vendored UCD text files in data/ucd/<version>/ by the codegen tool. It is a packaging-time tool run only when updating the data or the Unicode version — the published crate never builds or invokes it, and codegen/ is a standalone package (not a workspace member and not part of intl).

cargo run --manifest-path codegen/Cargo.toml

Output is deterministic and rustfmt-clean, so regeneration with the same data yields no diff. To update the Unicode version, drop the new UCD files into data/ucd/<version>/, bump the version in codegen, and re-run.

License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages