Skip to content

Mytherin/minimal-icu-collation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Minimal ICU Collation & Time Zones

This project contains an easy-to-use version of the collation/timezone part of the ICU library. The entire library is contained in two files (icu-collate.cpp and icu-collate.hpp). This includes all the data necessary to make it work. If you want to use this in your own project, simply copy those two files there.

The compiled size of the project is around 6MB. The majority of this is the inlined ICU data that is required to properly support collation for all included locales. The header Reducing Data Size down below can help you if you want to strip out certain locales to make the included data smaller.

Usage

Here are a small number of snippets.

List Supported Collation Locales

int32_t count;
auto locales = Collator::getAvailableLocales(count);
fprintf(stdout, "Available collation locales: [");
for(int32_t i = 0; i < count; i++) {
    if (i > 0) {
        fprintf(stdout, ", ");
    }
    if (string(locales[i].getCountry()).empty()) {
        // language only
        fprintf(stdout, "%s", locales[i].getLanguage());
    } else {
        // language + country
        fprintf(stdout, "%s_%s", locales[i].getLanguage(), locales[i].getCountry());
    }
}
fprintf(stdout, "]\n");
// output:
// Available collation locales: [af, am, ar, as, az, be, bg, bn, bo, bs, bs, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US, eo, es, et, fa, fa_AF, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, om, or, pa, pa, pa_IN, pl, ps, pt, ro, ru, se, si, sk, sl, smn, sq, sr, sr, sr_BA, sr_ME, sr_RS, sr, sr_BA, sr_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh, zh_CN, zh_SG, zh, zh_HK, zh_MO, zh_TW, zu]

Create a Collator and sort a vector of strings

We can use collators to perform locale-based string ordering using the getSortKey method. This method takes as input a unicode string, and returns a new byte-orderable string based on the locale. That is, we can use standard string comparisons to order the sort keys, and we will end up with a correct ordering for that locale.

In this example, we order several German names using the German locale.

// createSortKey creates a byte-orderable string according to the collation locale
static string createSortKey(icu::Collator &collator, const string &str) {
    // run getSortKey once to get the size
    int32_t size = collator.getSortKey(UnicodeString::fromUTF8(StringPiece(str)), nullptr, 0);
    // allocate a buffer to hold the sort key
    auto buf = std::unique_ptr<uint8_t[]>(new uint8_t[size]);
	// generate the sort key from the string
    UnicodeString unicode_str = UnicodeString::fromUTF8(StringPiece(str));
    collator.getSortKey(unicode_str, buf.get(), size);
    return string((char*) buf.get(), size);
}

vector<string> strings = {"Göbel", "Goethe", "Goldmann", "Göthe" "Götz", "Gabel"};

UErrorCode status = U_ZERO_ERROR;
// create the collator for the german locale
auto de_collator = std::unique_ptr<icu::Collator>(Collator::createInstance(Locale("de"), status));
if (U_FAILURE(status)) {
	// something went wrong
	exit(1);
}
// order the strings using the createSortKey method
std::sort(strings.begin(), strings.end(), [&](const string &a, const string &b) {
	return createSortKey(*de_collator, a) < createSortKey(*de_collator, b);
});
for(auto &str : strings) {
	fprintf(stdout, "%s\n", str.c_str());
}
// output:
// Gabel
// Göbel
// Goethe
// Goldmann
// Göthe
// Götz

// compare this result with the (incorrect) binary ordering:
std::sort(strings.begin(), strings.end());
for(auto &str : strings) {
	fprintf(stdout, "%s\n", str.c_str());
}
// Gabel
// Goethe
// Goldmann
// Göbel
// Göthe
// Götz

Listing Available Timezones

UErrorCode status = U_ZERO_ERROR;
auto timeZoneIds = std::unique_ptr<StringEnumeration>(TimeZone::createEnumeration());
const UnicodeString *zoneId = timeZoneIds->snext(status);
if (U_FAILURE(status)) {
	// something went wrong
	exit(1);
}

int32_t count = 0;
fprintf(stdout, "Available timezones: [");
while (zoneId != NULL && status == U_ZERO_ERROR) {
	std::string zoneIdString;
	zoneId->toUTF8String(zoneIdString);
	if (count == 0) {
		fprintf(stdout, ", ", zoneIdString.c_str());
	}
	fprintf(stdout, "%s", zoneIdString.c_str());
	count++;
	zoneId = timeZoneIds->snext(status);
}
fprintf(stdout, "]\n");
// out:
// Available timezones: [ACT, AET, AGT, ART, AST, Africa/Abidjan, Africa/Accra, Africa/Addis_Ababa, Africa/Algiers, Africa/Asmara, Africa/Asmera, Africa/Bamako, Africa/Bangui, Africa/Banjul, Africa/Bissau, Africa/Blantyre, Africa/Brazzaville, Africa/Bujumbura, Africa/Cairo, Africa/Casablanca, Africa/Ceuta, Africa/Conakry, Africa/Dakar, Africa/Dar_es_Salaam, Africa/Djibouti, Africa/Douala, Africa/El_Aaiun, Africa/Freetown, Africa/Gaborone, Africa/Harare, Africa/Johannesburg, Africa/Juba, Africa/Kampala, Africa/Khartoum, Africa/Kigali, Africa/Kinshasa, Africa/Lagos, Africa/Libreville, Africa/Lome, Africa/Luanda, Africa/Lubumbashi, Africa/Lusaka, Africa/Malabo, Africa/Maputo, Africa/Maseru, Africa/Mbabane, Africa/Mogadishu, Africa/Monrovia, Africa/Nairobi, Africa/Ndjamena, Africa/Niamey, Africa/Nouakchott, Africa/Ouagadougou, Africa/Porto-Novo, Africa/Sao_Tome, Africa/Timbuktu, Africa/Tripoli, Africa/Tunis, Africa/Windhoek, America/Adak, America/Anchorage, America/Anguilla, America/Antigua, America/Araguaina, America/Argentina/Buenos_Aires, America/Argentina/Catamarca, America/Argentina/ComodRivadavia, America/Argentina/Cordoba, America/Argentina/Jujuy, America/Argentina/La_Rioja, America/Argentina/Mendoza, America/Argentina/Rio_Gallegos, America/Argentina/Salta, America/Argentina/San_Juan, America/Argentina/San_Luis, America/Argentina/Tucuman, America/Argentina/Ushuaia, America/Aruba, America/Asuncion, America/Atikokan, America/Atka, America/Bahia, America/Bahia_Banderas, America/Barbados, America/Belem, America/Belize, America/Blanc-Sablon, America/Boa_Vista, America/Bogota, America/Boise, America/Buenos_Aires, America/Cambridge_Bay, America/Campo_Grande, America/Cancun, America/Caracas, America/Catamarca, America/Cayenne, America/Cayman, America/Chicago, America/Chihuahua, America/Coral_Harbour, America/Cordoba, America/Costa_Rica, America/Creston, America/Cuiaba, America/Curacao, America/Danmarkshavn, America/Dawson, America/Dawson_Creek, America/Denver, America/Detroit, America/Dominica, America/Edmonton, America/Eirunepe, America/El_Salvador, America/Ensenada, America/Fort_Nelson, America/Fort_Wayne, America/Fortaleza, America/Glace_Bay, America/Godthab, America/Goose_Bay, America/Grand_Turk, America/Grenada, America/Guadeloupe, America/Guatemala, America/Guayaquil, America/Guyana, America/Halifax, America/Havana, America/Hermosillo, America/Indiana/Indianapolis, America/Indiana/Knox, America/Indiana/Marengo, America/Indiana/Petersburg, America/Indiana/Tell_City, America/Indiana/Vevay, America/Indiana/Vincennes, America/Indiana/Winamac, America/Indianapolis, America/Inuvik, America/Iqaluit, America/Jamaica, America/Jujuy, America/Juneau, America/Kentucky/Louisville, America/Kentucky/Monticello, America/Knox_IN, America/Kralendijk, America/La_Paz, America/Lima, America/Los_Angeles, America/Louisville, America/Lower_Princes, America/Maceio, America/Managua, America/Manaus, America/Marigot, America/Martinique, America/Matamoros, America/Mazatlan, America/Mendoza, America/Menominee, America/Merida, America/Metlakatla, America/Mexico_City, America/Miquelon, America/Moncton, America/Monterrey, America/Montevideo, America/Montreal, America/Montserrat, America/Nassau, America/New_York, America/Nipigon, America/Nome, America/Noronha, America/North_Dakota/Beulah, America/North_Dakota/Center, America/North_Dakota/New_Salem, America/Ojinaga, America/Panama, America/Pangnirtung, America/Paramaribo, America/Phoenix, America/Port-au-Prince, America/Port_of_Spain, America/Porto_Acre, America/Porto_Velho, America/Puerto_Rico, America/Punta_Arenas, America/Rainy_River, America/Rankin_Inlet, America/Recife, America/Regina, America/Resolute, America/Rio_Branco, America/Rosario, America/Santa_Isabel, America/Santarem, America/Santiago, America/Santo_Domingo, America/Sao_Paulo, America/Scoresbysund, America/Shiprock, America/Sitka, America/St_Barthelemy, America/St_Johns, America/St_Kitts, America/St_Lucia, America/St_Thomas, America/St_Vincent, America/Swift_Current, America/Tegucigalpa, America/Thule, America/Thunder_Bay, America/Tijuana, America/Toronto, America/Tortola, America/Vancouver, America/Virgin, America/Whitehorse, America/Winnipeg, America/Yakutat, America/Yellowknife, Antarctica/Casey, Antarctica/Davis, Antarctica/DumontDUrville, Antarctica/Macquarie, Antarctica/Mawson, Antarctica/McMurdo, Antarctica/Palmer, Antarctica/Rothera, Antarctica/South_Pole, Antarctica/Syowa, Antarctica/Troll, Antarctica/Vostok, Arctic/Longyearbyen, Asia/Aden, Asia/Almaty, Asia/Amman, Asia/Anadyr, Asia/Aqtau, Asia/Aqtobe, Asia/Ashgabat, Asia/Ashkhabad, Asia/Atyrau, Asia/Baghdad, Asia/Bahrain, Asia/Baku, Asia/Bangkok, Asia/Barnaul, Asia/Beirut, Asia/Bishkek, Asia/Brunei, Asia/Calcutta, Asia/Chita, Asia/Choibalsan, Asia/Chongqing, Asia/Chungking, Asia/Colombo, Asia/Dacca, Asia/Damascus, Asia/Dhaka, Asia/Dili, Asia/Dubai, Asia/Dushanbe, Asia/Famagusta, Asia/Gaza, Asia/Harbin, Asia/Hebron, Asia/Ho_Chi_Minh, Asia/Hong_Kong, Asia/Hovd, Asia/Irkutsk, Asia/Istanbul, Asia/Jakarta, Asia/Jayapura, Asia/Jerusalem, Asia/Kabul, Asia/Kamchatka, Asia/Karachi, Asia/Kashgar, Asia/Kathmandu, Asia/Katmandu, Asia/Khandyga, Asia/Kolkata, Asia/Krasnoyarsk, Asia/Kuala_Lumpur, Asia/Kuching, Asia/Kuwait, Asia/Macao, Asia/Macau, Asia/Magadan, Asia/Makassar, Asia/Manila, Asia/Muscat, Asia/Nicosia, Asia/Novokuznetsk, Asia/Novosibirsk, Asia/Omsk, Asia/Oral, Asia/Phnom_Penh, Asia/Pontianak, Asia/Pyongyang, Asia/Qatar, Asia/Qostanay, Asia/Qyzylorda, Asia/Rangoon, Asia/Riyadh, Asia/Saigon, Asia/Sakhalin, Asia/Samarkand, Asia/Seoul, Asia/Shanghai, Asia/Singapore, Asia/Srednekolymsk, Asia/Taipei, Asia/Tashkent, Asia/Tbilisi, Asia/Tehran, Asia/Tel_Aviv, Asia/Thimbu, Asia/Thimphu, Asia/Tokyo, Asia/Tomsk, Asia/Ujung_Pandang, Asia/Ulaanbaatar, Asia/Ulan_Bator, Asia/Urumqi, Asia/Ust-Nera, Asia/Vientiane, Asia/Vladivostok, Asia/Yakutsk, Asia/Yangon, Asia/Yekaterinburg, Asia/Yerevan, Atlantic/Azores, Atlantic/Bermuda, Atlantic/Canary, Atlantic/Cape_Verde, Atlantic/Faeroe, Atlantic/Faroe, Atlantic/Jan_Mayen, Atlantic/Madeira, Atlantic/Reykjavik, Atlantic/South_Georgia, Atlantic/St_Helena, Atlantic/Stanley, Australia/ACT, Australia/Adelaide, Australia/Brisbane, Australia/Broken_Hill, Australia/Canberra, Australia/Currie, Australia/Darwin, Australia/Eucla, Australia/Hobart, Australia/LHI, Australia/Lindeman, Australia/Lord_Howe, Australia/Melbourne, Australia/NSW, Australia/North, Australia/Perth, Australia/Queensland, Australia/South, Australia/Sydney, Australia/Tasmania, Australia/Victoria, Australia/West, Australia/Yancowinna, BET, BST, Brazil/Acre, Brazil/DeNoronha, Brazil/East, Brazil/West, CAT, CET, CNT, CST, CST6CDT, CTT, Canada/Atlantic, Canada/Central, Canada/East-Saskatchewan, Canada/Eastern, Canada/Mountain, Canada/Newfoundland, Canada/Pacific, Canada/Saskatchewan, Canada/Yukon, Chile/Continental, Chile/EasterIsland, Cuba, EAT, ECT, EET, EST, EST5EDT, Egypt, Eire, Etc/GMT, Etc/GMT+0, Etc/GMT+1, Etc/GMT+10, Etc/GMT+11, Etc/GMT+12, Etc/GMT+2, Etc/GMT+3, Etc/GMT+4, Etc/GMT+5, Etc/GMT+6, Etc/GMT+7, Etc/GMT+8, Etc/GMT+9, Etc/GMT-0, Etc/GMT-1, Etc/GMT-10, Etc/GMT-11, Etc/GMT-12, Etc/GMT-13, Etc/GMT-14, Etc/GMT-2, Etc/GMT-3, Etc/GMT-4, Etc/GMT-5, Etc/GMT-6, Etc/GMT-7, Etc/GMT-8, Etc/GMT-9, Etc/GMT0, Etc/Greenwich, Etc/UCT, Etc/UTC, Etc/Universal, Etc/Zulu, Europe/Amsterdam, Europe/Andorra, Europe/Astrakhan, Europe/Athens, Europe/Belfast, Europe/Belgrade, Europe/Berlin, Europe/Bratislava, Europe/Brussels, Europe/Bucharest, Europe/Budapest, Europe/Busingen, Europe/Chisinau, Europe/Copenhagen, Europe/Dublin, Europe/Gibraltar, Europe/Guernsey, Europe/Helsinki, Europe/Isle_of_Man, Europe/Istanbul, Europe/Jersey, Europe/Kaliningrad, Europe/Kiev, Europe/Kirov, Europe/Lisbon, Europe/Ljubljana, Europe/London, Europe/Luxembourg, Europe/Madrid, Europe/Malta, Europe/Mariehamn, Europe/Minsk, Europe/Monaco, Europe/Moscow, Europe/Nicosia, Europe/Oslo, Europe/Paris, Europe/Podgorica, Europe/Prague, Europe/Riga, Europe/Rome, Europe/Samara, Europe/San_Marino, Europe/Sarajevo, Europe/Saratov, Europe/Simferopol, Europe/Skopje, Europe/Sofia, Europe/Stockholm, Europe/Tallinn, Europe/Tirane, Europe/Tiraspol, Europe/Ulyanovsk, Europe/Uzhgorod, Europe/Vaduz, Europe/Vatican, Europe/Vienna, Europe/Vilnius, Europe/Volgograd, Europe/Warsaw, Europe/Zagreb, Europe/Zaporozhye, Europe/Zurich, Factory, GB, GB-Eire, GMT, GMT+0, GMT-0, GMT0, Greenwich, HST, Hongkong, IET, IST, Iceland, Indian/Antananarivo, Indian/Chagos, Indian/Christmas, Indian/Cocos, Indian/Comoro, Indian/Kerguelen, Indian/Mahe, Indian/Maldives, Indian/Mauritius, Indian/Mayotte, Indian/Reunion, Iran, Israel, JST, Jamaica, Japan, Kwajalein, Libya, MET, MIT, MST, MST7MDT, Mexico/BajaNorte, Mexico/BajaSur, Mexico/General, NET, NST, NZ, NZ-CHAT, Navajo, PLT, PNT, PRC, PRT, PST, PST8PDT, Pacific/Apia, Pacific/Auckland, Pacific/Bougainville, Pacific/Chatham, Pacific/Chuuk, Pacific/Easter, Pacific/Efate, Pacific/Enderbury, Pacific/Fakaofo, Pacific/Fiji, Pacific/Funafuti, Pacific/Galapagos, Pacific/Gambier, Pacific/Guadalcanal, Pacific/Guam, Pacific/Honolulu, Pacific/Johnston, Pacific/Kiritimati, Pacific/Kosrae, Pacific/Kwajalein, Pacific/Majuro, Pacific/Marquesas, Pacific/Midway, Pacific/Nauru, Pacific/Niue, Pacific/Norfolk, Pacific/Noumea, Pacific/Pago_Pago, Pacific/Palau, Pacific/Pitcairn, Pacific/Pohnpei, Pacific/Ponape, Pacific/Port_Moresby, Pacific/Rarotonga, Pacific/Saipan, Pacific/Samoa, Pacific/Tahiti, Pacific/Tarawa, Pacific/Tongatapu, Pacific/Truk, Pacific/Wake, Pacific/Wallis, Pacific/Yap, Poland, Portugal, ROC, ROK, SST, Singapore, SystemV/AST4, SystemV/AST4ADT, SystemV/CST6, SystemV/CST6CDT, SystemV/EST5, SystemV/EST5EDT, SystemV/HST10, SystemV/MST7, SystemV/MST7MDT, SystemV/PST8, SystemV/PST8PDT, SystemV/YST9, SystemV/YST9YDT, Turkey, UCT, US/Alaska, US/Aleutian, US/Arizona, US/Central, US/East-Indiana, US/Eastern, US/Hawaii, US/Indiana-Starke, US/Michigan, US/Mountain, US/Pacific, US/Pacific-New, US/Samoa, UTC, Universal, VST, W-SU, WET, Zulu]

Get TimeZone Offset

UErrorCode success = U_ZERO_ERROR;
auto tz_us = CreateTimezone("America/Los_Angeles");

auto calendar = std::unique_ptr<Calendar>(Calendar::createInstance(success));
if (U_FAILURE(success)) {
	// something went wrong
	exit(1);
}

auto curDate = calendar->getNow();

// Use getOffset to get the stdOffset and dstOffset for the given time
int32_t stdOffset, dstOffset;
tz_us->getOffset(curDate, false, stdOffset, dstOffset, success);
if (U_FAILURE(success)) {
	// something went wrong
	exit(1);
}
printf("%s: %d\n","US Time Zone STD offset",stdOffset/(1000*60*60));
printf("%s: %d\n","US Time Zone DST offset",dstOffset/(1000*60*60));
printf("%s: ", "US date/time is in daylight savings time");
printf("%s\n", (calendar->inDaylightTime(success))?"Yes":"No");

// out:
// US Time Zone STD offset: -8
// US Time Zone DST offset: 1
// US date/time is in daylight savings time: Yes

Reducing Data Size

The inlined data is present in data/icudt66l.dat. It is compiled from the ICU library as described here, with the following filters set:

filters.json
{
    "featureFilters": {
        "brkitr_rules" : "exclude",
        "brkitr_dictionaries" : "exclude",
        "brkitr_tree"   : "exclude",
        "conversion_mappings"   : "exclude",
        "confusables"   : "exclude",
        "curr_supplemental" : "exclude",
        "curr_tree" : "exclude",
        "lang_tree" : "exclude",
        "normalization" : "exclude",
        "region_tree"   : "exclude",
        "rbnf_tree" : "exclude",
        "stringprep"    : "exclude",
        "zone_tree" : "exclude",
        "translit"  : "exclude",
        "unames"    : "exclude",
        "ulayout"   : "exclude",
        "unit_tree" : "exclude",
        "cnvalias" : "exclude",
        "locales_tree"  : "exclude"

    }
}

The following command can then be run in the ICU source directory to create a packaged data file:

ICU_DATA_FILTER_FILE=filters.json ./runConfigureICU Linux --with-data-packaging=archive
make
# result data file: data/out/icudt66l.dat

In the default configuration, only misc, "coll_tree" and "coll_ucadata" are included, which are the parts required for collation and basic time zone support. However, all locales are included. The size of the data can be significantly reduced by stripping certain locales. The linked page describes how to do that. After re-packaging the data, you can run scripts/inline-data.py to inline a smaller segment of the data.

Generating new data

As this ICU version is based on ICU 66, it is recommended to generate the data using ICU 66. However, since you might still want to use the data of newer versions of ICU, you can generate updated data by copying it over from a new ICU repository.

Here is the commands we used to generate a new data file with data from ICU 71:

# download ICU 66
wget https://github.com/unicode-org/icu/archive/refs/tags/release-66-1.zip
unzip release-66-1.zip

# download ICU 71 (replace with latest version)
wget https://github.com/unicode-org/icu/archive/refs/tags/release-71-1.zip
unzip release-71-1.zip

# copy over the data
find icu-release-71-1/icu4c/source/data -type f ! -iname "*.txt" -delete
cp -r icu-release-71-1/icu4c/source/data icu-release-66-1/icu4c/source

# build the data, make sure to create "filters.json" first, see above
cd icu-release-66-1/icu4c/source
ICU_DATA_FILTER_FILE=filters.json ./runConfigureICU Linux --with-data-packaging=archive
make

# the data file will be located in icu-release-66-1/icu4c/source/data/out/icudt66l.dat
# copy over the data to the minimal-icu-collation data repository
# then run the following two commands:
python3 scripts/inline-data.py
python3 scripts/amalgamation.py

About

Simple wrapper of Unicode Collation based on ICU

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published