This project contains an easy-to-use version of the collation/timezone part of the ICU library. The entire library is contained in two files (icu-collate.cpp and icu-collate.hpp). This includes all the data necessary to make it work. If you want to use this in your own project, simply copy those two files there.
The compiled size of the project is around 6MB. The majority of this is the inlined ICU data that is required to properly support collation for all included locales. The header Reducing Data Size down below can help you if you want to strip out certain locales to make the included data smaller.
Here are a small number of snippets.
int32_t count;
auto locales = Collator::getAvailableLocales(count);
fprintf(stdout, "Available collation locales: [");
for(int32_t i = 0; i < count; i++) {
if (i > 0) {
fprintf(stdout, ", ");
}
if (string(locales[i].getCountry()).empty()) {
// language only
fprintf(stdout, "%s", locales[i].getLanguage());
} else {
// language + country
fprintf(stdout, "%s_%s", locales[i].getLanguage(), locales[i].getCountry());
}
}
fprintf(stdout, "]\n");
// output:
// Available collation locales: [af, am, ar, as, az, be, bg, bn, bo, bs, bs, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US, eo, es, et, fa, fa_AF, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, om, or, pa, pa, pa_IN, pl, ps, pt, ro, ru, se, si, sk, sl, smn, sq, sr, sr, sr_BA, sr_ME, sr_RS, sr, sr_BA, sr_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh, zh_CN, zh_SG, zh, zh_HK, zh_MO, zh_TW, zu]We can use collators to perform locale-based string ordering using the getSortKey method. This method takes as input a unicode string, and returns a new byte-orderable string based on the locale. That is, we can use standard string comparisons to order the sort keys, and we will end up with a correct ordering for that locale.
In this example, we order several German names using the German locale.
// createSortKey creates a byte-orderable string according to the collation locale
static string createSortKey(icu::Collator &collator, const string &str) {
// run getSortKey once to get the size
int32_t size = collator.getSortKey(UnicodeString::fromUTF8(StringPiece(str)), nullptr, 0);
// allocate a buffer to hold the sort key
auto buf = std::unique_ptr<uint8_t[]>(new uint8_t[size]);
// generate the sort key from the string
UnicodeString unicode_str = UnicodeString::fromUTF8(StringPiece(str));
collator.getSortKey(unicode_str, buf.get(), size);
return string((char*) buf.get(), size);
}
vector<string> strings = {"Göbel", "Goethe", "Goldmann", "Göthe" "Götz", "Gabel"};
UErrorCode status = U_ZERO_ERROR;
// create the collator for the german locale
auto de_collator = std::unique_ptr<icu::Collator>(Collator::createInstance(Locale("de"), status));
if (U_FAILURE(status)) {
// something went wrong
exit(1);
}
// order the strings using the createSortKey method
std::sort(strings.begin(), strings.end(), [&](const string &a, const string &b) {
return createSortKey(*de_collator, a) < createSortKey(*de_collator, b);
});
for(auto &str : strings) {
fprintf(stdout, "%s\n", str.c_str());
}
// output:
// Gabel
// Göbel
// Goethe
// Goldmann
// Göthe
// Götz
// compare this result with the (incorrect) binary ordering:
std::sort(strings.begin(), strings.end());
for(auto &str : strings) {
fprintf(stdout, "%s\n", str.c_str());
}
// Gabel
// Goethe
// Goldmann
// Göbel
// Göthe
// GötzUErrorCode status = U_ZERO_ERROR;
auto timeZoneIds = std::unique_ptr<StringEnumeration>(TimeZone::createEnumeration());
const UnicodeString *zoneId = timeZoneIds->snext(status);
if (U_FAILURE(status)) {
// something went wrong
exit(1);
}
int32_t count = 0;
fprintf(stdout, "Available timezones: [");
while (zoneId != NULL && status == U_ZERO_ERROR) {
std::string zoneIdString;
zoneId->toUTF8String(zoneIdString);
if (count == 0) {
fprintf(stdout, ", ", zoneIdString.c_str());
}
fprintf(stdout, "%s", zoneIdString.c_str());
count++;
zoneId = timeZoneIds->snext(status);
}
fprintf(stdout, "]\n");
// out:
// Available timezones: [ACT, AET, AGT, ART, AST, Africa/Abidjan, Africa/Accra, Africa/Addis_Ababa, Africa/Algiers, Africa/Asmara, Africa/Asmera, Africa/Bamako, Africa/Bangui, Africa/Banjul, Africa/Bissau, Africa/Blantyre, Africa/Brazzaville, Africa/Bujumbura, Africa/Cairo, Africa/Casablanca, Africa/Ceuta, Africa/Conakry, Africa/Dakar, Africa/Dar_es_Salaam, Africa/Djibouti, Africa/Douala, Africa/El_Aaiun, Africa/Freetown, Africa/Gaborone, Africa/Harare, Africa/Johannesburg, Africa/Juba, Africa/Kampala, Africa/Khartoum, Africa/Kigali, Africa/Kinshasa, Africa/Lagos, Africa/Libreville, Africa/Lome, Africa/Luanda, Africa/Lubumbashi, Africa/Lusaka, Africa/Malabo, Africa/Maputo, Africa/Maseru, Africa/Mbabane, Africa/Mogadishu, Africa/Monrovia, Africa/Nairobi, Africa/Ndjamena, Africa/Niamey, Africa/Nouakchott, Africa/Ouagadougou, Africa/Porto-Novo, Africa/Sao_Tome, Africa/Timbuktu, Africa/Tripoli, Africa/Tunis, Africa/Windhoek, America/Adak, America/Anchorage, America/Anguilla, America/Antigua, America/Araguaina, America/Argentina/Buenos_Aires, America/Argentina/Catamarca, America/Argentina/ComodRivadavia, America/Argentina/Cordoba, America/Argentina/Jujuy, America/Argentina/La_Rioja, America/Argentina/Mendoza, America/Argentina/Rio_Gallegos, America/Argentina/Salta, America/Argentina/San_Juan, America/Argentina/San_Luis, America/Argentina/Tucuman, America/Argentina/Ushuaia, America/Aruba, America/Asuncion, America/Atikokan, America/Atka, America/Bahia, America/Bahia_Banderas, America/Barbados, America/Belem, America/Belize, America/Blanc-Sablon, America/Boa_Vista, America/Bogota, America/Boise, America/Buenos_Aires, America/Cambridge_Bay, America/Campo_Grande, America/Cancun, America/Caracas, America/Catamarca, America/Cayenne, America/Cayman, America/Chicago, America/Chihuahua, America/Coral_Harbour, America/Cordoba, America/Costa_Rica, America/Creston, America/Cuiaba, America/Curacao, America/Danmarkshavn, America/Dawson, America/Dawson_Creek, America/Denver, America/Detroit, America/Dominica, America/Edmonton, America/Eirunepe, America/El_Salvador, America/Ensenada, America/Fort_Nelson, America/Fort_Wayne, America/Fortaleza, America/Glace_Bay, America/Godthab, America/Goose_Bay, America/Grand_Turk, America/Grenada, America/Guadeloupe, America/Guatemala, America/Guayaquil, America/Guyana, America/Halifax, America/Havana, America/Hermosillo, America/Indiana/Indianapolis, America/Indiana/Knox, America/Indiana/Marengo, America/Indiana/Petersburg, America/Indiana/Tell_City, America/Indiana/Vevay, America/Indiana/Vincennes, America/Indiana/Winamac, America/Indianapolis, America/Inuvik, America/Iqaluit, America/Jamaica, America/Jujuy, America/Juneau, America/Kentucky/Louisville, America/Kentucky/Monticello, America/Knox_IN, America/Kralendijk, America/La_Paz, America/Lima, America/Los_Angeles, America/Louisville, America/Lower_Princes, America/Maceio, America/Managua, America/Manaus, America/Marigot, America/Martinique, America/Matamoros, America/Mazatlan, America/Mendoza, America/Menominee, America/Merida, America/Metlakatla, America/Mexico_City, America/Miquelon, America/Moncton, America/Monterrey, America/Montevideo, America/Montreal, America/Montserrat, America/Nassau, America/New_York, America/Nipigon, America/Nome, America/Noronha, America/North_Dakota/Beulah, America/North_Dakota/Center, America/North_Dakota/New_Salem, America/Ojinaga, America/Panama, America/Pangnirtung, America/Paramaribo, America/Phoenix, America/Port-au-Prince, America/Port_of_Spain, America/Porto_Acre, America/Porto_Velho, America/Puerto_Rico, America/Punta_Arenas, America/Rainy_River, America/Rankin_Inlet, America/Recife, America/Regina, America/Resolute, America/Rio_Branco, America/Rosario, America/Santa_Isabel, America/Santarem, America/Santiago, America/Santo_Domingo, America/Sao_Paulo, America/Scoresbysund, America/Shiprock, America/Sitka, America/St_Barthelemy, America/St_Johns, America/St_Kitts, America/St_Lucia, America/St_Thomas, America/St_Vincent, America/Swift_Current, America/Tegucigalpa, America/Thule, America/Thunder_Bay, America/Tijuana, America/Toronto, America/Tortola, America/Vancouver, America/Virgin, America/Whitehorse, America/Winnipeg, America/Yakutat, America/Yellowknife, Antarctica/Casey, Antarctica/Davis, Antarctica/DumontDUrville, Antarctica/Macquarie, Antarctica/Mawson, Antarctica/McMurdo, Antarctica/Palmer, Antarctica/Rothera, Antarctica/South_Pole, Antarctica/Syowa, Antarctica/Troll, Antarctica/Vostok, Arctic/Longyearbyen, Asia/Aden, Asia/Almaty, Asia/Amman, Asia/Anadyr, Asia/Aqtau, Asia/Aqtobe, Asia/Ashgabat, Asia/Ashkhabad, Asia/Atyrau, Asia/Baghdad, Asia/Bahrain, Asia/Baku, Asia/Bangkok, Asia/Barnaul, Asia/Beirut, Asia/Bishkek, Asia/Brunei, Asia/Calcutta, Asia/Chita, Asia/Choibalsan, Asia/Chongqing, Asia/Chungking, Asia/Colombo, Asia/Dacca, Asia/Damascus, Asia/Dhaka, Asia/Dili, Asia/Dubai, Asia/Dushanbe, Asia/Famagusta, Asia/Gaza, Asia/Harbin, Asia/Hebron, Asia/Ho_Chi_Minh, Asia/Hong_Kong, Asia/Hovd, Asia/Irkutsk, Asia/Istanbul, Asia/Jakarta, Asia/Jayapura, Asia/Jerusalem, Asia/Kabul, Asia/Kamchatka, Asia/Karachi, Asia/Kashgar, Asia/Kathmandu, Asia/Katmandu, Asia/Khandyga, Asia/Kolkata, Asia/Krasnoyarsk, Asia/Kuala_Lumpur, Asia/Kuching, Asia/Kuwait, Asia/Macao, Asia/Macau, Asia/Magadan, Asia/Makassar, Asia/Manila, Asia/Muscat, Asia/Nicosia, Asia/Novokuznetsk, Asia/Novosibirsk, Asia/Omsk, Asia/Oral, Asia/Phnom_Penh, Asia/Pontianak, Asia/Pyongyang, Asia/Qatar, Asia/Qostanay, Asia/Qyzylorda, Asia/Rangoon, Asia/Riyadh, Asia/Saigon, Asia/Sakhalin, Asia/Samarkand, Asia/Seoul, Asia/Shanghai, Asia/Singapore, Asia/Srednekolymsk, Asia/Taipei, Asia/Tashkent, Asia/Tbilisi, Asia/Tehran, Asia/Tel_Aviv, Asia/Thimbu, Asia/Thimphu, Asia/Tokyo, Asia/Tomsk, Asia/Ujung_Pandang, Asia/Ulaanbaatar, Asia/Ulan_Bator, Asia/Urumqi, Asia/Ust-Nera, Asia/Vientiane, Asia/Vladivostok, Asia/Yakutsk, Asia/Yangon, Asia/Yekaterinburg, Asia/Yerevan, Atlantic/Azores, Atlantic/Bermuda, Atlantic/Canary, Atlantic/Cape_Verde, Atlantic/Faeroe, Atlantic/Faroe, Atlantic/Jan_Mayen, Atlantic/Madeira, Atlantic/Reykjavik, Atlantic/South_Georgia, Atlantic/St_Helena, Atlantic/Stanley, Australia/ACT, Australia/Adelaide, Australia/Brisbane, Australia/Broken_Hill, Australia/Canberra, Australia/Currie, Australia/Darwin, Australia/Eucla, Australia/Hobart, Australia/LHI, Australia/Lindeman, Australia/Lord_Howe, Australia/Melbourne, Australia/NSW, Australia/North, Australia/Perth, Australia/Queensland, Australia/South, Australia/Sydney, Australia/Tasmania, Australia/Victoria, Australia/West, Australia/Yancowinna, BET, BST, Brazil/Acre, Brazil/DeNoronha, Brazil/East, Brazil/West, CAT, CET, CNT, CST, CST6CDT, CTT, Canada/Atlantic, Canada/Central, Canada/East-Saskatchewan, Canada/Eastern, Canada/Mountain, Canada/Newfoundland, Canada/Pacific, Canada/Saskatchewan, Canada/Yukon, Chile/Continental, Chile/EasterIsland, Cuba, EAT, ECT, EET, EST, EST5EDT, Egypt, Eire, Etc/GMT, Etc/GMT+0, Etc/GMT+1, Etc/GMT+10, Etc/GMT+11, Etc/GMT+12, Etc/GMT+2, Etc/GMT+3, Etc/GMT+4, Etc/GMT+5, Etc/GMT+6, Etc/GMT+7, Etc/GMT+8, Etc/GMT+9, Etc/GMT-0, Etc/GMT-1, Etc/GMT-10, Etc/GMT-11, Etc/GMT-12, Etc/GMT-13, Etc/GMT-14, Etc/GMT-2, Etc/GMT-3, Etc/GMT-4, Etc/GMT-5, Etc/GMT-6, Etc/GMT-7, Etc/GMT-8, Etc/GMT-9, Etc/GMT0, Etc/Greenwich, Etc/UCT, Etc/UTC, Etc/Universal, Etc/Zulu, Europe/Amsterdam, Europe/Andorra, Europe/Astrakhan, Europe/Athens, Europe/Belfast, Europe/Belgrade, Europe/Berlin, Europe/Bratislava, Europe/Brussels, Europe/Bucharest, Europe/Budapest, Europe/Busingen, Europe/Chisinau, Europe/Copenhagen, Europe/Dublin, Europe/Gibraltar, Europe/Guernsey, Europe/Helsinki, Europe/Isle_of_Man, Europe/Istanbul, Europe/Jersey, Europe/Kaliningrad, Europe/Kiev, Europe/Kirov, Europe/Lisbon, Europe/Ljubljana, Europe/London, Europe/Luxembourg, Europe/Madrid, Europe/Malta, Europe/Mariehamn, Europe/Minsk, Europe/Monaco, Europe/Moscow, Europe/Nicosia, Europe/Oslo, Europe/Paris, Europe/Podgorica, Europe/Prague, Europe/Riga, Europe/Rome, Europe/Samara, Europe/San_Marino, Europe/Sarajevo, Europe/Saratov, Europe/Simferopol, Europe/Skopje, Europe/Sofia, Europe/Stockholm, Europe/Tallinn, Europe/Tirane, Europe/Tiraspol, Europe/Ulyanovsk, Europe/Uzhgorod, Europe/Vaduz, Europe/Vatican, Europe/Vienna, Europe/Vilnius, Europe/Volgograd, Europe/Warsaw, Europe/Zagreb, Europe/Zaporozhye, Europe/Zurich, Factory, GB, GB-Eire, GMT, GMT+0, GMT-0, GMT0, Greenwich, HST, Hongkong, IET, IST, Iceland, Indian/Antananarivo, Indian/Chagos, Indian/Christmas, Indian/Cocos, Indian/Comoro, Indian/Kerguelen, Indian/Mahe, Indian/Maldives, Indian/Mauritius, Indian/Mayotte, Indian/Reunion, Iran, Israel, JST, Jamaica, Japan, Kwajalein, Libya, MET, MIT, MST, MST7MDT, Mexico/BajaNorte, Mexico/BajaSur, Mexico/General, NET, NST, NZ, NZ-CHAT, Navajo, PLT, PNT, PRC, PRT, PST, PST8PDT, Pacific/Apia, Pacific/Auckland, Pacific/Bougainville, Pacific/Chatham, Pacific/Chuuk, Pacific/Easter, Pacific/Efate, Pacific/Enderbury, Pacific/Fakaofo, Pacific/Fiji, Pacific/Funafuti, Pacific/Galapagos, Pacific/Gambier, Pacific/Guadalcanal, Pacific/Guam, Pacific/Honolulu, Pacific/Johnston, Pacific/Kiritimati, Pacific/Kosrae, Pacific/Kwajalein, Pacific/Majuro, Pacific/Marquesas, Pacific/Midway, Pacific/Nauru, Pacific/Niue, Pacific/Norfolk, Pacific/Noumea, Pacific/Pago_Pago, Pacific/Palau, Pacific/Pitcairn, Pacific/Pohnpei, Pacific/Ponape, Pacific/Port_Moresby, Pacific/Rarotonga, Pacific/Saipan, Pacific/Samoa, Pacific/Tahiti, Pacific/Tarawa, Pacific/Tongatapu, Pacific/Truk, Pacific/Wake, Pacific/Wallis, Pacific/Yap, Poland, Portugal, ROC, ROK, SST, Singapore, SystemV/AST4, SystemV/AST4ADT, SystemV/CST6, SystemV/CST6CDT, SystemV/EST5, SystemV/EST5EDT, SystemV/HST10, SystemV/MST7, SystemV/MST7MDT, SystemV/PST8, SystemV/PST8PDT, SystemV/YST9, SystemV/YST9YDT, Turkey, UCT, US/Alaska, US/Aleutian, US/Arizona, US/Central, US/East-Indiana, US/Eastern, US/Hawaii, US/Indiana-Starke, US/Michigan, US/Mountain, US/Pacific, US/Pacific-New, US/Samoa, UTC, Universal, VST, W-SU, WET, Zulu]UErrorCode success = U_ZERO_ERROR;
auto tz_us = CreateTimezone("America/Los_Angeles");
auto calendar = std::unique_ptr<Calendar>(Calendar::createInstance(success));
if (U_FAILURE(success)) {
// something went wrong
exit(1);
}
auto curDate = calendar->getNow();
// Use getOffset to get the stdOffset and dstOffset for the given time
int32_t stdOffset, dstOffset;
tz_us->getOffset(curDate, false, stdOffset, dstOffset, success);
if (U_FAILURE(success)) {
// something went wrong
exit(1);
}
printf("%s: %d\n","US Time Zone STD offset",stdOffset/(1000*60*60));
printf("%s: %d\n","US Time Zone DST offset",dstOffset/(1000*60*60));
printf("%s: ", "US date/time is in daylight savings time");
printf("%s\n", (calendar->inDaylightTime(success))?"Yes":"No");
// out:
// US Time Zone STD offset: -8
// US Time Zone DST offset: 1
// US date/time is in daylight savings time: YesThe inlined data is present in data/icudt66l.dat. It is compiled from the ICU library as described here, with the following filters set:
filters.json
{
"featureFilters": {
"brkitr_rules" : "exclude",
"brkitr_dictionaries" : "exclude",
"brkitr_tree" : "exclude",
"conversion_mappings" : "exclude",
"confusables" : "exclude",
"curr_supplemental" : "exclude",
"curr_tree" : "exclude",
"lang_tree" : "exclude",
"normalization" : "exclude",
"region_tree" : "exclude",
"rbnf_tree" : "exclude",
"stringprep" : "exclude",
"zone_tree" : "exclude",
"translit" : "exclude",
"unames" : "exclude",
"ulayout" : "exclude",
"unit_tree" : "exclude",
"cnvalias" : "exclude",
"locales_tree" : "exclude"
}
}The following command can then be run in the ICU source directory to create a packaged data file:
ICU_DATA_FILTER_FILE=filters.json ./runConfigureICU Linux --with-data-packaging=archive
make
# result data file: data/out/icudt66l.datIn the default configuration, only misc, "coll_tree" and "coll_ucadata" are included, which are the parts required for collation and basic time zone support. However, all locales are included. The size of the data can be significantly reduced by stripping certain locales. The linked page describes how to do that. After re-packaging the data, you can run scripts/inline-data.py to inline a smaller segment of the data.
As this ICU version is based on ICU 66, it is recommended to generate the data using ICU 66. However, since you might still want to use the data of newer versions of ICU, you can generate updated data by copying it over from a new ICU repository.
Here is the commands we used to generate a new data file with data from ICU 71:
# download ICU 66
wget https://github.com/unicode-org/icu/archive/refs/tags/release-66-1.zip
unzip release-66-1.zip
# download ICU 71 (replace with latest version)
wget https://github.com/unicode-org/icu/archive/refs/tags/release-71-1.zip
unzip release-71-1.zip
# copy over the data
find icu-release-71-1/icu4c/source/data -type f ! -iname "*.txt" -delete
cp -r icu-release-71-1/icu4c/source/data icu-release-66-1/icu4c/source
# build the data, make sure to create "filters.json" first, see above
cd icu-release-66-1/icu4c/source
ICU_DATA_FILTER_FILE=filters.json ./runConfigureICU Linux --with-data-packaging=archive
make
# the data file will be located in icu-release-66-1/icu4c/source/data/out/icudt66l.dat
# copy over the data to the minimal-icu-collation data repository
# then run the following two commands:
python3 scripts/inline-data.py
python3 scripts/amalgamation.py