Skip to content

Implement P1885R12: <text_encoding> header #141312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

smallp-o-p
Copy link
Contributor

@smallp-o-p smallp-o-p commented May 24, 2025

Resolve #105373 and consequently resolve #118371

First crack at <text_encoding>. Implementation is pretty similar to libstdc++.

@smallp-o-p smallp-o-p requested a review from a team as a code owner May 24, 2025 02:44
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label May 24, 2025
@llvmbot
Copy link
Member

llvmbot commented May 24, 2025

@llvm/pr-subscribers-libcxx

Author: William Tran-Viet (smallp-o-p)

Changes

Resolve #105373 and consequently #118371

First crack at &lt;text_encoding&gt;. Implementation is pretty similar to libstdc++.


Patch is 118.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141312.diff

32 Files Affected:

  • (modified) libcxx/docs/FeatureTestMacroTable.rst (+1-1)
  • (modified) libcxx/docs/Status/Cxx2cPapers.csv (+2-2)
  • (modified) libcxx/include/CMakeLists.txt (+2)
  • (modified) libcxx/include/__locale (+9)
  • (added) libcxx/include/__text_encoding/text_encoding.h (+1483)
  • (modified) libcxx/include/module.modulemap.in (+7)
  • (added) libcxx/include/text_encoding (+68)
  • (modified) libcxx/include/version (+1-1)
  • (modified) libcxx/modules/std.compat.cppm.in (-3)
  • (modified) libcxx/modules/std.cppm.in (+3-3)
  • (modified) libcxx/modules/std/text_encoding.inc (+3-6)
  • (modified) libcxx/src/CMakeLists.txt (+1)
  • (modified) libcxx/src/locale.cpp (+13)
  • (added) libcxx/src/text_encoding.cpp (+49)
  • (modified) libcxx/test/libcxx/transitive_includes/cxx26.csv (+15)
  • (added) libcxx/test/std/language.support/support.limits/support.limits.general/text_encoding.version.compile.pass.cpp (+63)
  • (modified) libcxx/test/std/language.support/support.limits/support.limits.general/version.version.compile.pass.cpp (+5-11)
  • (added) libcxx/test/std/localization/locales/locale/locale.members/encoding.pass.cpp (+56)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.ctor/default.pass.cpp (+39)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.ctor/id.pass.cpp (+56)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.ctor/string_view.pass.cpp (+73)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.eq/equal.id.pass.cpp (+69)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.eq/equal.pass.cpp (+66)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.members/aliases.pass.cpp (+37)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.members/environment.pass.cpp (+83)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.members/literal.pass.cpp (+49)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.members/text_encoding.aliases_view/begin.pass.cpp (+66)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.members/text_encoding.aliases_view/empty.pass.cpp (+64)
  • (added) libcxx/test/std/utilities/text_encoding/text_encoding.members/text_encoding.aliases_view/front.pass.cpp (+66)
  • (added) libcxx/test/support/test_text_encoding.h (+1173)
  • (modified) libcxx/utils/generate_feature_test_macro_components.py (-1)
  • (modified) libcxx/utils/libcxx/header_information.py (+2-1)
diff --git a/libcxx/docs/FeatureTestMacroTable.rst b/libcxx/docs/FeatureTestMacroTable.rst
index 9b57b7c8eeb52..93308e4078075 100644
--- a/libcxx/docs/FeatureTestMacroTable.rst
+++ b/libcxx/docs/FeatureTestMacroTable.rst
@@ -500,7 +500,7 @@ Status
     ---------------------------------------------------------- -----------------
     ``__cpp_lib_submdspan``                                    *unimplemented*
     ---------------------------------------------------------- -----------------
-    ``__cpp_lib_text_encoding``                                *unimplemented*
+    ``__cpp_lib_text_encoding``                                ``202306L``
     ---------------------------------------------------------- -----------------
     ``__cpp_lib_to_chars``                                     *unimplemented*
     ---------------------------------------------------------- -----------------
diff --git a/libcxx/docs/Status/Cxx2cPapers.csv b/libcxx/docs/Status/Cxx2cPapers.csv
index 3809446a57896..a7dfa75df7c87 100644
--- a/libcxx/docs/Status/Cxx2cPapers.csv
+++ b/libcxx/docs/Status/Cxx2cPapers.csv
@@ -13,7 +13,7 @@
 "`P2013R5 <https://wg21.link/P2013R5>`__","Freestanding Language: Optional ``::operator new``","2023-06 (Varna)","","",""
 "`P2363R5 <https://wg21.link/P2363R5>`__","Extending associative containers with the remaining heterogeneous overloads","2023-06 (Varna)","","",""
 "`P1901R2 <https://wg21.link/P1901R2>`__","Enabling the Use of ``weak_ptr`` as Keys in Unordered Associative Containers","2023-06 (Varna)","","",""
-"`P1885R12 <https://wg21.link/P1885R12>`__","Naming Text Encodings to Demystify Them","2023-06 (Varna)","","",""
+"`P1885R12 <https://wg21.link/P1885R12>`__","Naming Text Encodings to Demystify Them","2023-06 (Varna)","|Complete|","21",""
 "`P0792R14 <https://wg21.link/P0792R14>`__","``function_ref``: a type-erased callable reference","2023-06 (Varna)","","",""
 "`P2874R2 <https://wg21.link/P2874R2>`__","P2874R2: Mandating Annex D Require No More","2023-06 (Varna)","|Complete|","12",""
 "`P2757R3 <https://wg21.link/P2757R3>`__","Type-checking format args","2023-06 (Varna)","","",""
@@ -79,7 +79,7 @@
 "`P3136R1 <https://wg21.link/P3136R1>`__","Retiring niebloids","2024-11 (Wrocław)","|Complete|","14",""
 "`P3138R5 <https://wg21.link/P3138R5>`__","``views::cache_latest``","2024-11 (Wrocław)","","",""
 "`P3379R0 <https://wg21.link/P3379R0>`__","Constrain ``std::expected`` equality operators","2024-11 (Wrocław)","|Complete|","21",""
-"`P2862R1 <https://wg21.link/P2862R1>`__","``text_encoding::name()`` should never return null values","2024-11 (Wrocław)","","",""
+"`P2862R1 <https://wg21.link/P2862R1>`__","``text_encoding::name()`` should never return null values","2024-11 (Wrocław)","|Complete|","21",""
 "`P2897R7 <https://wg21.link/P2897R7>`__","``aligned_accessor``: An ``mdspan`` accessor expressing pointer over-alignment","2024-11 (Wrocław)","|Complete|","21",""
 "`P3355R1 <https://wg21.link/P3355R1>`__","Fix ``submdspan`` for C++26","2024-11 (Wrocław)","","",""
 "`P3222R0 <https://wg21.link/P3222R0>`__","Fix C++26 by adding transposed special cases for P2642 layouts","2024-11 (Wrocław)","","",""
diff --git a/libcxx/include/CMakeLists.txt b/libcxx/include/CMakeLists.txt
index 43cefd5600646..ba61ee7c11e35 100644
--- a/libcxx/include/CMakeLists.txt
+++ b/libcxx/include/CMakeLists.txt
@@ -751,6 +751,7 @@ set(files
   __system_error/error_condition.h
   __system_error/system_error.h
   __system_error/throw_system_error.h
+  __text_encoding/text_encoding.h
   __thread/formatter.h
   __thread/id.h
   __thread/jthread.h
@@ -1062,6 +1063,7 @@ set(files
   strstream
   syncstream
   system_error
+  text_encoding
   tgmath.h
   thread
   tuple
diff --git a/libcxx/include/__locale b/libcxx/include/__locale
index d6c6ef19627ff..4da3f38ac408f 100644
--- a/libcxx/include/__locale
+++ b/libcxx/include/__locale
@@ -31,6 +31,10 @@
 #  include <cstddef>
 #  include <cstring>
 
+#  if _LIBCPP_STD_VER >= 26
+#    include <__text_encoding/text_encoding.h>
+#  endif
+
 #  if _LIBCPP_HAS_WIDE_CHARACTERS
 #    include <cwchar>
 #  else
@@ -99,6 +103,11 @@ public:
 
   // locale operations:
   string name() const;
+  
+#  if _LIBCPP_STD_VER >= 26 && __CHAR_BIT__ == 8
+  text_encoding encoding() const; 
+#  endif // _LIBCPP_STD_VER >= 26
+
   bool operator==(const locale&) const;
 #  if _LIBCPP_STD_VER <= 17
   _LIBCPP_HIDE_FROM_ABI bool operator!=(const locale& __y) const { return !(*this == __y); }
diff --git a/libcxx/include/__text_encoding/text_encoding.h b/libcxx/include/__text_encoding/text_encoding.h
new file mode 100644
index 0000000000000..93d0ae2ab6b89
--- /dev/null
+++ b/libcxx/include/__text_encoding/text_encoding.h
@@ -0,0 +1,1483 @@
+// -*- C++ -*-
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef _LIBCPP___TEXT_ENCODING_TEXT_ENCODING_H
+#define _LIBCPP___TEXT_ENCODING_TEXT_ENCODING_H
+
+#include <__config>
+
+#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
+#  pragma GCC system_header
+#endif
+
+#if _LIBCPP_HAS_LOCALIZATION
+
+#include <__algorithm/copy_n.h>
+#include <__algorithm/lower_bound.h>
+#include <__algorithm/min.h>
+#include <__functional/hash.h>
+#include <__iterator/iterator_traits.h>
+#include <__locale_dir/locale_base_api.h>
+#include <__ranges/view_interface.h>
+#include <__string/char_traits.h>
+#include <__utility/unreachable.h>
+#include <cstdint>
+#include <string_view>
+
+_LIBCPP_PUSH_MACROS
+#include <__undef_macros>
+
+#if _LIBCPP_STD_VER >= 26
+_LIBCPP_BEGIN_NAMESPACE_STD
+
+struct _LIBCPP_EXPORTED_FROM_ABI text_encoding {
+  static constexpr size_t max_name_length = 63;
+
+private:
+  struct __encoding_data {
+    using __id_rep _LIBCPP_NODEBUG = int_least32_t;
+    __id_rep __mib_rep;
+    const char* __name;
+
+    friend constexpr bool operator==(const __encoding_data& __e, const __encoding_data& __other) _NOEXCEPT {
+      return __e.__mib_rep == __other.__mib_rep || __comp_name(__e.__name, __other.__name);
+    }
+
+    friend constexpr bool operator<(const __encoding_data& __e, const __id_rep __i) _NOEXCEPT {
+      return __e.__mib_rep < __i;
+    }
+  };
+
+public:
+  enum class id : __encoding_data::__id_rep {
+    other                   = 1,
+    unknown                 = 2,
+    ASCII                   = 3,
+    ISOLatin1               = 4,
+    ISOLatin2               = 5,
+    ISOLatin3               = 6,
+    ISOLatin4               = 7,
+    ISOLatinCyrillic        = 8,
+    ISOLatinArabic          = 9,
+    ISOLatinGreek           = 10,
+    ISOLatinHebrew          = 11,
+    ISOLatin5               = 12,
+    ISOLatin6               = 13,
+    ISOTextComm             = 14,
+    HalfWidthKatakana       = 15,
+    JISEncoding             = 16,
+    ShiftJIS                = 17,
+    EUCPkdFmtJapanese       = 18,
+    EUCFixWidJapanese       = 19,
+    ISO4UnitedKingdom       = 20,
+    ISO11SwedishForNames    = 21,
+    ISO15Italian            = 22,
+    ISO17Spanish            = 23,
+    ISO21German             = 24,
+    ISO60DanishNorwegian    = 25,
+    ISO69French             = 26,
+    ISO10646UTF1            = 27,
+    ISO646basic1983         = 28,
+    INVARIANT               = 29,
+    ISO2IntlRefVersion      = 30,
+    NATSSEFI                = 31,
+    NATSSEFIADD             = 32,
+    NATSDANO                = 33,
+    NATSDANOADD             = 34,
+    ISO10Swedish            = 35,
+    KSC56011987             = 36,
+    ISO2022KR               = 37,
+    EUCKR                   = 38,
+    ISO2022JP               = 39,
+    ISO2022JP2              = 40,
+    ISO13JISC6220jp         = 41,
+    ISO14JISC6220ro         = 42,
+    ISO16Portuguese         = 43,
+    ISO18Greek7Old          = 44,
+    ISO19LatinGreek         = 45,
+    ISO25French             = 46,
+    ISO27LatinGreek1        = 47,
+    ISO5427Cyrillic         = 48,
+    ISO42JISC62261978       = 49,
+    ISO47BSViewdata         = 50,
+    ISO49INIS               = 51,
+    ISO50INIS8              = 52,
+    ISO51INISCyrillic       = 53,
+    ISO54271981             = 54,
+    ISO5428Greek            = 55,
+    ISO57GB1988             = 56,
+    ISO58GB231280           = 57,
+    ISO61Norwegian2         = 58,
+    ISO70VideotexSupp1      = 59,
+    ISO84Portuguese2        = 60,
+    ISO85Spanish2           = 61,
+    ISO86Hungarian          = 62,
+    ISO87JISX0208           = 63,
+    ISO88Greek7             = 64,
+    ISO89ASMO449            = 65,
+    ISO90                   = 66,
+    ISO91JISC62291984a      = 67,
+    ISO92JISC62991984b      = 68,
+    ISO93JIS62291984badd    = 69,
+    ISO94JIS62291984hand    = 70,
+    ISO95JIS62291984handadd = 71,
+    ISO96JISC62291984kana   = 72,
+    ISO2033                 = 73,
+    ISO99NAPLPS             = 74,
+    ISO102T617bit           = 75,
+    ISO103T618bit           = 76,
+    ISO111ECMACyrillic      = 77,
+    ISO121Canadian1         = 78,
+    ISO122Canadian2         = 79,
+    ISO123CSAZ24341985gr    = 80,
+    ISO88596E               = 81,
+    ISO88596I               = 82,
+    ISO128T101G2            = 83,
+    ISO88598E               = 84,
+    ISO88598I               = 85,
+    ISO139CSN369103         = 86,
+    ISO141JUSIB1002         = 87,
+    ISO143IECP271           = 88,
+    ISO146Serbian           = 89,
+    ISO147Macedonian        = 90,
+    ISO150                  = 91,
+    ISO151Cuba              = 92,
+    ISO6937Add              = 93,
+    ISO153GOST1976874       = 94,
+    ISO8859Supp             = 95,
+    ISO10367Box             = 96,
+    ISO158Lap               = 97,
+    ISO159JISX02121990      = 98,
+    ISO646Danish            = 99,
+    USDK                    = 100,
+    DKUS                    = 101,
+    KSC5636                 = 102,
+    Unicode11UTF7           = 103,
+    ISO2022CN               = 104,
+    ISO2022CNEXT            = 105,
+    UTF8                    = 106,
+    ISO885913               = 109,
+    ISO885914               = 110,
+    ISO885915               = 111,
+    ISO885916               = 112,
+    GBK                     = 113,
+    GB18030                 = 114,
+    OSDEBCDICDF0415         = 115,
+    OSDEBCDICDF03IRV        = 116,
+    OSDEBCDICDF041          = 117,
+    ISO115481               = 118,
+    KZ1048                  = 119,
+    UCS2                    = 1000,
+    UCS4                    = 1001,
+    UnicodeASCII            = 1002,
+    UnicodeLatin1           = 1003,
+    UnicodeJapanese         = 1004,
+    UnicodeIBM1261          = 1005,
+    UnicodeIBM1268          = 1006,
+    UnicodeIBM1276          = 1007,
+    UnicodeIBM1264          = 1008,
+    UnicodeIBM1265          = 1009,
+    Unicode11               = 1010,
+    SCSU                    = 1011,
+    UTF7                    = 1012,
+    UTF16BE                 = 1013,
+    UTF16LE                 = 1014,
+    UTF16                   = 1015,
+    CESU8                   = 1016,
+    UTF32                   = 1017,
+    UTF32BE                 = 1018,
+    UTF32LE                 = 1019,
+    BOCU1                   = 1020,
+    UTF7IMAP                = 1021,
+    Windows30Latin1         = 2000,
+    Windows31Latin1         = 2001,
+    Windows31Latin2         = 2002,
+    Windows31Latin5         = 2003,
+    HPRoman8                = 2004,
+    AdobeStandardEncoding   = 2005,
+    VenturaUS               = 2006,
+    VenturaInternational    = 2007,
+    DECMCS                  = 2008,
+    PC850Multilingual       = 2009,
+    PC8DanishNorwegian      = 2012,
+    PC862LatinHebrew        = 2013,
+    PC8Turkish              = 2014,
+    IBMSymbols              = 2015,
+    IBMThai                 = 2016,
+    HPLegal                 = 2017,
+    HPPiFont                = 2018,
+    HPMath8                 = 2019,
+    HPPSMath                = 2020,
+    HPDesktop               = 2021,
+    VenturaMath             = 2022,
+    MicrosoftPublishing     = 2023,
+    Windows31J              = 2024,
+    GB2312                  = 2025,
+    Big5                    = 2026,
+    Macintosh               = 2027,
+    IBM037                  = 2028,
+    IBM038                  = 2029,
+    IBM273                  = 2030,
+    IBM274                  = 2031,
+    IBM275                  = 2032,
+    IBM277                  = 2033,
+    IBM278                  = 2034,
+    IBM280                  = 2035,
+    IBM281                  = 2036,
+    IBM284                  = 2037,
+    IBM285                  = 2038,
+    IBM290                  = 2039,
+    IBM297                  = 2040,
+    IBM420                  = 2041,
+    IBM423                  = 2042,
+    IBM424                  = 2043,
+    PC8CodePage437          = 2011,
+    IBM500                  = 2044,
+    IBM851                  = 2045,
+    PCp852                  = 2010,
+    IBM855                  = 2046,
+    IBM857                  = 2047,
+    IBM860                  = 2048,
+    IBM861                  = 2049,
+    IBM863                  = 2050,
+    IBM864                  = 2051,
+    IBM865                  = 2052,
+    IBM868                  = 2053,
+    IBM869                  = 2054,
+    IBM870                  = 2055,
+    IBM871                  = 2056,
+    IBM880                  = 2057,
+    IBM891                  = 2058,
+    IBM903                  = 2059,
+    IBBM904                 = 2060,
+    IBM905                  = 2061,
+    IBM918                  = 2062,
+    IBM1026                 = 2063,
+    IBMEBCDICATDE           = 2064,
+    EBCDICATDEA             = 2065,
+    EBCDICCAFR              = 2066,
+    EBCDICDKNO              = 2067,
+    EBCDICDKNOA             = 2068,
+    EBCDICFISE              = 2069,
+    EBCDICFISEA             = 2070,
+    EBCDICFR                = 2071,
+    EBCDICIT                = 2072,
+    EBCDICPT                = 2073,
+    EBCDICES                = 2074,
+    EBCDICESA               = 2075,
+    EBCDICESS               = 2076,
+    EBCDICUK                = 2077,
+    EBCDICUS                = 2078,
+    Unknown8BiT             = 2079,
+    Mnemonic                = 2080,
+    Mnem                    = 2081,
+    VISCII                  = 2082,
+    VIQR                    = 2083,
+    KOI8R                   = 2084,
+    HZGB2312                = 2085,
+    IBM866                  = 2086,
+    PC775Baltic             = 2087,
+    KOI8U                   = 2088,
+    IBM00858                = 2089,
+    IBM00924                = 2090,
+    IBM01140                = 2091,
+    IBM01141                = 2092,
+    IBM01142                = 2093,
+    IBM01143                = 2094,
+    IBM01144                = 2095,
+    IBM01145                = 2096,
+    IBM01146                = 2097,
+    IBM01147                = 2098,
+    IBM01148                = 2099,
+    IBM01149                = 2100,
+    Big5HKSCS               = 2101,
+    IBM1047                 = 2102,
+    PTCP154                 = 2103,
+    Amiga1251               = 2104,
+    KOI7switched            = 2105,
+    BRF                     = 2106,
+    TSCII                   = 2107,
+    CP51932                 = 2108,
+    windows874              = 2109,
+    windows1250             = 2250,
+    windows1251             = 2251,
+    windows1252             = 2252,
+    windows1253             = 2253,
+    windows1254             = 2254,
+    windows1255             = 2255,
+    windows1256             = 2256,
+    windows1257             = 2257,
+    windows1258             = 2258,
+    TIS620                  = 2259,
+    CP50220                 = 2260,
+    reserved                = 3000
+  };
+
+  using enum id;
+
+  _LIBCPP_HIDE_FROM_ABI constexpr text_encoding() = default;
+  _LIBCPP_HIDE_FROM_ABI constexpr explicit text_encoding(string_view __enc) _NOEXCEPT
+      : __encoding_rep_(__find_encoding_data(__enc)) {
+    __enc.copy(__name_, max_name_length, 0);
+  }
+  _LIBCPP_HIDE_FROM_ABI constexpr text_encoding(id __i) _NOEXCEPT : __encoding_rep_(__find_encoding_data_by_id(__i)) {
+    if (__encoding_rep_->__name[0] != '\0')
+      std::copy_n(__encoding_rep_->__name, std::char_traits<char>::length(__encoding_rep_->__name), __name_);
+  }
+
+  [[nodiscard]] _LIBCPP_HIDE_FROM_ABI constexpr id mib() const _NOEXCEPT { return id(__encoding_rep_->__mib_rep); }
+  [[nodiscard]] _LIBCPP_HIDE_FROM_ABI constexpr const char* name() const _NOEXCEPT { return __name_; }
+
+  // [text.encoding.aliases], class text_encoding::aliases_view
+  struct aliases_view : ranges::view_interface<aliases_view> {
+    constexpr aliases_view() = default;
+    constexpr aliases_view(const __encoding_data* __d) : __view_data_(__d) {}
+    struct __end_sentinel {};
+    struct __iterator {
+      using value_type        = const char*;
+      using reference         = const char*;
+      using difference_type   = ptrdiff_t;
+
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator() noexcept = default; 
+      
+      _LIBCPP_HIDE_FROM_ABI constexpr value_type operator*() const {
+        if (__can_dereference())
+          return __data_->__name;
+        std::unreachable();
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr value_type operator[](difference_type __n) const {
+        auto __it = *this;
+        return *(__it + __n);
+      }
+
+      _LIBCPP_HIDE_FROM_ABI friend constexpr __iterator operator+(__iterator __it, difference_type __n) {
+        __it += __n;
+        return __it;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI friend constexpr __iterator operator+(difference_type __n, __iterator __it) {
+        __it += __n;
+        return __it;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI friend constexpr __iterator operator-(__iterator __it, difference_type __n) {
+        __it -= __n; 
+        return __it;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr difference_type operator-(const __iterator& __other) const 
+      {
+        if(__other.__mib_rep_ == __mib_rep_)
+          return __mib_rep_ - __other.__mib_rep_;
+        std::unreachable();
+      }
+
+      _LIBCPP_HIDE_FROM_ABI friend constexpr __iterator operator-(difference_type __n, __iterator& __it) {
+        __it -= __n; 
+        return __it;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator& operator++() {
+        __data_++;
+        return *this;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator operator++(int) {
+        auto __old = *this;
+        __data_++;
+        return __old;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator& operator--() {
+        __data_--;
+        return *this;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator operator--(int) {
+        auto __old = *this;
+        __data_--;
+        return __old;
+      }
+
+      // Check if going past the encoding data list array and if the new index has the same id, if not then
+      // replace it with a sentinel "out-of-bounds" iterator.
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator& operator+=(difference_type __n) {
+        if (__data_) [[__likely__]] {
+          if (__n > 0) {
+            if ((__data_ + __n) < std::end(__text_encoding_data) && __data_[__n - 1].__mib_rep == __mib_rep_)
+              __data_ += __n;
+            else
+              *this = __iterator{};
+          } else if (__n < 0) {
+            if ((__data_ + __n) > __text_encoding_data && __data_[__n].__mib_rep == __mib_rep_)
+              __data_ += __n;
+            else
+              *this = __iterator{};
+          }
+        }
+        return *this;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator& operator-=(difference_type __n) { return operator+=(-__n); }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr bool operator==(const __iterator& __it) const {
+        return __data_ == __it.__data_ && __it.__mib_rep_ == __mib_rep_;
+      }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr bool operator==(__end_sentinel) const { return !__can_dereference(); }
+
+      _LIBCPP_HIDE_FROM_ABI constexpr auto operator<=>(__iterator __it) const { return __data_ <=> __it.__data_; }
+
+    private:
+      friend struct text_encoding;
+
+      _LIBCPP_HIDE_FROM_ABI constexpr __iterator(const __encoding_data* __enc_d) noexcept
+         ...
[truncated]

Copy link

github-actions bot commented May 24, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions h,,inc,cpp -- libcxx/include/__text_encoding/get_locale_encoding.h libcxx/include/__text_encoding/text_encoding.h libcxx/include/text_encoding libcxx/src/text_encoding.cpp libcxx/test/std/language.support/support.limits/support.limits.general/text_encoding.version.compile.pass.cpp libcxx/test/std/localization/locales/locale/locale.members/encoding.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.ctor/default.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.ctor/id.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.ctor/string_view.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.eq/equal.id.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.eq/equal.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.members/aliases.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.members/environment.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.members/literal.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.members/nodiscard.verify.cpp libcxx/test/std/utilities/text_encoding/text_encoding.members/text_encoding.aliases_view/begin.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.members/text_encoding.aliases_view/empty.pass.cpp libcxx/test/std/utilities/text_encoding/text_encoding.members/text_encoding.aliases_view/front.pass.cpp libcxx/test/support/test_text_encoding.h libcxx/include/__locale libcxx/include/__locale_dir/locale_base_api.h libcxx/include/__locale_dir/locale_base_api/ibm.h libcxx/include/__locale_dir/support/bsd_like.h libcxx/include/__locale_dir/support/fuchsia.h libcxx/include/__locale_dir/support/linux.h libcxx/include/version libcxx/modules/std/text_encoding.inc libcxx/test/std/language.support/support.limits/support.limits.general/version.version.compile.pass.cpp libcxx/test/std/localization/locale.categories/category.monetary/locale.money.get/locale.money.get.members/get_long_double_fr_FR.pass.cpp libcxx/test/std/localization/locale.categories/category.monetary/locale.money.put/locale.money.put.members/put_long_double_fr_FR.pass.cpp
View the diff from clang-format here.
diff --git a/libcxx/test/std/language.support/support.limits/support.limits.general/text_encoding.version.compile.pass.cpp b/libcxx/test/std/language.support/support.limits/support.limits.general/text_encoding.version.compile.pass.cpp
index 817b0f0d6..1678e8840 100644
--- a/libcxx/test/std/language.support/support.limits/support.limits.general/text_encoding.version.compile.pass.cpp
+++ b/libcxx/test/std/language.support/support.limits/support.limits.general/text_encoding.version.compile.pass.cpp
@@ -60,4 +60,3 @@
 #endif // TEST_STD_VER > 23
 
 // clang-format on
-
diff --git a/libcxx/test/std/localization/locale.categories/category.monetary/locale.money.get/locale.money.get.members/get_long_double_fr_FR.pass.cpp b/libcxx/test/std/localization/locale.categories/category.monetary/locale.money.get/locale.money.get.members/get_long_double_fr_FR.pass.cpp
index a87fd19c1..341f05943 100644
--- a/libcxx/test/std/localization/locale.categories/category.monetary/locale.money.get/locale.money.get.members/get_long_double_fr_FR.pass.cpp
+++ b/libcxx/test/std/localization/locale.categories/category.monetary/locale.money.get/locale.money.get.members/get_long_double_fr_FR.pass.cpp
@@ -543,7 +543,8 @@ int main(int, char**)
           std::noshowbase(ios);
         }
         {   // negative, showbase
-          std::wstring v = convert_thousands_sep(L"-1" THOUSANDS_SEP_ "234" THOUSANDS_SEP_ "567,89 \u20ac"); // EURO SIGN
+          std::wstring v =
+              convert_thousands_sep(L"-1" THOUSANDS_SEP_ "234" THOUSANDS_SEP_ "567,89 \u20ac"); // EURO SIGN
           std::showbase(ios);
           typedef cpp17_input_iterator<const wchar_t*> I;
           long double ex;

@smallp-o-p smallp-o-p marked this pull request as draft May 24, 2025 02:47
@frederick-vs-ja
Copy link
Contributor

frederick-vs-ja commented May 24, 2025

Thanks! I've edited the PR description to associate this PR with both issues.

Copy link
Contributor

@cor3ntin cor3ntin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting to see progress on this

I'm not a library maintainer, so take my comments for what they are worth :)

What's the Windows support for libc++ these days? environment is Posix specific at the moment

auto __make_locale = [](const char* __name) {
text_encoding __enc{};
if (auto __loc = __locale::__newlocale(LC_CTYPE_MASK, __name, static_cast<locale_t>(0))) {
if (const char* __codeset = nl_langinfo_l(CODESET, __loc)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to check that you are in a POSIX environment here. nl_langinfo_l is not going to be available on windows, for example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in the locale base API, since this is platform-specific and locale related.

}
_LIBCPP_HIDE_FROM_ABI constexpr text_encoding(id __i) _NOEXCEPT : __encoding_rep_(__find_encoding_data_by_id(__i)) {
if (__encoding_rep_->__name[0] != '\0')
std::copy_n(__encoding_rep_->__name, std::char_traits<char>::length(__encoding_rep_->__name), __name_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use strncpy here - but I don;t know what libc++ folks prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather save the size and avoid this call entirely. I'm pretty sure we can get away with not even increasing the size of the struct, since it's at most 63.

Comment on lines +495 to +498
template <id __i>
[[nodiscard]] _LIBCPP_HIDE_FROM_ABI static bool environment_is() {
return environment() == __i;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function intended that it could be optimized - eg for utf8 - such that it would not access / odr-use the data table. But that implementation is fine, especially as a first pass

Comment on lines 567 to 568
{1, ""},
{2, ""},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these useful?

@cor3ntin cor3ntin requested a review from EricWF May 24, 2025 08:34
Comment on lines +1458 to +1459
const __encoding_data* __encoding_rep_ = __text_encoding_data + 1;
char __name_[max_name_length + 1] = {0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To decrease the size at least somewhat, we could instead have a union of these two and set the last byte to a non-zero value if we store a pointer. The __name_ would be the same as __encoding_rep_ in that case IIUC.

Copy link
Contributor Author

@smallp-o-p smallp-o-p May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is feasible due to 2.2 in the case of a match found for the name passed in enc, we'd have to copy the name into the buffer and somehow be able to retrieve the id without the pointer.

Edit: We could still avoid the call to copy_n though and change name() to check if the first character is null terminator.

Comment on lines 45 to 46
__id_rep __mib_rep;
const char* __name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we store the size of the name as well? We already have space between __mib_rep and __name anyways. Also, they should be __mib_rep_ and __name_.

# pragma GCC system_header
#endif

#if _LIBCPP_HAS_LOCALIZATION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to guard this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think a user who is building libc++ without localization wouldn't have much use for a text encoding database, hence guarding it behind _LIBCPP_HAS_LOCALIZATION, however I'm not an expert. In the standard, localization and text_encoding are different subcategories of the text processing library, so it's more likely that the guard should be removed. Perhaps @cor3ntin could weigh in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locale and encoding are unrelated, for the most part. so I would not disable the feature if locale is not available.
maybe environment/environment_is would be considered to require locale?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's only one or two functions that actually require locale we should guard only those. It may not make a ton of sense to use this without localization, but _LIBCPP_HAS_LOCALIZATION really means "has libc locale support". Unless there is actually a platform requirement for a feature, we shouldn't guard that feature.

auto __make_locale = [](const char* __name) {
text_encoding __enc{};
if (auto __loc = __locale::__newlocale(LC_CTYPE_MASK, __name, static_cast<locale_t>(0))) {
if (const char* __codeset = nl_langinfo_l(CODESET, __loc)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be in the locale base API, since this is platform-specific and locale related.

@@ -0,0 +1,56 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

# include <langinfo.h>
#endif

#if _LIBCPP_STD_VER >= 26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never have version checks in src/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon my unfamiliarity with the codebase, but wouldn't this prevent the library from being built, since it's built with -std=c++23? In that case this PR would probably have to be parked (once all the requested changes are made and approvals are given) until the library is built with -std=c++26.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library needs to be built with -std=c++26 (or in an even later mode) as long as text_encoding::environment is separately compiled. So it doesn't make sense to guard this with _LIBCPP_STD_VER >= 26.

Alternatively, it's possible to provide an internal function achieving this functionality that's separately compiled and available in old modes. But it still makes no sense to guard the function definition in src/.

Comment on lines +66 to +68
#if _LIBCPP_STD_VER >= 26
# include <__text_encoding/text_encoding.h>
#endif // _LIBCPP_STD_VER >= 26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you put the implementation in an internal header? I don't see a reason for that. IMO you should move the implementation here instead. The granularization of headers has a special purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is fine. We've granularized most headers by now, so this keeps the pattern, and this avoids having to move code around in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not arguing. IMO Without anything granularized it's just unnecessary overhead. I wondered why you didn't comment on it.

Comment on lines 12 to 14
#include "test_macros.h"
#include <cstdint>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "test_macros.h"
#include <cstdint>
#include <cstdint>
#include "test_macros.h"

Do you really need this header in the support directory? Perhaps it can be put in the test directory?

@Zingam
Copy link
Contributor

Zingam commented May 24, 2025

@smallp-o-p BTW You should use proper GitHub syntax to link the PR to the issues: https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue


// [text.encoding.aliases], class text_encoding::aliases_view
struct aliases_view : ranges::view_interface<aliases_view> {
constexpr aliases_view() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this default constructor? It's clarified in https://eel.is/c++draft/text.encoding.aliases#note-1 that aliases_view can be non-default_initializable.

// [text.encoding.aliases], class text_encoding::aliases_view
struct aliases_view : ranges::view_interface<aliases_view> {
constexpr aliases_view() = default;
constexpr aliases_view(const __encoding_data* __d) : __view_data_(__d) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor seems too permissive (ditto __iterator's), e.g. one can write aliases_view(nullptr). I guess it's better to make it hard or even impossible to call constructor without mentioning any reserved identifier.

@frederick-vs-ja
Copy link
Contributor

frederick-vs-ja commented May 28, 2025

The issue with text_encoding::environment() is that the text_encoding class isn't visible in C++23, and I don't see how it would be possible to make it visible prior to C++26 other than:

  1. Backport <text_encoding> to C++23
  2. Wait until libcxx is built with -std=c++26 by default

I'll have to dig in more with the availability macros, how <filesystem> was implemented seems promising.

I think there's another approach. Note that text_encoding is a trivially copyable class (although not yet guarantee by the standard!), we can define

  • a __text_encoding_rep class which has the same layout as text_encoding, and
  • a __text_encoding_environment_rep function returning it, and then
  • use std::bit_cast<text_encoding>(std::__text_encoding_environment_rep()) in text_encoding::environment.

The helper class and function can be available in old modes without exposing text_encoding. This approach can also make text_encoding::environment away from ABI boundary and potentially easier to evolve.

@cor3ntin
Copy link
Contributor

cor3ntin commented May 29, 2025

I have no idea what libc++ policies are in terms of backporting, but given that this feature is most useful for legacy systems, it might be reasonable to make it available in C++20 (we need consteval) - assuming there are appropriate warnings (note that gcc does not do that though).

@smallp-o-p
Copy link
Contributor Author

smallp-o-p commented May 29, 2025

A couple notes based on the recent build failures:

  • AIX does not seem to have <langinfo.h>, but looking at documentation nl_langinfo is available, so I may have forgotten to put it in somewhere. Perhaps __locale_base_dir/locale_base_api/ibm.h?
  • Android 5.0 and 13.0 don't have nl_langinfo_l, but do have <langinfo.h> (on further research, it looks like that shouldn't even be possible, <langinfo.h> is not part of the Android NDK... I've found some bits of source code of the NDK that have it, some that don't. Will have to investigate what NDK the CI uses. From the bits I've found, it's guarded by !__LP64__...)
  • There is a macro collision from <langinfo.h> in two of the tests

It may just be better to use the draft implementation for text_encoding::environment(), or find a way to guard environment() behind the availability of nl_langinfo_l

@Zingam
Copy link
Contributor

Zingam commented May 29, 2025

  • Android 5.0 and 13.0 don't have nl_langinfo_l, but do have <langinfo.h> (on further research, it looks like that shouldn't even be possible, <langinfo.h> is not part of the Android NDK... I've found some bits of source code of the NDK that have it, some that don't. Will have to investigate what NDK the CI uses. From the bits I've found, it's guarded by !__LP64__...)
  • There is a macro collision from <langinfo.h> in two of the tests

I belive the latest released NDK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2862R1: text_encoding::name() should never return null values P1885R12: Naming Text Encodings to Demystify Them
6 participants