Implement StringConverter class #38

mrylov · 2022-08-03T10:45:40Z

No description provided.

stefanuhrig · 2022-08-03T11:04:48Z

src/odbc/StringConverter.cpp

+enum class endian
+{
+#ifdef BYTE_ORDER
+    little = LITTLE_ENDIAN,
+    big = BIG_ENDIAN,
+    native = BYTE_ORDER
+#elif defined(_M_IX86) || defined(_M_AMD64)
+    little = 0,
+    big = 1,
+    native = little
+#else
+#error "Cannot determine endianness"
+#endif
+};


I don't think we need any endianness information.

stefanuhrig · 2022-08-03T11:06:16Z

src/odbc/StringConverter.h

+public:
+    StringConverter() = delete;
+
+    static std::u16string utf8ToUtf16(const char* src, std::size_t srcLength);


We should also add an overload just taking a const char* for null-terminated strings.

stefanuhrig · 2022-08-03T11:06:54Z

src/odbc/StringConverter.h

+    static std::pair<int, char32_t> utf8ToCodePoint(
+        const char* curr,
+        const char* end);
+
+    static std::size_t utf8ToUtf16Length(const char* src,
+                                         std::size_t srcLength);
+};


These methods can be moved to the anonymous namespace in the .cpp.

This is not possible as the Exception class can only be used by its friends.

stefanuhrig · 2022-08-03T11:08:07Z

src/odbc/StringConverter.cpp

+u16string StringConverter::utf8ToUtf16(const char* src, size_t srcLength)
+{
+    if (src == nullptr)
+        ODBC_FAIL("Input string cannot be nullptr.");


"must not" instead of "cannot".

src/odbc/StringConverter.h

stefanuhrig · 2022-08-03T11:35:12Z

src/odbc/internal/charset/Utf16.h

+/**
+ * Checks if a 16-bit code unit is a high surrogate (starting a surrogate pair).
+ *
+ * @param c  The 16-bit code unit to check.
+ * @return   True if the code unit is a high surrogate, false otherwise.
+ */
+inline bool isHighSurrogate(char16_t c)
+{
+    return (c >= 0xD800 && c <= 0xDBFF);
+}
+//------------------------------------------------------------------------------
+/**
+ * Checks if a 16-bit code unit is a low surrogate (ending a surrogate pair).
+ *
+ * @param c  The 16-bit code unit to check.
+ * @return   True if the code unit is a high surrogate, false otherwise.
+ */
+inline bool isLowSurrogate(char16_t c)
+{
+    return (c >= 0xDC00 && c <= 0xDFFF);
+}
+//------------------------------------------------------------------------------
+/**
+ * Checks if the 16-bit code unit is either a high or low surrogate.
+ *
+ * @param c  The 16-bit code unit to check
+ * @return   True if the code unit is a low or high surrogate, false otherwise.
+ */
+inline bool isSurrogatePart(char16_t c)
+{
+    return (c >= 0xD800 && c <= 0xDFFF);
+}


Only required for decoding. Can be removed.

stefanuhrig · 2022-08-03T11:35:35Z

src/odbc/internal/charset/Utf16.h

+/**
+ * Encodes a 16-bit code unit as little endian.
+ *
+ * @param c       The 16-bit code unit to encode.
+ * @param target  The target buffer. buffer[0] and buffer[1] must writable.
+ */
+inline void encodeSingleLE(char16_t c, char* target)
+{
+    target[0] = (char)(c & 0xFF);
+    target[1] = (char)(c >> 8);
+}
+//------------------------------------------------------------------------------
+/**
+ * Encodes a 16-bit code unit as big endian.
+ *
+ * @param c       The 16-bit code unit to encode.
+ * @param target  The target buffer. buffer[0] and buffer[1] must writable.
+ */
+inline void encodeSingleBE(char16_t c, char* target)
+{
+    target[0] = (char)(c >> 8);
+    target[1] = (char)(c & 0xFF);
+}


Not required if we work on character level only.

stefanuhrig · 2022-08-03T11:36:27Z

src/odbc/internal/charset/Utf16.h

+/**
+ * Encodes a code point as UTF-16 little endian.
+ *
+ * This method automatically encodes as single 16-bit code point or a surrogate
+ * pair depending on the code point. Therefore the first 4 bytes of the target
+ * buffer must be accessible.
+ *
+ * @param c       The code point to encode.
+ * @param target  The target buffer. The first four bytes must be accessible.
+ * @return        The number of bytes written to the target buffer.
+ */
+inline int encodeLE(char32_t c, char* target)
+{
+    ODBC_ASSERT(
+        isRepresentable(c), "Codepoint " << (uint32_t)c << " is invalid");
+    if (!needsSurrogatePair(c))
+    {
+        encodeSingleLE((char16_t)c, target);
+        return 2;
+    }
+    else
+    {
+        std::pair<char16_t, char16_t> sp = encodeSurrogatePair(c);
+        encodeSingleLE(sp.first, target);
+        encodeSingleLE(sp.second, target + 2);
+        return 4;
+    }
+}
+//------------------------------------------------------------------------------
+/**
+ * Encodes a code point as UTF-16 big endian.
+ *
+ * This method automatically encodes as single 16-bit code point or a surrogate
+ * pair depending on the code point. Therefore the first 4 bytes of the target
+ * buffer must be accessible.
+ *
+ * @param c       The code point to encode.
+ * @param target  The target buffer. The first four bytes must be accessible.
+ * @return        The number of bytes written to the target buffer.
+ */
+inline int encodeBE(char32_t c, char* target)
+{
+    ODBC_ASSERT(
+        isRepresentable(c), "Codepoint " << (uint32_t)c << " is invalid");
+    if (!needsSurrogatePair(c))
+    {
+        encodeSingleBE((char16_t)c, target);
+        return 2;
+    }
+    else
+    {
+        std::pair<char16_t, char16_t> sp = encodeSurrogatePair(c);
+        encodeSingleBE(sp.first, target);
+        encodeSingleBE(sp.second, target + 2);
+        return 4;
+    }
+}


Not required if we work only character level only.

stefanuhrig · 2022-08-03T11:40:46Z

test/StringConverterTest.cpp

+{
+    const char* src;
+    size_t srcLength;
+    const char* dst;


We should use const char16_t* here. dst is kind of a misleading name. It's the expected outcome.

stefanuhrig · 2022-08-03T11:42:29Z

test/StringConverterTest.cpp

+{
+    "\x48",
+    0,
+    "",


I'd use UTF-16 string literals here, e.g. u"Oststraße".

stefanuhrig · 2022-08-04T10:43:05Z

src/odbc/StringConverter.h

+    StringConverter() = delete;
+
+    /**
+     * Converts a null-terminated UTF-8 string to a UTF16 string.


UTF-16 instead of UTF16

stefanuhrig · 2022-08-04T10:47:36Z

src/odbc/internal/Macros.h

+#define ODBC_TERMINATE(msg) std::terminate()
+//------------------------------------------------------------------------------
+#define ODBC_TERMINATE_CHECK(cond, expr)                                       \
+    do                                                                         \
+    {                                                                          \
+        if (!(cond))                                                           \
+        {                                                                      \
+            ODBC_TERMINATE(expr << "; Condition '" << #cond << "' failed.");   \
+        }                                                                      \
+    } while (false)
+//------------------------------------------------------------------------------
+#define ODBC_TERMINATE_CHECK_0(cond)                                           \
+    do                                                                         \
+    {                                                                          \
+        if (!(cond))                                                           \
+        {                                                                      \
+            ODBC_TERMINATE("Condition '" << #cond << "' failed.");             \
+        }                                                                      \
+    } while (false)
+//------------------------------------------------------------------------------
+// Asserts are executed in debug mode only
+#ifdef ODBC_DBG
+#define ODBC_ASSERT(cond, expr) ODBC_TERMINATE_CHECK(cond, expr)
+#define ODBC_ASSERT_0(cond) ODBC_TERMINATE_CHECK_0(cond)
+#else
+#define ODBC_ASSERT(cond, expr)
+#define ODBC_ASSERT_0(cond)
+#endif
+//------------------------------------------------------------------------------
+#if defined(__GNUC__) || defined(__clang__)
+#define ODBC_BUILTIN_UNREACHABLE __builtin_unreachable()
+#elif defined(_MSC_VER)
+#define ODBC_BUILTIN_UNREACHABLE __assume(0)
+#endif
+//------------------------------------------------------------------------------
+// Use these macros at code locations that should be unreachable.
+#define ODBC_TERMINATE_CHECK_UNREACHABLE                                       \
+    ODBC_TERMINATE("Reached unreachable code location")
+//------------------------------------------------------------------------------
+// Use this macro at code locations that are unreachable.
+#ifdef ODBC_DBG
+#define ODBC_ASSERT_UNREACHABLE ODBC_TERMINATE_CHECK_UNREACHABLE
+#else
+#define ODBC_ASSERT_UNREACHABLE ODBC_BUILTIN_UNREACHABLE
+#endif
+//------------------------------------------------------------------------------


I would not add all that stuff because certain functionality is platform dependent. At other locations, we just use assert , which is good enough from my point of view. The macros don't do anything with message anyway.

src/odbc/StringConverter.cpp

stefanuhrig · 2022-08-04T10:57:23Z

src/odbc/StringConverter.cpp

+    u16string str(dstLength, 0);
+
+    const char* curr = begin;
+    size_t i = 0;


To get rid of i, we could just call reserve on str (instead of creating it with a given size) and then use push_back. That way we also wouldn't do the unnecessary initialization with 0 anymore.

stefanuhrig · 2022-08-04T10:58:57Z

src/odbc/StringConverter.cpp

+        curr += cp.first;
+
+        ODBC_CHECK(utf16::isRepresentable(cp.second),
+                   "Codepoint " << (uint32_t)cp.second << " is invalid");


Well, it's valid, but cannot be represented. Maybe something like: The UTF-8 string contains codepoint U+xxxxxx, which cannot represented in UTF-16. The codepoint should also be encoded in hexadecimal, because that's the convention.

stefanuhrig · 2022-08-04T10:59:42Z

src/odbc/StringConverter.cpp

+            str[i] = static_cast<char16_t>(sp.first);
+            str[i + 1] = static_cast<char16_t>(sp.second);


static_casts are not needed here.

src/odbc/StringConverter.cpp

stefanuhrig · 2022-08-04T11:05:03Z

src/odbc/StringConverter.cpp

+    // We have to make sure that the sequence does not contain a terminating
+    // zero and the following byte-sequence is valid.


The comment about the terminating zero does not apply here.

stefanuhrig · 2022-08-04T11:12:04Z

test/StringConverterTest.cpp

+    u16string actual = p.srcLength >= 0 ?
+                StringConverter::utf8ToUtf16(p.src, p.srcLength) :
+                StringConverter::utf8ToUtf16(p.src);
+    ASSERT_EQ(p.expected.length(), actual.length());


I think ASSERT_EQ(actual, p.expected) should work, because there is an comparison operator for u16string and const char16_t*.

stefanuhrig · 2022-08-04T11:13:11Z

test/StringConverterTest.cpp

+{
+    "\x73\x74\xE5\x5D\x0D\x0A",
+    6,
+    "The string contains an incomplete byte-sequence at position 2."


We should either use "byte sequence" or "byte-sequence". Currently, we have both.

stefanuhrig · 2022-08-04T14:15:10Z

src/odbc/StringConverter.cpp

+        ODBC_CHECK(utf16::isRepresentable(cp.second),
+                   "The UTF-8 string contains codepoint U+" <<
+                   std::hex << (uint32_t)cp.second <<
+                   ", which cannot be represented in UTF-16.");


An assert should suffice here because we checked this already in utf8ToUtf16Length().

stefanuhrig · 2022-08-04T14:17:34Z

src/odbc/StringConverter.cpp

+        ODBC_FAIL("The string contains an incomplete byte sequence at "
+                  "position " << (curr - begin) << ".");


Probably we need to distinguish the "incomplete sequence" from the "invalid sequence" case.

mrylov added 2 commits July 27, 2022 16:20

Implement StringConverter class

2955dd5

Implement StringConverter class

cdfc27b

mrylov requested a review from stefanuhrig August 3, 2022 10:45

stefanuhrig reviewed Aug 3, 2022

View reviewed changes

Implement StringConverter class (patchset 2)

779a9b6

stefanuhrig reviewed Aug 4, 2022

View reviewed changes

Implement StringConverter class (patchset 3)

a289480

stefanuhrig reviewed Aug 4, 2022

View reviewed changes

Implement StringConverter class (patchset 4)

1b283ba

stefanuhrig merged commit 2f08d5f into SAP:master Aug 5, 2022

+              {
+                  "\x48",
+,
+                  "",

		str[i] = static_cast<char16_t>(sp.first);
		str[i + 1] = static_cast<char16_t>(sp.second);

		// We have to make sure that the sequence does not contain a terminating
		// zero and the following byte-sequence is valid.

		ODBC_FAIL("The string contains an incomplete byte sequence at "
		"position " << (curr - begin) << ".");

Implement StringConverter class #38

Implement StringConverter class #38

Conversation

mrylov commented Aug 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment