Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement StringConverter class #38

Merged
merged 5 commits into from
Aug 5, 2022
Merged

Conversation

mrylov
Copy link
Contributor

@mrylov mrylov commented Aug 3, 2022

No description provided.

@mrylov mrylov requested a review from stefanuhrig August 3, 2022 10:45
Comment on lines 18 to 31
enum class endian
{
#ifdef BYTE_ORDER
little = LITTLE_ENDIAN,
big = BIG_ENDIAN,
native = BYTE_ORDER
#elif defined(_M_IX86) || defined(_M_AMD64)
little = 0,
big = 1,
native = little
#else
#error "Cannot determine endianness"
#endif
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any endianness information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public:
StringConverter() = delete;

static std::u16string utf8ToUtf16(const char* src, std::size_t srcLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add an overload just taking a const char* for null-terminated strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 19 to 25
static std::pair<int, char32_t> utf8ToCodePoint(
const char* curr,
const char* end);

static std::size_t utf8ToUtf16Length(const char* src,
std::size_t srcLength);
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods can be moved to the anonymous namespace in the .cpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible as the Exception class can only be used by its friends.

u16string StringConverter::utf8ToUtf16(const char* src, size_t srcLength)
{
if (src == nullptr)
ODBC_FAIL("Input string cannot be nullptr.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"must not" instead of "cannot".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/odbc/StringConverter.h Show resolved Hide resolved
Comment on lines 12 to 43
/**
* Checks if a 16-bit code unit is a high surrogate (starting a surrogate pair).
*
* @param c The 16-bit code unit to check.
* @return True if the code unit is a high surrogate, false otherwise.
*/
inline bool isHighSurrogate(char16_t c)
{
return (c >= 0xD800 && c <= 0xDBFF);
}
//------------------------------------------------------------------------------
/**
* Checks if a 16-bit code unit is a low surrogate (ending a surrogate pair).
*
* @param c The 16-bit code unit to check.
* @return True if the code unit is a high surrogate, false otherwise.
*/
inline bool isLowSurrogate(char16_t c)
{
return (c >= 0xDC00 && c <= 0xDFFF);
}
//------------------------------------------------------------------------------
/**
* Checks if the 16-bit code unit is either a high or low surrogate.
*
* @param c The 16-bit code unit to check
* @return True if the code unit is a low or high surrogate, false otherwise.
*/
inline bool isSurrogatePart(char16_t c)
{
return (c >= 0xD800 && c <= 0xDFFF);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only required for decoding. Can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 68 to 90
/**
* Encodes a 16-bit code unit as little endian.
*
* @param c The 16-bit code unit to encode.
* @param target The target buffer. buffer[0] and buffer[1] must writable.
*/
inline void encodeSingleLE(char16_t c, char* target)
{
target[0] = (char)(c & 0xFF);
target[1] = (char)(c >> 8);
}
//------------------------------------------------------------------------------
/**
* Encodes a 16-bit code unit as big endian.
*
* @param c The 16-bit code unit to encode.
* @param target The target buffer. buffer[0] and buffer[1] must writable.
*/
inline void encodeSingleBE(char16_t c, char* target)
{
target[0] = (char)(c >> 8);
target[1] = (char)(c & 0xFF);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required if we work on character level only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 110 to 166
/**
* Encodes a code point as UTF-16 little endian.
*
* This method automatically encodes as single 16-bit code point or a surrogate
* pair depending on the code point. Therefore the first 4 bytes of the target
* buffer must be accessible.
*
* @param c The code point to encode.
* @param target The target buffer. The first four bytes must be accessible.
* @return The number of bytes written to the target buffer.
*/
inline int encodeLE(char32_t c, char* target)
{
ODBC_ASSERT(
isRepresentable(c), "Codepoint " << (uint32_t)c << " is invalid");
if (!needsSurrogatePair(c))
{
encodeSingleLE((char16_t)c, target);
return 2;
}
else
{
std::pair<char16_t, char16_t> sp = encodeSurrogatePair(c);
encodeSingleLE(sp.first, target);
encodeSingleLE(sp.second, target + 2);
return 4;
}
}
//------------------------------------------------------------------------------
/**
* Encodes a code point as UTF-16 big endian.
*
* This method automatically encodes as single 16-bit code point or a surrogate
* pair depending on the code point. Therefore the first 4 bytes of the target
* buffer must be accessible.
*
* @param c The code point to encode.
* @param target The target buffer. The first four bytes must be accessible.
* @return The number of bytes written to the target buffer.
*/
inline int encodeBE(char32_t c, char* target)
{
ODBC_ASSERT(
isRepresentable(c), "Codepoint " << (uint32_t)c << " is invalid");
if (!needsSurrogatePair(c))
{
encodeSingleBE((char16_t)c, target);
return 2;
}
else
{
std::pair<char16_t, char16_t> sp = encodeSurrogatePair(c);
encodeSingleBE(sp.first, target);
encodeSingleBE(sp.second, target + 2);
return 4;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required if we work only character level only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
const char* src;
size_t srcLength;
const char* dst;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use const char16_t* here. dst is kind of a misleading name. It's the expected outcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
"\x48",
0,
"",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use UTF-16 string literals here, e.g. u"Oststraße".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

StringConverter() = delete;

/**
* Converts a null-terminated UTF-8 string to a UTF16 string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-16 instead of UTF16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 41 to 86
#define ODBC_TERMINATE(msg) std::terminate()
//------------------------------------------------------------------------------
#define ODBC_TERMINATE_CHECK(cond, expr) \
do \
{ \
if (!(cond)) \
{ \
ODBC_TERMINATE(expr << "; Condition '" << #cond << "' failed."); \
} \
} while (false)
//------------------------------------------------------------------------------
#define ODBC_TERMINATE_CHECK_0(cond) \
do \
{ \
if (!(cond)) \
{ \
ODBC_TERMINATE("Condition '" << #cond << "' failed."); \
} \
} while (false)
//------------------------------------------------------------------------------
// Asserts are executed in debug mode only
#ifdef ODBC_DBG
#define ODBC_ASSERT(cond, expr) ODBC_TERMINATE_CHECK(cond, expr)
#define ODBC_ASSERT_0(cond) ODBC_TERMINATE_CHECK_0(cond)
#else
#define ODBC_ASSERT(cond, expr)
#define ODBC_ASSERT_0(cond)
#endif
//------------------------------------------------------------------------------
#if defined(__GNUC__) || defined(__clang__)
#define ODBC_BUILTIN_UNREACHABLE __builtin_unreachable()
#elif defined(_MSC_VER)
#define ODBC_BUILTIN_UNREACHABLE __assume(0)
#endif
//------------------------------------------------------------------------------
// Use these macros at code locations that should be unreachable.
#define ODBC_TERMINATE_CHECK_UNREACHABLE \
ODBC_TERMINATE("Reached unreachable code location")
//------------------------------------------------------------------------------
// Use this macro at code locations that are unreachable.
#ifdef ODBC_DBG
#define ODBC_ASSERT_UNREACHABLE ODBC_TERMINATE_CHECK_UNREACHABLE
#else
#define ODBC_ASSERT_UNREACHABLE ODBC_BUILTIN_UNREACHABLE
#endif
//------------------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not add all that stuff because certain functionality is platform dependent. At other locations, we just use assert , which is good enough from my point of view. The macros don't do anything with message anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/odbc/StringConverter.cpp Show resolved Hide resolved
Comment on lines 32 to 35
u16string str(dstLength, 0);

const char* curr = begin;
size_t i = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get rid of i, we could just call reserve on str (instead of creating it with a given size) and then use push_back. That way we also wouldn't do the unnecessary initialization with 0 anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

curr += cp.first;

ODBC_CHECK(utf16::isRepresentable(cp.second),
"Codepoint " << (uint32_t)cp.second << " is invalid");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's valid, but cannot be represented. Maybe something like: The UTF-8 string contains codepoint U+xxxxxx, which cannot represented in UTF-16. The codepoint should also be encoded in hexadecimal, because that's the convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 48 to 49
str[i] = static_cast<char16_t>(sp.first);
str[i + 1] = static_cast<char16_t>(sp.second);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_casts are not needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/odbc/StringConverter.cpp Show resolved Hide resolved
Comment on lines 80 to 81
// We have to make sure that the sequence does not contain a terminating
// zero and the following byte-sequence is valid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about the terminating zero does not apply here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

u16string actual = p.srcLength >= 0 ?
StringConverter::utf8ToUtf16(p.src, p.srcLength) :
StringConverter::utf8ToUtf16(p.src);
ASSERT_EQ(p.expected.length(), actual.length());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ASSERT_EQ(actual, p.expected) should work, because there is an comparison operator for u16string and const char16_t*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
"\x73\x74\xE5\x5D\x0D\x0A",
6,
"The string contains an incomplete byte-sequence at position 2."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either use "byte sequence" or "byte-sequence". Currently, we have both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 46 to 49
ODBC_CHECK(utf16::isRepresentable(cp.second),
"The UTF-8 string contains codepoint U+" <<
std::hex << (uint32_t)cp.second <<
", which cannot be represented in UTF-16.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An assert should suffice here because we checked this already in utf8ToUtf16Length().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 88 to 89
ODBC_FAIL("The string contains an incomplete byte sequence at "
"position " << (curr - begin) << ".");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we need to distinguish the "incomplete sequence" from the "invalid sequence" case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@stefanuhrig stefanuhrig merged commit 2f08d5f into SAP:master Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants