The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. #7643

achimbab · 2019-11-06T02:18:56Z

Executed Query

SELECT javaHash(convertCharset('a1가', 'utf-8', 'utf-16'))

Result of convertCharset
I set a break point at the javaHash() and checked what the value passed by convertCharset().
I expected the value as 97 0 49 0 0 -84 and the size to be 6.
But, an actual value is not.

   │335     struct JavaHashImpl                                                      
   │336     {                                                                        
   │337         static constexpr auto name = "javaHash";                             
   │338         using ReturnType = Int32;                                            
   │339                                                                              
B+ │340         static Int32 apply(const char * data, const size_t size)             
   │341         {                                                                    
  >│342             UInt32 h = 0;                                                    
   │343             for (size_t i = 0; i < size; ++i)                                
   │344                 h = 31 * h + static_cast<UInt32>(static_cast<Int8>(data[i]));
   │345             return static_cast<Int32>(h);                                    
   │346         }                                                                    

(gdb) p size
$132 = 8    

(gdb) x/8db data                                                           
0x7fff46814150: -1      -2      97      0       49      0       0       -84

Notice that first two bytes are -1 -2. It looks weird.

Versions

ClickHouse client version 19.17.1.1.
Connecting to localhost:19000 as user default.
Connected to ClickHouse server version 19.17.1 revision 54428.

The text was updated successfully, but these errors were encountered:

achimbab · 2019-11-06T03:29:32Z

I found what -1 -2 mean.
Those mean BOM(Byte Order Mark) of utf-16.
0xFEFF are used in utf16, (See more details)
So the javaHash() function have to recognize the charsets of source string, because the Java language calculates a hashCode value based on a charset.

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            hash = h = isLatin1() ? StringLatin1.hashCode(value)
                                  : StringUTF16.hashCode(value);
        }
        return h;
    }

alexey-milovidov · 2019-11-06T08:36:50Z

But our strings have no information about encoding. The only way to solve is to provide another function javaHashUTF16 that will calculate javaHash under the assumption that the string is in UTF-16.

alexey-milovidov · 2019-11-06T08:38:38Z

To avoid BOM, you should specify utf16be or utf16le:

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))

┌─hex(convertCharset('1', 'utf-8', 'utf-16'))─┐
│ FFFE3100                                    │
└─────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.011 sec. 

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))

┌─hex(convertCharset('1', 'utf-8', 'utf-16be'))─┐
│ 0031                                          │
└───────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.010 sec. 

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))

┌─hex(convertCharset('1', 'utf-8', 'utf-16le'))─┐
│ 3100                                          │
└───────────────────────────────────────────────┘

alexey-milovidov · 2019-11-06T08:39:12Z

And probably we should add a function javaHashUTF16LE.

achimbab · 2019-11-06T09:18:56Z

@alexey-milovidov
Thank you for your help.
I have already made a function like javaHashUTF16LE for my job.
If you don't mind. I will make PR about javaHashUTF16LE as soon as possible.

achimbab added the question Question? label Nov 6, 2019

achimbab changed the title ~~The convertCharset(s, 'utf-8', 'utf-16') look not working properly.~~ The convertCharset(s, 'utf-8', 'utf-16') looks not working properly. Nov 6, 2019

achimbab changed the title ~~The convertCharset(s, 'utf-8', 'utf-16') looks not working properly.~~ The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. Nov 6, 2019

alexey-milovidov added the invalid label Nov 6, 2019

achimbab mentioned this issue Nov 6, 2019

Implemented javaHashUTF16LE() #7651

Merged

achimbab closed this as completed Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. #7643

The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. #7643

achimbab commented Nov 6, 2019 •

edited

achimbab commented Nov 6, 2019 •

edited

alexey-milovidov commented Nov 6, 2019

alexey-milovidov commented Nov 6, 2019

alexey-milovidov commented Nov 6, 2019

achimbab commented Nov 6, 2019 •

edited

The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. #7643

The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. #7643

Comments

achimbab commented Nov 6, 2019 • edited

achimbab commented Nov 6, 2019 • edited

alexey-milovidov commented Nov 6, 2019

alexey-milovidov commented Nov 6, 2019

alexey-milovidov commented Nov 6, 2019

achimbab commented Nov 6, 2019 • edited

achimbab commented Nov 6, 2019 •

edited

achimbab commented Nov 6, 2019 •

edited

achimbab commented Nov 6, 2019 •

edited