Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. #7643

Closed
achimbab opened this issue Nov 6, 2019 · 5 comments
Closed
Labels

Comments

@achimbab
Copy link
Contributor

achimbab commented Nov 6, 2019

Executed Query

SELECT javaHash(convertCharset('a1가', 'utf-8', 'utf-16'))

Result of convertCharset
I set a break point at the javaHash() and checked what the value passed by convertCharset().
I expected the value as 97 0 49 0 0 -84 and the size to be 6.
But, an actual value is not.

   │335     struct JavaHashImpl                                                      
   │336     {                                                                        
   │337         static constexpr auto name = "javaHash";                             
   │338         using ReturnType = Int32;                                            
   │339                                                                              
B+ │340         static Int32 apply(const char * data, const size_t size)             
   │341         {                                                                    
  >│342             UInt32 h = 0;                                                    
   │343             for (size_t i = 0; i < size; ++i)                                
   │344                 h = 31 * h + static_cast<UInt32>(static_cast<Int8>(data[i]));
   │345             return static_cast<Int32>(h);                                    
   │346         }                                                                    

(gdb) p size
$132 = 8    

(gdb) x/8db data                                                           
0x7fff46814150: -1      -2      97      0       49      0       0       -84

Notice that first two bytes are -1 -2. It looks weird.

Versions

ClickHouse client version 19.17.1.1.
Connecting to localhost:19000 as user default.
Connected to ClickHouse server version 19.17.1 revision 54428.
@achimbab achimbab added the question Question? label Nov 6, 2019
@achimbab achimbab changed the title The convertCharset(s, 'utf-8', 'utf-16') look not working properly. The convertCharset(s, 'utf-8', 'utf-16') looks not working properly. Nov 6, 2019
@achimbab achimbab changed the title The convertCharset(s, 'utf-8', 'utf-16') looks not working properly. The convertCharset (s, 'utf-8', 'utf-16') doesn't seem to work properly. Nov 6, 2019
@achimbab
Copy link
Contributor Author

achimbab commented Nov 6, 2019

I found what -1 -2 mean.
Those mean BOM(Byte Order Mark) of utf-16.
0xFEFF are used in utf16, (See more details)
So the javaHash() function have to recognize the charsets of source string, because the Java language calculates a hashCode value based on a charset.

public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            hash = h = isLatin1() ? StringLatin1.hashCode(value)
                                  : StringUTF16.hashCode(value);
        }
        return h;
    }

@alexey-milovidov
Copy link
Member

But our strings have no information about encoding. The only way to solve is to provide another function javaHashUTF16 that will calculate javaHash under the assumption that the string is in UTF-16.

@alexey-milovidov
Copy link
Member

To avoid BOM, you should specify utf16be or utf16le:

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16'))

┌─hex(convertCharset('1', 'utf-8', 'utf-16'))─┐
│ FFFE3100                                    │
└─────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.011 sec. 

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16be'))

┌─hex(convertCharset('1', 'utf-8', 'utf-16be'))─┐
│ 0031                                          │
└───────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.010 sec. 

milovidov-Pro-P30 :) SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))

SELECT hex(convertCharset('1', 'utf-8', 'utf-16le'))

┌─hex(convertCharset('1', 'utf-8', 'utf-16le'))─┐
│ 3100                                          │
└───────────────────────────────────────────────┘

@alexey-milovidov
Copy link
Member

And probably we should add a function javaHashUTF16LE.

@achimbab
Copy link
Contributor Author

achimbab commented Nov 6, 2019

@alexey-milovidov
Thank you for your help.
I have already made a function like javaHashUTF16LE for my job.
If you don't mind. I will make PR about javaHashUTF16LE as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants