-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38
Comments
Need some time, but figured something out
Now we have to check why it's after 3072 bytes UTF-8 and not after 73939 bytes. |
it failes at first at bytecount 73837 (so last 102 bytes) |
F0 9F 98 80 is not recognized as utf8 :( |
In general, emoji is another logic :) |
Maybe can come up with a table of special characters that it need to skip in a certain scenario |
replaced with #149 |
Test program launched from latest source:
Result:
![image](https://user-images.githubusercontent.com/305637/29879908-b6486e48-8da6-11e7-874b-399c3b502350.png)
The file is a HTML UTF-8 (without BOM) encoded file containing 1 simple emoji : 😀
(attached in the zip below)
utf8_with_emoji.zip
Why does the
DetectFromBytes
method gives a different result?The text was updated successfully, but these errors were encountered: