Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38

Closed
fretman92 opened this issue Aug 30, 2017 · 6 comments
Labels
Milestone

Comments

@fretman92
Copy link

fretman92 commented Aug 30, 2017

Test program launched from latest source:

            string filename = args[0];

            var result = CharsetDetector.DetectFromFile(filename);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromFile - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

            byte[] bytes = System.IO.File.ReadAllBytes(filename);
            result = CharsetDetector.DetectFromBytes(bytes);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromBytes - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

            System.IO.Stream fileStream = System.IO.File.OpenRead(filename);
            result = CharsetDetector.DetectFromStream(fileStream);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromStream - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

Result:
image

The file is a HTML UTF-8 (without BOM) encoded file containing 1 simple emoji : 😀
(attached in the zip below)
utf8_with_emoji.zip

Why does the DetectFromBytes method gives a different result?

@304NotModified 304NotModified added this to the 1.1 milestone Feb 6, 2019
@304NotModified 304NotModified modified the milestones: 2.0, 2.1 Mar 27, 2019
@304NotModified
Copy link
Member

Why does the DetectFromBytes method gives a different result?

Need some time, but figured something out

  • The detect stream read in buffers of 1024 bytes. After 3 cycles, it's convinced it's UTF-8
  • The detect bytes reads all bytes (as they al already in memory) (73939 bytes), and after that It don't think it's ut8
    --

Now we have to check why it's after 3072 bytes UTF-8 and not after 73939 bytes.

@304NotModified
Copy link
Member

304NotModified commented Jun 29, 2019

it failes at first at bytecount 73837 (so last 102 bytes)

@304NotModified 304NotModified modified the milestones: 2.1, Backlog Aug 13, 2019
@304NotModified
Copy link
Member

F0 9F 98 80 is not recognized as utf8 :(

@rstm-sf
Copy link
Collaborator

rstm-sf commented Nov 9, 2019

In general, emoji is another logic :)

@rstm-sf
Copy link
Collaborator

rstm-sf commented Nov 9, 2019

Maybe can come up with a table of special characters that it need to skip in a certain scenario

@304NotModified
Copy link
Member

replaced with #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants