Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38

fretman92 · 2017-08-30T15:15:25Z

Test program launched from latest source:

            string filename = args[0];

            var result = CharsetDetector.DetectFromFile(filename);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromFile - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

            byte[] bytes = System.IO.File.ReadAllBytes(filename);
            result = CharsetDetector.DetectFromBytes(bytes);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromBytes - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

            System.IO.Stream fileStream = System.IO.File.OpenRead(filename);
            result = CharsetDetector.DetectFromStream(fileStream);

            if (result.Detected != null)
            {
                Console.WriteLine("DetectFromStream - Charset: {0}, confidence: {1}", result.Detected.EncodingName, result.Detected.Confidence);
            }

Result:

The file is a HTML UTF-8 (without BOM) encoded file containing 1 simple emoji : 😀
(attached in the zip below)
utf8_with_emoji.zip

Why does the DetectFromBytes method gives a different result?

The text was updated successfully, but these errors were encountered:

304NotModified · 2019-06-29T14:24:52Z

Why does the DetectFromBytes method gives a different result?

Need some time, but figured something out

The detect stream read in buffers of 1024 bytes. After 3 cycles, it's convinced it's UTF-8
The detect bytes reads all bytes (as they al already in memory) (73939 bytes), and after that It don't think it's ut8
--

Now we have to check why it's after 3072 bytes UTF-8 and not after 73939 bytes.

304NotModified · 2019-06-29T14:30:50Z

it failes at first at bytecount 73837 (so last 102 bytes)

304NotModified · 2019-08-14T21:03:09Z

F0 9F 98 80 is not recognized as utf8 :(

rstm-sf · 2019-11-09T11:35:00Z

In general, emoji is another logic :)

rstm-sf · 2019-11-09T11:36:50Z

Maybe can come up with a table of special characters that it need to skip in a certain scenario

304NotModified · 2022-06-27T20:26:57Z

replaced with #149

304NotModified added this to the 1.1 milestone Feb 6, 2019

304NotModified added the bug label Feb 6, 2019

304NotModified modified the milestones: 2.0, 2.1 Mar 27, 2019

304NotModified closed this as completed Jun 29, 2019

304NotModified reopened this Jun 29, 2019

304NotModified modified the milestones: 2.1, Backlog Aug 13, 2019

rstm-sf mentioned this issue Nov 23, 2019

Update README, resolve some todo and add tests #99

Merged

3 tasks

adimosh mentioned this issue Feb 2, 2022

SingleByteCharSetProber.Reset() does not correctly reset #138

Closed

304NotModified mentioned this issue Jun 27, 2022

DetectFromBytes should follow same logic as DetectFromStream #149

Open

304NotModified closed this as completed Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38

Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38

fretman92 commented Aug 30, 2017 •

edited by 304NotModified

Loading

304NotModified commented Jun 29, 2019

304NotModified commented Jun 29, 2019 •

edited

Loading

304NotModified commented Aug 14, 2019

rstm-sf commented Nov 9, 2019

rstm-sf commented Nov 9, 2019 •

edited

Loading

304NotModified commented Jun 27, 2022

Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38

Detected encoding is wrong with DetectFromBytes, ok with other methods for UTF-8 file containing emoji #38

Comments

fretman92 commented Aug 30, 2017 • edited by 304NotModified Loading

304NotModified commented Jun 29, 2019

304NotModified commented Jun 29, 2019 • edited Loading

304NotModified commented Aug 14, 2019

rstm-sf commented Nov 9, 2019

rstm-sf commented Nov 9, 2019 • edited Loading

304NotModified commented Jun 27, 2022

fretman92 commented Aug 30, 2017 •

edited by 304NotModified

Loading

304NotModified commented Jun 29, 2019 •

edited

Loading

rstm-sf commented Nov 9, 2019 •

edited

Loading