Add detect encoding with BOM: UTF-7 and GB-18030 #98

rstm-sf · 2019-11-16T10:52:57Z

Resolve #79

Add detect

UTF-7 BOM
GB-18030 BOM

because len <= buf.Length always

rstm-sf · 2019-11-16T17:37:10Z

Simplification of checks in FindCharSetByBom
because len <= buf.Length always

So in the end it is called from the following places

UTF-unknown/src/CharsetDetector.cs

Line 142 in cb3dca2

detector.Feed(bytes, 0, bytes.Length);

UTF-unknown/src/CharsetDetector.cs

Lines 199 to 201 in cb3dca2

    
           while ((read = stream.Read(buff, 0, toRead)) > 0) 
        
           { 
        
               detector.Feed(buff, 0, read);

via

UTF-unknown/src/CharsetDetector.cs

Line 269 in cb3dca2

protected virtual void Feed(byte[] buf, int offset, int len)

UTF-unknown/src/CharsetDetector.cs

Line 283 in cb3dca2

_done = IsStartsWithBom(buf, len);

UTF-unknown/src/CharsetDetector.cs

Lines 297 to 299 in cb3dca2

    
           private bool IsStartsWithBom(byte[] buf, int len) 
        
           { 
        
               var bomSet = FindCharSetByBom(buf, len);

rstm-sf · 2019-11-17T07:10:08Z

It's bug or feature?

rstm-sf · 2019-11-17T07:32:24Z

At least add verification will be more efficient

rstm-sf · 2019-11-17T07:56:20Z

It seemed like it would be better, because otherwise, you need to add a length check every time

304NotModified

Great work! Thanks!

304NotModified · 2019-11-20T16:21:01Z

tests/CharsetDetectorTest.cs

+        [TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x39 })]
+        [TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x2B })]
+        [TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x2F })]
+        [TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x38, 0x2D })]


304NotModified · 2019-11-20T16:22:24Z

Thanks for the refactor also :)

Add test for detect UTF-7 BOM

4d76ea4

rstm-sf changed the title ~~Add detect encoding with BOM: UTF-7 and GB-18030~~ WIP: Add detect encoding with BOM: UTF-7 and GB-18030 Nov 16, 2019

rstm-sf added 2 commits November 16, 2019 13:58

Simplification of checks in FindCharSetByBom

980b1c0

because len <= buf.Length always

Add detect UTF-7 BOM

758136e

Add test for detect GB18030 BOM

6440658

Add check detect GB18030 BOM

67e6ace

refactor FindCharSetByBom

5eeb066

rstm-sf changed the title ~~WIP: Add detect encoding with BOM: UTF-7 and GB-18030~~ Add detect encoding with BOM: UTF-7 and GB-18030 Nov 17, 2019

rstm-sf added 2 commits November 17, 2019 11:00

Remove extra changes

f1afa1d

refactor check position

9ac0615

304NotModified approved these changes Nov 20, 2019

View reviewed changes

304NotModified merged commit 303f024 into CharsetDetector:master Nov 20, 2019

304NotModified added this to the 2.3 milestone Nov 20, 2019

304NotModified added feature refactor labels Nov 20, 2019

rstm-sf deleted the feature/add_detect_utf7_bom branch January 12, 2020 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add detect encoding with BOM: UTF-7 and GB-18030 #98

Add detect encoding with BOM: UTF-7 and GB-18030 #98

rstm-sf commented Nov 16, 2019 •

edited

rstm-sf commented Nov 16, 2019

rstm-sf commented Nov 17, 2019

rstm-sf commented Nov 17, 2019

rstm-sf commented Nov 17, 2019

304NotModified left a comment

304NotModified Nov 20, 2019

304NotModified commented Nov 20, 2019

Add detect encoding with BOM: UTF-7 and GB-18030 #98

Add detect encoding with BOM: UTF-7 and GB-18030 #98

Conversation

rstm-sf commented Nov 16, 2019 • edited

rstm-sf commented Nov 16, 2019

rstm-sf commented Nov 17, 2019

rstm-sf commented Nov 17, 2019

rstm-sf commented Nov 17, 2019

304NotModified left a comment

Choose a reason for hiding this comment

304NotModified Nov 20, 2019

Choose a reason for hiding this comment

304NotModified commented Nov 20, 2019

rstm-sf commented Nov 16, 2019 •

edited