Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add detect encoding with BOM: UTF-7 and GB-18030 #98

Merged
merged 8 commits into from
Nov 20, 2019
Merged

Add detect encoding with BOM: UTF-7 and GB-18030 #98

merged 8 commits into from
Nov 20, 2019

Conversation

rstm-sf
Copy link
Collaborator

@rstm-sf rstm-sf commented Nov 16, 2019

Resolve #79

Add detect

  • UTF-7 BOM
  • GB-18030 BOM

@rstm-sf rstm-sf changed the title Add detect encoding with BOM: UTF-7 and GB-18030 WIP: Add detect encoding with BOM: UTF-7 and GB-18030 Nov 16, 2019
@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Nov 16, 2019

Simplification of checks in FindCharSetByBom
because len <= buf.Length always

So in the end it is called from the following places

detector.Feed(bytes, 0, bytes.Length);

while ((read = stream.Read(buff, 0, toRead)) > 0)
{
detector.Feed(buff, 0, read);

via

protected virtual void Feed(byte[] buf, int offset, int len)

_done = IsStartsWithBom(buf, len);

private bool IsStartsWithBom(byte[] buf, int len)
{
var bomSet = FindCharSetByBom(buf, len);

@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Nov 17, 2019

It's bug or feature?

@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Nov 17, 2019

At least add verification will be more efficient

@rstm-sf
Copy link
Collaborator Author

rstm-sf commented Nov 17, 2019

It seemed like it would be better, because otherwise, you need to add a length check every time

@rstm-sf rstm-sf changed the title WIP: Add detect encoding with BOM: UTF-7 and GB-18030 Add detect encoding with BOM: UTF-7 and GB-18030 Nov 17, 2019
Copy link
Member

@304NotModified 304NotModified left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks!

[TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x39 })]
[TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x2B })]
[TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x2F })]
[TestCase(new byte[] { 0x2B, 0x2F, 0x76, 0x38, 0x2D })]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍

@304NotModified 304NotModified merged commit 303f024 into CharsetDetector:master Nov 20, 2019
@304NotModified 304NotModified added this to the 2.3 milestone Nov 20, 2019
@304NotModified
Copy link
Member

Thanks for the refactor also :)

@rstm-sf rstm-sf deleted the feature/add_detect_utf7_bom branch January 12, 2020 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add detect encoding with BOM: UTF-7 and GB-18030
2 participants