Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port multi-byte character ratio detection in UTF-8 prober confidence function from jschardet #117

Closed
wants to merge 2 commits into from

Conversation

yinyue200
Copy link
Contributor

fix #108

@yinyue200 yinyue200 changed the title Port Multi-byte character ratio detection in UTF-8 prober confidence function from jschardet Port multi-byte character ratio detection in UTF-8 prober confidence function from jschardet Feb 19, 2021
@304NotModified
Copy link
Member

@rstm-sf do we think we should merge this one?

{
for (int i = 0; i < numOfMBChar; i++)
unlike *= ONE_CHAR_PROB;
unlike *= (float)Math.Pow(ONE_CHAR_PROB, numOfMBChar);
Copy link
Collaborator

@rstm-sf rstm-sf Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, sorry for the delay, it took a while to understand this change.

This method can be simplified to the following state:

public override float GetConfidence(StringBuilder status = null)
{
    const float like = 0.99f;

    if (numOfMBChar >= 6)
        return like;

    var mbCharRatio = (float)mbCharLen / (fullLen - basicAsciiLen);
    if (mbCharRatio > 0.6f)
        return like;

    var negative = (float)Math.Pow(ONE_CHAR_PROB, numOfMBChar * numOfMBChar);
    return like * (1f - negative);
}

I found a partial explanation in this PR aadsm/jschardet#59

But this particular change is not entirely clear to me. Asked about it here aadsm/jschardet#57 (comment)

Also, I can't understand out why it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ONE_CHAR_PROB = 0.45f, then tests will start to pass too (instead of double pow)

(... I can't understand out why it works)

@@ -107,11 +119,17 @@ public override float GetConfidence(StringBuilder status = null)
{
float unlike = 0.99f;
float confidence;
var mbCharRatio = 0.0f;
var nonBasciAsciiLen = fullLen - basicAsciiLen;
if (nonBasciAsciiLen > 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this is always true. Why could it be otherwise?

@304NotModified
Copy link
Member

@yinyue200 do you think you could check the review comments? Or should be close this PR for now?

@yinyue200 yinyue200 closed this Jul 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File detected as Windows-1250, but is UTF-8
3 participants