Skip to content

Commit

Permalink
nsSBCharSetProber: multiply confidence by ratio of positive seqs per …
Browse files Browse the repository at this point in the history
…chars.

If all sequences in a text are positive sequences, the ratio of positive
sequences cannot make the difference between 2 very close charsets.
A ratio of positive sequences per letters on the other hand will
change a tie between 2 encoding. If while adding a letter, the number
of positive sequences does not increase, the confidence will decrease
(corresponding to the fact it was likely not a letter).
On the other hand, if the number of positive sequences increase, so
will the confidence.
For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15.
When letters only available in ISO-8859-15 appear in a text, we expect
confidence to tilt towards the close yet slightly different ISO-8859-15.
  • Loading branch information
Jehan committed Nov 30, 2015
1 parent 9cb5764 commit 4f1c3ff
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions src/nsSBCharSetProber.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,15 @@ float nsSingleByteCharSetProber::GetConfidence(void)

if (mTotalSeqs > 0) {
r = ((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio;
/* Multiply by a ratio of positive sequences per characters.
* This would help in particular to distinguish close winners.
* Indeed if you add a letter, you'd expect the positive sequence count
* to increase as well. If it doesn't, it may mean that this new codepoint
* may not have been a letter, but instead a symbol (or some other
* character). This could make the difference between very closely related
* charsets used for the same language.
*/
r = r*mSeqCounters[POSITIVE_CAT] / mTotalChar;
r = r*mFreqChar/mTotalChar;
if (r >= (float)1.00)
r = (float)0.99;
Expand Down

0 comments on commit 4f1c3ff

Please sign in to comment.