nsSBCharSetProber: multiply confidence by ratio of positive seqs per …

…chars. If all sequences in a text are positive sequences, the ratio of positive sequences cannot make the difference between 2 very close charsets. A ratio of positive sequences per letters on the other hand will change a tie between 2 encoding. If while adding a letter, the number of positive sequences does not increase, the confidence will decrease (corresponding to the fact it was likely not a letter). On the other hand, if the number of positive sequences increase, so will the confidence. For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15. When letters only available in ISO-8859-15 appear in a text, we expect confidence to tilt towards the close yet slightly different ISO-8859-15.
BYVoid · Nov 30, 2015 · 4f1c3ff · 4f1c3ff
1 parent 9cb5764
commit 4f1c3ff
Showing 1 changed file with 9 additions and 0 deletions.
diff --git a/src/nsSBCharSetProber.cpp b/src/nsSBCharSetProber.cpp
@@ -102,6 +102,15 @@ float nsSingleByteCharSetProber::GetConfidence(void)
 
   if (mTotalSeqs > 0) {
     r = ((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio;
+    /* Multiply by a ratio of positive sequences per characters.
+     * This would help in particular to distinguish close winners.
+     * Indeed if you add a letter, you'd expect the positive sequence count
+     * to increase as well. If it doesn't, it may mean that this new codepoint
+     * may not have been a letter, but instead a symbol (or some other
+     * character). This could make the difference between very closely related
+     * charsets used for the same language.
+     */
+    r = r*mSeqCounters[POSITIVE_CAT] / mTotalChar;
     r = r*mFreqChar/mTotalChar;
     if (r >= (float)1.00)
       r = (float)0.99;