Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMOG calculation discrepancies #31

Open
srdjan-stojkovic opened this issue Jun 24, 2015 · 2 comments
Open

SMOG calculation discrepancies #31

srdjan-stojkovic opened this issue Jun 24, 2015 · 2 comments

Comments

@srdjan-stojkovic
Copy link

Hi,

Text: "June 23rd, 2015 How Cigna deal limits Anthem’s Blue Cross brand “When health plans operate using the Blue Cross and Blue Shield brand, they are generally limited to business in a specific state or region as part of a licensing agreement with their trade group, the Blue Cross and Blue Shield Association. So when Anthem (ANTM), a major operator of Blue Cross plans, made its $184-a-share offer for Cigna (CI) to grow both health insurance businesses, it created potential hurdles when it comes to Anthem’s valuable Blue Cross brands expanding."

On readability-score.com I'm getting value of 15.2 for SMOG, but with $textStatistics->smogIndex($input) only 9.4. This is big difference. Am I doing something wrong?

@gburtini
Copy link

gburtini commented Mar 5, 2016

https://travis-ci.org/DaveChild/Text-Statistics

It appears SMOG is broken. Can anyone confirm this?

@jee7
Copy link

jee7 commented Jul 5, 2018

Yes. There seem to be many issues here.

  1. The SMOG value here is always "normalized" (ie clamped) to the range [0, 12]. With that enabled you can never get that 15.2.
public $normalise = false;
  1. The SMOG formula is implemented wrong. It is taking the square root of the sum and lastly multiplies, but actually the order should be: square root, multiplication and then the sum.
            Maths::bcCalc(
                Maths::bcCalc(
                    Maths::bcCalc(
                        Syllables::wordsWithThreeSyllables($strText, true, $this->strEncoding),
                        '*',
                        Maths::bcCalc(
                            30,
                            '/',
                            Text::sentenceCount($strText, $this->strEncoding)
                        )
                    ),
                    'sqrt',
                    0
                ),
                '*',
                1.043
            ),
            '+',
            3.1291
        );
  1. When the input text is cleaned it is utf8_decoded. However, if you have an ASCII text, then some symbols get converted to "?" signs and those will be interpreted as terminators. So in your example text there are 2 sentences, but the script finds 5.
//$strText = utf8_decode($strText);
  1. I'm not sure, but I also removed all the words that contain numbers. I dunno. It didn't make sense to me to count "23rd" or "$184-a-share" as words.
$strText = preg_replace('/([^\.\s]*[0-9][^\.\s]*)/', '', $strText); // Remove words with numbers
$strText = preg_replace('/\'/', '', $strText); // Remove ' symbol, dunno if helps.
$strText = preg_replace('`  `', ' ', $strText); // Remove double spaces (because for some reason you calculate words based on number of spaces)

Now, I don't have an account on readability-score.com, but I tried with other online calculators:

Online-Utility LearningAndWork StoryToolz Current TS Improved TS
characters 437 - 436 427 425
words 94 92 94 94 92
poly-words - 14 - 13 13
sentences 2 2 2 5 2
syl. per word 1.48 - 1.38 1.44 1.46
ARI 23.97 - 23.9 9.4 23.3
Gunning-F 23.06 - 23.9 12.6 23.6
Flesch-K 20.19 - 19.1 8.7 19.5
Coleman-L 10.94 - 10.8 10.9 11.4
SMOG 16.96 23.2 16.4 9.4 17.7

I also tried with my own test text, which is a bit longer.

Online-Utility LearningAndWork StoryToolz Current TS Improved TS
characters 2919 - 2924 2899 2890
words 604 604 592 604 585
poly-words - 90 - 108 108
sentences 32 32 32 32 32
syl. per word 1.65 - 1.58 1.64 1.68
ARI 10.77 - 11.1 10.6 11
Gunning-F 12.32 - 13.4 14 13.9
Flesch-K 11.2 - 10.3 11.2 11.4
Coleman-L 11.08 - 11.6 12 13.3
SMOG 12.8 17.7 12.1 10.7 13.6

But, yeah, there still seem to be problems. For example now the Coleman-Liau index went up compared to the other calculators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants