In [None]:
import re  #just in case
import nltk
from nltk.tokenize import SyllableTokenizer  #for tokenizing syllables
from nltk.tokenize import RegexpTokenizer   #for word tokenizing to ignore punctuation
import itertools #used later to flatten/merge a list

### How Jargon-y is DND?

To take a look, I grabbed my copy of the Player's Handbook and converted it to a text file. I won't be sharing since the copyright belongs to Wizards of the Coast, but I will be discussing my findings here. 

In [None]:
with open('../data/PHB.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
len(text)

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
phbtext = tokenizer.tokenize(text)

Our total tokenized word count for the Player's Handbook is below. This excludes punctuation unlike in the usual nltk tokenizing, since the measurements I'm going to be using don't need that information. There's a short glance at a small section to get a feel for what we're looking at. The transformation into a .txt file was not perfect because of some text formatting in the original form, but I did the best I could without spending an excessive amount of time correcting every single issue. 

In [None]:
len(phbtext) #total words (roughly)

In [None]:
phbtext[8000:8020]

#### On second thought...

Some of the contents of this text is stuff like table of contents or an Appendix, which list page numbers. I don't think these would be considered part of the text... or would affect readability of the text. However, this is D&D... There is also a lot of talk about dice and numbers, ability scores, modifiers, rolling a D20, D10, D6 etc. I don't love ditching all instances of numbers, but I'm going to take a look at the different in word count without the page numbers included to see the difference. At worst, I'll run the calculations twice and see how different the scores look.

In [None]:
nonums = re.sub(r"\d+", "", text)

In [None]:
nonumtxt = tokenizer.tokenize(nonums)
nonumtxt[:10]

In [None]:
len(nonumtxt)

In [None]:
phbtext[:10] #for comparison

## Flesch–Kincaid readability test
<img src=https://wikimedia.org/api/rest_v1/media/math/render/svg/bd4916e193d2f96fa3b74ee258aaa6fe242e110e>

This readability test was created to measure how easy a text is to read, it was formulated for use on technical manuals for the military and has been adopted into the educational field as well. The formula above results in a score, the higher the score, the easier the text is to read. There is a scale from 0-100 raning from 5th grade to Professional.

[source](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests)

This comes with a translation of the score: The Grade Level Score
<img src=https://readable.com/wp-content/uploads/2017/01/fleschkincaidchart.png>

[source](https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/)

In [None]:
SSP = SyllableTokenizer()

In [None]:
syls = [SSP.tokenize(w) for w in phbtext]

#### Syllables

I tried multiple syllable splitters, and none were perfect, but NLTK's syllable tokenizer was easily the best among them. It's not perfect, and I give it some leeway because we are dealing here with a lot of invented fantasy words, but even regular words it struggles with. See "appreciative" and "rare" below.

It's doing its best, but there's room to grow. We'll take the score we get from our measurement very cautiously.

In [None]:
len(syls) #same as tokens - each word is made into a sub-list 

In [None]:
syls[20300:20325]

In [None]:
count = [len(w) for w in syls] #len for each sublist in the larger list of syllables

In [None]:
count[20300:20325] #verification by looking at the same slice as above, looks accurate

In [None]:
sum(count)  #sum of all values = count of all syllables

#### Syllables p.2 (no numbers)

In [None]:
nonumsyls = [SSP.tokenize(w) for w in nonumtxt]

In [None]:
len(nonumsyls)

In [None]:
nonumcount = [len(w) for w in nonumsyls]
sum(nonumcount) #not a HUGE loss or change, but I'll keep this anyway just to see what changes.

#### Sentences

Again, nltk is the best bet for sentence tokenizing. And, again, we can't be 100% positive that this is entirely accurate. Some of these things are table of contents, some may be sentence fragments, etc. I feel confident this output is more accurate than the syllable tokenizer.

Remember, tokenizing sentences on the "text" before we removed the punctuation.

In [None]:
sents = nltk.sent_tokenize(text)

In [None]:
sents[400:405]

In [None]:
len(sents)

#### Sents p.2

In [None]:
nonumsents = nltk.sent_tokenize(nonums)
len(nonumsents)

## Running the Calculations

### First: PHB text including numbers

1. Reading Ease
2. Grade Level Score

In [None]:
206.853-(1.015*(219554/11042))-(84.6*(353627/219554))

In [None]:
(0.39*(219554/11042))+(11.8*(353627/219554))-15.59

### Second: PHB with numbers removed

In [None]:
206.853-(1.015*(210805/11020))-(84.6*(344878/210805))

In [None]:
(0.39*(210805/11020))+(11.8*(344878/210805))-15.59

Not all that different! That's interesting, that's cool to see. 

## Results

According to the scores listed, a score of 50 the lowest possible score for the "Fairly difficult to read" 10th-12th grade level and the highest for "difficult to read" stating a College level. Since the removal of numbers (page numbers but also dice denominations and tutorial text about rolling dice and so on) pushed it just a little more into the "College" level difficulty, I think it's reasonable to place the Player's Handbook readability there.

The score information at the wiki source for this readability test shares some other scores for popular texts to put this into some perspective:
-Time Magazine : 52
-Moby Dick : 57.9
    - one particularly long sentence about sharks in chapter 64 has a readability score of −146.77
-Highest (easiest) possible score is 121.22, every sentence must use only 1 syllable words (think Dr. Seuss!)

### Thoughts

The Readability Score is interesting. The Player's Handbook is no child's book, necessarily, but is it more difficult than Moby Dick? That gives me pause. It's ultimately a game instruction book and playable for children, is it just down to the fantasy words being used? Prestidigitation, Thaumaturgy, Polymorph... They're tricky words, but once you know the meaning, it's not that crazy.

The Grade Score, which is very very similar both for the PHB with and without numbers included, makes sense as a good translation! A score of around 11 putting it on par with Jurassic Park. As someone who has read and owns the PHB, I can agree that feels about right.

## SMOG Readability Test

G. Harry McLaughlin created the SMOG (Simple Measure of Gobbledygook) in 1969 to measure text readability. There is a full breakdown of the formula [here](https://readabilityformulas.com/the-smog-readability-formula/), but it functions similarly to the Flesch–Kincaid test, and I'd like to compare.
<img src=https://readabilityformulas.com/wp-content/uploads/01-SMOG-readability-formula.png>

The SMOG test is made to be tested on 3 groups of ten sentences, from the beginning, middle, and end of a text, so I'll take some samples rather than the full text. Now, this is likely because the rest was being done by hand at the time of its creation, and this is easier to do than doing the whole book, but these are the instructions and so I will stick to them.

So this formula basically can be simplified down to SQ RT of the total number of polysyllabic words plus 3. Since we're sampling only 30 sentences 30/30 is 1 anyway... There are simplified instructions linked below. 

Link to Ohio State instructions PDF [here](https://ogg.osu.edu/media/documents/health_lit/WRRSMOG_Example.pdf)

### Sampling from the text

Knowing the beginning of the book is table of contents and the end is appendix, I want to select for a good option of 30 sentences. I'll do some searching and concatenating into a list!

In [None]:
early = sents[84:94] #some introductory stuff about DND
early 

In [None]:
late = sents[9500:9510] #spells and spell descriptions
late  

In [None]:
mid = sents[4703:4713] #something around the middle point - looks like information about shopping/money
mid  

In [None]:
testvals = early+mid+late
len(testvals) #30 total sentences

### Polysyllabic Words

For the SMOG test, these are defined as words 3 syllables or longer

wtoks creates a list of lists, which I merge into one long list and comprehend into a list of syllables and then syllable counts. Again, this is a little rough. The syllable counter is..... not the best. You can see in the preview below the mistakes it's making again. This module is good, but it needs a lot of work still. From there, select for only words 3 syllables or longer and get a count of how many appear (not a sum like before)

In [None]:
wtoks = [tokenizer.tokenize(s) for s in testvals] #tokenized sents

In [None]:
merged = list(itertools.chain(*wtoks))
len(merged) #merged into one long list, 651 words in 30 sentences

In [None]:
polysyl = [SSP.tokenize(w) for w in merged]

In [None]:
polysyl[40:50]

In [None]:
polysylval = [len(w) for w in polysyl]

In [None]:
multsyls = [w for w in polysylval if w>=3]
multsyls[:25]

In [None]:
len(multsyls) #89 instances of multisyllabic words

Next step is to find the nearest perfect square to get the sqrt of that value: SQ RT of 81 is **9**. SQ RT of 100 is **10**

Adding 3, this gives us grade levels of 12, and 13.

### One More Thing...

With our syllable tokens being a bit unsatisfying to me and with the idea in my mind that this test was designed to be done by hand..... I did it by hand. It's only 30 sentences I'm looking through. Let's see how far off the syllable tokenizer really is. After all, we see that it sometimes splits a 3 syllable word into 2 as well as ther reverse. Maybe it evens out...

68 3+ syllable words in our 30 sentences when counted by hand. That's 26 entire instances where a token was incorrectly split into 3 syllables. Not ideal.

This means our nearest perfect square is *actually* 64, which takes us to 8 for a grade level score of 11. This puts us quite in agreement with the results of the Flesch–Kincaid test. 

## Thoughts about Readability Scores

