In [1]:
import re  #just in case
import nltk
from nltk.tokenize import SyllableTokenizer  #for tokenizing syllables
from nltk.tokenize import RegexpTokenizer   #for word tokenizing to ignore punctuation
import itertools #used later to flatten/merge a list
import pickle

### How Jargon-y is DND?

To take a look, I grabbed my copy of the Player's Handbook (PHB) and converted it to a text file. I won't be sharing since the copyright belongs to Wizards of the Coast, but I will be discussing my findings here. 

In [2]:
with open('../data/PHB.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
len(text)

1220025

In [4]:
tokenizer = RegexpTokenizer(r'\w+')
phbtext = tokenizer.tokenize(text)

Our total tokenized word count for the Player's Handbook is below. This excludes punctuation unlike in the usual nltk tokenizing, since the measurements I'm going to be using don't need that information. There's a short glance at a small section to get a feel for what we're looking at. The transformation into a .txt file was not perfect because of some text formatting in the original form, but I did the best I could without spending an excessive amount of time correcting every single issue. 

In [5]:
len(phbtext) #total words (roughly)

215058

In [6]:
phbtext[8000:8020]

['you',
 'a',
 'different',
 'way',
 'to',
 'calculate',
 'your',
 'AC',
 'If',
 'you',
 'have',
 'multiple',
 'features',
 'that',
 'give',
 'you',
 'different',
 'ways',
 'to',
 'calculate']

#### On second thought...

Some of the contents of this text is stuff like table of contents or an Appendix, which list page numbers. I don't think these would be considered part of the text... or would affect readability of the text. However, this is D&D... There is also a lot of talk about dice and numbers, ability scores, modifiers, rolling a D20, D10, D6 etc. I don't love ditching all instances of numbers, but I'm going to take a look at the difference in word count without the page numbers included to see the difference. At worst, I'll run the calculations twice and see how different the scores look.

In [7]:
nonums = re.sub(r"\d+", "", text)

In [8]:
nonumtxt = tokenizer.tokenize(nonums)
nonumtxt[:10]

['Contents',
 'Preface',
 'Introduction',
 'Worlds',
 'of',
 'Adventure',
 'Using',
 'This',
 'Book',
 'How']

In [9]:
len(nonumtxt)

206353

In [10]:
phbtext[:10] #for comparison

['Contents',
 'Preface',
 '4',
 'Introduction',
 '5',
 'Worlds',
 'of',
 'Adventure',
 '5',
 'Using']

## Flesch–Kincaid readability test
<img src=https://wikimedia.org/api/rest_v1/media/math/render/svg/bd4916e193d2f96fa3b74ee258aaa6fe242e110e>

This readability test was created to measure how easy a text is to read, it was formulated for use on technical manuals for the military and has been adopted into the educational field as well. The formula above results in a score, the higher the score, the easier the text is to read. There is a scale from 0-100 raning from 5th grade to Professional.

[source](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests)

This comes with a translation of the score: The Grade Level Score
<img src=https://readable.com/wp-content/uploads/2017/01/fleschkincaidchart.png>

[source](https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/)

In [11]:
SSP = SyllableTokenizer()

In [12]:
syls = [SSP.tokenize(w) for w in phbtext]

#### Syllables

I tried multiple syllable splitters, and none were perfect, but NLTK's syllable tokenizer was easily the best among them. It's not perfect, and I give it some leeway because we are dealing here with a lot of invented fantasy words, but even regular words it struggles with. See "ability" and "gnome" below.

It's doing its best, but there's room to grow. We'll take the score we get from our measurement very cautiously.

In [13]:
len(syls) #same as tokens - each word is made into a sub-list 

215058

In [14]:
syls[20300:20325]

[['gno', 'mes'],
 ['Most'],
 ['gno', 'mes'],
 ['in'],
 ['the'],
 ['worlds'],
 ['of'],
 ['DND'],
 ['a', 're'],
 ['rock'],
 ['gno', 'mes'],
 ['in', 'clu', 'ding'],
 ['the'],
 ['tin', 'ker'],
 ['gno', 'mes'],
 ['of'],
 ['the'],
 ['Dra', 'gon', 'lan', 'ce'],
 ['set', 'ting'],
 ['Abi', 'li', 'ty'],
 ['Sco', 're'],
 ['Increa', 'se'],
 ['Yo', 'ur'],
 ['Cons', 'ti', 'tu', 'tion'],
 ['sco', 're']]

In [15]:
count = [len(w) for w in syls] #len for each sublist in the larger list of syllables

In [16]:
count[20300:20325] #verification by looking at the same slice as above, looks accurate

[2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 3, 1, 2, 2, 1, 1, 4, 2, 3, 2, 2, 2, 4, 2]

In [17]:
sum(count)  #sum of all values = count of all syllables

350211

#### Syllables p.2 (no numbers)

In [18]:
nonumsyls = [SSP.tokenize(w) for w in nonumtxt]

In [19]:
len(nonumsyls)

206353

In [20]:
nonumcount = [len(w) for w in nonumsyls]
sum(nonumcount) #not a HUGE loss or change, but I'll keep this anyway just to see what changes.

341505

#### Sentences

Again, nltk is the best bet for sentence tokenizing. And, again, we can't be 100% positive that this is entirely accurate. Some of these things are table of contents, some may be sentence fragments, etc. I feel confident this output is more accurate than the syllable tokenizer.

Remember, tokenizing sentences on the "text" before we removed the punctuation.

In [21]:
sents = nltk.sent_tokenize(text)

In [22]:
sents[400:405]

['Not all characters wear armor or carry shields, however.',
 'Without armor or a shield, your character’s AC equals 10 + his or her Dexterity modifier.',
 'If your character wears armor, carries a shield, or both, calculate your AC using the rules in chapter 5.',
 'Record your AC on your character sheet.',
 'Your character needs to be proficient with armor and shields to wear and use them effectively, and your armor and shield proficiencies are determined by your class.']

In [23]:
len(sents)

11046

#### Sents p.2

In [24]:
nonumsents = nltk.sent_tokenize(nonums)
len(nonumsents)

11027

## Running the Calculations

### First: PHB text including numbers

1. Reading Ease
2. Grade Level Score

In [48]:
206.853-(1.015*(215058/11046))-(84.6*(350211/215058))

49.3248589059024

In [49]:
(0.39*(215058/11046))+(11.8*(350211/215058))-15.59

11.218729982163296

### Second: PHB with numbers removed

In [52]:
206.853-(1.015*(206353/11046))-(84.6*(341505/206353))

47.882317229015655

In [51]:
(0.39*(206353/11046))+(11.8*(341505/206353))-15.59

11.224157464103332

Not all that different! That's interesting, that's cool to see. 

## Results

According to the scores listed, a score of 47-49 lands the PHB at the easier end of "difficult to read" stating a College level. Since the removal of numbers (page numbers but also dice denominations and tutorial text about rolling dice and so on) Interesting that the removal of numbers increased the difficulty as much as it did. I suspect they were quite often tokenized as single-syllable words even though the majority of numbers are not single-syllable.

I don't think it's an unreasonable score, but I take it cautiously.

The score information at the wiki source for this readability test shares some other scores for popular texts to put this into some perspective:
-Time Magazine : 52
-Moby Dick : 57.9
    - one particularly long sentence about sharks in chapter 64 has a readability score of −146.77
-Highest (easiest) possible score is 121.22, every sentence must use only 1 syllable words (think Dr. Seuss!)

### Thoughts

The Readability Score is interesting. The Player's Handbook is no child's book, necessarily, but is it more difficult than Moby Dick? That gives me pause. It's ultimately a game instruction book and playable for children, is it just down to the fantasy words being used? Prestidigitation, Thaumaturgy, Polymorph... They're tricky words, but once you know the meaning, it's not that crazy. (It's probably down more to the syllable tokenizer).

The Grade Score, which is very very similar both for the PHB with and without numbers included, makes sense as a good translation! A score of around 11 putting it on par with Jurassic Park. As someone who has read and owns the PHB, I can agree that feels about right.

**However** I still take this score very cautiously. The syllable tokenizer is unreliable and both under and over tokenizes across the board. Not to mention that the text file itself was scraped from a PDF and is imperfect. Words are at times split apart because of original format text and caption information as well as the original format being two narrow columns of text per page. I did my best to correct as much of it as possible, but I'm sure there are parts I missed.

## SMOG Readability Test

G. Harry McLaughlin created the SMOG (Simple Measure of Gobbledygook) in 1969 to measure text readability. There is a full breakdown of the formula [here](https://readabilityformulas.com/the-smog-readability-formula/), but it functions similarly to the Flesch–Kincaid test, and I'd like to compare.
<img src=https://readabilityformulas.com/wp-content/uploads/01-SMOG-readability-formula.png>

The SMOG test is made to be tested on 3 groups of ten sentences, from the beginning, middle, and end of a text, so I'll take some samples rather than the full text. Now, this is likely because the rest was being done by hand at the time of its creation, and this is easier to do than doing the whole book, but these are the instructions and so I will stick to them.

So this formula basically can be simplified down to SQ RT of the total number of polysyllabic words plus 3. Since we're sampling only 30 sentences 30/30 is 1 anyway... There are simplified instructions linked below. 

Link to Ohio State instructions PDF [here](https://ogg.osu.edu/media/documents/health_lit/WRRSMOG_Example.pdf)

### Sampling from the text

Knowing the beginning of the book is table of contents and the end is appendix, I want to select for a good option of 30 sentences. I'll do some searching and concatenating into a list!

In [29]:
early = sents[84:94] #some introductory stuff about DND
early 

['Because the DM can improvise to react to anything the players attempt, DND is infinitely flexible, and each adventure can be exciting and unexpected.',
 'The game has no real end; when one story or quest wraps up, another one can begin, creating an ongoing story called a campaign.',
 'Many people who play the game keep their campaigns going for months or years, meeting with their friends every week or so to pick up the story where they left off.',
 'The adventurers grow in might as the campaign continues.',
 'Each monster defeated, each adventure completed, and each treasure recovered not only adds to the continuing story, but also earns the adventurers new capabilities.',
 'This increase in power is reflected by an adventurer’s level.',
 'There’s no winning and losing in the Dungeons N Dragons game—at least, not the way those terms are usually understood.',
 'Together, the DM and the players create an exciting story of bold adventurers who confront deadly perils.',
 'sometimes an ad

In [30]:
late = sents[9500:9510] #spells and spell descriptions
late  

['Each target must make a Wisdom saving throw and falls unconscious for 10 minutes on a failed save.',
 'A creature awakens if it takes damage or if someone uses an action to shake or slap it awake.',
 'Stunning.',
 'Each target must make a Wisdom saving throw and becomes stunned for 1 minute on a failed save.',
 'Tasha ’s Hideous Laughter 1st-level enchantment Casting Time: 1 action Range: 30 feet Components: V, S, M (tiny tarts and a feather that is waved in the air) Duration: Concentration, up to 1 minute A creature of your choice that you can see within range perceives everything as hilariously funny and falls into fits of laughter if this spell affects it.',
 'The target must succeed on a Wisdom saving throw or fall prone, becoming incapacitated and unable to stand up for the duration.',
 'A creature with an Intelligence score of 4 or less isn’t affected.',
 'At the end of each of its turns, and each time it takes damage, the target can make another Wisdom saving throw.',
 'The ta

In [31]:
mid = sents[4703:4713] #something around the middle point - looks like information about shopping/money
mid  

['Only merchants, adventurers, and those offering professional services for hire commonly deal in coins.',
 'Coinage Common coins come in several different denominations based on the relative worth of the metal from which they are made.',
 'The three most common coins are the gold piece (gp), the silver piece (sp), and the copper piece (cp).',
 'With one gold piece, a character can buy a belt pouch, 50 feet of good rope, or a goat.',
 'A skilled (but not exceptional) artisan can earn one gold piece a day.',
 'The gold piece is the standard unit of measure for wealth, even if the coin itself is not commonly used.',
 'When merchants discuss deals that involve goods or services worth hundreds or thousands of gold pieces, the transactions don’t usually involve the exchange of individual coins.',
 'Rather, the gold piece is a standard measure of value, and the actual exchange is in gold bars, letters of credit, or valuable goods.',
 'One gold piece is worth ten silver pieces, the most preva

In [32]:
testvals = early+mid+late
len(testvals) #30 total sentences

30

### Polysyllabic Words

For the SMOG test, these are defined as words 3 syllables or longer

wtoks creates a list of lists, which I merge into one long list and comprehend into a list of syllables and then syllable counts. Again, this is a little rough. The syllable counter is..... not the best. You can see in the preview below the mistakes it's making again. This module is good, but it needs a lot of work still. From there, select for only words 3 syllables or longer and get a count of how many appear (not a sum like before)

In [33]:
wtoks = [tokenizer.tokenize(s) for s in testvals] #tokenized sents

In [34]:
merged = list(itertools.chain(*wtoks))
len(merged) #merged into one long list, 602 words in 30 sentences

602

In [35]:
polysyl = [SSP.tokenize(w) for w in merged]

In [36]:
polysyl[40:50] #still..... imperfect.

[['be', 'gin'],
 ['crea', 'ting'],
 ['an'],
 ['on', 'going'],
 ['sto', 'ry'],
 ['cal', 'led'],
 ['a'],
 ['cam', 'paign'],
 ['Ma', 'ny'],
 ['peo', 'ple']]

In [37]:
polysylval = [len(w) for w in polysyl]

In [38]:
multsyls = [w for w in polysylval if w>=3]
multsyls[:25]

[3, 4, 3, 5, 3, 4, 3, 4, 3, 3, 4, 3, 3, 4, 3, 3, 4, 3, 4, 5, 3, 3, 4, 3, 3]

In [39]:
len(multsyls) #96 instances of multisyllabic words

96

Next step is to find the nearest perfect square to get the sqrt of that value: SQ RT of 100 is **10**

Adding 3, this gives us grade levels of 13. (College?)

### One More Thing...

With our syllable tokens being a bit unsatisfying to me and with the idea in my mind that this test was designed to be done by hand..... I did it by hand. It's only 30 sentences I'm looking through. Let's see how far off the syllable tokenizer really is. After all, we see that it sometimes splits a 3 syllable word into 2 as well as ther reverse. Maybe it evens out...

66 3+ syllable words in our 30 sentences when counted by hand. That's 30 entire instances where a token was incorrectly split into 3 syllables. Not ideal.

This means our nearest perfect square is *actually* 64, which takes us to 8 for a grade level score of 11. This puts us quite in agreement with the results of the Flesch–Kincaid test. 

## Thoughts about Readability Scores

While, overall, I'm okay with the grade level assessment of the readability of the Player's Handbook being around an 11th grade level, I have some reservations.

1. What does that mean, exactly? 
    - A critique of these readability tests I've run into during my research is that they don't really take into account variation in readers. I assume it means "what an average 11th grader is capable of reading" and is based on statistics from schools and exams. 
    - It's imperfect but it's a reasonable label that can be understood by a typical person to indicate "this is the developmental and educational point a person is at or near to understand this text" and not strictly "anyone younger than grade 11 students will struggle to understand this"<br>
<br>
2. Readability doesn't necessarily correspond to comprehensibility. 
    - The IRS forms for filing individual taxes have a readability score of 8 (according to the PDF [here](https://www.irs.gov/pub/irs-soi/14rpreadabilityfederalincometaxsystem.pdf) on the IRS site of someone doing very similar assessment to this here). Maybe this is just me, but I've done my taxes the long way before and those are *not* an easy or pleasant read. 
    - So, great, you use shorter words on average and your syllable to word and sentence ratio indicates the reading level for 8th grade. How well written is the text? IRS forms are circuitous, confusing, and aggravating in a way that I feel complicates the readability very much!
    - The PHB, in comparison, is written fairly colloquially in a modern, approachable tone. Sections are labeled and organized appropriately. It's been a while since I've been a new player, but I think that once the average person adapts to the specialty language, it is not very difficult to read. (definitely not more difficult than tax forms)

## The Question of Jargon

I had the thought about specialized language in Dungeons & Dragons, and how to pinpoint exactly the most niche vocabulary. My method of narrowing down the vocabulary specific to DND is this:
1. using Peter Norvig's [google](https://norvig.com/ngrams/) "count_1w.txt" corpus of 330,000(ish) of the most common English words (already made into tuples and pickled in Ling 1330). Isolate this list to create a list of just the words.
2. using the tokenized list of words with numbers removed from the PHB, lowercase the full list and make a set to remove duplicates.
3. cross check the two lists, return the words from the PHB not found in the Norvig data list. 

In [40]:
f = open('../pickle_jar/goog1w_rank.pkl', 'rb')
goog1w_rank = pickle.load(f)
f.close()

In [41]:
#goog1w_rank is a list of tuples of words and their frequency counts. We just want the words for this.
googwords = [w for (w,c) in goog1w_rank]
googwords[:5]

['the', 'of', 'and', 'to', 'a']

In [42]:
phbtoks = [w.lower() for w in nonumtxt]
phbtypes = list(set(phbtoks))
len(phbtypes)

11269

In [43]:
phbtypes[:8]

['portrait',
 'discordant',
 'companions',
 'hits',
 'clone',
 'booms',
 'beginning',
 'function']

In [44]:
jargon = [w for w in phbtypes if w not in googwords]
len(jargon)

836

In [45]:
jargon[200:215]

['glimmerings',
 'lunitari',
 'loviatar',
 'arrah',
 'truesilver',
 'habbakuk',
 'volen',
 'cutpurses',
 'mindfire',
 'seipora',
 'flumphs',
 'ensurate',
 'damarans',
 'seebo',
 'carceri']

### About this Process

I think it did really well! 836 specific tokens is not so bad, I think. It appears to me to be a non-comprehensive but quite informative list of words. It honestly was more effective than I expected it to be, and I feel pretty good about referencing this as a list of DND-specific words to check my speech data for references. 

## The Jargon List

So what kinds of words made it into the list?

Looking through all the words we've isolated, I can sort them into 8 categories:
1. **vocab**: subrace, clanless, cantrip, multiattack
2. **names**: thamior, breena, stumbleduck, myrkul, agathys, keyleth, strahd
3. **locations**: shadowfell, feywild, candlekeep
4. **spells**: thaumaturgy, barkskin, truesight, enfeeblement, longstrider, cloudkill, revivify, countercharm, thunderwave
5. **groups**: archfey, dragonborn, tiefling(s), lightfoots, merfolk, lizardfolk
6. **creatures**: hippogriffs, wererats, tarrasque, owlbear
7. **weapons**: greatclub, handaxes, scimitars, greataxe, longswords, shortbow, shortsword
8. **languages**: dwarvish, undercommon, draconic

The majority of these are names, by far.

There are some words I'm surprised to have found in here, too! Words I think of as, sure, not exactly common, but still regular-ish English words: nonhostile, extradimensional, wineskin/waterskin, shapechanger, clumsier, crewmate, reckonings, longboats, nonplayer, semiconscious, feinting, falteringly, unadventurous, nonmagical, boastfulness, thunderously, glassblowers, gesticulation, otherwordly

Some honorable mentions:<br>
sibilants<br>
blibdoolpoolp<br>
mordenkainen<br>
maglubiyet

### Is that everything...?

No! Lillian actually in her recent response to my progress report asked, what about words that are pretty common in the English language but have a particular or unique meaning in DND? That certainly exists, and we'll take a little bit of a look. 

Some examples: perception, proficiency, halfling, beholder, polymorph, prestidigitation... to name a few

I don't plan to go out of my way to add them into the Jargon list just yet though, since this endeavor is really secondary to the main goals of the project. It's still interesting!

Finally, I noticed that a lot of the words I was surprised weren't in the Norvig list were expected words but with some kind of pre or suffix attached. Reckoning, longboat, sibilant, feint, boastful, thunderous, glassblower all apeared in the Norvig data. Gesticulate did not. 

In [46]:
#used to search for words in norvig 
rankdict = dict()
for (index, (word, count)) in enumerate(goog1w_rank):
    rankdict[word] = word

In [69]:
rankdict['prestidigitation']

'prestidigitation'