# Frequency of Words

The Quintus Curtius Snodgrass Letters: As a forensic example of applied statistics, there was a famous case where Mark Twain was accused of being a Confederate deserter during the Civil War, and the evidence given were ten essays published in the New Orleans Daily Crescent under the name Quintus Curtius Snodgrass. In 1963 Claude Brinegar published an article in the 
Journal of the American Statistical Association where he uses word frequencies and a chi-squared test to show that the essays were almost certainly not Twain’s.


Brinegar’s Abstract: 
“Mark Twain is widely credited with the authorship of 10 letters published in 1861 in the New Orleans Daily Crescent. The adventures described in these letters, which are signed “Quintus Curtius Snodgrass,” provide the historical basis of a main part of Twain’s presumed role in the Civil War. This study applies an old, though little used statistical test of authorship - a word-length frequency test - to show that Twain almost certainly did not write these 10 letters. The statistical analysis includes a visual comparison of several word-length frequency distributions and applications of the 𝜒2 and two-sample t tests.”



The following table shows relative frequencies of three-letter-words from the Snodgrass letters, and from samples of Twain’s known works. Rather than run them through complex calculations, let’s make box plots!


In [17]:
import pandas as pd
snodgrass = [.209,.205,.196,.210,.202,.207,.224,.223,.220,.201]
twain = [.225,.262,.217,.240,.230,.229,.235,.217]
df = pd.DataFrame([snodgrass,twain]).T
df.columns= ["QCS","MT"]
df

Unnamed: 0,QCS,MT
0,0.209,0.225
1,0.205,0.262
2,0.196,0.217
3,0.21,0.24
4,0.202,0.23
5,0.207,0.229
6,0.224,0.235
7,0.223,0.217
8,0.22,
9,0.201,


# Lets check if there is a significant difference in these three letter words

In [16]:
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode(connected=True)


snodgrass = [.209,.205,.196,.210,.202,.207,.224,.223,.220,.201]
twain = [.225,.262,.217,.240,.230,.229,.235,.217]

data = [
    go.Box(
        y=snodgrass,
        name='QCS'
    ),
    go.Box(
        y=twain,
        name='MT'
    )
]
layout = go.Layout(
    title = 'Comparison of three-letter-word frequencies<br>\
    between Quintus Curtius Snodgrass and Mark Twain'
)
fig = go.Figure(data=data, layout=layout)
pyo.iplot(fig, filename='box3.html')

# How ow to count the frequency of three letter words

In [34]:
import nltk
text1 = "plotti was here here wer was asd"
tokens = nltk.word_tokenize(text1)
fdist1 = FreqDist(tokens)
fdist1
words_length_3 = [w for w in fdist1 if len(w) == 3]
print("Words of length 3: %s" % words_length_3)
amount_of_three_letter_words = 0
for word in words_length_3:
    amount_of_three_letter_words += fdist1[word]
print("Total amount of three letter words %s " % amount_of_three_letter_words)
print("Total words %s" % len(text1))
percentage = amount_of_three_letter_words/len(tokens)
print("Precentage of three letter words %s " % percentage)


Words of length 3: ['was', 'wer', 'asd']
Total amount of three letter words 4 
Total words 32
Precentage of three letter words 0.5714285714285714 


In [35]:
def count_words(text):
    tokens = nltk.word_tokenize(text)
    fdist1 = FreqDist(tokens)
    words_length_3 = [w for w in fdist1 if len(w) == 3]
    amount_of_three_letter_words = 0
    for word in words_length_3:
        amount_of_three_letter_words += fdist1[word]
    percentage = amount_of_three_letter_words/len(tokens)
    return percentage

# Let's compare moby dick with sense and sensibility 

In [36]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [54]:
moby_dick = (" ").join(text1.tokens)
letters = []
for i in range(0,20):
    letters.append(moby_dick[i*500:i*500+500])
moby_dick_three_letter_words = [count_words(letter) for letter in letters]

In [55]:
sense = (" ").join(text2.tokens)
letters = []
for i in range(0,20):
    letters.append(sense[i*500:i*500+500])
sense_three_letter_words = [count_words(letter) for letter in letters]

In [58]:
import plotly.offline as pyo
import plotly.graph_objs as go
pyo.init_notebook_mode(connected=True)

data = [
    go.Box(
        y=moby_dick_three_letter_words,
        name='Moby Dick'
    ),
    go.Box(
        y=sense_three_letter_words,
        name='Sense and Sensibility'
    )
]
layout = go.Layout(
    title = 'Comparison of three-letter-word frequencies<br>\
    between Moby Dick and Sense and Sensibility'
)
fig = go.Figure(data=data, layout=layout)
pyo.iplot(fig, filename='box3.html')