# The Most Dangerous Game

The first line of code run here is something internal to Jupyter Notebooks that allows us to place any graphical output into the page itself and not in a separate window or file. (We can still save output to a file, if we want.)

In [12]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


After that, it's time to get our text and start examining it. So the first thing we need to do is load the text file:

In [13]:
mdg = open('../texts/mdg.txt', 'r').read()
# print(mdg)
# len(mdg)
# type(mdg)

If you were to look at the file, by simply adding `print(mdg)` above (commented out for the time being), you would see all of the text, in a completely readable form. If you wanted to see how long it was, you can uncomment `len(mdg)` to see for yourself. Python tells you it is 44078 characters long. But characters isn't a very useful way to measure texts, is it? (It's not clear how useful words are as a unit of measure, but it's the one we are used to using.) 

One of the problems is that, as `type(mdg)` reveals, currently our text is a string. Like almost all computer languages, Python doesn't natively understand human languages: they are nothing more than a series of things, characters made up of letters, numbers, punctuation marks, and spaces.

In order to count the words, we have to tell Python how to break the string into words.

In [14]:
# =-=-=-=-=-=-=-=-=-=-=
# Convert the string into a list of words (still human readable!)
# =-=-=-=-=-=-=-=-=-=-= 

import re

mdg_words = re.sub("[^a-zA-Z'-]"," ", mdg).lower().split()
print('Words in text: {}.'.format(len(mdg_words)))

Words in text: 8109.


In [15]:
# Proof of readability:

print(mdg_words)

['off', 'there', 'to', 'the', 'right', 'somewhere', 'is', 'a', 'large', 'island', 'said', 'whitney', 'it', 's', 'rather', 'a', 'mystery', 'what', 'island', 'is', 'it', 'rainsford', 'asked', 'the', 'old', 'charts', 'call', 'it', "'ship-trap", 'island', "'", 'whitney', 'replied', 'a', 'suggestive', 'name', 'isn', 't', 'it', 'sailors', 'have', 'a', 'curious', 'dread', 'of', 'the', 'place', 'i', 'don', 't', 'know', 'why', 'some', 'superstition', 'can', 't', 'see', 'it', 'remarked', 'rainsford', 'trying', 'to', 'peer', 'through', 'the', 'dank', 'tropical', 'night', 'that', 'was', 'palpable', 'as', 'it', 'pressed', 'its', 'thick', 'warm', 'blackness', 'in', 'upon', 'the', 'yacht', 'you', 've', 'good', 'eyes', 'said', 'whitney', 'with', 'a', 'laugh', 'and', 'i', 've', 'seen', 'you', 'pick', 'off', 'a', 'moose', 'moving', 'in', 'the', 'brown', 'fall', 'bush', 'at', 'four', 'hundred', 'yards', 'but', 'even', 'you', 'can', 't', 'see', 'four', 'miles', 'or', 'so', 'through', 'a', 'moonless', 'car

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Create dictionary of word:frequency pairs
# =-=-=-=-=-=-=-=-=-=-= 

# empty dictionary
freq_dic = {}

# Remove punctuation marks:
punctuation = re.compile(r'[.?!,":;]') 

# Build the dictionary:
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1
    
print('Unique words: {}.'.format(len(freq_dic)))

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Let's see those words
# =-=-=-=-=-=-=-=-=-=-= 

# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]

# sort by value or frequency
freq_list2.sort(reverse=True)

# display result
print("Sorted by highest frequency first:")
for freq, word in freq_list2:
    print(word + "," + str(freq))

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Save these results to a CSV file (makes it easier for the Excel-impaired)
# =-=-=-=-=-=-=-=-=-=-= 

with open("mdg_word_freq.csv", "w") as fileOut:
    for x,y in freq_list2:
        print("{},{}".format(y,x), file=fileOut)

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Let's take a look at this from within Python
# =-=-=-=-=-=-=-=-=-=-= 

import numpy as np, pandas as pd, matplotlib as mpl

df = pd.read_csv("mdg_word_freq.csv", ",", names = ["Word", "Frequency"])
print(df)

In [None]:
alphabetical = df.sort(columns='Word', ascending=True) 
print(alphabetical[0:20])

In [None]:
mpl.style.use('ggplot')
ax = df[['Word','Frequency']].plot(kind='bar', 
                                   title ="Frequency of Words in MDG",
                                   figsize=(20,10),
                                   legend=True)
ax.set_xlabel("Word")
ax.set_ylabel("Occurrences")
ax.set_xticklabels(list(df['Word'])) 
mpl.pyplot.show()

In [1]:
import nltk
myword = mdg.concordance("dangerous")
print(myword)

NameError: name 'mdg' is not defined

In [None]:
text.similar("love")
text.common_contexts(["husband", "wife"])
text.collocations()

In [None]:
import nltk
mdgtokens = nltk.word_tokenize(mdg)
len(mdgtokens)

import nltk, re

mdg_raw = open("./mdg.txt").read()
mdg_words = re.sub("[^a-zA-Z'-]"," ", mdg_raw)
mdg_case = mdg_words.lower()

# print(mdg_case)



import re


mdg_word_list = mdg_words.split()
print(mdg_word_list)

sorted(set(mdg_word_list))

len(sorted(set(mdg_word_list)))

# Lexical Diversity of MDG:
len(mdg2_word_list) / len(set(mdg2_word_list))

In [None]:
len(mdg_tokens) / len(set(mdg_tokens))

On average, a word occurs four times in "The Most Dangerous Game."

Out of curiosity, how many words occur four times?

In [None]:
wordfrequency = nltk.FreqDist(mdg_tokens)
four_times = [word for word in wordfrequency.keys() if wordfrequency[word] == 4]
print(four_times)

In [None]:
mdg_text.count("dangerous")

In [None]:
mdg_text.concordance("dangerous")

Where does "dangerous" occur within the larger text?

In [None]:
mdg_text.dispersion_plot(["dangerous", "danger", "game", "fear"])

In [None]:
wordfrequency.plot()