# Examing the amount of non-meaning carrying words in a children's story
by Cole DeMeulemeester, Sam Tobin

#### Inspiration

We wonder if the level of literature was dependant on the number of meaning carrying words. Surely a high-brow piece of literature has more meaning per word, right?

We will be looking at "Mother Goose or the Old Nursery Rhymes" and "Siddhartha" to examine their contents and compare the difference in composition between adult and children's literature. 

#### Preparing the Data

First we will import all the needed libraries and install nltk.

In [1]:
import json
import string
import numpy as np
import pandas as pd
import requests
import re
import nltk

In [3]:
!pip install nltk



Now let us access our text that we want to work with.

Hmmm... "Mother Goose or the Old Nursery Rhymes" looks like a good choice for a children's book.

In [2]:
url1 = "http://www.gutenberg.org/cache/epub/23794/pg23794.txt"
res1 = requests.get(url1)
# print(res.text[:1000])

Let's strip down the contents of the book to just the text of the story. First, let us cut out the beginning before the story.

In [3]:
i = res1.text.index("Hark! hark! the dogs bark,")

text = res1.text[i:]

print(text[:1000])

Hark! hark! the dogs bark,
The beggars are coming to town;
Some in rags and some in tags,
And some in a silken gown.
Some gave them white bread,
And some gave them brown,
And some gave them a good horse-whip,
And sent them out of the town.




[Illustration]

Little Jack Horner sat in the corner,
Eating a Christmas pie;
He put in his thumb, and pulled out a plum,
And said, oh! what a good boy am I.




[Illustration]

There was an old woman
Lived under a hill;
And if she's not gone,
She lives there still.




[Illustration]

Diddlty, diddlty, dumpty,
The cat ran up the plum tree,
Give her a plum, and down she'll come,
Diddlty, diddlty, dumpty.




[Illustration]

We're all jolly boys, and we're coming with a noise,
Our stockings shall be made
Of the finest silk,
And our tails shall trail the ground.




[Illustration]

To market, to market, to buy a plum cake,
Home again, home again, market is late;
To market, to market, to buy a 

Now let us cut off all the excess after the story.

In [4]:
i = text.index("ENGRAVED AND PRINTED BY")

text = text[:i]
print(text[-100:])

of posies;
Hush! hush! hush! hush!
We're all tumbled down.




[Illustration]

            


Now we want to clear the "[Illustrations]" from our text.

In [5]:
from nltk import RegexpTokenizer
re_tokenizer = RegexpTokenizer("\[.+?\]")
illustrations = re_tokenizer.tokenize(text)
print(illustrations)

['[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]', '[Illustration]']


In [6]:
for index in illustrations:
    i = text.index("[Illustration]")
    text = text[:i]+text[(i+14):]
print(text)

Hark! hark! the dogs bark,
The beggars are coming to town;
Some in rags and some in tags,
And some in a silken gown.
Some gave them white bread,
And some gave them brown,
And some gave them a good horse-whip,
And sent them out of the town.






Little Jack Horner sat in the corner,
Eating a Christmas pie;
He put in his thumb, and pulled out a plum,
And said, oh! what a good boy am I.






There was an old woman
Lived under a hill;
And if she's not gone,
She lives there still.






Diddlty, diddlty, dumpty,
The cat ran up the plum tree,
Give her a plum, and down she'll come,
Diddlty, diddlty, dumpty.






We're all jolly boys, and we're coming with a noise,
Our stockings shall be made
Of the finest silk,
And our tails shall trail the ground.






To market, to market, to buy a plum cake,
Home again, home again, market is late;
To market, to market, to buy a plum bun,
Home again, home again, market is done.






Elsie 

In case the nltk packages have not been downloaded yet:

In [24]:
# After running this cell, a dialog window should open:
# Under "Models", download "punkt",
# Under "Corpora", download "stopwords" and "wordnet"

nltk.download()

270


Now we are finally ready to split our story up into each word.

In [7]:
# to split up the text into each word.
gooseWord_tokens = nltk.word_tokenize(text)
for i, word in enumerate(gooseWord_tokens):
    if len(word) == 1 and word[0] in string.punctuation:
        gooseWord_tokens.pop(i)
        
gooseWord_tokens

['Hark',
 'hark',
 'the',
 'dogs',
 'bark',
 'The',
 'beggars',
 'are',
 'coming',
 'to',
 'town',
 'Some',
 'in',
 'rags',
 'and',
 'some',
 'in',
 'tags',
 'And',
 'some',
 'in',
 'a',
 'silken',
 'gown',
 'Some',
 'gave',
 'them',
 'white',
 'bread',
 'And',
 'some',
 'gave',
 'them',
 'brown',
 'And',
 'some',
 'gave',
 'them',
 'a',
 'good',
 'horse-whip',
 'And',
 'sent',
 'them',
 'out',
 'of',
 'the',
 'town',
 'Little',
 'Jack',
 'Horner',
 'sat',
 'in',
 'the',
 'corner',
 'Eating',
 'a',
 'Christmas',
 'pie',
 'He',
 'put',
 'in',
 'his',
 'thumb',
 'and',
 'pulled',
 'out',
 'a',
 'plum',
 'And',
 'said',
 'oh',
 'what',
 'a',
 'good',
 'boy',
 'am',
 'I',
 'There',
 'was',
 'an',
 'old',
 'woman',
 'Lived',
 'under',
 'a',
 'hill',
 'And',
 'if',
 'she',
 "'s",
 'not',
 'gone',
 'She',
 'lives',
 'there',
 'still',
 'Diddlty',
 'diddlty',
 'dumpty',
 'The',
 'cat',
 'ran',
 'up',
 'the',
 'plum',
 'tree',
 'Give',
 'her',
 'a',
 'plum',
 'and',
 'down',
 'she',
 "'ll",
 'c

What we want to explore in our visualizations is how Stopwords compare to the rest of the words in this children's book as a proportion.

In [8]:
from nltk.corpus import stopwords
gooseNonStopwords = [word for word in gooseWord_tokens if word not in stopwords.words('english')]
gooseStopwords = [word for word in gooseWord_tokens if word in stopwords.words('english')]
numGooseNonStopwords = len(gooseNonStopwords)
numGooseStopwords = len(gooseStopwords)
print(numGooseNonStopwords)
print(numGooseStopwords)


799
407


Now if we wanted to, we could make a CSV with the number of Stopwords and Non-stopwords like so.

In [55]:
# data = pd.DataFrame(columns = ['Non_Stop_words','Stop_Words'])
# data.loc[len(data.index)] = [numGooseNonStopwords, numGooseStopwords]

# data
# data.to_csv('MotherGoose.csv')

What will be really interesting to see is if a much more complicated book, let us say "Siddhartha," will have a higher Non-stopword to Stopword ratio. Let us quickly use the processes that Dr. Z has so kindly provided for us to quickly (or not so quickly as this kernel takes a very long time to run because I just through every piece of the text cleaning into it) gather the Non-stopwords and Stopwords in that piece.

In [9]:
url2 = "http://www.gutenberg.org/cache/epub/2500/pg2500.txt"
res2 = requests.get(url2)
i = res2.text.index("FIRST PART")
text = res2.text[i:] #cut out the initial part
i = text.index("End of the Project Gutenberg EBook")
text = text[:i] #cut out the end of the file
re_tokenizer = RegexpTokenizer("^(?:[A-Z][A-Z]+ ?)+")
sections = re_tokenizer.tokenize(text)
for chapter in sections:
    i = text.index(chapter)
    text = text[:i]+text[(i+len(chapter)):] #takes out chapter titles
siddhartha_word_tokens = nltk.word_tokenize(text)
for i, word in enumerate(siddhartha_word_tokens):
    if len(word) == 1 and word[0] in string.punctuation:
        siddhartha_word_tokens.pop(i)
        
from nltk.corpus import stopwords
siddharthaNonStopwords = [word for word in siddhartha_word_tokens if word not in stopwords.words('english')]
siddharthaStopwords = [word for word in siddhartha_word_tokens if word in stopwords.words('english')]
numSiddharthaNonStopwords=len(siddharthaNonStopwords)
numSiddharthaStopwords=len(siddharthaStopwords)
print(numSiddharthaNonStopwords)
print(numSiddharthaStopwords)

21123
19274


The following cell would create a csv with the total number of stopwords and nonstopwords for each text.

In [24]:
# d = {'SW': [numSiddharthaStopwords, numGooseStopwords], 'NSW': [numSiddharthaNonStopwords, numGooseNonStopwords]}
# data = pd.DataFrame(data=d)
# data.rename({0: "Sid", 1: 'MG'}) #how do I rename the rows?

# data.to_csv('StopWordsComparison.csv')

However, we already know that we are interested in the ratio of stopwords to nonstopwords for each text, so we will just create a csv with those percentages already calculated.

In [12]:
SidRatio = numSiddharthaNonStopwords/(numSiddharthaNonStopwords+numSiddharthaStopwords)
MGRatio = numGooseNonStopwords/(numGooseStopwords+numGooseNonStopwords)
d = {'Sid': [SidRatio], 'MG': [MGRatio]}
data = pd.DataFrame(data=d)
data.to_csv('StopWordsComparison.csv')

We also need to download some images to be used in our visualization. A link for the Siddhartha book cover image can be found at the following: https://images-na.ssl-images-amazon.com/images/I/518G7jJJhcL._SX322_BO1,204,203,200_.jpg
A link for the Mother Goose book cover image can be found at the following: https://images-na.ssl-images-amazon.com/images/I/51UwQpNGtFL._SX258_BO1,204,203,200_.jpg

Now with our texts processed and images downloaded, we are ready to visualize our data.

#### How it Works

Our program is very simple as we have done nearly all of the heavy lifting of processing the data in the Jupyter Notebook.

First, we load the data [line 44] and create an svg on which we will create our graphic [line 49].

Using the images of the book covers that we found, we will append the images to the graph in the place of data bars [line 68, 76]. We retrieve the data in the csv for the y-coordinate and height attributes [line 71, 79].

Using good chart generation inspired by Tufte, we add invisible axis lines [line 98-116].

#### Running the Program

Running the program only takes going through this notebook, as well as ensuring that the csv file, book cover images, and TextProj.html are in the same folder, and a local host server is started. After that, just run the TextProj.html.

# Enjoy!