# Build your own concordance<a href="#Build-your-own-concordance" class="anchor-link">¶</a>

In the lecture, we discussed how it took 500 Dominican munks to write
the first concordance of the Latin bible, and how it took Rabbi Mordecai
Nathan 10 years to write the first concordance of the Hebrew bible. With
Python, it only takes a matter of seconds to find words in a text, along
with the surrounding words.

In this exercise, first run the code and check that everything works.
Then, try modifying the code. Do at least to the part where it says
**Stop now if you want!** Do as much or as little of the other exercises
as you can / want to.

Run each cell in this notebook at a time, in order. If something in one
cell doesn't work right, it might be because you have overwritten a
variable, so try going back and running all the previous cells again.

The point of the exercise isn't for you to understand all the commands
fully; the idea is to start getting you familiar with how jupyter
notebooks work, with the concept of copying and modifying code, and
hopefully give you a little taste of the power of Python.

In [72]:
# install the natural language toolkit package (nltk), which has a copy of several texts, 
#including the King James Bible

%pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [73]:
# import the nltk package so that it is accessible to Python, and download a collection of texts from Project Gutenberg
import nltk
nltk.download('gutenberg')

# Create a variable called "bible" which contains the text of the King James bible.
bible = nltk.corpus.gutenberg.raw('bible-kjv.txt')

# make all characters lowercase
bible = bible.lower()

# remove the "\n" characters, which indicate line breaks in the text (newlines)
bible = bible.replace('\n', ' ')

# split up the text into a long list of individual words
bible = bible.split(' ')



[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/Nikita/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [74]:

# make a variable called "concordance", and fill it with every occurrence of the phrase "this world", and a few words preceeding and following "this world"
concordance = []
for i, val in enumerate(bible):
    if val == "world":
        if bible[i-1] == "this":
            concordance.append(str(' '.join(bible[i-5:i+5])))


# take a look at what the algorithm has found
concordance

['for the children of this world are in their generation',
 'them, the children of this world marry, and are given',
 'hateth his life in this world shall keep it unto',
 'shall the prince of this world be cast out. ',
 'should depart out of this world unto the father, having',
 'for the prince of this world cometh, and hath nothing',
 'because the prince of this world is judged.  16:12',
 'of the princes of this world knew: for had they',
 'for the wisdom of this world is foolishness with god.',
 'for the fashion of this world passeth away.  7:32',
 'whom the god of this world hath blinded the minds',
 'chosen the poor of this world rich in faith, and',
 'saying, the kingdoms of this world are become the kingdoms']

Let's try again, but this time let's just search for "world" by itself,
not "this world".

In [75]:

concordance = []
for i, val in enumerate(bible):
    if val == "world":
        concordance.append(str(' '.join(bible[i-5:i+5])))



# take a look at what the algorithm has found
concordance

['and he hath set the world upon them.  2:9',
 'appeared, the foundations of the world were discovered, at the',
 'him, all the earth: the world also shall be stable,',
 'upon the face of the world in the earth. ',
 'and he shall judge the world in righteousness, he shall',
 'and the foundations of the world were discovered at thy',
 'all the ends of the world shall remember and turn',
 'all the inhabitants of the world stand in awe of',
 'not tell thee: for the world is mine, and the',
 'is thine: as for the world and the fulness thereof,',
 'he hath girded himself: the world also is stablished, that',
 'that the lord reigneth: the world also shall be established',
 'earth: he shall judge the world with righteousness, and the',
 'also he hath set the world in their heart, so',
 'and i will punish the world for their evil, and',
 'kingdoms; 14:17 that made the world as a wilderness, and',
 'fill the face of the world with cities.  14:22',
 'all the kingdoms of the world upon the face o

Now, in the cell below, modify the code to search for a different word.

In [76]:
concordance = []
for i, val in enumerate(bible):
    if val == "nature":
        concordance.append(str(' '.join(bible[i-5:i+5])))

# take a look at what the algorithm has found
concordance

['not the law, do by nature the things contained in',
 'and wert graffed contrary to nature into a good olive',
 ' 11:14 doth not even nature itself teach you, that,',
 'service unto them which by nature are no gods. ',
 'the mind; and were by nature the children of wrath,',
 'took not on him the nature of angels; but he']

The nltk package has the full text of several other classic books. You
can see what they are called by running the command in the cell below:

In [77]:
 nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

## Stop now if you want!<a href="#Stop-now-if-you-want!" class="anchor-link">¶</a>

If you have come this far, and got everything to work, you can stop if
you want! The goal of this assignment is to successfully run and modify
some Python code to achieve a goal. But..... if you want to challenge
yourself, here are a few more things you can try. In each case, I have
given you a little bit of starter code to get you going.

## Challenge 1: build your own concordance


Pick a different book and a different word, or pair of words. Copy and
paste from the code above to write some Python code that searches the
book of your choice for the word or pair of words of your choice. Put
this code in the cell below. By the way, some of the texts use the
characters "\r" for "carriage return" instead of "\n" for "newline". You
can remove these the same way that you remove the "\n" characters.





In [9]:
# add your code to search for a word or pair of words in a different book here
# the following two lines should only be run, if the challenge is run independently
import nltk
nltk.download('gutenberg')
# Cleaning text
# Creating a variable called "alice" containing the Alice in Wonderland text
alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')

# make all characters lowercase
alice = alice.lower()

# remove the "\n" characters, which indicate line breaks in the text (newlines)
alice = alice.replace('\n', ' ')

# split up the text into a long list of individual words
alice = alice.split(' ')


[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/Nikita/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [13]:
concordance = []
for i, val in enumerate(alice):
    if val == "tea":
        concordance.append(str(' '.join(alice[i-5:i+5])))
concordance

['and the hatter were having tea at it: a dormouse',
 'he poured a little hot tea upon its nose. ',
 "you know--' (pointing with his tea spoon at the march",
 'she helped herself to some tea and bread-and-butter, and then',
 "i hadn't quite finished my tea when i was sent"]

In [11]:
# searching for the words "white rabbit" by making a variable "concordance" like done for the bible-text
concordance = []
for i, val in enumerate(alice):
    if val == "rabbit":
        if alice[i-1] == "white":
            concordance.append(str(' '.join(alice[i-5:i+5])))
concordance

['daisies, when suddenly a white rabbit with pink eyes ran',
 'long passage, and the white rabbit was still in sight,',
 'coming. it was the white rabbit returning, splendidly dressed, with',
 "stopped hastily, for the white rabbit cried out, 'silence in",
 ' on this the white rabbit blew three blasts on',
 'the king; and the white rabbit blew three blasts on',
 ' alice watched the white rabbit as he fumbled over',
 'her surprise, when the white rabbit read out, at the',
 "their slates, when the white rabbit interrupted: 'unimportant, your majesty",
 'the king.  the white rabbit put on his spectacles.',
 'were the verses the white rabbit read:--   ',
 'her feet as the white rabbit hurried by--the frightened mouse']

## Challenge 2: compare lengths of books

We can use the command `len` to find how many items there are in a list.
E.g., to find the number of words in the list called `bible`, above, we
can write: `len(bible)`.

Use the starter code below to find out which book in the books included
in `nltk` has the most words.

In [80]:
# more advanced, for those with some Python experience, or those who want to google around..
# solution 2: make a list of titles and a list of wordcounts, zip them together, then sort them based on wordcount

import nltk
# I found out you can simply download the gutenberg corpus like this by looking at the nltk.org
# https://www.nltk.org/howto/corpus.html
nltk.download('gutenberg')

# Get a list of the book names in the Gutenberg Corpus
books = nltk.corpus.gutenberg.fileids()
titles = []
numwords = []

for title in books:
    # Get the title of the book by deleting the extension and replacing the dash with a space
    book = title.replace('.txt', '').replace('-', ' ').title()
    titles.append(book)

    # Get the words in the book and count them
    # append the all the words in each text to the specific title
    words = nltk.corpus.gutenberg.words(title)
    # vount the words with len
    word_count = len(words)

    numwords.append(word_count)

print(titles)
print(numwords)    

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/Nikita/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['Austen Emma', 'Austen Persuasion', 'Austen Sense', 'Bible Kjv', 'Blake Poems', 'Bryant Stories', 'Burgess Busterbrown', 'Carroll Alice', 'Chesterton Ball', 'Chesterton Brown', 'Chesterton Thursday', 'Edgeworth Parents', 'Melville Moby_Dick', 'Milton Paradise', 'Shakespeare Caesar', 'Shakespeare Hamlet', 'Shakespeare Macbeth', 'Whitman Leaves']
[192427, 98171, 141576, 1010654, 8354, 55563, 18963, 34110, 96996, 86063, 69213, 210663, 260819, 96825, 25833, 37360, 23140, 154883]


In [81]:
# zip the lists together
combined_list = list(zip(titles, numwords))

# sort the combined list according to number of word counts
# the sorted() function sorts ascendenlingly with `key=lambda wc: wc[1]`
# `key=lambda wc: wc[1]` takes the value wc(wordcount, not toilet!) at index [1] (the second element)
sorted_list = sorted(combined_list, key=lambda wc: wc[1])

print(sorted_list)

[('Blake Poems', 8354), ('Burgess Busterbrown', 18963), ('Shakespeare Macbeth', 23140), ('Shakespeare Caesar', 25833), ('Carroll Alice', 34110), ('Shakespeare Hamlet', 37360), ('Bryant Stories', 55563), ('Chesterton Thursday', 69213), ('Chesterton Brown', 86063), ('Milton Paradise', 96825), ('Chesterton Ball', 96996), ('Austen Persuasion', 98171), ('Austen Sense', 141576), ('Whitman Leaves', 154883), ('Austen Emma', 192427), ('Edgeworth Parents', 210663), ('Melville Moby_Dick', 260819), ('Bible Kjv', 1010654)]


## Challenge 3: what are the most frequent words?

`nltk` has a built-in function called `FreqDist` which counts up how
many times each word in a text occurs. So, if you have a list called
`words` which contains all the words in a book, you find the frequencies
of all of them by writing `freq = nltk.FreqDist(words)`. You can then
get the e.g. ten most common words by writing `freq.most_common(10)`.
What are the ten most common words in Jane Austen's "Emma"? What about
Herman Melville's "Moby Dick"?

In [94]:
from nltk import FreqDist
emma_book = nltk.corpus.gutenberg.raw('austen-emma.txt')
emma = emma_book.lower()
# remove the "\n" characters, which indicate line breaks in the text (newlines)
emma = emma.replace('\n', ' ')
# split up the text into a long list of individual words
emma = emma.split(' ')

# get the frequency for the Emma novel
emma_freq = FreqDist(emma)
# Get the most common words 
emma_top10 = emma_freq.most_common(10)

# Doing the same for Moby Dick
moby_book = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
moby = moby_book.lower()
moby = moby.replace('\n', ' ')
moby = moby.split(' ')
moby_freq = FreqDist(moby)
moby_top10 = moby_freq.most_common(10)

print(emma_top10)
print(moby_top10)


[('', 6290), ('the', 5120), ('to', 5079), ('and', 4445), ('of', 4196), ('a', 3055), ('i', 2602), ('was', 2302), ('she', 2169), ('in', 2091)]
[('the', 12757), ('of', 6014), ('', 5720), ('and', 5629), ('to', 4192), ('a', 4095), ('in', 3756), ('\r', 3273), ('that', 2537), ('his', 2214)]



## Challenge 4: Remove stopwords<a href="#Challenge-4:-Remove-stopwords" class="anchor-link">¶</a>
Often, the most frequent words are not the most interesting ones. Words
like "a" and "the" are so common in English, that they don't really tell
us much about the text. That is why we often remove "stopwords", that
is, a list of the most common words in English, before e.g. counting
frequencies. Below is some starter code to remove stopwords. Use these
snippets to see what the most common words in Emma and Moby Dick are
after removing these most common words.

Hint: In Moby Dick, you will also have to remove the string `\r`, in
addition to removing `\n`.


In [101]:
# list of stopwords
stopwords = ["", "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]
# emma cleanup
from nltk import FreqDist
emma_book = nltk.corpus.gutenberg.raw('austen-emma.txt')

emma = emma_book.lower()
# remove the "\n" characters, which indicate line breaks in the text (newlines)
emma = emma.replace('\n', ' ')
# split up the text into a long list of individual words
emma = emma.split(' ')

# code to remove stopwords.
emma = [word for word in emma if word not in stopwords]
# get the frequency for the Emma novel
emma_freq = FreqDist(emma)
# Get the most common words 
emma_top10 = emma_freq.most_common(10)

#doing the same for Moby Dick
moby_book = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
moby = moby_book.lower()
moby = moby.replace('\n', ' ')
moby = moby.replace('\r', ' ')
moby = moby.split(' ')
# code to remove stopwords.
moby = [word for word in moby if word not in stopwords]
moby_freq = FreqDist(moby)
moby_top10 = moby_freq.most_common(10)

print("Emma top 10 most frequent words", emma_top10)
print("Moby Dick top 10 most frequent words", moby_top10)



Emma top 10 most frequent words [('mr.', 1097), ('could', 800), ('would', 795), ('mrs.', 675), ('miss', 568), ('must', 543), ('emma', 481), ('much', 427), ('every', 425), ('said', 392)]
Moby Dick top 10 most frequent words [('one', 779), ('like', 564), ('upon', 556), ('whale', 528), ('old', 425), ('would', 416), ('though', 311), ('great', 292), ('still', 282), ('seemed', 273)]
