# Build your own concordance

It took 500 Dominican munks to write the first concordance of the Latin bible, and it took Rabbi Mordecai Nathan 10 years to write the first concordance of the Hebrew bible. With Python, it only takes a matter of seconds to find words in a text, along with the surrounding words.

Run each cell in this notebook one at a time, in order. If something in one cell doesn't work right, it might be because you have overwritten a variable, so try going back and running all the previous cells again.

First run the code and check that everything works. Then, try modifying the code. Start with the first challenges, and then continue in order. Feel free to work together, and see how far you can get. The important thing is to learn, not to solve all the challenges!

In [14]:
# install the natural language toolkit package (nltk), which has a copy of several texts, 
#including the King James Bible

%pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading regex-2024.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (797 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.0/797.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Installing

In [15]:
# import the nltk package so that it is accessible to Python, and download a collection of texts from Project Gutenberg
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /home/ucloud/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [16]:
# Create a variable called "bible" which contains the text of the King James bible.
bible = nltk.corpus.gutenberg.raw('bible-kjv.txt')

# make all characters lowercase
bible = bible.lower()

# remove the "\n" characters, which indicate line breaks in the text (newlines)
bible = bible.replace('\n', ' ')

# split up the text into a long list of individual words
bible = bible.split(' ')

In [17]:
# make a variable called "concordance", and fill it with every occurrence of the phrase "this world", and a few words preceeding and following "this world"
concordance = []
for i, val in enumerate(bible):
    if val == "world":
        if bible[i-1] == "this":
            concordance.append(str(' '.join(bible[i-5:i+5])))

In [18]:
# take a look at what the algorithm has found
# concordance

In [19]:
# let's see how many instances of the phrase "this world" were found
# len(concordance)


Let's try again, but this time let's just search for "world" by itself, not "this world".

In [20]:
concordance = []
for i, val in enumerate(bible):
    if val == "world":
        concordance.append(str(' '.join(bible[i-5:i+5])))

In [21]:
# take a look at what the algorithm has found
# concordance

In [22]:
# let's see how many instances of just the word "world" were found
# len(concordance)

Now, in the cell below, modify the code to search for a different word.

In [23]:
# add your modified code here and run the cell...
concordance = []
for i, val in enumerate(bible):
    if val == "god":
        concordance.append(str(' '.join(bible[i-5:i+5])))

In [24]:
# concordance

In [25]:
# len(concordance)

The nltk package has the full text of several other classic books. You can see what they are called by running the command in the cell below:

In [26]:
# nltk.corpus.gutenberg.fileids()

## Your turn!

Here are a some more things you can try. In each case, I have given you a little bit of starter code to get you going, but the cells will not run without some additional code from you.


In [27]:
# Setup for challenges
%pip install nltk
import nltk
nltk.download('gutenberg')

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package gutenberg to /home/ucloud/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True



## Challenge 1: build your own concordance

Pick a different book and a different word, or pair of words. Copy and paste from the code above to write some Python code that searches the book of your choice for the word or pair of words of your choice. Put this code in the cell below. By the way, some of the texts use the characters "\r" for "carriage return" instead of "\n" for "newline". You can remove these the same way that you remove the "\n" characters.

In [28]:
# add your code to search for a word or pair of words in a different book here

# Create a variable called "paradise" which contains the text of Milton's Paradise Lost.
paradise = nltk.corpus.gutenberg.raw('milton-paradise.txt')

# make all characters lowercase
paradise = paradise.lower()

# remove the "\n" characters, which indicate line breaks in the text (newlines)
paradise = paradise.replace('\n', ' ')

# split up the text into a long list of individual words
paradise = paradise.split(' ')

# From here, this is referred to as text preprocessing:

# Make concordance list, which counts amount of times the word "satan" appears in the book:
par_concordance = []
for i, val in enumerate(paradise):
    if val == "satan":
        par_concordance.append(str(' '.join(paradise[i-5:i+5])))

In [29]:
# Output the list of the chosen word + the 5 words in either direction from it
par_concordance

['hell?"    so satan spake; and him beelzebub',
 'the high capital  of satan and his peers. their',
 'barbaric pearl and gold,  satan exalted sat, by merit',
 'when beelzebub perceived--than whom,  satan except, none higher sat--with',
 'kingly crown had on.  satan was now at hand,',
 'side,  incensed with indignation, satan stood  unterrified, and',
 'forbore: then these to her satan returned:--   ',
 '  he ceased; and satan stayed not to reply,',
 'less hostile din;  that satan with less toil, and',
 'and the gulf between, and satan there   coasting',
 'inroad of darkness old,  satan alighted walks:  a',
 'bound the ocean wave.  satan from hence, now on',
 "o'er sea and land: him satan thus accosts.  uriel,",
 'that steep savage hill  satan had journeyed on, pensive',
 'usher evening rose:  when satan still in gaze, as',
 'bliss!  to whom thus satan with contemptuous brow. ',
 'judge of wise  since satan fell, whom folly overthrew,',
 ' so threatened he; but satan to no threats ',
 '

In [30]:
# Output is the total amount of entries in the list made legible with an f-string
print(f' This concordance has {len(par_concordance)} entries')

 This concordance has 36 entries


## Challenge 2: compare lengths of books

We can use the command `len` to find how many items there are in a list. E.g., to find the number of words in the list called `bible`, above, we can write: `len(bible)`. 

Use the starter code below to find out which book in the books included in `nltk` has the most words.

In [31]:
# solution 1: print all the titles and numbers of words
# starter code:

books = nltk.corpus.gutenberg.fileids()

for title in books:

    # Book preprocessing:
    book = nltk.corpus.gutenberg.raw(title)
    book = book.lower()
    book = book.replace('\n', ' ')
    book = book.split(' ')

    # Applying the length of the 'book' object to a new object called 'words':
    words = len(book)
    
    # Print all book titles and their word count, made legible with an f-string:
    print(f'The book {title} has {words} words')

The book austen-emma.txt has 164457 words
The book austen-persuasion.txt has 86270 words
The book austen-sense.txt has 123514 words
The book bible-kjv.txt has 848001 words
The book blake-poems.txt has 8886 words
The book bryant-stories.txt has 49404 words
The book burgess-busterbrown.txt has 16305 words
The book carroll-alice.txt has 28387 words
The book chesterton-ball.txt has 86481 words
The book chesterton-brown.txt has 80382 words
The book chesterton-thursday.txt has 59297 words
The book edgeworth-parents.txt has 177685 words
The book melville-moby_dick.txt has 221023 words
The book milton-paradise.txt has 91832 words
The book shakespeare-caesar.txt has 23339 words
The book shakespeare-hamlet.txt has 33477 words
The book shakespeare-macbeth.txt has 20164 words
The book whitman-leaves.txt has 138730 words


In [32]:
# more advanced, for those with some Python experience, or those who want to google around..
# solution 2: make a list of titles and a list of wordcounts, zip them together, then sort them based on wordcount
# starter code:

# Store Gutenberg files in the 'books' object
books = nltk.corpus.gutenberg.fileids()

# Make lists for future processing
Book_wordcount = []
titles = []
words = []

for title in books:

    # Book preprocessing:
    book = nltk.corpus.gutenberg.raw(title)
    book = book.lower()
    book = book.replace('\n', ' ')
    book = book.split(' ')

    # Applying the length of the 'book' object to a new object called 'numwords':
    words.append(len(book))
    titles.append(title)

# Zip together the 'title' list and the 'words' list into one zip-object:
wordcount = zip(titles, words)

# Sort the new list by the second column. Since it sorts automatically- 
# -from least to most, the reverse argument is used to make it from most to least
sorted_wordcount = sorted(wordcount, reverse = True,
                        key = lambda x: x[1])

# Print only the first entry, thereby the book with the most words.
print(f'Title of book and the amount of words: {sorted_wordcount [0]}')


Title of book and the amount of words: ('bible-kjv.txt', 848001)


## Challenge 3: what are the most frequent words?

`nltk` has a built-in function called `FreqDist` which counts up how many times each word in a text occurs. So, if you have a list called `words` which contains all the words in a book, you can find the frequencies of all of them by writing `freq = nltk.FreqDist(words)`. You can then get the e.g. ten most common words by writing `freq.most_common(10)`. What are the ten most common words in Jane Austen's "Emma"? What about Herman Melville's "Moby Dick"?

In [33]:
# starter code:

# Create lists for future storage:
freqdist1 = []
freqdist2 = []

# First chosen book preprocessing:
book1 = nltk.corpus.gutenberg.raw('austen-emma.txt')
words1 = book1.lower()
words1 = book1.replace('\n', ' ')
words1 = book1.replace('\r', ' ')
words1 = book1.split(' ')

# Second book preprocessing:
book2 = nltk.corpus.gutenberg.raw('milton-paradise.txt')
words2 = book2.lower()
words2 = book2.replace('\n', ' ')
words2 = book2.replace('\r', ' ')
words2 = book2.split(' ')

# Make new object with most frequent words 
freq1 = nltk.FreqDist(words1)
freq2 = nltk.FreqDist(words2)

# Append frequent distribution to list to make output more legible:
freqdist1.append(freq1.most_common(10))
freqdist2.append(freq2.most_common(10))

# Output specifying books worked with, and then list of 10 most common words in both:
print(f'Most frequent words in the books "Emma" and "Paradise Lost"'), freqdist1, freqdist2




Most frequent words in the books "Emma" and "Paradise Lost"


(None,
 [[('to', 4475),
   ('the', 4380),
   ('', 3704),
   ('of', 3647),
   ('and', 3357),
   ('a', 2707),
   ('I', 2151),
   ('was', 2063),
   ('not', 1808),
   ('in', 1791)]],
 [[('and', 2704),
   ('the', 2500),
   ('to', 1752),
   ('of', 1483),
   ('', 1357),
   ('in', 1080),
   ('his', 981),
   ('with', 871),
   ('or', 562),
   ('all', 553)]])

## Challenge 4: Remove stopwords

Often, the most frequent words are not the most interesting ones. Words like "a" and "the" are so common in English, that they don't really tell us much about the text. That is why we often remove "stopwords", that is, a list of the most common words in English, before e.g. counting frequencies. There are several of these lists available, in [English]((https://gist.github.com/sebleier/554280)) as well as other languages, such as [Danish](https://gist.github.com/berteltorp/0cf8a0c7afea7f25ed754f24cfc2467b). Below is some starter code to remove stopwords. Use these snippets to see what the most common words in Emma and Moby Dick are after removing these most common words.

Hint: In Moby Dick, you will also have to remove the string `\r`, in addition to removing `\n`.

In [34]:
# create list of stopwords:
stopwords = ["", "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you",
             "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself",
             "she", "her", "hers", "herself", "it", "its", "itself", "they", "them",
             "their", "theirs", "themselves", "what", "which", "who", "whom", "this",
             "that", "these", "those", "am", "is", "are", "was", "were", "be", "been",
             "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a",
             "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
             "of", "at", "by", "for", "with", "about", "against", "between", "into", "through",
             "during", "before", "after", "above", "below", "to", "from", "up", "down", "in",
             "out", "on", "off", "over", "under", "again", "further", "then", "once", "here",
             "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more",
             "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so",
             "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

freqdist = []
book = nltk.corpus.gutenberg.raw('austen-emma.txt')

# Book preprocessing:
words = book.lower()
words = book.replace('\n', ' ')
words = book.replace('\r', ' ')
words = book.split(' ')

# code to exclude stopwords from wordcount:
words = [word for word in words if word not in stopwords]

# Make a frequence distribution of the list 'words':
freq = nltk.FreqDist(words)

# Append frequence distribution to new list to make output more legible
freqdist.append(freq.most_common(10))

# Output list of 10 most common words, excluding stop-words:
print(f'List of most common words i "Emma" by Jane Austen, excluding stop words'), freqdist

List of most common words i "Emma" by Jane Austen, excluding stop words


(None,
 [[('I', 2151),
   ('Mr.', 936),
   ('could', 682),
   ('would', 676),
   ('Mrs.', 575),
   ('Miss', 491),
   ('must', 472),
   ('much', 374),
   ('said', 354),
   ('every', 349)]])