# Build your own concordance

It took 500 Dominican munks to write the first concordance of the Latin bible, and it took Rabbi Mordecai Nathan 10 years to write the first concordance of the Hebrew bible. With Python, it only takes a matter of seconds to find words in a text, along with the surrounding words.

Run each cell in this notebook one at a time, in order. If something in one cell doesn't work right, it might be because you have overwritten a variable, so try going back and running all the previous cells again.

First run the code and check that everything works. Then, try modifying the code. Start with the first challenges, and then continue in order. Feel free to work together, and see how far you can get. The important thing is to learn, not to solve all the challenges!

In [None]:
# install the natural language toolkit package (nltk), which has a copy of several texts, 
#including the King James Bible

%pip install nltk

In [None]:
# import the nltk package so that it is accessible to Python, and download a collection of texts from Project Gutenberg
import nltk
nltk.download('gutenberg')

In [3]:
# Create a variable called "bible" which contains the text of the King James bible.
bible = nltk.corpus.gutenberg.raw('bible-kjv.txt')

# make all characters lowercase
bible = bible.lower()

# remove the "\n" characters, which indicate line breaks in the text (newlines)
bible = bible.replace('\n', ' ')

# split up the text into a long list of individual words
bible = bible.split(' ')

In [4]:
# make a variable called "concordance", and fill it with every occurrence of the phrase "this world", and a few words preceeding and following "this world"
concordance = []
for i, val in enumerate(bible):
    if val == "world":
        if bible[i-1] == "this":
            concordance.append(str(' '.join(bible[i-5:i+5])))

In [5]:
# take a look at what the algorithm has found
concordance

['for the children of this world are in their generation',
 'them, the children of this world marry, and are given',
 'hateth his life in this world shall keep it unto',
 'shall the prince of this world be cast out. ',
 'should depart out of this world unto the father, having',
 'for the prince of this world cometh, and hath nothing',
 'because the prince of this world is judged.  16:12',
 'of the princes of this world knew: for had they',
 'for the wisdom of this world is foolishness with god.',
 'for the fashion of this world passeth away.  7:32',
 'whom the god of this world hath blinded the minds',
 'chosen the poor of this world rich in faith, and',
 'saying, the kingdoms of this world are become the kingdoms']

In [6]:
# let's see how many instances of the phrase "this world" were found
len(concordance)


13

Let's try again, but this time let's just search for "world" by itself, not "this world".

In [7]:
concordance = []
for i, val in enumerate(bible):
    if val == "world":
        concordance.append(str(' '.join(bible[i-5:i+5])))

In [8]:
# take a look at what the algorithm has found
concordance

['and he hath set the world upon them.  2:9',
 'appeared, the foundations of the world were discovered, at the',
 'him, all the earth: the world also shall be stable,',
 'upon the face of the world in the earth. ',
 'and he shall judge the world in righteousness, he shall',
 'and the foundations of the world were discovered at thy',
 'all the ends of the world shall remember and turn',
 'all the inhabitants of the world stand in awe of',
 'not tell thee: for the world is mine, and the',
 'is thine: as for the world and the fulness thereof,',
 'he hath girded himself: the world also is stablished, that',
 'that the lord reigneth: the world also shall be established',
 'earth: he shall judge the world with righteousness, and the',
 'also he hath set the world in their heart, so',
 'and i will punish the world for their evil, and',
 'kingdoms; 14:17 that made the world as a wilderness, and',
 'fill the face of the world with cities.  14:22',
 'all the kingdoms of the world upon the face o

In [9]:
# let's see how many instances of just the word "world" were found
len(concordance)

113

Now, in the cell below, modify the code to search for a different word.

In [10]:
# searching for the word "god" instead
concordance = []
for i, val in enumerate(bible):
    if val == "god":
        concordance.append(str(' '.join(bible[i-5:i+5])))

# counting the number of occurences
len(concordance)

2303

The nltk package has the full text of several other classic books. You can see what they are called by running the command in the cell below:

In [11]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

## Your turn!

Here are a some more things you can try. In each case, I have given you a little bit of starter code to get you going, but the cells will not run without some additional code from you.




## Challenge 1: build your own concordance

Pick a different book and a different word, or pair of words. Copy and paste from the code above to write some Python code that searches the book of your choice for the word or pair of words of your choice. Put this code in the cell below. By the way, some of the texts use the characters `\r` for "carriage return" instead of `\n` for "newline". You can remove these the same way that you remove the `\n` characters.

In [12]:
# Searching for instances of "white whale" in the Moby Dick book

# Create a variable called "moby_dick"
moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

# make all characters lowercase
moby_dick = moby_dick.lower()

# remove the "\n" characters
moby_dick = moby_dick.replace('\n', ' ')

# remove the "\r" characters
moby_dick = moby_dick.replace('\r', ' ')

# split up the text into a long list of individual words
moby_dick = moby_dick.split(' ')

# searching for the words "white whale"
concordance = []
for i, val in enumerate(moby_dick):
    if val == "whale":
        if moby_dick[i-1] == "white":
            concordance.append(str(' '.join(moby_dick[i-5:i+5])))

# counting the number of occurences
len(concordance)

48

## Challenge 2: compare lengths of books

We can use the command `len` to find how many items there are in a list. E.g., to find the number of words in the list called `bible`, above, we can write: `len(bible)`. 

Use the starter code below to find out which book in the books included in `nltk` has the most words.

In [13]:
# solution 1: print all the titles and numbers of words
# starter code:

books = nltk.corpus.gutenberg.fileids()

for title in books:
    book = nltk.corpus.gutenberg.raw(title)
    book = book.lower()
    book = book.replace('\n', ' ')
    book = book.replace('\r', ' ')
    book = book.split(' ')
    print(title, len(book))

austen-emma.txt 164457
austen-persuasion.txt 86270
austen-sense.txt 123514
bible-kjv.txt 848001
blake-poems.txt 8886
bryant-stories.txt 54942
burgess-busterbrown.txt 17976
carroll-alice.txt 28387
chesterton-ball.txt 86481
chesterton-brown.txt 80382
chesterton-thursday.txt 59297
edgeworth-parents.txt 195982
melville-moby_dick.txt 243947
milton-paradise.txt 91832
shakespeare-caesar.txt 23339
shakespeare-hamlet.txt 33477
shakespeare-macbeth.txt 20164
whitman-leaves.txt 138730


Looking at the output above, the bible seems to have the most words.

In [14]:
# more advanced, for those with some Python experience, or those who want to google around..
# solution 2: make a list of titles and a list of wordcounts, zip them together, then sort them based on wordcount
# starter code:

books = nltk.corpus.gutenberg.fileids()

titles = []
numwords = []
for title in books:
    book = nltk.corpus.gutenberg.raw(title)
    book = book.lower()
    book = book.replace('\n', ' ')
    book = book.replace('\r', ' ')
    book = book.split(' ')
    # adding the title to a list
    titles.append(title)
    # adding the word count to a list
    numwords.append(len(book))

# zipping together the two list
zipped_list = list(zip(numwords, titles))

In [15]:
# sorting the zipped_list in descending order and printing
sorted(zipped_list, reverse=True)

[(848001, 'bible-kjv.txt'),
 (243947, 'melville-moby_dick.txt'),
 (195982, 'edgeworth-parents.txt'),
 (164457, 'austen-emma.txt'),
 (138730, 'whitman-leaves.txt'),
 (123514, 'austen-sense.txt'),
 (91832, 'milton-paradise.txt'),
 (86481, 'chesterton-ball.txt'),
 (86270, 'austen-persuasion.txt'),
 (80382, 'chesterton-brown.txt'),
 (59297, 'chesterton-thursday.txt'),
 (54942, 'bryant-stories.txt'),
 (33477, 'shakespeare-hamlet.txt'),
 (28387, 'carroll-alice.txt'),
 (23339, 'shakespeare-caesar.txt'),
 (20164, 'shakespeare-macbeth.txt'),
 (17976, 'burgess-busterbrown.txt'),
 (8886, 'blake-poems.txt')]

Again, the bible has the most words.

## Challenge 3: what are the most frequent words?

`nltk` has a built-in function called `FreqDist` which counts up how many times each word in a text occurs. So, if you have a list called `words` which contains all the words in a book, you can find the frequencies of all of them by writing `freq = nltk.FreqDist(words)`. You can then get the e.g. ten most common words by writing `freq.most_common(10)`. What are the ten most common words in Jane Austen's "Emma"? What about Herman Melville's "Moby Dick"?

In [16]:
# finding ten most common words in "Emma"
book = nltk.corpus.gutenberg.raw('austen-emma.txt')
words = book.lower()
words = words.replace('\n', ' ')
words = words.replace('\r', ' ')
words = words.split(' ')
emma_freq = nltk.FreqDist(words)
emma_freq.most_common(10)

[('', 6290),
 ('the', 5120),
 ('to', 5079),
 ('and', 4445),
 ('of', 4196),
 ('a', 3055),
 ('i', 2602),
 ('was', 2302),
 ('she', 2169),
 ('in', 2091)]

In [17]:
# finding ten most common words in "Moby Dick"
moby_dick_freq = nltk.FreqDist(moby_dick) # reusing variable from earlier
moby_dick_freq.most_common(10)

[('', 31917),
 ('the', 14226),
 ('of', 6545),
 ('and', 6238),
 ('a', 4597),
 ('to', 4518),
 ('in', 4058),
 ('that', 2744),
 ('his', 2485),
 ('it', 1765)]

## Challenge 4: Remove stopwords

Often, the most frequent words are not the most interesting ones. Words like "a" and "the" are so common in English, that they don't really tell us much about the text. That is why we often remove "stopwords", that is, a list of the most common words in English, before e.g. counting frequencies. There are several of these lists available, in [English]((https://gist.github.com/sebleier/554280)) as well as other languages, such as [Danish](https://gist.github.com/berteltorp/0cf8a0c7afea7f25ed754f24cfc2467b). Below is some starter code to remove stopwords. Use these snippets to see what the most common words in Emma and Moby Dick are after removing these most common words.

Hint: In Moby Dick, you will also have to remove the string `\r`, in addition to removing `\n`.

In [18]:
# list of stopwords

stopwords = ["", "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [19]:
# finding ten most common words in "Emma" w/o stopwords
book = nltk.corpus.gutenberg.raw('austen-emma.txt')
words = book.lower()
words = words.replace('\n', ' ')
words = words.replace('\r', ' ')
words = words.split(' ')
words = [word for word in words if word not in stopwords] # code to remove stopwords
emma_freq = nltk.FreqDist(words) # finding each word freq
emma_freq.most_common(10) # printing top ten

[('mr.', 1097),
 ('could', 800),
 ('would', 795),
 ('mrs.', 675),
 ('miss', 568),
 ('must', 543),
 ('emma', 481),
 ('much', 427),
 ('every', 425),
 ('said', 392)]

In [20]:
# finding ten most common words in "Moby Dick"
words = moby_dick
words = [word for word in words if word not in stopwords] # code to remove stopwords
moby_dick_freq = nltk.FreqDist(words) # finding each word freq
moby_dick_freq.most_common(10) # printing top ten

[('one', 779),
 ('like', 564),
 ('upon', 556),
 ('whale', 528),
 ('old', 425),
 ('would', 416),
 ('though', 311),
 ('great', 292),
 ('still', 282),
 ('seemed', 273)]

# Lab 03: Fun with `pandas`!

Below are some exercises to get you working with `pandas` to manipulate data. As always, get as far as you can, and ask for help when you need it! Your teacher (me), you instructor, and your classmates are all here to help each other get better at coding. Getting the code to work is important, but do also take the time to make sure you understand what the commands are doing. This time, (with the exception of the Stroop challenge), all I've given you is the code to download the data. Then you are on your own. For the Stroop challenge, I gave the you code for the first step—after that, it's up to you :-)

## Music sales challenge

Write a script that:

1. Combines the tables of best-selling physical singles and best-selling digital singles on the Wikipedia page "List_of_best-selling_singles"
2. Adds a column which marks whether each row is from the list of physical singles or digital singles
3. Outputs the artist and single name for the year you were born. If there is no entry for that year, take the closest year after you were born.
4. Outputs the artist and single name for the year you were 15 years old.

In [None]:
%pip install lxml
%pip install pandas
import pandas as pd

In [None]:
rawdata = pd.read_html("https://en.wikipedia.org/wiki/List_of_best-selling_singles")
df1 = rawdata[0] # physical, 15 million physical copies or more
df1['Type'] = 'physical' # adding new col with type
df2 = rawdata[1] # physical, 10–14.9 million copies
df2['Type'] = 'physical'
df3 = rawdata[3] # digital, 15 million digital copies or more
df3['Type'] = 'digital'
df4 = rawdata[4] # digital, 10–14.99 million copies
df4['Type'] = 'digital'

# combining
df = pd.concat([df1, df2, df3, df4])

# im born in 1997, so making a new df with those entries
df_1997 = df[df['Released'] == 1997]

# printing content of 1997 list in prose
for i in range(len(df_1997)):
    artist = df_1997.iloc[i,0]
    single = df_1997.iloc[i,1]
    print('Entries from my birthyear, 1997, include', artist, 'with', single)

# adding 15 years to 1997, and making a new df with those entries
df_age_15 = df[df['Released'] == 1997+15]

# printing content of age_15 list in prose
for i in range(len(df_age_15)):
    artist = df_age_15.iloc[i,0]
    single = df_age_15.iloc[i,1]
    print('Entries from the year I was 15 include', artist, 'with', single)

Entries from my birthyear, 1997, include Elton John with "Something About the Way You Look Tonight"/"Candle in the Wind 1997"
Entries from my birthyear, 1997, include Celine Dion with "My Heart Will Go On"
Entries from the year I was 15 include Imagine Dragons with "Radioactive"
Entries from the year I was 15 include Macklemore and Ryan Lewis featuring Wanz with "Thrift Shop"


## Space challenge

1. Make a single dataframe that combines the space missions from the 1950's to the 2020's
2. Write a script that returns the year with the most launches
3. Write a script that returns the most common month for launches
4. Write a script that ranks the months from most launches to fewest launches


In [None]:
rawdata = pd.read_html("https://en.wikipedia.org/wiki/Timeline_of_Solar_System_exploration")

# combining df's
df = pd.DataFrame() # preparing df
for i in range(8): # looping through the 8 df on wiki page
    df = pd.concat([df,rawdata[i]]) # adding the content of iteration to main df

# script that returns year with most launches
df['Launch year'] = df['Launch date'].str[-4:] # making new col with launch year -> which is the first four characters from the right
grouped_df = df.groupby('Launch year').count().reset_index() # grouping by launch year and counting number of entries in each group
grouped_df = grouped_df.sort_values(by = ['Mission name'], ascending = False) # sorting from highest to lowest count
print('The year with the most launches was', grouped_df.iloc[0]['Launch year'], 'with', grouped_df.iloc[0]['Description'], 'launches.') # printing top result in prose

# script that returns most common month for launches
df['Launch date'] = pd.to_datetime(df['Launch date']) # formatting launch date as datetime
df['Launch month'] = df['Launch date'].dt.strftime('%B') # extracting month from Launch date to new col
grouped_df = df.groupby('Launch month').count().reset_index() # grouping by launch month and counting number of entries in each group
grouped_df = grouped_df.sort_values(by = ['Mission name'], ascending = False) # sorting from highest to lowest count
print('The month with the most launches was', grouped_df.iloc[0]['Launch month'], 'with', grouped_df.iloc[0]['Description'], 'launches.') # printing top result in prose

# script that ranks the months from most launches to fewest launches
print('Ranking the months from most launches to fewest launches:')
for i in range(len(grouped_df)):
    print(grouped_df.iloc[i]['Launch month'])

The year with the most launches was 1965 with 12 launches.
The month with the most launches was November with 30 launches.
Ranking the months from most launches to fewest launches:
November
August
October
September
July
January
December
May
March
February
June
April


## Supervillain challenge

1. Write a script that combines the tables showing supervillain debuts from the 30's through the 2010's
2. Write a script that ranks each decade in terms of how many supervillains debuted in that decade
3. Write a script that ranks the different comics companies in terms of how many supervillains they have, and display the results in a nice table (pandas dataframe)

In [None]:
rawdata = pd.read_html("https://en.wikipedia.org/wiki/List_of_comic_book_supervillain_debuts")

df = pd.DataFrame() # preparing df
Decade = 1930 # preparing start value for decade variable

# script that combines tables
for i in range(3,12): # looping through item 3-11 on wiki page
    rawdata[i]['Decade'] = Decade # storing decade variable in new col
    df = pd.concat([df,rawdata[i]]) # adding the content of iteration to main df
    Decade = Decade + 10 # updating decade variable

# script that ranks each decade by number of debuting supervillains
grouped_df = df.groupby('Decade').count().reset_index() # grouping by decade and counting number of entries in each group
nice_df = grouped_df[['Decade','Year Debuted']].copy().rename(columns={'Decade':'Decade','Year Debuted':'Count'}) # copying Decade and count to new df
nice_df = nice_df.sort_values(by = ['Count'], ascending = False) # sorting from highest to lowest count
nice_df

Unnamed: 0,Decade,Count
3,1960,228
4,1970,97
5,1980,92
6,1990,84
7,2000,49
1,1940,47
2,1950,26
8,2010,9
0,1930,4


In [None]:
# script that ranks comic companies by number of supervillains as pd df
grouped_df = df.groupby('Company').count().reset_index() # grouping by company and counting number of entries in each group
nice_df = grouped_df[['Company','Year Debuted']].copy().rename(columns={'Company':'Company','Year Debuted':'Count'}) # copying Company and count to new df
nice_df = nice_df.sort_values(by = ['Count'], ascending = False) # sorting from highest to lowest count
nice_df

Unnamed: 0,Company,Count
1,DC,338
9,Marvel,264
5,Fawcett Comics/DC,6
2,Dark Horse,5
6,Image,5
3,Disney/Hyperion,4
10,Marvel/Timely,4
4,Eternity,3
0,Comico,1
7,Image Comics,1


## Stroop challenge

Every year between 2015 and 2021, the students in my Language, Cognition, and the Brain course participated in a version of the Stroop task. Using a stopwatch (ok, using their phones), they recorded how fast they could say a list of things (either reading or naming colors or color words). The column names mean "Reading with No Interference", "Naming with Interference", "Naming with No Interference", and "Reading with Interference". The times are in seconds.

### Stroop challenge 1: 
Transform these data from wide format to long format, so that the result is a dataframe with
- 1 column named "Participant_id" with a unique number for each participant (you can use the row indices)
- 1 column named "Year" with the year data
- 1 column named "Task" that shows which task they were doing
- 1 column named "RT" that shows their response time

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/ethanweed/Stroop/master/Stroop-raw-over-the-years.csv")
# Make a new column using the dataframe indices as particpant numbers
df.index.name = 'Participant_id'
df = df.reset_index()

# transforming from wide to long format
df_long = pd.melt(df, id_vars=[ 'Participant_id', 'Year']) # transforming to long format
df_long = df_long.rename(columns={'Participant_id':'Participant_id','Year':'Year', 'variable': 'Task', 'value': 'RT'}) # renaming headers
df_long.head(185) # inspecting

Unnamed: 0,Participant_id,Year,Task,RT
0,0,2015,Reading_NoInt,4.16
1,1,2015,Reading_NoInt,4.35
2,2,2015,Reading_NoInt,3.60
3,3,2015,Reading_NoInt,3.90
4,4,2015,Reading_NoInt,4.22
...,...,...,...,...
180,180,2021,Reading_NoInt,5.16
181,181,2021,Reading_NoInt,4.27
182,0,2015,Naming_Int,6.76
183,1,2015,Naming_Int,7.73


## Stroop challenge 2 (Advanced!!!):

Make a new dataframe which shows the mean response time (in seconds) for each task for each year.

In [None]:
grouped_df = df_long.groupby(['Task', 'Year'])['RT'].mean().reset_index()
grouped_df

Unnamed: 0,Task,Year,RT
0,Naming_Int,2015,8.617143
1,Naming_Int,2016,8.859268
2,Naming_Int,2017,9.311765
3,Naming_Int,2018,9.372667
4,Naming_Int,2019,9.536087
5,Naming_Int,2020,9.740833
6,Naming_Int,2021,10.105484
7,Naming_NoInt,2015,5.123571
8,Naming_NoInt,2016,5.40561
9,Naming_NoInt,2017,5.771176
