# Loops

Loops are used to repeat a process over and over until a given condition is met. It is similar to our process of, for example, searching for a specific quote in a text.

Read sentence, "Is this the sentence I'm looking for?" you asking yourself, "No" your brain affirms-- and so you repeat this process of reading each sentence until you find the one you're looking for. Translating this process into computer, it would look something like this:

```
for every sentence on this page:
    if this is the quote I am looking for:
        I can stop reading, I've found it!
    if this isn't the quote:
        Let's read the next sentence.
```

In DH applications, for loops allow you to search a large amount of data very quickly.

In [None]:
# here is a list of names
list_of_names = ["George Atherton",
                "Marian Hyde",
                "Sybil Dickenson",
                "Sabina Dobson",
                "Jessica Bradbury",
                "Cindy Salter",
                "Carolina McCabe",
                "Glynis Graves",
                "Laurie Dobson",
                "Phoebe Watkins",
                "Noel Boardman"]

# first, we can make a loop to "iterate" over the list with no conditions
# it will simply continue to go over each item in the list until there is nothing left
# to simply just print out every name in the list:

for name in list_of_names:
    print(name)

# NOTE: "name" is a variable declared only in the loop, and it stores the item that the loop is presently looking at
# in our case, in the first loop "name" = "George Atherton", and then after that name is printed, the loop repeats and "name" = "Marian Hyde", in the next loop "name" = "Sybil Dickenson", and so on until the end of the list

In [None]:
# now, like stated earlier, you'll more likely want to use a loop to find something relevant to your work
# let's say we're only interested in people with the surname "Dobson"
# we can use a combination of for loops and if statements to create a new list of only Dobsons!

# declare your new list that we will add to
only_dobsons = []

for name in list_of_names:
    # we check if this name includes "Dobson"
    if "Dobson" in name:
        # if this is True, we add this name to our new list
        only_dobsons.append(name)

print(only_dobsons)

## Practice Activity #4: Loop and Look 👀
Here is a list of quotes from the novel *Little Women* by Louisa May Alcott. In this activity, use a `for` loop and `if` statement as done above to find quotes that include the word **"work"**. These quotes should be added to a new list, then printed.

In [None]:
little_women_quotes = [
                        "...I do think washing dishes and keeping things tidy is the worst work in the world. It makes me cross; and my hands get so stiff, I can't practise well at all.",
                        "I don't see how you can write and act such splendid things, Jo. You're a regular Shakespeare!",
                        "But it does seem so nice to have little suppers and bouquets, and go to parties, and drive home, and read and rest, and not work. It's like other people, you know, and I always envy girls who do such things; I'm so fond of luxury",
                        "She caught up her knitting, which had dropped out of her hands, gave me a sharp look through her specs, and said, in her short way, 'Finish the chapter, and don't be impertinent, miss.'",
                        "You may try your experiment for a week, and see how you like it. I think by Saturday night you will find that all play and no work is as bad as all work and no play."
                    ]

for i in little_women_quotes:
    if "work" in i:
        print(i)

# Functions

'Functions' are blocks of reuseable code; as you know by now, in Python there are many functions such as `print()` or `len()` which were designed to perform specific tasks when called. If in your own code you believe that there is a task you will need to repeat multiple times at various points, you can write a function yourself! For example, instead of having something like this: 

In [None]:
# find age of each person from records
life_records = [["Cindy Salter", "Born: 1903", "Died: 1933"], ["Glynis Graves", "Born: 1911", "Died: 1989"], ["Noel Boardman", "Born: 1908", "Died: 1972"]]

cs_born = life_records[0][1]
cs_death = life_records[0][2]

# get only the number
for word in cs_born.split():
    if word.isdigit():
        cs_born = int(word)

for word in cs_death.split():
    if word.isdigit():
        cs_death = int(word)

cs_age = cs_death - cs_born
print(life_records[0][0] + "'s age: " + str(cs_age))

gg_born = life_records[1][1]
gg_death = life_records[1][2]

# get only the number
for word in gg_born.split():
    if word.isdigit():
        gg_born = int(word)

for word in gg_death.split():
    if word.isdigit():
        gg_death = int(word)

gg_age = gg_death - gg_born
print(life_records[1][0] + "'s age: " + str(gg_age))

nb_born = life_records[2][1]
nb_death = life_records[2][2]

# get only the number
for word in nb_born.split():
    if word.isdigit():
        nb_born = int(word)

for word in nb_death.split():
    if word.isdigit():
        nb_death = int(word)

nb_age = nb_death - nb_born
print(life_records[2][0] + "'s age: " + str(nb_age))


...We could have something much tidier and easier to read, like this:

In [None]:
# find age of each person from records
life_records = [["Cindy Salter", "Born: 1903", "Died: 1933"], ["Glynis Graves", "Born: 1911", "Died: 1989"], ["Noel Boardman", "Born: 1908", "Died: 1972"]]

def find_age(record):
    born = ''
    death = ''
    for word in record[1].split():
        if word.isdigit():
            born = int(word)

    for word in record[2].split():
        if word.isdigit():
            death = int(word)

    age = record[0] + "'s age: " + str(death - born)
    return age

print(find_age(life_records[0]))
print(find_age(life_records[1]))
print(find_age(life_records[2]))

# Libraries

Now, what can make your code even *tidier*, plus easier to read *and* write? Libraries! Also referred to as "packages", these helpful tools are essentially large collections of pre-written functions that you can install in your Python environment and import so that you can use these functions in your code. 

## NLTK (Natural Language Tool Kit)

In this workshop, we will be introducing two libraries which are necessities for any digital humanist's tool kit the first of which is [NLTK (Natural Language Tool Kit)](https://www.nltk.org/). This is an all-encompassing library to support work in natural language processing (NLP), a multidisciplinary field which deals with the interactions between "natural" human language and computers. It has its roots in linguisitics which is why it can do things like this:

In [None]:
!pip3 install nltk

In [13]:
import nltk
from nltk import pos_tag, word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# tagging PoS in inputted text
text = word_tokenize("Be careful with that butter knife.")
nltk.pos_tag(text)



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Be', 'VB'),
 ('careful', 'JJ'),
 ('with', 'IN'),
 ('that', 'DT'),
 ('butter', 'NN'),
 ('knife', 'NN'),
 ('.', '.')]

...But as the `word_tokenize()` function hints at, NLTK is also excellent at preparing text for and performing textual analysis in a less particulated manner!

(**Note**: NLTK uses the Penn Treebank Tag Set for POS tagging, [which can be found here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).)

In [14]:
import nltk

# tokenization is the process of splitting strings into their individual "tokens"
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

# to import a .txt file we use the "open" function, giving it the path to our text file and an instrution about what we want to do with the file
# here, we would like to "read" our file into a variable so 
transcript = open('Bette-Smith-Transcript.txt').read().lower()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
# we could then tokenize by sentence, which splits the text into sentences
transcript_sentences = sent_tokenize(transcript)
transcript_sentences

["my name is bette smith, and i grew up in ottawa in the carleton, carling area, sorry i'm going back a little bit.",
 'and i went to fisher park high school, and while i was there, i went to grade 13, which usually prepares you for university.',
 "however, in my grade 13 year, i really didn't feel i wanted to go to university, but i decided instead to take the money that my parents had put aside for me for schooling or whatever and went to business college instead.",
 'so i took a year at willis business college in downtown ottawa, and at the end of that, they just said to me well, where would you like to work?',
 'and i said maybe i would like to work at the university.',
 "so they actually got me the job, i didn't apply.",
 'and i started at carleton university in 1972, so i would have been just 19 years of age.',
 'i started as a steno 03, which is basically the bottom of the pile, and i remember that i made 3,500 a year.',
 "which at that point wasn't enough to live on by yourself

In [16]:
# or more commonly, we can tokenize into words, which splits the sentences into its parts of speech
transcript_words = word_tokenize(transcript)
transcript_words

['my',
 'name',
 'is',
 'bette',
 'smith',
 ',',
 'and',
 'i',
 'grew',
 'up',
 'in',
 'ottawa',
 'in',
 'the',
 'carleton',
 ',',
 'carling',
 'area',
 ',',
 'sorry',
 'i',
 "'m",
 'going',
 'back',
 'a',
 'little',
 'bit',
 '.',
 'and',
 'i',
 'went',
 'to',
 'fisher',
 'park',
 'high',
 'school',
 ',',
 'and',
 'while',
 'i',
 'was',
 'there',
 ',',
 'i',
 'went',
 'to',
 'grade',
 '13',
 ',',
 'which',
 'usually',
 'prepares',
 'you',
 'for',
 'university',
 '.',
 'however',
 ',',
 'in',
 'my',
 'grade',
 '13',
 'year',
 ',',
 'i',
 'really',
 'did',
 "n't",
 'feel',
 'i',
 'wanted',
 'to',
 'go',
 'to',
 'university',
 ',',
 'but',
 'i',
 'decided',
 'instead',
 'to',
 'take',
 'the',
 'money',
 'that',
 'my',
 'parents',
 'had',
 'put',
 'aside',
 'for',
 'me',
 'for',
 'schooling',
 'or',
 'whatever',
 'and',
 'went',
 'to',
 'business',
 'college',
 'instead',
 '.',
 'so',
 'i',
 'took',
 'a',
 'year',
 'at',
 'willis',
 'business',
 'college',
 'in',
 'downtown',
 'ottawa',
 '

In [17]:
# now remember that huge block of stopwords manually typed out in the sample block of code from the first lesson? That comes built in to NLTK as you may have guessed from the earlier import statment
# we can assign the NLTK stopwords to a variable like so:
stop_words = stopwords.words('english')

# and then remove the stopwords from out text using a loop to check if each word in the transcript and only keep the words that are NOT in out stopword list
filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words:
        filtered_transcript_words.append(word)

In [18]:
filtered_transcript_words

['name',
 'bette',
 'smith',
 ',',
 'grew',
 'ottawa',
 'carleton',
 ',',
 'carling',
 'area',
 ',',
 'sorry',
 "'m",
 'going',
 'back',
 'little',
 'bit',
 '.',
 'went',
 'fisher',
 'park',
 'high',
 'school',
 ',',
 ',',
 'went',
 'grade',
 '13',
 ',',
 'usually',
 'prepares',
 'university',
 '.',
 'however',
 ',',
 'grade',
 '13',
 'year',
 ',',
 'really',
 "n't",
 'feel',
 'wanted',
 'go',
 'university',
 ',',
 'decided',
 'instead',
 'take',
 'money',
 'parents',
 'put',
 'aside',
 'schooling',
 'whatever',
 'went',
 'business',
 'college',
 'instead',
 '.',
 'took',
 'year',
 'willis',
 'business',
 'college',
 'downtown',
 'ottawa',
 ',',
 'end',
 ',',
 'said',
 'well',
 ',',
 'would',
 'like',
 'work',
 '?',
 'said',
 'maybe',
 'would',
 'like',
 'work',
 'university',
 '.',
 'actually',
 'got',
 'job',
 ',',
 "n't",
 'apply',
 '.',
 'started',
 'carleton',
 'university',
 '1972',
 ',',
 'would',
 '19',
 'years',
 'age',
 '.',
 'started',
 'steno',
 '03',
 ',',
 'basically',
 '

In [19]:
# finally, we can simply find word frequeny with NLTK's frequnecy distribution function
from nltk import FreqDist

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)



[(',', 885),
 ('.', 712),
 ('?', 119),
 ("n't", 111),
 ('would', 94),
 ("'s", 86),
 ('know', 73),
 ('staff', 61),
 ('faculty', 60),
 ('think', 58)]

In [7]:
# now, as you can see, our list is topped by punctuation and contractions!

# to remove punctuation, we can use Python's string library to create a list of punctuation
from string import punctuation
punctuation = list(punctuation)

# and luckily, you can modify your stopwords and punctuation lists like any other list!
# let's add "n't", "'s", and "would"
# to add multiple elements to a list at once, we use extend() rather that append()
stop_words.extend(["n't", "'s", 'would'])

In [8]:
# let's re-run with our new stopwords and punctuation list to see the improved results
filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words and word not in punctuation:
        filtered_transcript_words.append(word)

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)

[('know', 73),
 ('staff', 61),
 ('faculty', 60),
 ('think', 58),
 ('well', 52),
 ('work', 52),
 ('going', 47),
 ('yeah', 44),
 ('really', 41),
 ('like', 39)]

In [9]:
# now that we have a word frequency list, we can even use NLTK for concordance analysis (seeing word in context)
# we can choose a word from the word frequency list, and search the original tokenized text for it after making it a Text object
from nltk.text import Text

text_list = Text(transcript_words)
text_list.concordance("work", lines=52)


Displaying 52 of 52 matches:
to me well , where would you like to work ? and i said maybe i would like to w
k ? and i said maybe i would like to work at the university . so they actually
counts payable , accounts receivable work . my brother gordon went on to do a 
 went on to do a master 's in social work , sorry not social work , sociology 
's in social work , sorry not social work , sociology and actually did some do
ology and actually did some doctoral work at london school of economics . marg
n dj of course is also done graduate work in political science , actually as w
h it 's very hard , as you know , to work , to work full time and do courses .
ry hard , as you know , to work , to work full time and do courses . and have 
ves . so then did he , when , did he work that farm ? dad was never really , h
 machinery . and he also enjoyed his work later on with the city of ottawa whi
and then he at some point started to work for the city ? yes , i think his bro
 . and of course mom di

In [34]:
# Here is what our original Python script from the Python-I notebook now looks like with NLTK
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from string import punctuation
punctuation = list(punctuation)

nltk.download('punkt')
nltk.download('stopwords')

transcript = open('Bette-Smith-Transcript.txt', encoding="utf-8").read().lower()

transcript_words = word_tokenize(transcript)

stop_words = stopwords.words('english')
stop_words.extend(["n't", "'s", 'would'])

filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words and word not in punctuation:
        filtered_transcript_words.append(word)

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('know', 73),
 ('staff', 61),
 ('faculty', 60),
 ('think', 58),
 ('well', 52),
 ('work', 52),
 ('going', 47),
 ('yeah', 44),
 ('really', 41),
 ('like', 39)]

## Practice Activity #5: Investigate your own text 🔍
For this activity, use a `.txt` file you have on hand, or download a plain text file from [Project Gutenburg](https://www.gutenberg.org/). Place it in the same folder as this notebook, then open it in your code and see if you can use the NLTK to perform a frequency distribution or concordance analysis!

In [12]:
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from string import punctuation
punctuation = list(punctuation)

nltk.download('punkt')
nltk.download('stopwords')

transcript = open('pg1404.txt', encoding="utf-8").read().lower()

transcript_words = word_tokenize(transcript)

stop_words = stopwords.words('english')
stop_words.extend(["n't", "'s", 'would'])

filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words and word not in punctuation:
        filtered_transcript_words.append(word)

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


NameError: name 'FreqDist' is not defined

## Pandas

[Pandas](https://pandas.pydata.org/docs/) is a data analysis and manipulation tool, working with data in the form of a `dataframe`. A `dataframe` is a Python version of a spreadsheet!

Like a spreadsheet, each column can be of a different type, and using Pandas means we can quickly perform a number of operations on our `dataframe` to prepare our data for use in analysis. To demonstrate functionality, we will be using an exported list of individuals accused of witchcraft in Scotland, from the [Survey of Scottish Witchcraft](https://www.shca.ed.ac.uk/Research/witches/).

In [None]:
!pip3 install pandas

In [16]:
import pandas as pd
# we can add these arguments to set how many columns and rows we want Jupyter Notebook to display
pd.options.display.max_columns = 70
pd.options.display.max_rows = 70

# we can import a CSV file very simply using Pandas's built in function
witches_df = pd.read_csv("wdb_accused.csv",  delimiter=",") 
witches_df

Unnamed: 0,accusedref,accusedsystemid,accusedid,firstname,lastname,m_firstname,m_surname,alias,patronymic,destitle,sex,age,age_estcareer,age_estchild,res_settlement,res_parish,res_presbytery,res_county,res_burgh,res_ngr_letters,res_ngr_easting,res_ngr_northing,ethnic_origin,maritalstatus,socioecstatus,occupation,notes,createdby,createdate,lastupdatedby,lastupdatedon
0,A/EGD/10,EGD,10,Mareon,Quheitt,Marion,White,,,,Female,,False,False,Sammuelston,P/JO/3539,Haddington,Haddington,,,,,,,,,,SMD,2001-05-15T11:06:51,jhm,2002-08-09T11:40:51
1,A/EGD/100,EGD,100,Thom,Cockburn,Thomas,Cockburn,,,,Male,,False,False,,,,Haddington,,,,,,,,,,SMD,2001-05-15T11:06:51,jhm,2002-10-02T10:32:30
2,A/EGD/1000,EGD,1000,Christian,Aitkenhead,Christine,Aikenhead,,,,Female,,False,False,Rottinraw,,,Dumfries,,,,,,Married,,,,SMD,2001-05-15T11:06:51,jhm,2002-10-01T10:48:12
3,A/EGD/1001,EGD,1001,Janet,Ireland,Janet,Ireland,,,,Female,,False,False,Rottinraw,,,Dumfries,,,,,,Widowed,,,,SMD,2001-05-15T11:06:51,jhm,2002-10-01T10:49:00
4,A/EGD/1002,EGD,1002,Agnes,Hendersoun,Agnes,Henderson,,,,Female,,False,False,,P/ST/1446,Stirling,Stirling,,,,,,,,,,SMD,2001-05-15T11:06:51,jhm,2002-10-01T10:50:07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3214,A/LA/3240,LA,3240,Cristeane,Johnnestoun,Christine,Johnson,,,,Female,,False,False,,,,,,,,,,,,,"In St Johnston, I don't know where this is.",LEM,2002-11-08T16:27:45,LEM,2002-11-08T16:29:08
3215,A/LA/3241,LA,3241,Jonet,Curchan,Janet,Curchan,,,,Female,,False,False,,,,,,,,,,,,,unable to modernise the surname and she is fro...,LEM,2002-11-08T16:30:17,LEM,2002-11-08T16:31:06
3216,A/LA/3242,LA,3242,James,Chalmer,James,Chalmers,,,,Male,,False,False,,,,,,,,,,,,,"He is from St John's Town, I don't know where ...",LEM,2002-11-08T16:31:55,LEM,2002-12-06T14:48:40
3217,A/LA/3243,LA,3243,Catherine,Campbell,Katherine,Campbell,,,,Female,,False,False,Fowlis,P/ST/1168,Forfar,Forfar,,,,,,,,,,LEM,2002-11-08T16:33:23,LEM,2002-11-08T16:36:06


In [17]:
# wow! that's a lot of confusing data!

# to get the contents of only one column you can call the column by name
print(witches_df['res_county'])

0       Haddington
1       Haddington
2         Dumfries
3         Dumfries
4         Stirling
           ...    
3214           NaN
3215           NaN
3216           NaN
3217        Forfar
3218        Forfar
Name: res_county, Length: 3219, dtype: object


In [18]:
# you can treat individual columns like lists by assigning them to a variable
witches_residence = witches_df['res_county']
print(type(witches_residence))

# ...but this is still a pandas series, so to make a column into a list "officially" to avoid surprise errors, you can cast the column to be a list
witches_residence = list(witches_df['res_county'])
print(type(witches_residence))

<class 'pandas.core.series.Series'>
<class 'list'>


In [19]:
# there's a lot of columns, so let's reshape our dataframe to only have a few we're interested in
witches_df = witches_df[['firstname', 'lastname', 'sex', 'age', 'res_county', 'maritalstatus', 'socioecstatus', 'occupation', 'notes']].copy()
witches_df

Unnamed: 0,firstname,lastname,sex,age,res_county,maritalstatus,socioecstatus,occupation,notes
0,Mareon,Quheitt,Female,,Haddington,,,,
1,Thom,Cockburn,Male,,Haddington,,,,
2,Christian,Aitkenhead,Female,,Dumfries,Married,,,
3,Janet,Ireland,Female,,Dumfries,Widowed,,,
4,Agnes,Hendersoun,Female,,Stirling,,,,
...,...,...,...,...,...,...,...,...,...
3214,Cristeane,Johnnestoun,Female,,,,,,"In St Johnston, I don't know where this is."
3215,Jonet,Curchan,Female,,,,,,unable to modernise the surname and she is fro...
3216,James,Chalmer,Male,,,,,,"He is from St John's Town, I don't know where ..."
3217,Catherine,Campbell,Female,,Forfar,,,,


In [20]:
# much better! now we can change a column name to make naming clearer
witches_df = witches_df.rename(columns={"res_county": "residing_county"})
witches_df

Unnamed: 0,firstname,lastname,sex,age,residing_county,maritalstatus,socioecstatus,occupation,notes
0,Mareon,Quheitt,Female,,Haddington,,,,
1,Thom,Cockburn,Male,,Haddington,,,,
2,Christian,Aitkenhead,Female,,Dumfries,Married,,,
3,Janet,Ireland,Female,,Dumfries,Widowed,,,
4,Agnes,Hendersoun,Female,,Stirling,,,,
...,...,...,...,...,...,...,...,...,...
3214,Cristeane,Johnnestoun,Female,,,,,,"In St Johnston, I don't know where this is."
3215,Jonet,Curchan,Female,,,,,,unable to modernise the surname and she is fro...
3216,James,Chalmer,Male,,,,,,"He is from St John's Town, I don't know where ..."
3217,Catherine,Campbell,Female,,Forfar,,,,


In [21]:
# let's say we want to look at the occupation of each accused witch
# there are a lot of NaN (Not a Number aka blank cells) which we can filter out using Pandas's .loc() and .notna() functions
witches_df.loc[witches_df["occupation"].notna()]

# NOTE: there is also a function .isna() that does the opposite of .notna()!

Unnamed: 0,firstname,lastname,sex,age,residing_county,maritalstatus,socioecstatus,occupation,notes
21,Niniane,Chirneyside,Male,,Edinburgh,,Middling,Servant,He was the Servant to Earl of Bothwell. There...
38,Margaret,Muirhead,Female,,Edinburgh,,Landless,Vagabond,few details.
55,John,McReadie,Male,,Berwick,,Middling,Weaver,
76,Isobel,Rutherfurde,Female,,Peebles,,Landless,Vagabond,
86,Janet,Melros,Female,,,,Lower,Midwife,
...,...,...,...,...,...,...,...,...,...
3138,Margaret,Fraser,Female,,Aberdeen,,Landless,Vagabond,The Privy councic commission did not specify a...
3155,Marion,Dobie,Female,,Haddington,,,Midwife,
3167,Williams,Weems,Male,,Berwick,,Lower,Sailor,
3175,Henry,Hoggart,Male,,Berwick,,Lower,Creelman,described as a creillman - one who carries goo...


In [22]:
# if I want to look only at those who were midwives, I can use .loc() with a comparison operator
witches_df.loc[witches_df["occupation"] == "Midwife"]

Unnamed: 0,firstname,lastname,sex,age,residing_county,maritalstatus,socioecstatus,occupation,notes
86,Janet,Melros,Female,,,,Lower,Midwife,
250,Alisone,Nisbet,Female,,Berwick,Married,Lower,Midwife,She is also referred to as Elie Nesbitt. Repu...
1557,Helen,Beatie,Female,,Peebles,,,Midwife,
1741,Marioun,Lynn,Female,,Haddington,Widowed,Lower,Midwife,"Servitor to John ??, Also described as a midwife."
1776,Beatrix,Leslie,Female,84.0,Edinburgh,Married,Lower,Midwife,She was described as having left her 'pock' in...
2065,Bessie,Gourlie,Female,,Edinburgh,Married,Lower,Midwife,
2620,Margerat,Bane,Female,55.0,Aberdeen,,Lower,Midwife,Bane appears to have been quite old as her dau...
2921,Unknown,Bell,Female,,Lanark,,Middling,Midwife,
3155,Marion,Dobie,Female,,Haddington,,,Midwife,


In [23]:
# like FreqDist in NLTK, Pandas has .value_counts() which will tally up the occurances of unique values in a given row
# so let's check the distribution of occupations
witches_df["occupation"].value_counts()

occupation
Servant           23
Vagabond          23
Midwife            9
Weaver             8
Miller             3
Tailor             2
Messenger          2
Nurse              2
Smith              2
Minister           2
Farmer             2
School teacher     2
Shop-keeper        2
Merchant           2
School Master      1
Workman            1
Cook               1
Fisherman          1
Henwife            1
Sailor             1
Slaterer           1
Loadman            1
Collier            1
Stabler            1
Blacksmith         1
Healer             1
Maltman            1
Tasker             1
Mealmaker          1
Brewster           1
Creelman           1
Name: count, dtype: int64

In [24]:
# if we want all basic statistics for numerical columns we can use .describe()
# I'm interested to see the mean age of the accused
witches_df.describe()

Unnamed: 0,age
count,166.0
mean,43.126506
std,14.205919
min,9.0
25%,34.25
50%,45.0
75%,50.0
max,100.0


In [25]:
# if we want to replace all instances of NaN in the dataframe with something more meaningful we can use the .fillna() function
witches_df = witches_df.fillna("Unknown")

In [26]:
witches_df

Unnamed: 0,firstname,lastname,sex,age,residing_county,maritalstatus,socioecstatus,occupation,notes
0,Mareon,Quheitt,Female,Unknown,Haddington,Unknown,Unknown,Unknown,Unknown
1,Thom,Cockburn,Male,Unknown,Haddington,Unknown,Unknown,Unknown,Unknown
2,Christian,Aitkenhead,Female,Unknown,Dumfries,Married,Unknown,Unknown,Unknown
3,Janet,Ireland,Female,Unknown,Dumfries,Widowed,Unknown,Unknown,Unknown
4,Agnes,Hendersoun,Female,Unknown,Stirling,Unknown,Unknown,Unknown,Unknown
...,...,...,...,...,...,...,...,...,...
3214,Cristeane,Johnnestoun,Female,Unknown,Unknown,Unknown,Unknown,Unknown,"In St Johnston, I don't know where this is."
3215,Jonet,Curchan,Female,Unknown,Unknown,Unknown,Unknown,Unknown,unable to modernise the surname and she is fro...
3216,James,Chalmer,Male,Unknown,Unknown,Unknown,Unknown,Unknown,"He is from St John's Town, I don't know where ..."
3217,Catherine,Campbell,Female,Unknown,Forfar,Unknown,Unknown,Unknown,Unknown


In [27]:
# and take note that you can apply string methods to any column! 
# let's make everything in the "notes" column lowercase so it's normalised in case you need it for text analysis later
witches_df["notes"] = witches_df["notes"].str.lower()


In [28]:
witches_df

Unnamed: 0,firstname,lastname,sex,age,residing_county,maritalstatus,socioecstatus,occupation,notes
0,Mareon,Quheitt,Female,Unknown,Haddington,Unknown,Unknown,Unknown,unknown
1,Thom,Cockburn,Male,Unknown,Haddington,Unknown,Unknown,Unknown,unknown
2,Christian,Aitkenhead,Female,Unknown,Dumfries,Married,Unknown,Unknown,unknown
3,Janet,Ireland,Female,Unknown,Dumfries,Widowed,Unknown,Unknown,unknown
4,Agnes,Hendersoun,Female,Unknown,Stirling,Unknown,Unknown,Unknown,unknown
...,...,...,...,...,...,...,...,...,...
3214,Cristeane,Johnnestoun,Female,Unknown,Unknown,Unknown,Unknown,Unknown,"in st johnston, i don't know where this is."
3215,Jonet,Curchan,Female,Unknown,Unknown,Unknown,Unknown,Unknown,unable to modernise the surname and she is fro...
3216,James,Chalmer,Male,Unknown,Unknown,Unknown,Unknown,Unknown,"he is from st john's town, i don't know where ..."
3217,Catherine,Campbell,Female,Unknown,Forfar,Unknown,Unknown,Unknown,unknown


In [29]:
# finally, Pandas makes it really easy to export your dataframe as a CSV for publication or later use
witches_df.to_csv("accused_cleaned.csv")

# Putting Everything Together

In [62]:
# write activity code here
data = pd.read_csv("Excel_20240611_153617.csv", delimiter=",")
data['Description'].to_csv('des.txt', index=False, header=None)

In [24]:

import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from string import punctuation
punctuation = list(punctuation)

nltk.download('punkt')
nltk.download('stopwords')

transcript = open('des.txt', encoding="utf-8").read().lower()

transcript_words = word_tokenize(transcript)

stop_words = stopwords.words('english')
stop_words.extend(["n't", "'s", 'would', "''", '``'])

filtered_transcript_words = []
for word in transcript_words:
    if word not in stop_words and word not in punctuation:
        filtered_transcript_words.append(word)

transcript_fdist = FreqDist(filtered_transcript_words)
transcript_fdist.most_common(10)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('digital', 23),
 ('humanities', 15),
 ('includes', 7),
 ('bibliographical', 6),
 ('references', 6),
 ('research', 5),
 ('studies', 5),
 ('index', 4),
 ('volume', 4),
 ('global', 4)]

## Answer key (No peeking!)

In [5]:
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

pd.options.display.max_rows = 100

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ZJW\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
litrev_df = pd.read_csv('digihum-lit-rev.csv', delimiter=",")

litrev_df

Unnamed: 0,Title,Creator,Supplement to,Dissertation,Subject,Subject.1,Genre,Description,Contents,Uniform title,...,Publisher,Creation Date,Format,Edition,Frequency,General note,Local note,Source,Bound with,Permalink
0,Visual Text Analysis in Digital Humanities,"Jänicke, S ; Franzini, G ; Cheema, M. F ; S...","Computer graphics forum, 2017-09, Vol.36 (6), ...",,close reading ; digital humanities ; distant...,,,"In 2005, Franco Moretti introduced Distant Rea...",,,...,"Wiley Subscription Services, Inc",,,,,,,Wiley-Blackwell Full Collection - CRKN; Busine...,,https://ocul-crl.primo.exlibrisgroup.com/perma...
1,CASS Digital Humanities Feature: An Introduction,"Reischl, Katherine M.H","Canadian-American Slavic studies, 2021-03-25, ...",,,,,,,,...,,,,,,,,Alma/SFX Local Collection,,https://ocul-crl.primo.exlibrisgroup.com/perma...
2,"Defining Digital Theology: Digital Humanities,...","Phillips, Peter ; Schiefelbein-Guerrero, Kyle...","Open theology, 2019-05-22, Vol.5 (1), p.29-43",,CODEC ; Computing for Humanities ; digital c...,,,"This article seeks to define Digital Theology,...",,,...,De Gruyter,,,,,,,De Gruyter Open Access Journals,,https://ocul-crl.primo.exlibrisgroup.com/perma...
3,Distribution features and intellectual structu...,"Wang, Qing","Journal of documentation, 2018-01-08, Vol.74 (...",,Author productivity ; Bibliometrics ; Cultur...,,,Purpose The purpose of this paper is to conduc...,,,...,Bradford: Emerald Group Publishing Limited,,,,,,,Computer Science Database,,https://ocul-crl.primo.exlibrisgroup.com/perma...
4,On the Meanings of Self-Regulation: Digital Hu...,"Burman, Jeremy T ; Green, Christopher D ; Sh...","Child development, 2015-09, Vol.86 (5), p.1507...",,Controlled vocabularies ; EMPIRICAL ARTICLES ...,,,Self-regulation is of interest both to psychol...,,,...,United States: Blackwell Publishing Ltd,,,,,,,Wiley-Blackwell Full Collection - CRKN; MEDLIN...,,https://ocul-crl.primo.exlibrisgroup.com/perma...
5,"Natural allies: Librarians, archivists, and bi...","Poole, Alex H ; Garwood, Deborah A","Journal of documentation, 2018-07-09, Vol.74 (...",,,,,,,,...,,,,,,,,Computer Science Database; Emerald Management ...,,https://ocul-crl.primo.exlibrisgroup.com/perma...
6,A Chinese ancient book digital humanities rese...,"Chen, Chih-Ming ; Chang, Chung","Electronic library, 2019, Vol.37 (2), p.314-336",,Academic libraries ; Annotations ; Archival ...,,,Presents a Chinese ancient books digital human...,,,...,Oxford: Emerald Group Publishing Limited,,,,,,,Computer Science Database,,https://ocul-crl.primo.exlibrisgroup.com/perma...
7,Library and information science and the digita...,"Koltay, Tibor","Journal of documentation, 2016-07-11, Vol.72 (...",,,,,,,,...,,,,,,,,Computer Science Database; Emerald Management ...,,https://ocul-crl.primo.exlibrisgroup.com/perma...
8,Digital humanities in Sweden and its infrastru...,"Golub, Koraljka ; Göransson, Elisabet ; Foka...","Digital Scholarship in the Humanities, 2019-08...",,Annan humaniora ; Annan teknik ; digital hum...,,,Abstract The article offers a state-of-the-art...,,,...,,,,,,,,Oxford Journals - CRKN,,https://ocul-crl.primo.exlibrisgroup.com/perma...
9,Editorial for the Special Issue on “Digital Hu...,"Gonzalez-Perez, Cesar","Information (Basel), 2020-07-10, Vol.11 (7), p...",,Archaeology ; Artificial intelligence ; Data...,,,Digital humanities are often described in term...,,,...,Basel: MDPI AG,,,,,,,Computer Science Database; Advanced Technologi...,,https://ocul-crl.primo.exlibrisgroup.com/perma...


In [25]:
litrev_df = litrev_df[litrev_df["Description"].notna()]

abstracts_as_text = ""

for i in litrev_df["Description"]:
    abstracts_as_text += i + "\n"    
    
abstractTokens = word_tokenize(abstracts_as_text.lower())

cleaned_abstractTokens = []

for word in list(abstractTokens):
    if word not in stopwords.words("english") and word.isalpha():
        cleaned_abstractTokens.append(word)

abstracts_df = pd.DataFrame(cleaned_abstractTokens, columns =['uniqueWords'])
        
keywords = abstracts_df["uniqueWords"].value_counts()

keywords

uniqueWords
digital          138
humanities       101
research          54
dh                48
analysis          27
                ... 
open               1
consider           1
peers              1
assessing          1
applicability      1
Name: count, Length: 1762, dtype: int64

# Identifying and Solving Errors

Try and correct the following errors! For more of a challenge, try and identify the errors before running the code 🔎

In [2]:
# Error 1

people = [
    {'name': 'Jolene', 'birth_year': 1955, 'death_year': 1972},
    {'name': 'George', 'birth_year': 1942, 'death_year': 2010},
    {'name': 'Charlene', 'birth_year': 1927, 'death_year': 1941},
    {'name': 'David', 'birth_year': 1830, 'death_year': 1923},
    {'name': 'Eve', 'birth_year': 1899, 'death_year': 1940},
]

print(people[5])

IndexError: list index out of range

In [1]:
# Error 2

# takes two arguments and returns their sum
def add_numbers(x, y):
    return x + y

result = add_numbers(5, 10)

print("The sum of the numbers is:", result)
print("The difference of the numbers is:", result2)

The sum of the numbers is: 15


NameError: name 'result2' is not defined

In [None]:
# Error 3

year = 1955
name = "Jolene Barrie"

result = name + " was born in " + year
print(result)

In [4]:
# Error 4

people = [
    {'name': 'Jolene', 'birth_year': 1955, 'death_year': 1972},
    {'name': 'George', 'birth_year': 1942, 'death_year': 2010},
    {'name': 'Charlene', 'birth_year': 1927, 'death_year': 1941},
    {'name': 'David', 'birth_year': 1830, 'death_year' 1923},
    {'name': 'Eve', 'birth_year': 1899, 'death_year': 1940},
]

for person in people:
    print("Age at death: " + str(person['death_year'] - person['birth_year']))

SyntaxError: invalid syntax (<ipython-input-4-c5d0a82bee4e>, line 7)

In [3]:
# Error 5

# convert strings to int
def convert_to_num(year):
    return int(year)

year = "1955"
name = "Jolene Barrie"

convert_to_num(name)

ValueError: invalid literal for int() with base 10: 'Jolene Barrie'