<div style="display: flex;">
<!-- ; background:url('https://desirschool.sciencesconf.org/data/header/archives_1.jpg')  repeat center center; color: #fff; -->    
  <div style="flex: 33%;">
      <img src="https://desirschool.sciencesconf.org/data/header/DESIR_logo_1.jpg" width=500>
  </div>
  <div style="flex: 66%; margin: 1em; text-align: center;">
    <h1> DESIR Winter School: Shaping new approaches to data management in arts and humanities </h1>
    <h2> Open Research Notebooks </h2>
    <h3> 10-13 Dec 2019 Lisbon (Portugal) </h3>
  </div>
</div>

### About me
<br>
<div style="display: flex;">
  <div style="flex: 50%;">
      <img src="https://www.dropbox.com/s/8u2cy57qpz4yx1y/profile_pic.jpg?raw=1" width=200/>
  </div>
  <div style="flex: 50%;margin: 1em;">
      <b>Javier de la Rosa</b>, <a href="mailto:versae@gmail.com"><i>versae@gmail.com</i></a>, <a href="https://twitter.com/versae"><i>@versae</i></a>
      <br />
      <br />
      <div style="padding-left: 1em;">
      Postdoctoral Researcher in NLP at UNED (ERC POSTDATA Project), Spain
      <br />
      PhD in Hispanic Studies (Digital Humanities), University of Western Ontario, Canada
      <br />
      Master in Artificial Intelligence, Universidad de Sevilla, Spain
      <br />
      <br />
      Ex-Research Software Engineer at Stanford University, California
      <br />
      Ex-Technical Lead at the CulturePlex Lab, University of Western Ontario, Canada
      </div>
  </div>
</div>

# Reproducible Research

We'll be focusing on <strong>Mary Chester-Kadwell's <a href="https://github.com/mchesterkadwell/bughunt-analysis" target="_blank">"text mining of English children’s literature 1789-1914 for the representation of insects and other creepy crawlies"</a></strong>. As such, this notebook is basically an adaption of her MIT licensed notebooks and all credit goes to her.

## Simple String Manipulation in Python
This section introduces some very basic things you can do in Python to create and manipulate *strings*. A string is a simple sequence of characters, like `flabbergast`. This introduction is limited to those things that may be useful to know in order to understand the *Bughunt!* data mining in the following two notebooks.

### Creating and Storing Strings in Variables
Strings are simple to create in Python. You can simply write some characters in quote marks.

In [None]:
'Butterflies are important as pollinators.'

In order to do something useful with this string, other than print it out, we need to store in a *variable* by using the assignment operator `=` (equals sign). Whatever is on the right-hand side of the `=` is stored into a variable with the name on the left-hand side.

In [None]:
# my_variable is the variable on the left
# 'manuscripts' is the string on the right that is stored in the variable my_variable

my_variable = 'Butterflies are important as pollinators.'

Notice that nothing is printing to the screen. That's because the string is stored in the variable `my_variable`. In order to see what is inside the variable `my_variable` we can simply write `my_variable` in a code cell, run it, and the interpreter will print it out for us.

In [None]:
my_variable

### Manipulating Bits of Strings

#### Accessing Individual Characters
A strings is just a sequence (or list) of characters. You can access **individual characters** in a string by specifying which ones you want in square brackets. If you want the first character you specify `1`.

In [None]:
my_variable[1]

Hang on a minute! Why did it give us `u` instead of `B`?

In programming, everything tends to be *zero indexed*, which means that things are counted from 0 rather than 1. Thus, in the example above, `1` gives us the *second* character in the string.

If you want the first character in the string, you need to specify the index `0`! 

In [None]:
my_variable[0]

#### Accessing a Range of Characters

You can also pick out a **range of characters** from within a string, by giving the *start index* followed by the *end index* with a semi-colon (`:`) in between.

The example below gives us the character at index `0` all the way up to, *but not including*, the character at index `20`.

In [None]:
my_variable[0:20]

### Changing Whole Strings with Functions
Python has some built-in *functions* that allow you to change a whole string at once. You can change all characters to lowercase or uppercase:

In [None]:
my_variable.lower()

In [None]:
my_variable.upper()

NB: These functions do not change the original string but create a new one. Our original string is still the same as it was before:

In [None]:
my_variable

### Testing Strings

You can also test a string to see if it is passes some test, e.g. is the string all alphabetic characters only?

In [None]:
my_variable.isalpha()

Does the string have the letter `p` in it?

In [None]:
'p' in my_variable

### Lists of Strings
Another important thing we can do with strings is creating a list of strings by listing them inside square brackets `[]`:

In [None]:
my_list = ['Butterflies are important as pollinators',
          'Butterflies feed primarily on nectar from flowers',
          'Butterflies are widely used in objects of art']
my_list

### Manipulating Lists of Strings
Just like with strings, we can access individual items inside a list by index number:

In [None]:
my_list[0]

And we can access a range of items inside a list by *slicing*:

In [None]:
my_list[0:2]

### Advanced: Creating Lists of Strings with List Comprehensions
We can create new lists in an elegant way by combining some of the things we have covered above. Here is an example where we have taken our original list `my_list` and created a new list `new_list` by going over each string in the list:

In [None]:
new_list = [string for string in my_list]
new_list

Why do this? If we combine it with a test, we can have a list that only contains strings with the letter `p` in them:

In [None]:
new_list_p = [string for string in my_list if 'p' in string]
new_list_p

This is a very powerful way to quickly create lists. We can even change all the strings to uppercase at the same time!

In [None]:
new_list_p_upper = [string.upper() for string in my_list if 'p' in string]
new_list_p_upper

# Text Processing the Bughunt Corpus
This notebook follows the process of taking the manually cleaned Bughunt corpus and creating a frequency distribution of the different bug words.

NB: This notebook does not actually process the whole corpus -- that is done by the script `insect-freq-unigram.py`. The examples here are a walk-through and explanation of the code using a single file.

We will use the code library called Natural Language Toolkit (NLTK) to provide a lot of text mining functions that are already written. More information on this can be found here: http://www.nltk.org/

## Corpus Files

We already have the corpus **split into files by decade**. Here is a list of them:

In [None]:
import os
from pathlib import Path
import requests

if not os.path.exists('corpora'):
    os.makedirs('corpora')
    
urls = "https://raw.githubusercontent.com/mchesterkadwell/bughunt-analysis/master/corpora/bughunt/2-clean-by-decade/bughunt-clean-{}.txt"
for i in range(1800, 1920, 10):
    url = urls.format(str(i))
    filename = url.rsplit("/", 1)[1]
    print("Downloading and storing", filename)
    text = requests.get(url).text
    with open(Path("corpora", filename), "w") as f:
        f.write(text)

## Preparing to Process
Before we are ready to process these files, we need to gather together some resources.

### Bug Words
We have our list of **simple bug words** as a text file. Here it is:

In [None]:
url = "https://raw.githubusercontent.com/mchesterkadwell/bughunt-analysis/master/wordlists/insect-wordlist.txt"
wordlist = requests.get(url).text
with open(Path("insect-wordlist.txt"), "w") as f:
    f.write(wordlist)
bug_words = wordlist.split("\n")
bug_words

We also have a list of the **stems** of bug words. **Stemming** is a form of word normalisation. It means reducing a word to its root, eliminating plurals and other inflections. Stems may not be actual words. 

In [None]:
url = "https://raw.githubusercontent.com/mchesterkadwell/bughunt-analysis/master/wordlists/insect-wordstems.txt"
wordstems = requests.get(url).text
with open(Path("insect-wordstems.txt"), "w") as f:
    f.write(wordstems)
bug_stems = wordstems.split("\n")
bug_stems

As you can see in the list above, the stems 'butterfli', 'dragonfli' and 'fli' are not real words.

This contrasts with **lemmatisation** where the reduced word, the **lemma**, is a proper word in the language; in fact, it is the canonical or dictionary form.

### English Stopwords
We are not interested in common words in English that carry little meaning, such as 'the', 'a' and 'its'. There is no definitive list of stopwords, but a commonly-used list is provided by the Natural Language Toolkit (NLTK).

In [None]:
import sys
import nltk
!{sys.executable} -m nltk.downloader stopwords
# nltk.download('stopwords', download_dir=Path('nltk_data'))
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
sorted(list(english_stops))[:20]

## Tokenising the Corpus
Tokenising means splitting a text into meaningful elements, such as words, sentences, or symbols.

To do this we use a simple facility provided by the NLTK to read in the files and a function to do the tokenising for us. The code example below takes a single corpus file and tokenises it. 

In [None]:
# nltk.download('punkt', download_dir=Path('..', 'nltk_data'))
!{sys.executable} -m nltk.downloader punkt

from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader('.', '')
file_1810 = os.path.join("corpora", 'bughunt-clean-1810.txt')
text = reader.raw(file_1810)

from nltk import word_tokenize
tokens = word_tokenize(text)
tokens[:20]

There are a number of problems with these tokens: the capitalisation of the words has been preserved, and some of the tokens have unwanted special characters or comprise single items of punctuation.

### Normalising to Lowercase
Normalising all words in a corpus to lowercase ensures that the same word in different cases can be recognised as the same word, e.g. we want 'Gnat', 'gnat' and 'GNAT' to be recognised as the same word.

However, whether you choose to do this depends on the nature of your corpus and the questions you are investigating. For example, in another case, you may be not want the word 'Conservative' to be conflated with the word 'conservative'.

In our case, we will lowercase the whole corpus immediately before tokenising it:

In [None]:
tokens = word_tokenize(text.lower())
tokens[:20]

### Removing Puctuation
Punctuation such as commas, fullstops and apostrophes can complicate processing a corpus. For example, if punctuation is left in, the words "termite" and "termite," might be considered to be different words.

This is a complicated matter, however, and what you choose to do would vary depending on the nature of your corpus and what questions you wish to ask.

It may be appropriate to remove punctuation at different stages of processing. In our case we are going to remove it *after* the text has been tokenised.

We will replace *all* punctuation with the empty string ''.

In [None]:
import string
table = str.maketrans('', '', string.punctuation)
tokens_nopunct = [token.translate(table) for token in tokens]
tokens_nopunct[:20]

### Removing Non-Word Tokens

We are still left with some problematic tokens that are not useful words, such as empty tokens `''` and tokens that may be chapter numbers:

In [None]:
tokens_empty = [word for word in tokens_nopunct if not word.isalpha()]
tokens_empty[:10]

In [None]:
tokens_nonwords = [word for word in tokens_nopunct if word.isnumeric()]
tokens_nonwords[:10]

We can remove both these by filtering for only those words that are alphabetic:

In [None]:
words = [word for word in tokens_nopunct if word.isalpha()]
words[:20]

### Removing Stopwords
We are now ready to remove the stopwords we prepared earlier and thereby create a list of only meaningful words. Before using the stopwords, we will also remove all the punctuation so that it matches the text of the corpus.

In [None]:
english_stops_nopunct = {stopword.translate(table) for stopword in english_stops}
words_nostops = [word for word in words if word not in english_stops_nopunct]
words_nostops[:20]

### Stemming the Tokens
Stemming the tokens ensures that plurals and adjectives are reduced to the same stem and can be counted as the same word. For example, 'lice' and 'louse' will be normalised to 'lous', but so too will 'lousy', which may or may not be desirable.

To do this we use another facility provided by the NLTK called a **stemmer**. There are many different ways to stems words, but we will use the Porter Stemmer. (The Porter Stemmer is the original stemmer, first created in 1979. It is simple and speedy, but has some important limitations.)

In [None]:
from nltk import PorterStemmer
porter = PorterStemmer()
stems = [porter.stem(word) for word in words_nostops]
stems[:20]

## Creating a Frequency Distribution
At last, we are ready to create a frequency distribution. We will use another NLTK facility called `FreqDist` to count the frequency of each unique word in the corpus, and then create a relative frequency value between `0` and `1`.

First, we create a frequency distribution:

In [None]:
from nltk.probability import FreqDist
freqdist = FreqDist(stems)

Here are the top 20 most frequent words (the numbers are the absolute word count):

In [None]:
freqdist.most_common(20)

We are not interested in a lot of these words, so the next thing to do is filter out all the words that are not in our list of bugs. Once we have done this we have a dictionary of stems and their relative frequencies.

In [None]:
from nltk.corpus.reader import WordListCorpusReader
insect_words = WordListCorpusReader('.', [Path('insect-wordstems.txt')])

insect_freq = {word: freqdist.freq(word) for word in insect_words.words()}
insect_freq

## What's Next
In the script `insect-freq-unigram.py` the process above is applied to each of the corpus files in turn, and the results are output as the CSV file `insect-stem-freq-unigram.csv`.

In [None]:
!python scripts/insect-freq-unigram.py

# Visualising the Bughunt Corpus
This notebook follows the process of taking the frequency distribution of the different bug words and creating a visualisation of how frequency changes over time.

NB: This notebook does not actually create the figure `insect-stem-freq-unigram.png` -- that is done by the script `insect-freq-unigram.py`. The examples here are a walk-through and explanation of the code.

We will use the code library called Natural Language Toolkit (NLTK) to provide a lot of text mining functions that are already written. More information on this can be found here: http://www.nltk.org/. We will also use two popular libraries: Pandas for data manipulation (https://pandas.pydata.org/) and matplotlib (https://matplotlib.org/) to create the graph.

## Preparing the Data
### Loading the CSV File

After text processing the corpus, the results were saved as a CSV file. First we have to load the data from this file into what is called a 'dataframe', which is much like a table.

In [None]:
from pathlib import Path
data_path = Path('insect-stem-freq-unigram.csv')

import pandas as pd
df = pd.read_csv(data_path, index_col='year').sort_index()
df

### Loading the Bug Word list
We also need the bug word list again from the text file.

In [None]:
from pathlib import Path
stemlist = Path('insect-wordstems.txt')

from nltk.corpus.reader import WordListCorpusReader
insect_words = WordListCorpusReader('.', [stemlist])
insect_words.words()

## Plotting
Now that we have the data ready, we can experiment with some plotting!

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from cycler import cycler

plt.rcParams['figure.figsize'] = [15, 10]
plt.style.use('fivethirtyeight')
cc = (cycler(color=['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', 
                    '#911eb4', '#46f0f0', '#f032e6', '#bcf60c', '#fabebe', 
                    '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000', 
                    '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080',]) *
      cycler(linestyle=['-']))
plt.rc('axes', prop_cycle=cc)

ax = plt.gca()
for insect in insect_words.words():
    df.plot(kind='line', y=insect, ax=ax)

plt.axis([1800, 1910, 0, 0.009])
plt.xticks(np.arange(1800, 1920, 10))
plt.ylabel('frequency of bug stem')
plt.suptitle('Frequency of Bugs in Children\'s Literature by Decade 1800-1920')