# Literature analysis with unigrams: First the drudgery

Now that you have seen a few simple applications that use tokenization, it's time to look at something more realistic.
As you probably know from your English homeworks, comparing works of fiction can be a very hard and time-consuming task.
It would be much nicer if we could just have the computer do all the work.
But what could a computer possibly have to say about literature?

One simple idea is that an author's style is represented by which words (s)he uses, and in particular which words (s)he uses most.
Words are also known as *unigrams*.
This is in contrast to *bigrams*, which consist of two words, *trigrams* (three words), and so on.
For instance, the sentence

    John likes Mary and Peter
    
contains the unigrams

    John, likes, Mary, and, Peter
    
the bigrams

    John likes, likes Mary, Mary and, and Peter
    
and the trigrams

    John likes Mary, likes Mary and, Mary and Peter
    
We could also have 4-grams, 5-grams, or 127-grams.
Quite generally, a model that is based on words or sequences of words is called an *n-gram model*.
So if we want to analyze an author's style in terms of their word usage, we are proposing a unigram model of stylistic analysis.

But does a unigram model actually work?
Well, let's put the idea to the test: we will compare three works of fiction comparing this technique:

- William Shakespeare's *Hamlet*
- Christopher Marlowe's *The Tragical History of Dr. Faustus*
- Edgar Rice Burrough's *A Princess of Mars*

If we find something interesting, then unigram models might be worthwhile after all.

A brief remark on those works: The first two are world-famous Victorian plays, whereas the third is an early 20th century pulp novel that you might know as the basis for Disney's 2012 box office debacle *John Carter*. Although the movie is better than its reputation, it still doesn't do justice to the book, so give it a read if you are in the mood for a fun science fantasy story.

## Getting the files

First we need to have the books in some digital format that we can feed into Python.
Ideally, we want this to be a plaintext format, i.e. the pure text without any layout information.
We do not want a pdf or doc file, as those are much harder to work with.
We can use Python to download all the files from [Project Gutenberg](https://www.gutenberg.org/), an online platform that hosts literary works that are no longer under copyright.

To do so, we first import the library `urllib.request` and then use the following command:

```python
urllib.request.urlretrieve("url_to_download", "filename_of_your_choice")
```

In [None]:
import urllib.request
urllib.request.urlretrieve("https://www.gutenberg.org/files/1524/1524-h/1524-h.htm", "hamlet.txt")
urllib.request.urlretrieve("https://www.gutenberg.org/cache/epub/811/pg811.txt", "faustus.txt")
urllib.request.urlretrieve("https://www.gutenberg.org/cache/epub/62/pg62.txt", "mars.txt")

**Exercise.**
Browse Project Gutenberg and find a book you really like.
Keep in mind that Project Gutenberg only has texts that are in the public domain, which means that they are no longer copy-righted.
So you won't see *Harry Potter*, *Hunger Games*, or even Stephen King's *It* there, but almost everything from the 19th century and earlier can be found there.

Once you have picked a book, look at the different file formats.
You might see html (for display in web browsers), epub (an ebook format), and txt (plaintext, usually the easiest format for computational analysis).
Download one of them using the `urllib.request.urlretrieve` command and save it as `mybookpick.txt`.

In [None]:
# put your code here

Running the code above should have put three files in the folder you are running this notebook from:

1.  `faustus.txt`
1.  `hamlet.txt`
1.  `mars.txt`

You can open them in CoCalc to look at their contents.
If you're not using CoCalc, open them with a text editor, for example Notepad if your computer is running Windows.
Scroll up and down a bit to get a better idea of what the files look like.

**Exercise.**
Write down a list of the things that stand out to you in these files.
In particular:

1. Do the files look the same, or are there major differences?
1. Do the files just contain the text of the plays, or also additional information (check the top and bottom of each file carefully)?
1. If we want just the words used by the protagonists of the plays, what changes need to made to the files?

*put your answers here*

## Cleaning up the files

### Analysis

You should have noticed quite a few problems with the files, only some of which we can fix by hand.

1. While `faustus.txt` and `mars.txt` are fairly easy to read, `hamlet.txt` is cluttered with all kinds of weird code like `<p>` and `<br/>`. That's because we downloaded a textfile for `faustus.txt` and `mars.txt`, but an html-file for `hamlet.txt`. The expressions between `<` and `>` are html-markup, which is needed to display a file in a webbrowser.

1. All files start with information about Project Gutenberg, which we do not want.

1. All files have information at the end that is not part of the play. In `hamlet.txt` and `mars.txt`, it's just a disclaimer that the play is over, whereas `faustus.txt` is also full of footnotes.

1. In `faustus.txt`, the text is often interrupted by strings like `[17]`. Those are references to footnotes.

1. For the two plays, slightly different formats are used to indicate who is speaking.
    - In `hamlet.txt`, names are fully capitalized and occur between the markup `<p>` and `<br/>`.
      Sometimes there is a dot after the name, sometimes there isn't.
    - In `faustus.txt`, names are fully capitalized and followed by a dot.
      The actual text usually starts on the same line.
    
1. In `faustus.txt`, stage instructions are indicated by indentation.
   In `hamlet.txt`, they occur between `<p class="scenedesc">` and `</p>`.

1. In `faustus.txt`, all dialog is indented, but less so than the stage instructions.
    
1. All three files contain many empty lines.

1. Both plays capitalize words at the beginning of a new line.

1. In `mars.txt`, Chapters are written in upper caps.

These are all problematic for us:

- We just want to be able to see which words are used in each play, and how often each word is used.
- We do not want HTML markup, information about Project Gutenberg, footnotes, or empty lines.
- We also do not want to keep track of names if they just indicate who is speaking. That's not part of the play as such.
- We should also exclude stage instructions because those do not belong to the literary part of the play either.

Fixing all these things by hand would be tons of work.
Fortunately, we only need to delete a few things by hand, while Python can do the rest.

### Clean-up

Let's first do the manual fixes.
Carry out the fixes below, then save the modified files under new names so that they don't get overwritten in case you redownload the files: `hamlet_manual.txt`, `faustus_manual.txt`, and `mars_manual.txt`.

1. Open `hamlet.txt` and delete the first 189 lines. That's everything before the line `<h4><b>SCENE. Elsinore.</b></h4>`.

1. Now go to the end of `hamlet.txt` and delete everything after line 7942. That's everything after (and including) the line with the single tag `<pre>`. It is the only such tag in the file, so it is easy to find with your editor's search function.

1. Open `faustus.txt` and delete the first 140 lines. That's everything up to and including the empty line right after `FROM THE QUARTO OF 1616.`

1. In the same file, delete everything after the line `Terminat hora diem; terminat auctor opus.`
   Use the editor's search function to find it quickly.
   
1. Open `mars.txt` and delete the first 235 lines. That's everything before the line that says `CHAPTER I`.

1. In the same file, delete everything after the line `that I shall soon know.`

We have removed quite a bit of unwanted stuff, but there's still many problems with the formatting.
The Python code below fixes all of those for us using the power of regular expressions.

The code uses several commands we haven't encountered before, such as `with`, `raise`, and `for`, as well as advanced regular expression techniques.
Ignore them, they're not the point of this unit (`for` will be explained in the next unit, and there's separate expansion units for `with` and `raise`).
The important thing is that we now have a function `text_cleaner` that will clean up the text for us.
Remember, that's the great thing about functions - you can treat them as blackboxes and use them efficiently even if you don't fully understand how they work!

In [None]:
# Code to clean up hamlet.txt, faustus.txt, and mars.txt
# ======================================================

# import regular expression module
import re

def text_cleaner(filename):
    """
    Open text and run required cleaning procedures.
    
    Arguments
    ---------
    filename: str
        name of file without extension (for instance .txt)
    """
    # Step 1: load file and store it as variable "text"
    with open(filename + "_manual.txt", mode="r", encoding='utf-8-sig') as text:
        # Step 2: create a new file to save cleaned up version
        with open(filename + "_clean.txt", mode="w", encoding='utf-8') as cleaned:
            # Step 2.5: hamlet needs some special tricks for multiline scene descriptions
            text = text.read()
            if filename == "hamlet":
                text = re.sub(r'<p.*?class="scenedesc".*?>[\s\S]*?</p>', r'', text)
            # Step 3: clean each line and write to clean-up file
            for line in str.split(text, '\n'):
                # cleaning
                line = line_cleaner(filename, line)
                # write line if it isn't empty
                if line != '':
                    cleaned.write(line)
                    cleaned.write('\n')

                    
def line_cleaner(filename, line):
    """clean line for hamlet, faustus, and mars"""
    # hamlet-specific cleaning
    if filename == "hamlet":
        # 1. remove all headers
        line = re.sub(r'<h[0-9].*', r'', line)
        # 2. remove speaker information
        #    (identified by html tags)
        line = re.sub(r'<p.*?>[A-Z\. ]*?<br/>', r'', line)
        # 3. remove html tags
        line = re.sub(r'<.*?>', r'', line)
        # 4. remove anything after [ or before ]
        line = re.sub(r'\[[^\]]*', r'', line)
        line = re.sub(r'[^\[]*\]', r'', line)
        # 5. replace special html codes by characters
        line = re.sub(r'&[rl]squo;', r"'", line)
        line = re.sub(r'&mdash;', r" --- ", line)
        line = re.sub(r"&amp;c[\.,]", r"&", line)
    # faustus-specific cleaning
    elif filename == "faustus":
        # 1. remove stage information
        #    (anything after 10 spaces)
        line = re.sub(r'(\s){10}.*', r'', line)
        # 2. remove speaker information
        #    (any word in upper caps followed by space or dot)
        line = re.sub(r'[A-Z]{2,}[\s\.]', r'', line)
        # 3. remove anything between square brackets
        line = re.sub(r'\[[^\]]*\]', r'', line)
        # 4. remove sentence initial spaces
        line = re.sub(r'^\s+', r'', line)
    # mars-specific cleaning
    elif filename == "mars":
        # 1. delete CHAPTER I
        # (must be done like this because Roman 1 looks like English I)
        line = re.sub('CHAPTER I', '', line)
        # 2. remove any word in upper caps
        line = re.sub(r'[A-Z]{2,}[\s\.]?', r'', line)
        # 3. remove anything after [ or before ]
        line = re.sub(r'\[[^\]]*', r'', line)
        line = re.sub(r'[^\[]*\]', r'', line)
    else:
        # give an error message
        raise Exception("No cleaning profile exists for this file")
    # remove multiple spaces that might be left after clean up
    line = re.sub(r'\s+', ' ', line)
    # return cleaned up line with everything in lower case
    return str.lower(line)
        
# do the actual cleaning
for filename in ["hamlet", "faustus", "mars"]:
    text_cleaner(filename)

After running the code, open the newly creates files `faustus_clean.txt`, `hamlet_clean.txt`, and `mars_clean.txt` in your text editor.
Contrast them to `faustus_manual.txt`, `hamlet_manual.txt`, and `mars_manual.txt` that were fed into the cleaning function.
All the unwanted annotations, markup and stage instructions are gone, and we have a much cleaner file now.
Also note that now all words are lowercase, including proper names.
That is a feature, not a bug: *but* and *But* are the same word, so we do not want to count them separately.
That the texts now talk about *hamlet*, *faustus*, and *carter* is not much of an issue since proper names are rarely identical to existing words.

Cleaning up files isn't too much fun, but it is really necessary.
Always remember the old saying: **garbage in, garbage out!**
We have to make sure our data is a clean as possible in order to carry out a good analysis.
But now we can finally get started on the fun part!

## Tokenization

Remember that we are interested in determining which words each author uses, and how often they do so.
As far as Python is concerned, our text files are just a very long string of random characters.
Python has no understanding of what a word is, so it cannot count words without our help.
What we need to do is to tell Python how it can convert a string into a list of words.
And as you know by now, that's exactly what tokenizers are for.

In [None]:
import re

def tokenize(the_string):
    """Convert string to list of words"""
    return re.findall(r"\w+", the_string)


def tokenize_file(the_file):
    """Read file as string and tokenize it"""
    with open(the_file, mode="r") as text:
        return tokenize(text.read())


# define a variable for each token list
hamlet = tokenize_file("hamlet_clean.txt")
faustus = tokenize_file("faustus_clean.txt")
mars = tokenize_file("mars_clean.txt")

Before we continue, let's see what these lists looks like compared to what we would get without the prior clean-up step.
After all, if we put so much effort in cleaning up the files, we want to know that it has paid off.

**Exercise.**
You could look at the cleaned-up lists with the `print` command:

```python
print(hamlet)
print(faustus)
print(mars)
```

**Don't do that!!!**

The output would be huge because these are long texts with thousands of words.
Use the `len` function to check how long each text is.

In [None]:
# put your code here

As you know, we can use indices to look at individual elements of a list.
So we can, say, compare the first word in the original version to the cleaned-up version.

In [None]:
import re

def tokenize(the_string):
    """Convert string to list of words"""
    return re.findall(r"\w+", the_string)


def tokenize_file(the_file):
    """Read file as string and tokenize it"""
    with open(the_file, mode="r") as text:
        return tokenize(text.read())


# define a variable for each token list
hamlet = tokenize_file("hamlet_clean.txt")
faustus = tokenize_file("faustus_clean.txt")
mars = tokenize_file("mars_clean.txt")

# and the counterparts without cleaning up
hamlet_manual = tokenize_file("hamlet_manual.txt")
faustus_manual = tokenize_file("faustus_manual.txt")
mars_manual = tokenize_file("mars_manual.txt")

print("Hamlet comparison")
print("-----------------")
print("First word in Hamlet before cleaning:", hamlet_manual[0])
print("First word in Hamlet after cleaning:", hamlet[0])

**Caution:**
The rest of this notebook assumes that the variables `hamlet`, `faustus`, and `mars` exist, and similarly for `hamlet_manual`, `faustus_manual`, and `mars_manual`.
That's the case if you have run the cell above.
But if you restart the kernel at a later point, you have to rerun the cell above so that the variables are defined again.
So if you run one of the cells below and get an error that `hamlet`, `faustus`, or `mars` are undefined, come back up here and run the code cell.

Obviously it would be very tedious to compare, say, the first 100 words by referencing each one with its index.
Fortunately, Python's got you covered.
The index notation can also be used to get **slices**.
A slice is a continuous part of a list.

In [None]:
# a short list
example_list = ["John", "really", "likes", "Sue"]
# show the first two elements
print(example_list[0:2])
# show the slice from index 1 to 4
print(example_list[1:4])

Slices are very easy to use:

```python
some_list[start_index:end_index]
```

This will give you a list that spans from the position `start_index` to the position `end_index`.

**Exercise.**
Experimentation time!
Play around with slices to figure out how they work.
Pay particular attention to the following issues:

1. What happens if the start index is greater than the end index?
1. What happens if the end index does not exist (e.g. 17 for the example list above)?
1. What happens if one of the indices is omitted?
   For instance, `example_list[:3]`, or `example_list[2:]`, or `example_list[-1:]`?

In [None]:
# experiment here

*put your answers here*

**Exercise.**
Compare the output of the two code cells below.
Explain why the two outputs are not the same.

In [None]:
["John", "Mary", "Sue"][1]

In [None]:
["John", "Mary", "Sue"][1:1]

*put your explanation here*

With slices, it is now very easy to compare specific passages of the texts.
For example, we can look at the first 50 words in each text.

In [None]:
def print_50(the_list):
    print("--------")
    print(the_list[:50])
    print("--------")
    
print_50(hamlet)
print_50(hamlet_manual)

**Exercise.**
Write a small custom function `print_first_last` that prints the first *n* and last *n* words of each one of the three texts.
For example, `print_first_last(5)` should print the first 5 words of `hamlet`, then the last 5 words of `hamlet`, and then the same for `faustus` and `mars`.

In [None]:
# put your code here

The comparisons of the lists with and without cleaning show how important it is to remove all unneccessary crud from the files you work with.
The list over the cleaned file looks like the actual beginning of *Hamlet*, the other one not so much.
So it's a good thing we cleaned up the files, but admittedly it's not exactly the most fun activity.
In the next notebook, we finally get to reap the rewards in the form of a quantitative analysis of writing style.

## Bullet point summary

- **Getting files**
    - `urllib.urlrequest.urlretrieve(url, filename)` is used to download and save a file (yeah, it will take a while to memorize that function name).
- **Manipulating lists**
    - Slices allow you to extract a continuous chunk of a list.
      The notation is `some_list[start_index:end_index]`.
      For example, `["John", "Mary", "Sue"][1:3]` is `["Mary", "Sue"]`.
    - In slices, start and end can be omitted.
    ```python
    ["John", "Mary", "Sue"][:2] == ["John", "Mary"]
    ["John", "Mary", "Sue"][2:] == ["Sue"]
    ["John", "Mary", "Sue"][:] == ["John", "Mary", "Sue"]
     ```