# Tagging Text for Parts of Speech and Named Entities (People, Places, Groups) - Working Notebook

The lesson for simple counting was long and fairly involved. We covered an enormous amount, from getting a straightforward count of the number of tokens in a single text to massive, rather complicated analyses of an entire corpus.

Here, we're going to take a look at tagging, essentially a way of putting words and phrases into groups, or "word classes." To a significant degree, these techniques tend to be more automated than simple-counting measures. Which is simply to say that, for simple counting, you have to think through the type of analysis you want to run and there's a great deal of room for you to make choices, and to specify exactly the types of results you might want. Tagging is different, in that it's a more automated form of text analysis - it's "set-it-and-forget-it" text analysis, if you will. This isn't necessarily to say that tagging is in any way less important, but there is certainly less to teach, and the emphasis will tend to fall on what you do with tagging information, after you've generated it, rather than on the process of tagging itself.

## 1. Open a Single Text: 

By this point, you should be fairly comfortable pointing to a specific directory and opening a text on your own. Let's start with James Boswell's <i>Life of Johnson</i>. This is a particularly interesting text, because it's a nonfictional biography of the great lexicographer -- so it concerns real people and places. It also has the advantage of being very long, so will necessarily discuss many more people and places than, say, a short poem. On your own, open the two volumes of Boswell's <i>Life of Johnson</i> (`K055619.001.txt` and `K055619.002.txt`), read them into working memory, and combine the two texts that result into a single text file. For this lesson, we're going to want to use the uncleaned working set, because capitalization and punctuation are important cues, which will help the computer to make distinctions among words, when it applies tags.

In [None]:
import os
from pathlib import Path
home = str(Path.home())

textdirectory = home + 

# Write your own code to point at the `working_set` directory in `/dh2/corpora_and_metadata/`. Make sure you're in the right directory.

The two files you'll need are "K055619.001.txt" and "K055619.002.txt"

Open both texts and then read them into working memory. Combine both documents into a single text file called johnsonTxt. 
Check the length of each of the three text documents to ensure you've done this correctly.


In [None]:
johnson1 = open("K055619.001.txt", "r")
johnson2 = open("K055619.002.txt", "r")

In [None]:
johnson1Txt = johnson1.read()
johnson2Txt = johnson2.read()

In [None]:
johnsonTxt = johnson1Txt + johnson2Txt

In [None]:
len(johnson1Txt), len(johnson2Txt), len(johnsonTxt)

## 2. Part-Of-Speech (POS) Tagging

Now that we've got the entierty of Boswell's <i>Life of Johnson</i> in a single variable, let's get down to business. What does the beginning of the text look like? Print the first 2000 characters and see.

In [None]:
# Without tokenizing, in this cell print the first 2000 characters of `johnsonTxt`.

There are a couple of things to say about this snippet, off the bat. It contains capitalization and punctuation, which is good, but it does include a bit of information about the frontispiece, as well as a transcription of the title page. If we wanted to be absolutely precise, we might want to remove these, and possibly also the Latin epigraph. You might even want to remove the dedication to Reynolds (the first president of the Royal Academy and one of the most famous portraitists of the time, as well as a close friend of Boswell and Johnson). 

The title and headers set entirely in caps could cause particular difficulty for our tagging algorithms, but let's give it a shot, and see these results come out. We're not particularly interested in getting an exact count, for our purposes. Ultimately, we just want to get a sense of the people, places, and organizations mentioned most often in the text, so these marginal cases that might cause problems near the beginning of the text are probably not a serious concern.

It's worth mentioning one final consideration. Notice in the dedication that "Art," "Philosophy," and "Literature" are all capitalized. If this were to continue throughout the text, it might cause serious problems. Luckily, though, Boswell seems to be capitalizing only the names of arts, and the ECCO-TCP doesn't have the indiscriminate capitalization of nouns that was so common in eighteenth-century printing. The volumes that we're using here normalize capitalization. If we were dealing with eighteenth-century capitalization we wouldn't necessarily be out of luck. A tool called <a href="https://pypi.org/project/truecase/">TrueCase</a> would give us the opportunity to try to correct the capitalization throughout the text, which would then make it possible to tag the text. Luckily, we don't have to worry about that, in this situation. 

Let's begin by testing out part-of-speech tagging on a well-known passage from the <i>Life</i>, in which Johnson and Boswell discuss the novelists Fielding and Richardson with the parliamentarian Thomas Erskine. We'll use an excerpt from the ECCO-TCP file, although we've introduced backslashes throughout. Marks like `\"` and `\'` constitute what's called an <i>escape sequence</i> in Python, so that this punctuation doesn't interupt our ability to read these texts into memory in as complete strings.

In [None]:
passage = "Fielding being mentioned, Johnson exclaimed, \"he was a blockhead;\" and upon my expressing my astonishment at so strange an assertion, he said, \"What I mean by his being a blockhead is, that he was a barren rascal.\" BOSWELL. \"Will you not allow, Sir, that he draws very natural pictures of human life?\" JOHNSON. \"Why, Sir, it is of very low life. Richardson used to say, that had he not known who Fielding was, he should have believed he was an ostler. Sir, there is more knowledge of the heart in one letter of Richardson\'s, than in all \'Tom Jones.\' I, indeed, never read \'Joseph Andrews.\" ERSKINE. \"Surely, Sir, Richardson is very tedious.\" JOHNSON. \"Why, Sir, if you were to read Richardson for the story, your impatience would be so much fretted, that you would hang yourself. But you must read him for the sentiment, and consider the story as only giving occasion to the sentiment.\" -- I have already given my opinion of Fielding; but I cannot refrain from repeating here my wonder at Johnson\'s excessive and unaccountable depreciation of one of the best writers that England has produced. \"Tom Jones\" has stood the test of public opinion with such success, as to have established its great merit, both for the story, the sentiments, and the manners, and also the varieties of diction, so as to leave no doubt of its having an animated truth of execution throughout.\""

In [None]:
print(passage)

In [None]:
import nltk
from nltk import pos_tag
from string import punctuation

nltk.download("averaged_perceptron_tagger")

sentences = nltk.tokenize.sent_tokenize(passage)
print("\nPOS Tags:")
for sentence in sentences:
    sentence = ''.join(c for c in sentence if not c.isdigit())
    sentence = ''.join(c for c in sentence if c not in punctuation)
    words = nltk.tokenize.word_tokenize(sentence)
    pos_tokens = nltk.pos_tag(words)
    print(pos_tokens)

These results might seem hard to read, initially, but they're very regular. The passage is turned into a series <i>tuples</i>, where the first value is a word and the second value is a tag that represents a specific part of speech. For example, in `('Johnson', 'NNP')`, _Johnson_ is a word from a sentence ("Fielding being mentioned, Johnson exclaimed...") and _NNP_ is a part of speech tag. In this case, NLTK has guessed correctly: Johnson is a proper noun. In the case of "Fielding", though, the the part of speech was assigned incorrectly.

To see human-readable definitions for these tag, run the following cell:

In [None]:
# download tagsets
nltk.download('tagsets')

# print the defintion of a single tag, as well as some examples.
nltk.help.upenn_tagset('NNP')
nltk.help.upenn_tagset('VBG')

You can also refer to <a href = "https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/">this list</a> of POS tags, or you can retrieve multiple definitions by using regex instead of a full tag.

In [None]:
# print defintions for every tag that starts with "N" -- this will show all types of nouns in this tagset
nltk.help.upenn_tagset('N.*')
# change 'N.*' to '.*' to get all the tags in this set at once

In [None]:
nltk.download("averaged_perceptron_tagger")

# Define the grammar for a noun phrase
grammar = "NP: {<DT>?<JJ>*<NN>}"
parser = nltk.RegexpParser(grammar)

# Split the document into sentences
sentences = nltk.tokenize.sent_tokenize(passage.lower())

noun_phrases = []

# Each tagged sentence consists of an "S" root with POS tagged words and "NP" noun phrase branches
print('\nSentence Trees\n------------------')
for sentence in sentences:
    # Normalize the text
    sentence = ''.join(c for c in sentence if not c.isdigit())
    sentence = ''.join(c for c in sentence if c not in punctuation)
    
    # Get the individual words
    words = nltk.tokenize.word_tokenize(sentence)
    
    # Tag the parts of speech
    pos_tokens = nltk.pos_tag(words)
    
    # Parse the tagged words
    parsed_sentence = parser.parse(pos_tokens)
    print(parsed_sentence)
    
    for chunk in parsed_sentence.subtrees():
        # Find the NP subtrees - these are noun phrases
        if chunk.label() == 'NP':
            # Assemble the phrase from its constituent words
            noun_phrase = []
            for word in chunk:
                noun_phrase.append(word[0])
            # Add the phrase to the list of noun phrases found in the document
            noun_phrases.append(' '.join(noun_phrase))

# Print the extracted noun phrases (the NP branches in the sentence trees)            
print('\nNoun Phrases\n------------------')
for phrase in noun_phrases:
    print(phrase)

Or, we can get a massive list of all the discrete POS tupples in their entire string, whether it's a passage or a whole volume.

In [None]:
import nltk
nltk.download("punkt")
from string import punctuation

doc6Txt = ''.join(c for c in passage if not c.isdigit())
doc6Txt = ''.join(c for c in doc6Txt if c not in punctuation)

# Tokenize the words
words = nltk.tokenize.word_tokenize(doc6Txt)

# Tag the words
pos_tokens = nltk.pos_tag(words)

# Get the vocabulary
vocab = set(pos_tokens)

print ("\nTotal words: %i" % len(pos_tokens))
print ("Vocabulary: %i" % len(vocab))

# Tag the vocabulary
print("\nTagged Vocabulary")
print(sorted(vocab))

## 3. Identifying People and Places with Named Entity Recognition (NER)

The NLTK named entity package relies on part-of-speech tagging to make inferences about whether a noun or noun phrase might be a person, place, or organization. We'll first tokenize Boswell's <i>Life of Johnson</i> into sentences.

In [None]:
import nltk
from nltk import pos_tag, ne_chunk
from collections import Counter

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download('maxent_ne_chunker')
nltk.download('words')

In [None]:
sentences = nltk.sent_tokenize(passage)

For each of these sentences, break the the sentences into pieces using `ne_chunk`, and for each of those chunks that has a person, place, or organization label, append it to the appropriate list. Whenever we append to the main lists, we want to add the word of the tuple, which is why we pull `c[0]`, excluding the NER tag.

In [None]:
people = []
places = []
for sentence in sentences:
    # Find named-entity chunks
    for chunk in ne_chunk(pos_tag(nltk.word_tokenize(sentence))):
        # Check for a label
        if hasattr(chunk, 'label'):
            # Which type of named entity was found?
            if chunk.label() == 'PERSON':
                people.append(' '.join(c[0] for c in chunk))
            elif chunk.label() == 'GPE':
                places.append(' '.join(c[0] for c in chunk))

Finally, sort each of these lists and print the most common.

In [None]:
print("\nNamed entity chunks:")
print("\nPeople:")
print("\t", Counter(people).most_common()[:30])
print("Places:")
print("\t", Counter(places).most_common()[:30])

Great! The results of this run are admittedly underwelming, in neither "Sir" nor "Fielding" are places. We have shown, however, that we can get an ordered list of people and places from the short passage we were using to test our script. Let's run the whole of <i>Life of Johnson</i> through the same analysis, combined in a single cell.

In [None]:
sentences = nltk.sent_tokenize(johnsonTxt)
people = []
places = []
for sentence in sentences:
    # Find named-entity chunks
    for chunk in ne_chunk(pos_tag(nltk.word_tokenize(sentence))):
        # Check for a label
        if hasattr(chunk, 'label'):
            # Which type of named entity was found?
            if chunk.label() == 'PERSON':
                people.append(' '.join(c[0] for c in chunk))
            elif chunk.label() == 'GPE':
                places.append(' '.join(c[0] for c in chunk))
print("\nNamed entity chunks:")
print("\nPeople:")
print("\t", Counter(people).most_common()[:30])
print("Places:")
print("\t", Counter(places).most_common()[:30])

Done. We've produced two lists: one, the most common people in Boswell's biography of Johnson; the other, the most common places. We can clearly see that the success of NER depends a great deal on the amount of cleaning you're willing to do in advance. The string `JOHNSON`, as we've seen, is used by Boswell almost as a dramatic cue, to indicate when Samuel Johnson is speaking. Our NER algorithm recognizes `Johnson` and `JOHNSON` as two different entities, and it unhelpfully recognizes `Sir` and `Mr.` and `Esquire` as separate people unto themselves. While these are certainly problems, we do have enough information so that, if we were willing to do a bit of manual cleaning, we would have a rather compelling list of people and places:

    People: Johnson, Garrick, Mr. Langton, Goldsmith, Williams, Sir Joshua Reynolds, Mr. Thrale, Pope, Thrale, Adams, Sir Joshua, Mr. Burke, etc.
    
    Places: London, Scotland, England, Oxford, Lichfield, Ireland, France, Italy, Edinburgh, America, Paris
    
Allowing that these results are far from perfect, they're good enough for our purposes. If we were to run these results against a larger dataset, at a glance we could determine roughly the people and places most discuessed in a text. Let's now put that to use, by analysing a larger corpus of multiple texts.

## Do NER for a Collection of Files

Now that we've performed NER on a single document, let's apply our script to a whole set of files. If you have the time, you may want to run the following cells against the entire `working_set_nocleaning` directory, which will give you results for the entirety of our enriched ECCO-TCP set. Depending on your computer, though, this will likely take about four or five hours to complete, so you may want to start the script just before you go to bed, in which case you should have results the following morning.

For the sake of expediency, we'll use a slightly smaller dataset, here, based on the volumes in our dataset by America's Founding Fathers. Go ahead, and fill in the following cells. If you need help, you can refer back to the simple counting notebook, or, if you really need it, to the explicit notebook for this lesson. 

In [None]:
# In this cell, change your working directory to `/dh2/corpora_and_metadata/`

Now, read use the metadata file `ecco_data_w_counts.csv` to get a list of the volumes written by John Adams, Benjamin Franklin, Thomas Jefferson, James Madison, George Washington, Thomas Paine. Refer back to the previous lesson on simple counting or consult the complete notebook, if you need to refresh your memory.

In [None]:
ecco_metadata_w_counts = ???
array = ???
filenames = [???].tolist()

In [None]:
# In this cell, Change your working directory to `/dh2/corpora_and_metadata/working_set_nocleaning/`

Finally, turn the script we've used to run NER against Boswell's <i>Life</i> into a loop that analyses every file in `filenames`. We've built in a counter, so that you can monitor the progress of your script, as it loops through the documents.

In [None]:
import pandas as pd
named_entities = []
n = 0

for ??? in ???:
    with open (str(file),'r') as inputFile:
        readFile = ???
        sentences = nltk.sent_tokenize(readFile)
        # As with NER for a single text, create blank lists for `people` and `places`, and thn fill them with each of the named-entity chunks that match `PERSON` and `GPE`.
        
        
        
        
        # Then, count the `topPeople` and `topPlaces`, exactly as you did in NER for a single text.

        
        
        new_data = {'DocName':file,'topPeople':topPeople,'topPlaces':topPlaces}
        named_entities.append(new_data)
        n += 1
        if n % 5 == 0:
            print(n)      
ner_df = pd.DataFrame(named_entities)
print("Script Complete")

Okay! Once again, let's cross our fingers and check our results. Good luck!

In [None]:
ner_df[:10]

You should have a sizeable set of results, with `DocName`, `main_author`, `title`, `topPeople`, and `topPlaces`.