# NLP with spaCy, part 2 

*I wrote version 1.0 of this notebook based off materials by Alison Parrish. Dan Sinykin supplemented the 2020 version with material from Melanie Walsh's chapters [Named Entity Recognition](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/Named-Entity-Recognition.html) and [Part-of-Speech Tagging](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Text-Analysis/POS-Keywords.html). For 2021, I've added material adapted from David Bamman's [Applied NLP](https://github.com/dbamman/anlp21) course.*

This notebook picks up where [NLP with spaCy, part 1](class10-spacy-part1) leaves off, and covers Part-of-Speech tagging. 

# Part-of-Speech Tagging

I've left part-of-speech (PoS) tagging until last because it's simultaneously very useful, very complicated (unless you're a big grammar nerd), and very boring (again, unless you're a big grammar nerd). 

Why should we bother? Well, for one, parts of speech are the grammatical units of language — such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence. By computationally identifying parts of speech, we can start computationally exploring syntax, the relationship between words — rather than only focusing on words in isolation.

I've attempted to make this lesson more exciting by including some dymanically-generated brightly colored charts and some [xkcd](https://xkcd.com/1443/). Buckle up!

<img src="https://imgs.xkcd.com/comics/language_nerd.png" >


## Part-of-Speech Tagging with spaCy 

In spaCy, POS tagging works much likes NER tagging. Here is a chart (which resembles the NER chart) of which parts of speech spaCy is able to recognize and identify:

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |
| ADP   | adposition                | in, to, during                                |
| ADV   | adverb                    | very, tomorrow, down, where, there            |
| AUX   | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ  | conjunction               | and, or, but                                  |
| CCONJ | coordinating conjunction  | and, or, but                                  |
| DET   | determiner                | a, an, the                                    |
| INTJ  | interjection              | psst, ouch, bravo, hello                      |
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |
| NUM   | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART  | particle                  | ’s, not,                                      |
| PRON  | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT | punctuation               | ., (, ), ?                                    |
| SCONJ | subordinating conjunction | if, while, that                               |
| SYM   | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :), 😝             |
| VERB  | verb                      | run, runs, running, eat, ate, eating          |
| X     | other                     | sfpksdpsxmsa                                  |
| SPACE | space                     |                                               |

You can access the POS for any word using the `pos_` attribute. If you want a more specific designation, you can use the `tag_` attribute.

Note that this *is* slightly different than accessing NER tags, which requires that you start with the document's entities (`document.ents`) rather than the document itself. There are technical reasons for this that I can explain if you're curious.

### Enough talk. Let's see POS tagging in action!

First, we need to rerun our setup...

In [None]:
# import spacy and our English language model

import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
# re-process our various documents from spaCy part 1...

# 2020 Democratic Party Platform
# open file
with open("../corpora/platforms/democrat_platform_2020.txt", "r", encoding="utf-8") as file:
    dem_platform = file.read()

# turn into a spaCy doc 
dem_plat_doc = nlp(dem_platform)

# 2020 Republican Party Platform
# open file
with open("../corpora/platforms/republican_platform_2020.txt", "r", encoding="utf-8") as file:
    repub_platform = file.read()
    
# turn into a spaCy doc 
repub_plat_doc = nlp(repub_platform)

And now we're ready to go. 

Let's look at the parts-of-speech in the Republican party platform.

In [None]:
print("Word, POS, tag\n")

for word in repub_plat_doc:
    print(word.text, word.pos_, word.tag_)

### Extracting words by part of speech

Now we can write simple code to extract and recombine words by their part of speech. The following code creates two lists, one for all the nouns and another for all of the adjectives in the Republican platform:

In [None]:
nouns = []
adjectives = []

for word in repub_plat_doc:
    if word.pos_ == 'NOUN' and word.text not in nouns:
        nouns.append(word.text)
    elif word.pos_ == 'ADJ' and word.text not in adjectives:
        adjectives.append(word.text)

print("here are the first 20 nouns: " + str(nouns[0:19]))
print("and here are the first 20 adjectives: " + str(adjectives[0:19]))

And below, some code to print out random pairings of an adjective from the text with a noun from the text:

In [None]:
import random
print(random.choice(adjectives) + " " + random.choice(nouns))
print(random.choice(adjectives) + " " + random.choice(nouns))
print(random.choice(adjectives) + " " + random.choice(nouns))
print(random.choice(adjectives) + " " + random.choice(nouns))

### Excercise!

Remember from a few cells up above how the `.tag_` attribute allows us to be even more specific thtan the `.pos` attribute about the parts of speech we want? 

How would we iterate through the words included in the Republican platform to get a list of only verbs in the past participle?

*Hint: you'll find the name of the tag you're looking for [here](https://spacy.io/api/annotation#pos-tagging). Click on "English" to expand.*

In [None]:
only_past_participles = []

for word in repub_plat_doc:
    # your code here...

## Larger syntactic units

So we can get individual words by their part of speech. Great! But what if we want larger chunks, based on their syntactic role in the sentence? The easy way is `.noun_chunks`, which is an attribute of a document or a sentence that evaluates to a list of [spans](https://spacy.io/docs/api/span) of noun phrases, regardless of their position in the document:

In [None]:
for item in dem_plat_doc.noun_chunks:
    print(item.text)

For anything more sophisticated than this, though, we'll need to learn about how spaCy parses sentences into its syntactic components.

### Understanding dependency grammars

![displacy parse](http://static.decontextualize.com/syntax_example.png)

The idea of a dependency grammar is that every word in a sentence is a "dependent" on some other word, which is that word's "head." Those "head" words are in turn dependents of other words. The finite verb in the sentence is the ultimate "head" of the sentence, and is not itself dependent on any other word. (The dependents of a particular head are sometimes called its "children.")

The question of how to know what constitutes a "head" and a "dependent" is complicated. As a starting point, here's a passage from [Dependency Grammar and Dependency Parsing](http://stp.lingfil.uu.se/~nivre/docs/05133.pdf):

> Here are some of the criteria that have been proposed for identifying a syntactic relation between a head H and a dependent D in a construction C (Zwicky, 1985; Hudson, 1990):
>
> 1. H determines the syntactic category of C and can often replace C.
> 2. H determines the semantic category of C; D gives semantic specification.
> 3. H is obligatory; D may be optional.
> 4. H selects D and determines whether D is obligatory or optional.
> 5. The form of D depends on H (agreement or government).
> 6. The linear position of D is specified with reference to H."

There are different *types* of relationships between heads and dependents, and each type of relation has its own name. 

**Visit [the displaCy visualizer](https://demos.explosion.ai/displacy/?text=Everyone%20has%20the%20right%20to%20life%2C%20liberty%20and%20security%20of%20person&model=en&cpu=1&cph=0) to see how a particular sentence is parsed, and what the relations between the heads and dependents are.**

Here's a list of a few dependency relations and what they mean. ([A more complete list can be found here.](http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf))

* `nsubj`: this word's head is a verb, and this word is itself the subject of the verb
* `nsubjpass`: same as above, but for subjects in sentences in the passive voice
* `dobj`: this word's head is a verb, and this word is itself the direct object of the verb
* `iobj`: same as above, but indirect object
* `aux`: this word's head is a verb, and this word is an "auxiliary" verb (like "have", "will", "be")
* `attr`: this word's head is a copula (like "to be"), and this is the description attributed to the subject of the sentence (e.g., in "This product is a global brand", `brand` is dependent on `is` with the `attr` dependency relation)
* `det`: this word's head is a noun, and this word is a determiner of that noun (like "the," "this," etc.)
* `amod`: this word's head is a noun, and this word is an adjective describing that noun
* `prep`: this word is a preposition that modifies its head
* `pobj`: this word is a dependent (object) of a preposition

In [None]:
# Let's take a look at how this works in practice
# We'll go back to using our first doc, the Universal Declaration of Human Rights, for the rest of this notebook

# Let's load it in again in case you're doing this part of the notebook as part of your homework
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.")

for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Tag:", word.tag_)
    print("Head:", word.head.text)
    print("Dependency relation:", word.dep_)
    print("Children:", list(word.children))
    print("")

We can also quickly see spaCy's POS tagging in action by we using the displacy on doc2 with the style= parameter set to "dep" (short for dependency parsing):

In [None]:
# reimport all our stuff in case you're running this notebook as homework 
from __future__ import unicode_literals
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(doc, style="dep", options=options)

### Using .subtree for extracting syntactic units

Now that the above makes perfect sense (or your eyes are glazing over), let's learn how the `.subtree` attribute evaluates to a generator (remember that?!) that can be flatted by passing it to `list()`. 

In this case, the subtree is a list of the word's syntactic dependents--essentially, the clause that the word belongs to.

This function merges a subtree and returns a string with the text of the words contained in it:

In [None]:
def flatten_subtree(st):
       return ''.join([w.text_with_ws for w in list(st)]).strip() # just take my word for it!

With this function in our toolbox, we can write a loop that prints out the subtree for each word in a sentence:

In [None]:
for word in list(doc.sents)[2]:
    print("Word:", word.text)
    print("Flattened subtree: ", flatten_subtree(word.subtree))
    print("")

Using the subtree and our knowledge of dependency relation types, we can write code that extracts larger syntactic units based on their relationship with the rest of the sentence. For example, to get all of the noun phrases that are subjects of a verb:

In [None]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))

In [None]:
subjects

Or every prepositional phrase:

In [None]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree))

In [None]:
prep_phrases

Now we know a large part of how the "Connotation Frames" and "Birth Stories" projects got made!

Type something in the cell below so that I know you've made it to the end. 

In [None]:
# your words here!!!

## Further reading and resources

[A few example programs can be found here.](https://github.com/aparrish/rwet-examples/tree/master/spacy)

We've barely scratched the surface of what it's possible to do with spaCy. [There's a great page of tutorials on the official site](https://spacy.io/docs/usage/tutorials) that you should check out!