Joseph Caguioa

Spring 2020

DS 7337: Natural Language Processing

Section 404 (Tuesday 2030-2200)

HW4 Due: Date of Live Session 8 (2/25/20)

---

# Homework 4

## <u><a name="toc">Table of Contents:</a></u>
* [Question 1](#question1)
* [Question 2](#question2)
* [Question 3](#question3)

---

### <a name="question1">Question 1</a> 

<b>Run one of the part-of-speech (POS) taggers available in Python.
  
* Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
* Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.</b> <sub>[(back to top)](#toc)</sub>

NLTK, the main natural language processing library used throughout previous homeworks, naturally also has an in-built POS tagger available. It makes sense to try that one first. This tagger uses a Naïve Bayes machine learning approach.

In [1]:
import nltk
import pandas as pd

In [2]:
nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

For simplicity purposes, the tokenization and tagging functions are combined into a one-liner function that is demonstrated below.

In [3]:
def token_tag(sentence):
    """
    POS tagging using NLTK.
    
    Args:
        sentence (str): The text to be tagged.
    
    Returns:
        tags (list): A list of tokens and POS tags.
    
    """
    
    tokens = nltk.word_tokenize(sentence)
    tags = nltk.pos_tag(tokens)
    return tags

In [4]:
pangram = "The quick brown fox jumps over the lazy dog."
pangram_df = pd.DataFrame(token_tag(pangram), columns=["Token", "NLTK POS Tag"])
pangram_df

Unnamed: 0,Token,NLTK POS Tag
0,The,DT
1,quick,JJ
2,brown,NN
3,fox,NN
4,jumps,VBZ
5,over,IN
6,the,DT
7,lazy,JJ
8,dog,NN
9,.,.


Notice that the NLTK tagger already makes a mistake on this commonly used pangram. The third word, "brown," is labeled a singular noun (NN) when it should be an adjective (JJ).

A full list of POS tags in the Penn Treebank Project can be found at the following web link:

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

The NLTK package also provides documentation that can be accessed using `nltk.help.upenn_tagset()`. Tags that are passed to this help function return with acronym expansions and example words, which is helpful for understanding what each tag means. An example, which was used extensively during manual tagging, is demonstrated below.

In [5]:
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


For the first task, any long sentence can likely be tagged correctly in full if it is lexically unambiguous, such as by avoiding homographs and homophones. Manually verifying the results of sentences for this unbounded problem could become very time-consuming, especially when interesting candidates include those on the scale of entire one-sentence books. Thus, a relatively interesting sentence longer than ten words, but not ridiculously so, is investigated.

In [6]:
decision = "We choose to go to the moon in this decade and do the other things, \
not because they are easy, but because they are hard, \
because that goal will serve to organize and measure the best of our energies and skills, \
because that challenge is one that we are willing to accept, \
one we are unwilling to postpone, and one which we intend to win, and the others, too."

pd.set_option('display.max_rows', 80)
decision_df = pd.DataFrame(token_tag(decision), columns=["Token", "NLTK POS Tag"])
decision_df

Unnamed: 0,Token,NLTK POS Tag
0,We,PRP
1,choose,VBP
2,to,TO
3,go,VB
4,to,TO
5,the,DT
6,moon,NN
7,in,IN
8,this,DT
9,decade,NN


An opening line from John F. Kennedy's 1961 speech, "The Decision to Go to the Moon," is fed into the POS tagger. My manual human check of the output does not detect any major errors. His rhetoric is simple yet evocative, and the words he uses are unambiguous in their meaning.

For the second task, ambiguous sentences can result in less than 100% correct tagging. One example is the famous buffalo sentence, in which the word "buffalo" is repeated eight times: "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo." It uses three meanings:

* Noun: The animal also known as the bison. Note that the word "buffalo" counts for both the singular and plural forms.
* Verb: An uncommon, informal usage that means "to puzzle or confuse" or "to impress or intimidate." (Retrieved from Dictionary.com.) 
* Proper noun: The city of Buffalo, New York. In this context it could be thought of as a noun adjunct, i.e., it specifies bison that come from the city in New York.

As explained on Wikipedia, the meaning can be understood as: "Buffalo bison, that other Buffalo bison bully, also bully Buffalo bison." A simplified parse tree is already available for this sentence on Wikimedia, but some manual tagging is needed to use the Penn treebank tags.

| | Buffalo | buffalo | Buffalo | buffalo | buffalo | buffalo | Buffalo | buffalo |
| - | - | - | - | - | - | - | - | - |
| Parse tags: | PN | N | PN | N | V | V | PN | N |
| Penn POS tags: | NNP | NNS | NNP | NNS | VBP | VBP | NNP | NNS |

Please note that while I am a native English speaker, I may make mistakes when it comes to manual POS tagging, especially for confusing sentences like this one. For clarity, the POS tag descriptions used in this section follow:

* NN: Noun, singular or mass
* NNP: Proper noun, singular
* NNS: Noun, plural
* VBP: Verb, non-3rd person singular present

Testing this sentence on the Naïve Bayes POS tagger gives the following results:

In [7]:
buffalo_long = "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo."

buffalo_long_df = pd.DataFrame(token_tag(buffalo_long), columns=["Token", "NLTK POS Tag"])
buffalo_long_df

Unnamed: 0,Token,NLTK POS Tag
0,Buffalo,NNP
1,buffalo,NN
2,Buffalo,NNP
3,buffalo,NN
4,buffalo,NN
5,buffalo,NN
6,Buffalo,NNP
7,buffalo,NN
8,.,.


While the tagger correctly identified the proper noun version, likely due to capitalization, it failed to identify the verbs. Considering that the verb usage is not commonly used, this is not surprising for a Naïve Bayes classifier. It is unlikely that the original Penn Treebank training dataset has many, if any, instances of "buffalo" being used as a verb, so it makes sense that the NLTK tagger would have difficulties.

Additionally, notice that the other noun instances are called "NN" for singular or mass (as in, uncountable) as opposed to "NNS" for plural. The tagger likely expects "buffaloes" as the plural form (though "buffalo" as plural is still considered correct) and receives no help for determining plurality because it does not detect any verbs.

Some interesting things happen if the novelty of this sentence is broken by changing "buffalo" to "buffaloes," which helps the tagger register plural nouns and usage as a verb.

In [8]:
token_tag("Buffalo buffaloes Buffalo buffaloes buffalo buffaloes Buffalo buffaloes.")

[('Buffalo', 'NNP'),
 ('buffaloes', 'NNS'),
 ('Buffalo', 'NNP'),
 ('buffaloes', 'NNS'),
 ('buffalo', 'VBD'),
 ('buffaloes', 'NNS'),
 ('Buffalo', 'NNP'),
 ('buffaloes', 'NNS'),
 ('.', '.')]

A shorter version that retains ambiguity can trick the tagger in a similar fashion, largely due to the low likelihood of even considering the verb usage of the word "buffalo." Here is the expected output for a simplified, yet still grammatically correct, three-word edition of the sentence.

| | Buffalo | buffalo | buffalo |
| - | - | - | - |
| Parse tags: | N | V | N |
| Penn POS tags: | NNS | VBP | NNS |

The meaning could be understood as "Bison bully bison," where "buffalo" is still being used in the plural. Note that the first capitalization is because that word begins the sentence, not because it means the city as it did before. The actual tag output errs on that and again fails to detect the verb, also likely due to uncommon usage in the training data as suggested before.

In [9]:
buffalo_short = "Buffalo buffalo buffalo."

buffalo_short_df = pd.DataFrame(token_tag(buffalo_short), columns=["Token", "NLTK POS Tag"])
buffalo_short_df

Unnamed: 0,Token,NLTK POS Tag
0,Buffalo,NNP
1,buffalo,NN
2,buffalo,NN
3,.,.


---

### <a name="question2">Question 2</a> 

<b>Run a different POS tagger in Python. Process the same two sentences from question 1.

* Does it produce the same or different output?
* Explain any differences as best you can.</b> <sub>[(back to top)](#toc)</sub>

spaCy is another NLP library in Python that has POS capabilities. The en_core_web_sm model is a pretrained convolutional neural network (CNN) trained on OntoNotes, which builds on top of the Penn Treebank.

An initial run of this section used the simplified Universal Dependencies POS tag set on `token.pos_` before I realized the Penn Treebank tag set was also available via `token.tag_`. The former is briefly demonstrated on the tester sentence just to show that the Universal Dependencies tag set does not go into details like singular versus plural nouns, unlike the Penn tag set.

In [10]:
import spacy
import en_core_web_sm
eng_mod = en_core_web_sm.load()

def spacy_tag(sentence):
    """
    POS tagging using spaCy.
    
    Args:
        sentence (str): The text to be tagged.
    
    Returns:
        tags (list): A list of POS tags.
    
    """
    
    doc = eng_mod(sentence)
    tags = [token.tag_ for token in doc]
    return tags

In [11]:
pangram_spacy = spacy_tag(pangram)
pangram_df['spaCy POS Tag'] = pangram_spacy
pangram_df

Unnamed: 0,Token,NLTK POS Tag,spaCy POS Tag
0,The,DT,DT
1,quick,JJ,JJ
2,brown,NN,JJ
3,fox,NN,NN
4,jumps,VBZ,NNS
5,over,IN,IN
6,the,DT,DT
7,lazy,JJ,JJ
8,dog,NN,NN
9,.,.,.


Interestingly enough, the spaCy tagger makes a different mistake on the pangram than the NLTK tagger. While spaCy correctly calls "brown" another adjective, it does not identify "jumps" as a verb and instead labels it a plural noun.

To address the long and short sentences part of this question, the spaCy tags are printed out alongside the NLTK tags for ease of comparison. First is the long sentence from "The Decision to Go to the Moon."

In [12]:
decision_spacy = spacy_tag(decision)
decision_df['spaCy POS Tag'] = decision_spacy
decision_df

Unnamed: 0,Token,NLTK POS Tag,spaCy POS Tag
0,We,PRP,PRP
1,choose,VBP,VBP
2,to,TO,TO
3,go,VB,VB
4,to,TO,IN
5,the,DT,DT
6,moon,NN,NN
7,in,IN,IN
8,this,DT,DT
9,decade,NN,NN


There are 3 POS tag differences shown in the table. The reasoning behind the taggers' decisions likely stems back to the manual tagging done on the training datasets. The words in question are very commonly used in both parts-of-speech given, which can make diagnosing the correct one tricky.

* Index 4 ("to", TO, IN): The context is "We choose to go <u>to</u> the moon..." where "to" is labeled a to by NLTK and preposition or subordinating conjunction by spaCy. Key is the succeeding phrase "the moon," indicating that the "to" is helping describe where the infinitive "to go" is directed. In this sense it is a preposition. Both IN and TO tags include prepositions in the description, so it seems that both could be valid.
* Index 11 ("do", VBP, VB): The context is "We choose...and <u>do</u> the other things..." where "do" is labeled a non-3rd person singular present verb by NLTK and a base form verb by spaCy. Of the two, VBP feels more correct, as the conjunctive "and" preceding it suggests that the pronoun "We" at the beginning of the sentences is performing both verbs.
* Index 50 ("that", IN, WDT): The context is "...because that challenge is one <u>that</u> we are willing to accept..." where "that" is labeled a preposition or subordinating conjunction by NLTK and a wh-determiner by spaCy. Of these, the latter case makes more sense, as "we are willing to accept" answers a wh-question, such as "Which challenge?"

These cases feel like they could go one way or the other, and my personal judgments might pale compared to that of another with more knowledge of linguistics. Nevertheless, no major content or action words (nouns and verbs) that would likely matter more in analysis were labeled incorrectly.

Next, the buffalo sentence in both its long and short variants is passed to the spaCy tagger.

In [13]:
nltk.help.upenn_tagset('WDT')

WDT: WH-determiner
    that what whatever which whichever


In [14]:
buffalo_long_spacy = spacy_tag(buffalo_long)
buffalo_long_df['spaCy POS Tag'] = buffalo_long_spacy
buffalo_long_df

Unnamed: 0,Token,NLTK POS Tag,spaCy POS Tag
0,Buffalo,NNP,NNP
1,buffalo,NN,NNP
2,Buffalo,NNP,NNP
3,buffalo,NN,NNP
4,buffalo,NN,NNP
5,buffalo,NN,NNP
6,Buffalo,NNP,NNP
7,buffalo,NN,NNP
8,.,.,.


In [15]:
buffalo_short_spacy = spacy_tag(buffalo_short)
buffalo_short_df['spaCy POS Tag'] = buffalo_short_spacy
buffalo_short_df

Unnamed: 0,Token,NLTK POS Tag,spaCy POS Tag
0,Buffalo,NNP,NNP
1,buffalo,NN,NNP
2,buffalo,NN,NNP
3,.,.,.


The spaCy tagger misses the verb usage similarly to the NLTK tagger, but defaults all instances of "buffalo" to the singular proper noun. This is interesting, as one would not expect an uncapitalized instance of an animal to be labeled as a proper noun like this. Perhaps, because the verb form is so uncommon, the spaCy tagger perceives these repeated words as compound nouns, which is helped along by the first instance that is simply capitalized because it begins the sentence. It is possible that the OntoNotes training dataset, which has content from more varied sources (including broadcast news, telephone conversations, and web data) than the Penn Treebank dataset (Wall Street Journal stories), resulted in rules that would cause the underlying CNN for the spaCy tagger to behave this way.

---

### <a name="question3">Question 3</a> 

<b>In a news article from this week's news, find a random sentence of at least 10 words.

* Looking at the Penn tag set, manually POS tag the sentence yourself.
* Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?
* Explain any differences between the two taggers and your manual tagging as much as you can.</b> <sub>[(back to top)](#toc)</sub>

Major current events this past week include the worldwide spread of COVID-19 and the US Deomcratic presidential primary's ninth debate. But in more entertaining news, a part of the 11-story Affiliated Computer Services building in Dallas, TX survived a planned implosion, resulting in a social media landmark dubbed the "Leaning Tower of Dallas." 

Read more at "Leaning Tower of Dallas survived demolition to become city's accidental Instagram star," an article written by Charles Trepany for USA Today: https://www.usatoday.com/story/travel/destinations/2020/02/20/leaning-tower-dallas-citys-accidental-tourist-destination/4825127002/

While the core will inevitably be fully demolished, it will live on in Internet fame through pictures, including those of a LEGO likeness built by Matt Graham at LEGOLAND Discovery Center Dallas/Fort Worth. The closing line of the article pokes self-aware fun at tourist tendencies and, with 12 words, is the sentence of choice for this task.

"Yes, Graham included little LEGO people ogling it with little LEGO iPhones."

This sentence is fairly straightforward compared to the buffalo one used in Questions 1 and 2. There are no homonyms introducing ambiguity, and the echoing repetition of "little LEGO [people/iPhones]" should be exactly the same in parts of speech, aside from the last word. The proper nouns are the main words that could be expected to trip up the POS taggers, although they would likely default to nouns for unknown words anyway. Below is my manual tagging using the Penn tag set.

| | Yes | Graham | included | little | LEGO | people | ogling | it | with | little | LEGO | iPhones |
| - | - | - | - | - | - | - | - | - | - | - | - | - | 
| Penn POS tags: | UH | NNP | VBD | JJ | NNP | NNS | VBG | PRP | IN | JJ | NNP | NNS |

And next, here are the results from the NLTK and spaCy taggers:

In [16]:
# Long live the Leaning Tower of Dallas (ltd)!
ltd = "Yes, Graham included little LEGO people ogling it with little LEGO iPhones."

ltd_df = pd.DataFrame(token_tag(ltd), columns=["Token", "NLTK POS Tag"])
ltd_spacy = spacy_tag(ltd)
ltd_df['spaCy POS Tag'] = ltd_spacy
ltd_df

Unnamed: 0,Token,NLTK POS Tag,spaCy POS Tag
0,Yes,UH,UH
1,",",",",","
2,Graham,NNP,NNP
3,included,VBD,VBD
4,little,JJ,JJ
5,LEGO,NNP,NNP
6,people,NNS,NNS
7,ogling,VBG,VBG
8,it,PRP,PRP
9,with,IN,IN


This time, both the NLTK and spaCy POS taggers produced the same results, and they happen to agree with my manual judgments using the Penn tag set.

While there are no differences to discuss in this section, from the investigations in Questions 1 and 2, it seems that taggers are more likely to get different results when some of the words used have multiple synsets.