# Cameron Stewart
# HW4

## 1.	Run one of the part-of-speech (POS) taggers available in Python. 
* Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
* Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence. 


I plan to use the NLTK POS Tagger for Question 1. Below, I will show the input/output of a long sentence the POS tagger captures completely correct.

In [1]:
import nltk

In [2]:
long_sentence="After I finish this assignment, I will need to take a vacation."
tokenized_long_sentence=nltk.word_tokenize(long_sentence)
nltk.pos_tag(tokenized_long_sentence)

[('After', 'IN'),
 ('I', 'PRP'),
 ('finish', 'VBP'),
 ('this', 'DT'),
 ('assignment', 'NN'),
 (',', ','),
 ('I', 'PRP'),
 ('will', 'MD'),
 ('need', 'VB'),
 ('to', 'TO'),
 ('take', 'VB'),
 ('a', 'DT'),
 ('vacation', 'NN'),
 ('.', '.')]

Now, we will look at a short sentence that is not captured correctly by the NLTK POS tagger.

In [3]:
short_sentence="Big Red is a spicy gum."
tokenized_short_sentence=nltk.word_tokenize(short_sentence)
nltk.pos_tag(tokenized_short_sentence)

[('Big', 'JJ'),
 ('Red', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('spicy', 'NN'),
 ('gum', 'NN'),
 ('.', '.')]

The NLTK pos tagger made two mistakes on this short sentence. The first is that 'Big Red' is a brand name and both words are proper nouns. The tagger marked 'Big' as an adjective. The second is that 'spicy' should be an adjective describing the noun 'gum' but it was marked a noun. What makes this especially odd is that spicy has no word sense in WordNet where it could be interpreted as a noun.

## 2.	Run a different POS tagger in Python. Process the same two sentences from question 1.
* Does it produce the same or different output?
* Explain any differences as best you can.


I plan to use the Flair POS Tagger for Question 2. This was marked as the most accurate tagger in a study on WSJ articles (https://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art). It also runs on PyTorch. Below, I will show the input/output of the same long sentence as before.

In [4]:
from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("flair/pos-english")

2022-02-13 21:50:36,371 loading file /Users/cameron/.flair/models/pos-english/a9a73f6cd878edce8a0fa518db76f441f1cc49c2525b2b4557af278ec2f0659e.121306ea62993d04cd1978398b68396931a39eb47754c8a06a87f325ea70ac63


In [5]:
# make example sentence
sentence = Sentence("After I finish this assignment, I will need to take a vacation.")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('pos'):
    print(entity)

Sentence: "After I finish this assignment , I will need to take a vacation ."   [− Tokens: 14  − Token-Labels: "After <IN> I <PRP> finish <VBP> this <DT> assignment <NN> , <,> I <PRP> will <MD> need <VB> to <TO> take <VB> a <DT> vacation <NN> . <.>"]
The following NER tags are found:
Span [1]: "After"   [− Labels: IN (0.9993)]
Span [2]: "I"   [− Labels: PRP (1.0)]
Span [3]: "finish"   [− Labels: VBP (1.0)]
Span [4]: "this"   [− Labels: DT (1.0)]
Span [5]: "assignment"   [− Labels: NN (1.0)]
Span [6]: ","   [− Labels: , (1.0)]
Span [7]: "I"   [− Labels: PRP (1.0)]
Span [8]: "will"   [− Labels: MD (1.0)]
Span [9]: "need"   [− Labels: VB (1.0)]
Span [10]: "to"   [− Labels: TO (1.0)]
Span [11]: "take"   [− Labels: VB (0.9998)]
Span [12]: "a"   [− Labels: DT (1.0)]
Span [13]: "vacation"   [− Labels: NN (1.0)]
Span [14]: "."   [− Labels: . (1.0)]


Now, we will look at the same short sentence that was not captured correctly by the NLTK POS tagger.

In [6]:
# make example sentence
sentence = Sentence("Big Red is a spicy gum.")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('pos'):
    print(entity)

Sentence: "Big Red is a spicy gum ."   [− Tokens: 7  − Token-Labels: "Big <JJ> Red <NNP> is <VBZ> a <DT> spicy <JJ> gum <NN> . <.>"]
The following NER tags are found:
Span [1]: "Big"   [− Labels: JJ (0.8844)]
Span [2]: "Red"   [− Labels: NNP (0.7943)]
Span [3]: "is"   [− Labels: VBZ (1.0)]
Span [4]: "a"   [− Labels: DT (1.0)]
Span [5]: "spicy"   [− Labels: JJ (0.9904)]
Span [6]: "gum"   [− Labels: NN (1.0)]
Span [7]: "."   [− Labels: . (1.0)]


We can see the 'Big' in the 'Big Red' brand name was still improperly captured as an adjective instead of a proper noun. We can see spicy is now properly captured as and adjective. Overall, Flair was better at POS tagging than NLTK.

## 3.	In a news article from this week’s news, find a random sentence of at least 10 words.
* Looking at the Penn tag set, manually POS tag the sentence yourself.
* Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?
* Explain any differences between the two taggers and your manual tagging as much as you can.


The sentence was pulled from a Super Bowl article found here: https://sports.yahoo.com/rams-beat-bengals-for-super-bowl-lvi-championship-thanks-to-cooper-kupps-heroics-030114941.html

Selected Sentence: Cooper Kupp was about the only option the Los Angeles Rams had on offense at the end of Super Bowl LVI.

Manual Tagging: <br>
Cooper/  NNP<br>
Kupp/    NNP<br>
was/     VBD<br>
about/   RB<br>
the/     DT<br>
only/    JJ<br>
option/  NN<br>
the/     DT<br>
Los/     NNP<br>
Angeles/ NNP<br>
Rams/    NNPS<br>
had/     VBD<br>
on/      IN<br>
offense/ NN<br>
at/      IN<br>
the/     DT<br>
end/     NN<br>
of/      IN<br>
Super/   NNP<br>
Bowl/    NNP<br>
LVI/     NNP<br>
.       

Next, we will use the NLTK POS tagger on the same sentence.

In [7]:
news_sentence="Cooper Kupp was about the only option the Los Angeles Rams had on offense at the end of Super Bowl LVI."
tokenized_news_sentence=nltk.word_tokenize(news_sentence)
nltk.pos_tag(tokenized_news_sentence)

[('Cooper', 'NNP'),
 ('Kupp', 'NNP'),
 ('was', 'VBD'),
 ('about', 'IN'),
 ('the', 'DT'),
 ('only', 'JJ'),
 ('option', 'NN'),
 ('the', 'DT'),
 ('Los', 'NNP'),
 ('Angeles', 'NNP'),
 ('Rams', 'NNP'),
 ('had', 'VBD'),
 ('on', 'IN'),
 ('offense', 'NN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('end', 'NN'),
 ('of', 'IN'),
 ('Super', 'NNP'),
 ('Bowl', 'NNP'),
 ('LVI', 'NNP'),
 ('.', '.')]

Finally, we will use the Flair POS tagger on the same sentence.

In [8]:
# make example sentence
sentence = Sentence("Cooper Kupp was about the only option the Los Angeles Rams had on offense at the end of Super Bowl LVI.")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('pos'):
    print(entity)

Sentence: "Cooper Kupp was about the only option the Los Angeles Rams had on offense at the end of Super Bowl LVI ."   [− Tokens: 22  − Token-Labels: "Cooper <NNP> Kupp <NNP> was <VBD> about <RB> the <DT> only <JJ> option <NN> the <DT> Los <NNP> Angeles <NNP> Rams <NNPS> had <VBD> on <IN> offense <NN> at <IN> the <DT> end <NN> of <IN> Super <NNP> Bowl <NNP> LVI <NNP> . <.>"]
The following NER tags are found:
Span [1]: "Cooper"   [− Labels: NNP (0.8798)]
Span [2]: "Kupp"   [− Labels: NNP (1.0)]
Span [3]: "was"   [− Labels: VBD (1.0)]
Span [4]: "about"   [− Labels: RB (0.9998)]
Span [5]: "the"   [− Labels: DT (1.0)]
Span [6]: "only"   [− Labels: JJ (0.9924)]
Span [7]: "option"   [− Labels: NN (1.0)]
Span [8]: "the"   [− Labels: DT (1.0)]
Span [9]: "Los"   [− Labels: NNP (1.0)]
Span [10]: "Angeles"   [− Labels: NNP (0.9999)]
Span [11]: "Rams"   [− Labels: NNPS (0.9414)]
Span [12]: "had"   [− Labels: VBD (1.0)]
Span [13]: "on"   [− Labels: IN (1.0)]
Span [14]: "offense"   [− Labels: NN (1.

The Flair POS tagger came up with the same result as my manual tags. This confirms the level of accuracy of the Flair POS tagger. The NLTK POS tagger had two differences from both the manual and Flair tagging methods. 

The first difference was the word 'about' being shown as a preposition in the NLTK tagger instead of an adverb. The word about is using the definition 'almost or nearly' which is an adverb sense.

The second difference was the word 'Rams'. This was marked as a singular proper noun in the NLTK tagger instead of a plural proper noun. The Rams are a single team but represent collective of players. I believe plural proper noun is the correct tag, but I can see where this is a confusing situation for determining the correct tag.