## DS 7337: Natural Language Processing

## Jaclyn Coate
## Homework 4

#### Spring 2021
#### Natural Language Processing w/ Python: Bird, Klein, & Loper

In [1]:
import nltk
import numpy as np
from nltk.metrics import *
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from pattern3.text.en import tag
from difflib import SequenceMatcher

##### HW 4: Question 1
Run one of the part-of-speech (POS) taggers available in Python.

a. Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.

In [25]:
longSentence = "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."
shortSentence = "Call me Ishmael."

taggedLongSent = nltk.pos_tag(longSentence)
taggedShortSent = nltk.pos_tag(shortSentence)

print('Original long sentence:')
print('-----------------------')
print(longSentence,'\n')
print('POS tagged sentence:')
print('--------------------')
print(taggedLongSent,'\n')

print('Original short sentence:')
print('-----------------------')
print(shortSentence,'\n')
print('POS tagged sentence:')
print('--------------------')
print(taggedShortSent,'\n')

Original long sentence:
-----------------------
Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. 

POS tagged sentence:
--------------------
[('C', 'VB'), ('a', 'DT'), ('l', 'NN'), ('l', 'NN'), (' ', 'NNP'), ('m', 'NN'), ('e', 'NN'), (' ', 'NN'), ('I', 'PRP'), ('s', 'VBP'), ('h', 'JJ'), ('m', 'FW'), ('a', 'DT'), ('e', 'NN'), ('l', 'NN'), ('.', '.'), (' ', 'CC'), ('S', 'NNP'), ('o', 'VBP'), ('m', 'NN'), ('e', 'NN'), (' ', 'NNP'), ('y', 'NNP'), ('e', 'VBZ'), ('a', 'DT'), ('r', 'NN'), ('s', 'NN'), (' ', 'VBZ'), ('a', 'DT'), ('g', 'NN'), ('o', 'NN'), ('—', 'NNP'), ('n', 'CC'), ('e', 'JJ'), ('v', 'NN'), ('e', 'NN'), ('r', 'NN'), (' ', 'NNP'), ('m', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('d', 'NN'), (' ', 'NN'), ('h', 'NN'), ('o', 'JJ'), ('w', 'NN'), (' ', 'NNP'), ('l', 'NN'), ('o', 'NN'), ('n', 'JJ'), ('g', 'NN'), (' ', '

b. Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.

I actually didn't feel like I could effectively 'break' the POS tagger. So I went ahead and used a slang sentence that ends a preposition just to see what it would do.

In [27]:
breakPOS = "I'm going with."
tokensBreak = nltk.word_tokenize(breakPOS)

taggedbreakPOS = nltk.pos_tag(breakPOS)

print('Original Sentence to Break POS:')
print('-----------------------')
print(breakPOS,'\n')
print('POS Broken Error')
print('--------------------')
print(taggedbreakPOS,'\n')

Original Sentence to Break POS:
-----------------------
I'm going with. 

POS Broken Error
--------------------
[('I', 'PRP'), ("'", "''"), ('m', 'JJ'), (' ', 'NNP'), ('g', 'NN'), ('o', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('g', 'NN'), (' ', 'NNP'), ('w', 'NN'), ('i', 'NN'), ('t', 'VBP'), ('h', 'NN'), ('.', '.')] 



##### HW 4: Question 2
Run a different POS tagger in Python. Process the same two sentences from _Question 1_.

a. Does it produce the same or different output?

In [4]:
import spacy
sp = spacy.load('en_core_web_sm')

In [20]:
longsenspacy = sp(u"Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.")
print('Original long sentence:')
print(longsenspacy.text)
print('-----------------------','\n')

#pos_ attribute returning the coarse-grained POS tag
print('Coarse Grained POS Tagging for 2nd Word')
print(longsenspacy[2].pos_)
print('-----------------------','\n')

print('Ganular POS Tagging for 2nd Word')
print(spacy.explain(longsenspacy[2].tag_))
print('-----------------------','\n')

#Full analyzed sentence
print('Ganular POS Tagging for Entire Sentence', '\n')
for word in longsenspacy:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

Original long sentence:
Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
----------------------- 

Coarse Grained POS Tagging for 2nd Word
PROPN
----------------------- 

Ganular POS Tagging for 2nd Word
noun, proper singular
----------------------- 

Ganular POS Tagging for Entire Sentence 

Call         VERB       VB       verb, base form
me           PRON       PRP      pronoun, personal
Ishmael      PROPN      NNP      noun, proper singular
.            PUNCT      .        punctuation mark, sentence closer
Some         DET        DT       determiner
years        NOUN       NNS      noun, plural
ago          ADV        RB       adverb
—            PUNCT      :        punctuation mark, colon or ellipsis
never        ADV        RB       adverb
mind         VERB       VB       verb, base form
how          ADV    

b.	Explain any differences as best you can.

There are tons of similarities and differences. Mostly I see all verbs regaularly recognized as well as subjects. I see more differences when it comes to preposition phrases. Adjectives tend to be similarly identified as well. I find the scaCy POS tagger much more informative. As someone who is new to NLP the ability to reference the word, part of speech, and POS tag in a sinlge line with a simple explanation really helps elevate the data in a digestable way.

##### HW 4: Question 3
In a news article from this week’s news, find a random sentence of at least 10 words.

a. Looking at the Penn tag set, manually POS tag the sentence yourself.

* Noun: Tiger Woods
* Adjective: Seriously
* Verb: Injured
* Preposition: in
* Verb: rollover
* Object: car
* Verb: crash
* Preposition: near
* Noun: Los Angeles

b. Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?

In [28]:
newssen = "Tiger Woods seriously injured in rollover car crash near Los Angeles."
taggednewssen = nltk.pos_tag(newssen)

print('Original news sentence:')
print('-----------------------')
print(newssen,'\n')
print('POS tagged news sentence:')
print('--------------------')
print(taggednewssen,'\n')

Original news sentence:
-----------------------
Tiger Woods seriously injured in rollover car crash near Los Angeles. 

POS tagged news sentence:
--------------------
[('T', 'NNP'), ('i', 'NN'), ('g', 'VBP'), ('e', 'NN'), ('r', 'NN'), (' ', 'NNP'), ('W', 'NNP'), ('o', 'MD'), ('o', 'VB'), ('d', 'JJ'), ('s', 'NN'), (' ', 'NNP'), ('s', 'NN'), ('e', 'NN'), ('r', 'NN'), ('i', 'NN'), ('o', 'VBP'), ('u', 'JJ'), ('s', 'NN'), ('l', 'NN'), ('y', 'NN'), (' ', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('j', 'NN'), ('u', 'JJ'), ('r', 'NN'), ('e', 'NN'), ('d', 'NN'), (' ', 'NN'), ('i', 'NN'), ('n', 'VBP'), (' ', 'JJ'), ('r', 'NN'), ('o', 'NN'), ('l', 'NN'), ('l', 'NN'), ('o', 'NN'), ('v', 'NN'), ('e', 'NN'), ('r', 'NN'), (' ', 'NNP'), ('c', 'VBZ'), ('a', 'DT'), ('r', 'NN'), (' ', 'NNP'), ('c', 'VBZ'), ('r', 'VB'), ('a', 'DT'), ('s', 'NN'), ('h', 'NN'), (' ', 'NNP'), ('n', 'RB'), ('e', 'VBZ'), ('a', 'DT'), ('r', 'NN'), (' ', 'NNP'), ('L', 'NNP'), ('o', 'MD'), ('s', 'VB'), (' ', 'VB'), ('A', 'NNP'), ('n', 'JJ

In [33]:
#Full analyzed sentence in spacy
newssen2 = sp("Tiger Woods seriously injured in rollover car crash near Los Angeles.")
print('Original news sentence:')
print(newssen2.text)
print('-----------------------','\n')

print('Spacy POS Tagging:','\n')
for word in newssen2:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

Original news sentence:
Tiger Woods seriously injured in rollover car crash near Los Angeles.
----------------------- 

Spacy POS Tagging: 

Tiger        PROPN      NNP      noun, proper singular
Woods        PROPN      NNP      noun, proper singular
seriously    ADV        RB       adverb
injured      VERB       VBN      verb, past participle
in           ADP        IN       conjunction, subordinating or preposition
rollover     NOUN       NN       noun, singular or mass
car          NOUN       NN       noun, singular or mass
crash        NOUN       NN       noun, singular or mass
near         SCONJ      IN       conjunction, subordinating or preposition
Los          PROPN      NNP      noun, proper singular
Angeles      PROPN      NNP      noun, proper singular
.            PUNCT      .        punctuation mark, sentence closer


c. Explain any differences between the two taggers and your manual tagging as much as you can.

Human interpretation versus machine learning interpreatation an being able to run every variation easily explains any difference between my manual tag and the two different POS packages.

Citations:
1. http://www.nltk.org/book/ch05.html

2. https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/
