## Homework 4
*Author: Puri Rudick*

##### 1.	Run one of the part-of-speech (POS) taggers available in Python.
- Find the longest sentence you can, longer than 10 words, that the POS tagger tags correctly. Show the input and output.
- Find the shortest sentence you can, shorter than 10 words, that the POS tagger fails to tag 100 percent correctly. Show the input and output. Explain your conjecture as to why the tagger might have been less than perfect with this sentence.

In [1]:
import re
import nltk
from nltk import pos_tag
from nltk.tag import UnigramTagger
from nltk.corpus import brown
from nltk import word_tokenize

The tagger that I chose to use here is pos_tag from nltk.

In [33]:
long_sentence = 'Belle knew that this was her chance to escape, but when she looked at the fallen Beast, she could not leave him.'
short_sentence = 'Giving up is not a choice.'

# Get words from a sentence
def words(txt):
    comp = re.compile(r'[^a-zA-Z ]+')
    txt = re.sub(comp, '', txt)
    splitted_words = txt.split()
    return splitted_words 

def postag(splitted_words):
    nltk.pos_tag(splitted_words)

In [34]:
long_words = words(long_sentence)
print('NLTK Pos Tag for Long Sentence:\n', pos_tag(long_words))

short_words = words(short_sentence)
print('\nNLTK Pos Tag for Short Sentence:\n', pos_tag(short_words))

NLTK Pos Tag for Long Sentence:
 [('Belle', 'NNP'), ('knew', 'VBD'), ('that', 'IN'), ('this', 'DT'), ('was', 'VBD'), ('her', 'PRP$'), ('chance', 'NN'), ('to', 'TO'), ('escape', 'VB'), ('but', 'CC'), ('when', 'WRB'), ('she', 'PRP'), ('looked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('fallen', 'VBN'), ('Beast', 'NNP'), ('she', 'PRP'), ('could', 'MD'), ('not', 'RB'), ('leave', 'VB'), ('him', 'PRP')]

NLTK Pos Tag for Short Sentence:
 [('Giving', 'VBG'), ('up', 'RP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('choice', 'NN')]


##### 2. Run a different POS tagger in Python. Process the same two sentences from question 1.
- Does it produce the same or different output?
- Explain any differences as best you can.

In [35]:
brown_tagged_sents = brown.tagged_sents()
unigram_tagger = UnigramTagger(brown_tagged_sents)

print('UnigramTagger for Long Sentence:\n', unigram_tagger.tag(long_words))
print('\nUnigramTagger for Short Sentence:\n', unigram_tagger.tag(short_words))

UnigramTagger for Long Sentence:
 [('Belle', 'NP'), ('knew', 'VBD'), ('that', 'CS'), ('this', 'DT'), ('was', 'BEDZ'), ('her', 'PP$'), ('chance', 'NN'), ('to', 'TO'), ('escape', 'VB'), ('but', 'CC'), ('when', 'WRB'), ('she', 'PPS'), ('looked', 'VBD'), ('at', 'IN'), ('the', 'AT'), ('fallen', 'VBN'), ('Beast', None), ('she', 'PPS'), ('could', 'MD'), ('not', '*'), ('leave', 'VB'), ('him', 'PPO')]

UnigramTagger for Short Sentence:
 [('Giving', 'VBG'), ('up', 'RP'), ('is', 'BEZ'), ('not', '*'), ('a', 'AT'), ('choice', 'NN')]


In [41]:
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

def spacy_tag(txt):
    cols = ["text", "pos", "tag", "explain pos", "explain tag"]
    doc = nlp(txt)
    rows = []
    for token in doc:
        row = token.text, token.pos_, token.tag_, spacy.explain(token.pos_), spacy.explain(token.tag_)
        rows.append(row)
    df = pd.DataFrame(rows, columns=cols)
    return df

spacy_tag(long_sentence)

Unnamed: 0,text,pos,tag,explain pos,explain tag
0,Belle,PROPN,NNP,proper noun,"noun, proper singular"
1,knew,VERB,VBD,verb,"verb, past tense"
2,that,SCONJ,IN,subordinating conjunction,"conjunction, subordinating or preposition"
3,this,PRON,DT,pronoun,determiner
4,was,AUX,VBD,auxiliary,"verb, past tense"
5,her,PRON,PRP$,pronoun,"pronoun, possessive"
6,chance,NOUN,NN,noun,"noun, singular or mass"
7,to,PART,TO,particle,"infinitival ""to"""
8,escape,VERB,VB,verb,"verb, base form"
9,",",PUNCT,",",punctuation,"punctuation mark, comma"


In [42]:
spacy_tag(short_sentence)

Unnamed: 0,text,pos,tag,explain pos,explain tag
0,Giving,VERB,VBG,verb,"verb, gerund or present participle"
1,up,ADP,RP,adposition,"adverb, particle"
2,is,AUX,VBZ,auxiliary,"verb, 3rd person singular present"
3,not,PART,RB,particle,adverb
4,a,DET,DT,determiner,determiner
5,choice,NOUN,NN,noun,"noun, singular or mass"
6,.,PUNCT,.,punctuation,"punctuation mark, sentence closer"


##### 3. In a news article from this week’s news, find a random sentence of at least 10 words.
- Looking at the Penn tag set, manually POS tag the sentence yourself.
- Now run the same sentences through both taggers that you implemented for questions 1 and 2. Did either of the taggers produce the same results as you had created manually?
- Explain any differences between the two taggers and your manual tagging as much as you can.


My news A [link](https://www.nbcnews.com/tech/internet/internet-explorers-run-finally-comes-end-rcna33628 "Title").

In [48]:
news = 'As of Wednesday, Microsoft will no longer support the once-dominant browser that legions of web surfers loved to hate — and a few still claim to adore.'

In [45]:
# Using nltk.pos_tag
news_words = words(news)
print('NLTK Pos Tag for News Sentence:\n', pos_tag(news_words))

NLTK Pos Tag for News Sentence:
 [('As', 'IN'), ('of', 'IN'), ('Wednesday', 'NNP'), ('Microsoft', 'NNP'), ('will', 'MD'), ('no', 'RB'), ('longer', 'RBR'), ('support', 'VB'), ('the', 'DT'), ('oncedominant', 'JJ'), ('browser', 'NN'), ('that', 'IN'), ('legions', 'NNS'), ('of', 'IN'), ('web', 'NN'), ('surfers', 'NNS'), ('loved', 'VBD'), ('to', 'TO'), ('hate', 'VB'), ('and', 'CC'), ('a', 'DT'), ('few', 'JJ'), ('still', 'RB'), ('claim', 'VBP'), ('to', 'TO'), ('adore', 'VB')]


In [49]:
# Using UnigramTagger
print('UnigramTagger for News Sentence:\n', unigram_tagger.tag(news_words))

UnigramTagger for News Sentence:
 [('As', 'CS'), ('of', 'IN'), ('Wednesday', 'NR'), ('Microsoft', None), ('will', 'MD'), ('no', 'AT'), ('longer', 'RBR'), ('support', 'NN'), ('the', 'AT'), ('oncedominant', None), ('browser', None), ('that', 'CS'), ('legions', 'NNS'), ('of', 'IN'), ('web', 'NN'), ('surfers', None), ('loved', 'VBD'), ('to', 'TO'), ('hate', 'VB'), ('and', 'CC'), ('a', 'AT'), ('few', 'AP'), ('still', 'RB'), ('claim', 'NN'), ('to', 'TO'), ('adore', 'VB')]


In [50]:
# Using spaCy
spacy_tag(news)

Unnamed: 0,text,pos,tag,explain pos,explain tag
0,As,ADP,IN,adposition,"conjunction, subordinating or preposition"
1,of,ADP,IN,adposition,"conjunction, subordinating or preposition"
2,Wednesday,PROPN,NNP,proper noun,"noun, proper singular"
3,",",PUNCT,",",punctuation,"punctuation mark, comma"
4,Microsoft,PROPN,NNP,proper noun,"noun, proper singular"
5,will,AUX,MD,auxiliary,"verb, modal auxiliary"
6,no,ADV,RB,adverb,adverb
7,longer,ADV,RB,adverb,adverb
8,support,VERB,VB,verb,"verb, base form"
9,the,DET,DT,determiner,determiner
