# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [2]:
with open('../../TextFiles/peterrabbit.txt') as file:
    document = file.read()
    doc = nlp(document)


**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [3]:
# Enter your code here:
list_of_sents = list(doc.sents)
sent = list_of_sents[2]

print(f'{"Token":{14}}  {"POS":{10}} {"TAG":{13}} {"Description":{15}}')
print('------------------------------------------------------------------------------------------')
for token in sent:
    print(f'{token.text:{15}} {token.pos_:{10}} {token.tag_:{13}} {spacy.explain(token.tag_):{15}}')

Token           POS        TAG           Description    
------------------------------------------------------------------------------------------
They            PRON       PRP           pronoun, personal
lived           VERB       VBD           verb, past tense
with            ADP        IN            conjunction, subordinating or preposition
their           PRON       PRP$          pronoun, possessive
Mother          NOUN       NN            noun, singular or mass
in              ADP        IN            conjunction, subordinating or preposition
a               DET        DT            determiner     
sand            NOUN       NN            noun, singular or mass
-               PUNCT      HYPH          punctuation mark, hyphen
bank            NOUN       NN            noun, singular or mass
,               PUNCT      ,             punctuation mark, comma
underneath      ADP        IN            conjunction, subordinating or preposition
the             DET        DT            dete

**3. Provide a frequency list of POS tags from the entire document**

In [4]:
POS_counts = doc.count_by(spacy.attrs.POS)
for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 53
85. ADP  : 125
86. ADV  : 63
87. AUX  : 49
89. CCONJ: 61
90. DET  : 90
92. NOUN : 172
93. NUM  : 9
94. PART : 28
95. PRON : 110
96. PROPN: 74
97. PUNCT: 171
98. SCONJ: 19
100. VERB : 135
103. SPACE: 99


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [5]:
# Attribute ID of NOUN is 92

total = len(doc)
print(total)
# Total token count: 1258

total_noun = POS_counts[92]
print(total_noun)
# Total NOUN token count: 172

# Printing the percentage of noun tokens, rounded up to 3 decimal places
print(f'Noun token percent: {round((total_noun/total)*100, 3)}%')




1258
172
Noun token percent: 13.672%


**5. Display the Dependency Parse for the third sentence**

In [6]:
displacy.render(sent, style='dep', jupyter=True, options={'distance': 150})

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit***

In [7]:
entities = doc.ents

for x in range(2):
    print(entities[x].text+' - '+entities[x].label_+' - '+str(spacy.explain(entities[x].label_)))


The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.
Beatrix Potter - PERSON - People, including fictional


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [8]:
len(list_of_sents)

55

**8. CHALLENGE: How many sentences contain named entities?**

In [9]:
ent_in_sents = [sent for sent in list_of_sents if sent.ents]
len(ent_in_sents)


35

**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [10]:
displacy.render(list_of_sents[0], style='ent', jupyter=True)