___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [2]:
# Quick function to remove ents formed on whitespace:
def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

# Insert this into the pipeline AFTER the ner component:
nlp.add_pipe(remove_whitespace_entities, after='ner')

In [3]:
with open('..\\TextFiles\\peterrabbit.txt') as f:
    doc = nlp(f.read()) 

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [4]:
# Enter your code here:
sentence = 0
index = 0
while sentence < 3:
    if doc[index].is_sent_start == True:
        sentence +=1
    if sentence == 2:
        print(f'{doc[index].text:<{10}} {doc[index].pos_:<{10}} {doc[index].tag_:<{10}} {spacy.explain(doc[index].tag_)}')
    index += 1

They       PRON       PRP        pronoun, personal
lived      VERB       VBD        verb, past tense
with       ADP        IN         conjunction, subordinating or preposition
their      ADJ        PRP$       pronoun, possessive
Mother     PROPN      NNP        noun, proper singular
in         ADP        IN         conjunction, subordinating or preposition
a          DET        DT         determiner
sand       NOUN       NN         noun, singular or mass
-          PUNCT      HYPH       punctuation mark, hyphen
bank       NOUN       NN         noun, singular or mass
,          PUNCT      ,          punctuation mark, comma
underneath ADP        IN         conjunction, subordinating or preposition
the        DET        DT         determiner
root       NOUN       NN         noun, singular or mass
of         ADP        IN         conjunction, subordinating or preposition
a          DET        DT         determiner

          SPACE                 None
very       ADV        RB         adver

**3. Provide a frequency list of POS tags from the entire document**

In [5]:
POS_Count = doc.count_by(spacy.attrs.POS)

for key, value in POS_Count.items():
    print(f'{key}. {doc.vocab[key].text:<{5}}: {value}')

96. PUNCT: 174
99. VERB : 182
102. SPACE: 99
83. ADJ  : 83
84. ADP  : 127
85. ADV  : 75
88. CCONJ: 61
89. DET  : 90
91. NOUN : 176
92. NUM  : 8
93. PART : 36
94. PRON : 72
95. PROPN: 75


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [6]:
POS_Count[91]/sum(POS_Count.values())*100

13.990461049284578

**5. Display the Dependency Parse for the third sentence**

In [7]:
list_of_sents = [sent for sent in doc.sents]

displacy.render(list_of_sents[2], style = 'dep', jupyter = True, options = {'distance': 100})

**6. Show the first two named entities from Beatrix Potter's \'The Tale of Peter Rabbit' **

In [8]:
doc.ents[:2]

(The Tale of Peter Rabbit, Beatrix Potter)

**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [9]:
len(list_of_sents)

56

**8. CHALLENGE: How many sentences contain named entities?**

In [10]:
ner_index = 0
ner_sent_count = 0
len_ner = len(doc.ents)
for sent in doc.sents:
    sent_start = sent.start
    sent_end = sent.end
    sent_has_ner = False
    while ner_index < len_ner and (sent_start <= doc.ents[ner_index].start and doc.ents[ner_index].end < sent_end):
        sent_has_ner = True
        ner_index += 1
    if sent_has_ner == True:
        ner_sent_count += 1
        
print(ner_sent_count)

36


**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [11]:
displacy.render(nlp(list_of_sents[0].text), style = 'ent', jupyter = True)

### Great Job!