### Parts-of-Speech Assessment

Reference : https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/

### Parts of Speech Assessment

#### For this assessment we'll be using the short story 'The Tale of Peter Rabbit' by Beatrx Potter (1902).
#### The story is in the public domain; the text file was obtained from 'Project Gutenberg'

In [483]:
# RUN THIS CELL to peform standard imports:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

### 1. Create a Doc object from the file peterrabbit.txt

In [484]:
filepath = '/Users/ratulnandy/Documents/MSAAI/NLP/Module 5/peterrabbit.txt'

In [485]:
with open(filepath) as f:
    filetext = f.read()
    doc = nlp(filetext)

### 2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG, and the description of the fine-grained tag

In [486]:
sentence = [sent for sent in doc.sents]

In [487]:
# As per the file downloaded from 'Project Gutenberg', this version of spaCy is treating this sentence as sentence 4.

print(sentence[3].text)

They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.


In [488]:
sen3 = nlp(sentence[3].text)

In [489]:
for word in sen3:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

They         PRON       PRP      pronoun, personal
lived        VERB       VBD      verb, past tense
with         ADP        IN       conjunction, subordinating or preposition
their        PRON       PRP$     pronoun, possessive
Mother       NOUN       NN       noun, singular or mass
in           ADP        IN       conjunction, subordinating or preposition
a            DET        DT       determiner
sand         NOUN       NN       noun, singular or mass
-            PUNCT      HYPH     punctuation mark, hyphen
bank         NOUN       NN       noun, singular or mass
,            PUNCT      ,        punctuation mark, comma
underneath   ADP        IN       conjunction, subordinating or preposition
the          DET        DT       determiner
root         NOUN       NN       noun, singular or mass
of           ADP        IN       conjunction, subordinating or preposition
a            DET        DT       determiner

            SPACE      _SP      whitespace
very         ADV        RB     

### 3. Provide a frequency list of POS tags from the entire document

In [490]:
POS_count = doc.count_by(spacy.attrs.POS)

In [491]:
for i,j in sorted(POS_count.items()):
    print(f'{i}. {doc.vocab[i].text:{8}}: {j}')

84. ADJ     : 55
85. ADP     : 126
86. ADV     : 65
87. AUX     : 41
89. CCONJ   : 61
90. DET     : 95
92. NOUN    : 173
93. NUM     : 8
94. PART    : 29
95. PRON    : 105
96. PROPN   : 72
97. PUNCT   : 173
98. SCONJ   : 16
100. VERB    : 140
103. SPACE   : 98


### 4. CHALLENGE: What percentage of tokens are nouns?
HINT: the attribute ID for 'NOUN' is 91

In [492]:
noun=0
total=0

for n,t in sorted(POS_count.items()):
    if n==92:
        noun=t+noun
    total = t+total


In [493]:
print('Total Noun : '+str(noun))

Total Noun : 173


In [494]:
print('Total tokens : '+str(total))

Total tokens : 1257


In [495]:
perc = round((noun/total)*100,3)

In [496]:
print('Percentage of tokens that are nouns : '+str(perc)+'%')

Percentage of tokens that are nouns : 13.763%


### 5. Display the Dependency Parse for the third sentence

In [497]:
displacy.render(sentence[3], style='dep', jupyter=True, options={'distance': 120})

### 6. Show the first two named entities from the Beatrix Potter's *The Tale of Peter Rabbit**

In [504]:
ner = [entity for entity in doc.ents]    

In [505]:
print(ner[0].text + ' - ' + ner[0].label_ + ' - ' + str(spacy.explain(ner[0].label_)))
print(ner[1].text + ' - ' + ner[1].label_ + ' - ' + str(spacy.explain(ner[1].label_)))

The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.
Beatrix Potter - ORG - Companies, agencies, institutions, etc.


### 7. How many sentences are contained in The Tale of Peter Rabbit?

In [506]:
print('Total sentences :'+str(len(sentence)))

Total sentences :76


### 8. CHALLENGE: How many sentences contain named entities?

In [507]:
ner_s =0
for s in range(len(sentence)):
    if (len(sentence[s].ents)>0):
        print('Sentence : '+str(s))
        print(sentence[s].ents)
        ner_s=ner_s+1

Sentence : 0
[The Tale of Peter Rabbit, Beatrix Potter, 1902]
Sentence : 1
[four, Rabbits, Peter]
Sentence : 4
[one morning, McGregor, McGregor]
Sentence : 7
[Rabbit, baker]
Sentence : 8
[five]
Sentence : 9
[Flopsy, Mopsy, Cottontail]
Sentence : 10
[McGregor]
Sentence : 12
[First, French]
Sentence : 15
[McGregor]
Sentence : 16
[McGregor]
Sentence : 17
[Peter]
Sentence : 20
[one]
Sentence : 21
[four]
Sentence : 24
[McGregor, Peter, Peter]
Sentence : 28
[McGregor, Peter]
Sentence : 32
[McGregor]
Sentence : 34
[Peter, three]
Sentence : 35
[McGregor, Peter]
Sentence : 44
[Peter]
Sentence : 49
[McGregor]
Sentence : 51
[Benjamin Bunny]
Sentence : 55
[Peter]
Sentence : 58
[first, McGregor]
Sentence : 59
[Peter]
Sentence : 62
[McGregor, Peter]
Sentence : 64
[McGregor]
Sentence : 69
[second, Peter]
Sentence : 70
[Peter]
Sentence : 72
[One]
Sentence : 74
[Flopsy, Mopsy, Cotton]


In [508]:
print('Total number of sentences containing named entities : '+str(ner_s))

Total number of sentences containing named entities : 30


### 9. CHALLENGE: Display the named entity visualization for list_of_sents[0] from the previous problem

In [509]:
displacy.render(sentence[0], style='ent', jupyter=True)