# Parts of Speech and Named Entity Recognition Assessment

- Kaung Khant Lin
- 6540131
- 541

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
from spacy import displacy
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [2]:
with open("peterrabbit 1.txt") as f:
        doc = nlp(f.read())
print(doc.text[:100])

The Tale of Peter Rabbit, by Beatrix Potter (1902).

Once upon a time there were four little Rabbits


**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [3]:
# Get the third sentence
sents = list(doc.sents)
third_sent = sents[2]

# Print token information
for token in third_sent:
    print(f'{token.text:{12}} {token.pos_:{6}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

They         PRON   PRP    pronoun, personal
lived        VERB   VBD    verb, past tense
with         ADP    IN     conjunction, subordinating or preposition
their        PRON   PRP$   pronoun, possessive
Mother       PROPN  NNP    noun, proper singular
in           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner
sand         NOUN   NN     noun, singular or mass
-            PUNCT  HYPH   punctuation mark, hyphen
bank         NOUN   NN     noun, singular or mass
,            PUNCT  ,      punctuation mark, comma
underneath   ADP    IN     conjunction, subordinating or preposition
the          DET    DT     determiner
root         NOUN   NN     noun, singular or mass
of           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner

            SPACE  _SP    whitespace
very         ADV    RB     adverb
big          ADJ    JJ     adjective (English), other noun-modifier (Chinese)
fir          NOUN   NN

**3. Provide a frequency list of POS tags from the entire document**

In [4]:
# Count POS tags
POS_counts = doc.count_by(spacy.attrs.POS)

# Print frequency list
for k, v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 54
85. ADP  : 124
86. ADV  : 65
87. AUX  : 50
89. CCONJ: 61
90. DET  : 90
92. NOUN : 173
93. NUM  : 8
94. PART : 28
95. PRON : 108
96. PROPN: 75
97. PUNCT: 172
98. SCONJ: 20
100. VERB : 131
103. SPACE: 99


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [5]:
# Count nouns
nouns_count = POS_counts[92]
nouns_count
total_tokens = len(doc)

# Calculate and print percentage
percentage = (nouns_count / total_tokens) * 100
print(f'{nouns_count}/{total_tokens} = {percentage:.2f}%')

173/1258 = 13.75%


**5. Display the Dependency Parse for the third sentence**

In [6]:
displacy.render(third_sent, style='dep', jupyter=True, options={'distance': 110})

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [7]:
for ent in doc.ents[:2]:
    print(f'{ent.text} - {ent.label_} - {spacy.explain(ent.label_)}')

The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.
Beatrix Potter - PERSON - People, including fictional


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [8]:
print(len(list(doc.sents)))

57


**8. CHALLENGE: How many sentences contain named entities?**

In [9]:
list_of_sents = [sent for sent in doc.sents if sent.ents]
print(len(list_of_sents))

38


**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [10]:
# Display the first sentence with named entities
displacy.render(list_of_sents[0], style='ent', jupyter=True)

### Great Job!