# Parts of Speech and Named Entity Recognition Assessment

Khant Nyi Thu
6632108

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [184]:
with open('peterrabbit.txt', 'r') as f:
    text = f.read()
doc = nlp(text)

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [188]:
third_sentence = list(doc.sents)[2]
for token in third_sentence:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

They       PRON     PRP    pronoun, personal
lived      VERB     VBD    verb, past tense
with       ADP      IN     conjunction, subordinating or preposition
their      PRON     PRP$   pronoun, possessive
Mother     PROPN    NNP    noun, proper singular
in         ADP      IN     conjunction, subordinating or preposition
a          DET      DT     determiner
sand       NOUN     NN     noun, singular or mass
-          PUNCT    HYPH   punctuation mark, hyphen
bank       NOUN     NN     noun, singular or mass
,          PUNCT    ,      punctuation mark, comma
underneath ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
root       NOUN     NN     noun, singular or mass
of         ADP      IN     conjunction, subordinating or preposition
a          DET      DT     determiner

          SPACE    _SP    whitespace
very       ADV      RB     adverb
big        ADJ      JJ     adjective (English), other noun-modifier (Chinese)
fir        NOUN     NN

**3. Provide a frequency list of POS tags from the entire document**

In [190]:
POS_counts = doc.count_by(spacy.attrs.POS)

for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 54
85. ADP  : 124
86. ADV  : 65
87. AUX  : 50
89. CCONJ: 61
90. DET  : 90
92. NOUN : 173
93. NUM  : 8
94. PART : 28
95. PRON : 108
96. PROPN: 75
97. PUNCT: 172
98. SCONJ: 20
100. VERB : 131
103. SPACE: 99


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [192]:
noun_count = POS_counts.get(92, 0) 
total_tokens = sum(POS_counts.values())

noun_percentage = (noun_count/total_tokens) * 100

print(f"Total tokens: {total_tokens}")
print(f"Noun tokens: {noun_count}")
print(f"Percentage of nouns: {noun_percentage:.2f}%")


#I'm getting 173 noun counts here instead of 176 counts in given notebook using the same peterrabbit.txt file. Attr ID for Nouns is also 92 here.


Total tokens: 1258
Noun tokens: 173
Percentage of nouns: 13.75%


**5. Display the Dependency Parse for the third sentence**

In [194]:
displacy.render(third_sentence, style="dep", jupyter=True, options={"compact": False, "distance": 100})

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [196]:
first_two_NER = list(doc.ents)[:2]
for ent in first_two_NER:
    print(f"{ent.text}, {ent.label_} ({spacy.explain(ent.label_)})")


The Tale of Peter Rabbit, WORK_OF_ART (Titles of books, songs, etc.)
Beatrix Potter, PERSON (People, including fictional)


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [201]:
sentence_count = len(list(doc.sents))

print(sentence_count)

57


**8. CHALLENGE: How many sentences contain named entities?**

In [215]:
counted_sentences = set()

# Count sentences containing named entities
for ent in doc.ents:
    sent = ent.sent  # Get the sentence containing the entity
    if sent not in counted_sentences:
        counted_sentences.add(sent)

num_sentences_with_entities = len(counted_sentences)
num_sentences_with_entities


38

**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [203]:
list_of_sents = list(doc.sents)

first_sentence = list_of_sents[0]

displacy.render(first_sentence, style="ent", jupyter=True)


### Great Job!