___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

In [7]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7c81f89215a0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7c81f8921660>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7c81fcfbfe60>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7c81f89c6bc0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7c81f8889200>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7c81fca0f8b0>)]

In [8]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [11]:
with open("/content/The Tale of Peter Rabbit is a child.txt", "r") as file:
    text = file.read()
doc = nlp(text)

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [16]:
# Enter your code here:
t_sent = list(doc.sents)[2]
for token in t_sent:
  print(f"Token: {token.text}")
  print(f"POS Tag: {token.pos_}")
  print(f"Fine-grained TAG: {token.tag_}")
  print(f"Description of Fine-grained TAG: {spacy.explain(token.tag_)}")
  print("-" * 20)

Token: The
POS Tag: DET
Fine-grained TAG: DT
Description of Fine-grained TAG: determiner
--------------------
Token: tale
POS Tag: NOUN
Fine-grained TAG: NN
Description of Fine-grained TAG: noun, singular or mass
--------------------
Token: was
POS Tag: AUX
Fine-grained TAG: VBD
Description of Fine-grained TAG: verb, past tense
--------------------
Token: written
POS Tag: VERB
Fine-grained TAG: VBN
Description of Fine-grained TAG: verb, past participle
--------------------
Token: for
POS Tag: ADP
Fine-grained TAG: IN
Description of Fine-grained TAG: conjunction, subordinating or preposition
--------------------
Token: five
POS Tag: NUM
Fine-grained TAG: CD
Description of Fine-grained TAG: cardinal number
--------------------
Token: -
POS Tag: PUNCT
Fine-grained TAG: HYPH
Description of Fine-grained TAG: punctuation mark, hyphen
--------------------
Token: year
POS Tag: NOUN
Fine-grained TAG: NN
Description of Fine-grained TAG: noun, singular or mass
--------------------
Token: -
POS Ta

**3. Provide a frequency list of POS tags from the entire document**

In [19]:
from collections import Counter

In [20]:
pos_tag_counter = Counter()

# Iterate through all tokens in the document and count POS tags
for token in doc:
    pos_tag_counter[token.pos_] += 1
print("POS Tag Frequency List:")
for pos_tag, count in pos_tag_counter.items():
    print(f"{pos_tag}: {count}")

POS Tag Frequency List:
DET: 10
PROPN: 19
ADP: 20
AUX: 10
NOUN: 24
PART: 3
VERB: 18
CCONJ: 9
PRON: 11
ADJ: 7
SCONJ: 2
PUNCT: 18
ADV: 5
NUM: 8


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [21]:
noun_count = sum(1 for token in doc if token.pos_ == "NOUN")

total_t = len(doc)
percentage_n = (noun_count/total_t)*100

print(f"Percentage of tokens that are nouns: {percentage_n:.2f}%")


Percentage of tokens that are nouns: 14.63%


**5. Display the Dependency Parse for the third sentence**

In [22]:
t_sent = list(doc.sents)[2]

print("Dependency Parse for the Third Sentence: ")

for token in t_sent:
  print(f"{token.text}-->{token.dep_}-->{token.head.text}")

Dependency Parse for the Third Sentence: 
The-->det-->tale
tale-->nsubjpass-->written
was-->auxpass-->written
written-->ROOT-->written
for-->prep-->written
five-->nummod-->year
--->punct-->year
year-->npadvmod-->old
--->punct-->old
old-->amod-->Moore
Noel-->compound-->Moore
Moore-->pobj-->for
,-->punct-->Moore
the-->det-->son
son-->appos-->Moore
of-->prep-->son
Potter-->poss-->governess
's-->case-->Potter
former-->amod-->governess
governess-->pobj-->of
,-->punct-->Moore
Annie-->compound-->Moore
Carter-->compound-->Moore
Moore-->appos-->Moore
,-->punct-->Moore
in-->prep-->written
1893-->pobj-->in
.-->punct-->written


**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [24]:
named_entities = [(ent.text, ent.label_) for ent in doc.ents]
print("First Two Named Entities:")
for i, (text, label) in enumerate(named_entities[:2], start=1):
    print(f"{i}. Text: {text}, Label: {label}")

First Two Named Entities:
1. Text: The Tale of Peter Rabbit, Label: WORK_OF_ART
2. Text: Beatrix Potter, Label: PERSON


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [25]:
num_sentences = len(list(doc.sents))
print(f"The Tale of Peter Rabbit contains {num_sentences} sentences.")

The Tale of Peter Rabbit contains 6 sentences.


**8. CHALLENGE: How many sentences contain named entities?**

In [26]:
num_sentences_with_entities = sum(1 for sent in doc.sents if any(ent for ent in sent.ents))

print(f"The Tale of Peter Rabbit contains {num_sentences_with_entities} sentences with named entities.")


The Tale of Peter Rabbit contains 5 sentences with named entities.


**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [27]:
sentences_with_entities = [sent for sent in doc.sents if any(ent for ent in sent.ents)]
if sentences_with_entities:
    first_sentence_with_entities = sentences_with_entities[0]
    displacy.render(first_sentence_with_entities, style="ent", jupyter=True)
else:
    print("No sentences with named entities found.")