<h2 align="center">Part Of Speech POS Tagging</h2>

In [1]:
import spacy



<h3>POS tags</h3>

In [2]:
nlp = spacy.load("en_core_web_sm")

In [5]:
doc = nlp("Yadav flew to Mumbai yesterday. He carried book with him")

for token in doc:
    print(token)

Yadav
flew
to
Mumbai
yesterday
.
He
carried
book
with
him


In [4]:
for ent in doc.ents:
    print(token, ent.label)

him 384
him 391


In [12]:
#Part of speech

for token in doc:
    print(token," | ",token.pos_," | ",spacy.explain(token.pos_))

Yadav  |  PROPN  |  proper noun
flew  |  VERB  |  verb
to  |  ADP  |  adposition
Mumbai  |  PROPN  |  proper noun
yesterday  |  NOUN  |  noun
.  |  PUNCT  |  punctuation
He  |  PRON  |  pronoun
carried  |  VERB  |  verb
book  |  NOUN  |  noun
with  |  ADP  |  adposition
him  |  PRON  |  pronoun


 list of pos categories in spacy.: https://v2.spacy.io/api/annotation

In [11]:
#lets create pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [17]:
#we can check tags - it will give more details


doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," | ",token.pos_," | ", spacy.explain(token.pos_)," | ", token.tag_," | ",  spacy.explain(token.tag_))

Wow  |  INTJ  |  interjection  |  UH  |  interjection
!  |  PUNCT  |  punctuation  |  .  |  punctuation mark, sentence closer
Dr.  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
Strange  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
made  |  VERB  |  verb  |  VBD  |  verb, past tense
265  |  NUM  |  numeral  |  CD  |  cardinal number
million  |  NUM  |  numeral  |  CD  |  cardinal number
$  |  NUM  |  numeral  |  CD  |  cardinal number
on  |  ADP  |  adposition  |  IN  |  conjunction, subordinating or preposition
the  |  DET  |  determiner  |  DT  |  determiner
very  |  ADV  |  adverb  |  RB  |  adverb
first  |  ADJ  |  adjective  |  JJ  |  adjective (English), other noun-modifier (Chinese)
day  |  NOUN  |  noun  |  NN  |  noun, singular or mass


<h2>Spacy figures out the past vs present tense for quit</h2>

In [18]:
doc = nlp("He quits the job")

for token in doc:
    print(token," | ",token.pos_," | ", spacy.explain(token.pos_)," | ", token.tag_," | ",  spacy.explain(token.tag_))

He  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
quits  |  VERB  |  verb  |  VBZ  |  verb, 3rd person singular present
the  |  DET  |  determiner  |  DT  |  determiner
job  |  NOUN  |  noun  |  NN  |  noun, singular or mass


In [19]:
doc = nlp("He quit the job")

for token in doc:
    print(token," | ",token.pos_," | ", spacy.explain(token.pos_)," | ", token.tag_," | ",  spacy.explain(token.tag_))

He  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
quit  |  VERB  |  verb  |  VBD  |  verb, past tense
the  |  DET  |  determiner  |  DT  |  determiner
job  |  NOUN  |  noun  |  NN  |  noun, singular or mass


from above we can conclude, spacy is smart to find what type of tense it is.

<h2>Removing all SPACE, PUNCT and X token from text</h2>

In [22]:


amazon_earning_report = """ SEATTLE--(BUSINESS WIRE)-- Amazon.com, Inc. (NASDAQ: AMZN) today announced financial results for its fourth quarter ended December 31, 2023.

Fourth Quarter 2023

Net sales increased 14% to $170.0 billion in the fourth quarter, compared with $149.2 billion in fourth quarter 2022. Excluding the $1.3 billion favorable impact from year-over-year changes in foreign exchange rates throughout the quarter, net sales increased 13% compared with fourth quarter 2022.
North America segment sales increased 13% year-over-year to $105.5 billion.
International segment sales increased 17% year-over-year to $40.2 billion, or increased 13% excluding changes in foreign exchange rates.
AWS segment sales increased 13% year-over-year to $24.2 billion. etc.."""


In [23]:
#remove punctations

filteredToken = 
doc = nlp(amazon_earning_report)

for token in doc:
    print(token," | ",token.pos_," | ", spacy.explain(token.pos_)," | ", token.tag_," | ",  spacy.explain(token.tag_))

   |  SPACE  |  space  |  _SP  |  whitespace
SEATTLE--(BUSINESS  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
WIRE)--  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
Amazon.com  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
,  |  PUNCT  |  punctuation  |  ,  |  punctuation mark, comma
Inc.  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
(  |  PUNCT  |  punctuation  |  -LRB-  |  left round bracket
NASDAQ  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
:  |  PUNCT  |  punctuation  |  :  |  punctuation mark, colon or ellipsis
AMZN  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
)  |  PUNCT  |  punctuation  |  -RRB-  |  right round bracket
today  |  NOUN  |  noun  |  NN  |  noun, singular or mass
announced  |  VERB  |  verb  |  VBD  |  verb, past tense
financial  |  ADJ  |  adjective  |  JJ  |  adjective (English), other noun-modifier (Chinese)
results  |  NOUN  |  noun  |  NNS  |  noun, plural
for  |  ADP  |  adpos

In [30]:
# need to remove X and punct


filteredToken = []

for token in doc:
    if token.pos_ not in ["SPACE", "X", "PUNCT"]:
        filteredToken.append(token)

In [31]:
#data is cleaned now
filteredToken[:20]

[SEATTLE--(BUSINESS,
 WIRE)--,
 Amazon.com,
 Inc.,
 NASDAQ,
 AMZN,
 today,
 announced,
 financial,
 results,
 for,
 its,
 fourth,
 quarter,
 ended,
 December,
 31,
 2023,
 Fourth,
 Quarter]

In [38]:
#How many verbs , nouns presnet using api

count = doc.count_by(spacy.attrs.POS)
count

{103: 6,
 96: 12,
 97: 22,
 92: 36,
 100: 12,
 84: 11,
 85: 17,
 95: 1,
 93: 23,
 99: 6,
 90: 3,
 89: 1,
 101: 2}

In [39]:
doc.vocab[96].text

'PROPN'

In [40]:
for k, v in count.items():
    print(doc.vocab[k].text, " | ", v)

SPACE  |  6
PROPN  |  12
PUNCT  |  22
NOUN  |  36
VERB  |  12
ADJ  |  11
ADP  |  17
PRON  |  1
NUM  |  23
SYM  |  6
DET  |  3
CCONJ  |  1
X  |  2


<h2>Tasks:</h2>

1. You are parsing a news story from cnbc.com. News story is stores in news_story.txt
    * Extract all NOUN tokens from this story. You will have to read the file in python first to collect all the text and then extract NOUNs in a python list
    * Extract all numbers (NUM POS type) in a python list
    * Print a count of all POS tags in this story

In [44]:
with open("news_story.txt") as f:
    text = f.readlines()
text =" ".join(text)
text

'Inflation rose again in April, continuing a climb that has pushed consumers to the brink and is threatening the economic expansion, the Bureau of Labor Statistics reported Wednesday.\n \n The consumer price index, a broad-based measure of prices for goods and services, increased 8.3% from a year ago, higher than the Dow Jones estimate for an 8.1% gain. That represented a slight ease from Marchâ€™s peak but was still close to the highest level since the summer of 1982.\n \n Removing volatile food and energy prices, so-called core CPI still rose 6.2%, against expectations for a 6% gain, clouding hopes that inflation had peaked in March.\n \n The month-over-month gains also were higher than expectations â€” 0.3% on headline CPI versus the 0.2% estimate and a 0.6% increase for core, against the outlook for a 0.4% gain.\n \n The price gains also meant that workers continued to lose ground. Real wages adjusted for inflation decreased 0.1% on the month despite a nominal increase of 0.3% in a

In [47]:
doc = nlp(text)

In [56]:
#How many verbs , nouns presnet using api
#iii.Print a count of all POS tags in this story
count = doc.count_by(spacy.attrs.POS)
count

{92: 98,
 100: 27,
 86: 15,
 85: 39,
 96: 17,
 97: 32,
 90: 34,
 95: 4,
 87: 13,
 89: 10,
 84: 23,
 103: 7,
 93: 20,
 94: 4,
 98: 8,
 101: 1}

In [49]:
for k, v in count.items():
    print(doc.vocab[k].text, " | ", v)

NOUN  |  98
VERB  |  27
ADV  |  15
ADP  |  39
PROPN  |  17
PUNCT  |  32
DET  |  34
PRON  |  4
AUX  |  13
CCONJ  |  10
ADJ  |  23
SPACE  |  7
NUM  |  20
PART  |  4
SCONJ  |  8
X  |  1


In [51]:
# i.Extract all NOUN tokens from this story.
nouns = []
for token in doc:
    if token.pos_ in ["PROPN", "NOUN"]:
        nouns.append(token)

nouns
        


[Inflation,
 April,
 climb,
 consumers,
 brink,
 expansion,
 Bureau,
 Labor,
 Statistics,
 Wednesday,
 consumer,
 price,
 index,
 measure,
 prices,
 goods,
 services,
 %,
 year,
 Dow,
 Jones,
 estimate,
 %,
 gain,
 ease,
 Marchâ€,
 ™,
 peak,
 level,
 summer,
 food,
 energy,
 prices,
 core,
 CPI,
 %,
 expectations,
 %,
 gain,
 hopes,
 inflation,
 March,
 month,
 month,
 gains,
 expectations,
 %,
 headline,
 CPI,
 %,
 estimate,
 %,
 increase,
 core,
 outlook,
 %,
 gain,
 price,
 gains,
 workers,
 ground,
 wages,
 inflation,
 %,
 month,
 increase,
 %,
 earnings,
 year,
 earnings,
 %,
 earnings,
 %,
 Inflation,
 threat,
 recovery,
 Covid,
 pandemic,
 economy,
 stage,
 year,
 growth,
 level,
 prices,
 pump,
 grocery,
 stores,
 problem,
 inflation,
 areas,
 housing,
 auto,
 sales,
 host,
 areas,
 Federal,
 Reserve,
 officials,
 problem,
 interest,
 rate,
 hikes,
 year,
 pledges,
 inflation,
 bankâ€,
 ™,
 %,
 goal,
 Wednesdayâ€,
 ™,
 data,
 Fed,
 job,
 Credits]

In [52]:
print("Total number of nouns are in the given text:", len(nouns)) 

Total number of nouns are in the given text: 115


In [54]:
# ii.Extract all numbers (NUM POS type) in a python list

numbers = []
for token in doc:
    if token.pos_ in ["NUM"]:
        numbers.append(token)

numbers

[8.3,
 8.1,
 1982,
 6.2,
 6,
 â€,
 0.3,
 0.2,
 0.6,
 0.4,
 0.1,
 0.3,
 2.6,
 5.5,
 2021,
 1984,
 one,
 two,
 two,
 2]

In [55]:

print("Total number of nouns are in the given text:", len(numbers)) 

Total number of nouns are in the given text: 20


In [57]:
##another way

with open("news_story.txt","r") as f:
    news_text = f.read()
    
news_text[:500]

'Inflation rose again in April, continuing a climb that has pushed consumers to the brink and is threatening the economic expansion, the Bureau of Labor Statistics reported Wednesday.\n\nThe consumer price index, a broad-based measure of prices for goods and services, increased 8.3% from a year ago, higher than the Dow Jones estimate for an 8.1% gain. That represented a slight ease from Marchâ€™s peak but was still close to the highest level since the summer of 1982.\n\nRemoving volatile food and ene'

In [58]:
doc = nlp(news_text)

numeral_tokens = []
noun_tokens = []

for token in doc:
    if token.pos_ == "NOUN":
        noun_tokens.append(token)
    elif token.pos_ == 'NUM':
        numeral_tokens.append(token)

In [59]:
numeral_tokens[:10]

[8.3, 8.1, 1982, 6.2, 6, â€, 0.3, 0.2, 0.6, 0.4]

In [60]:
noun_tokens[:10]

[Inflation,
 climb,
 consumers,
 brink,
 expansion,
 consumer,
 price,
 index,
 measure,
 prices]

In [61]:
count = doc.count_by(spacy.attrs.POS)
count

{92: 98,
 100: 27,
 86: 15,
 85: 39,
 96: 17,
 97: 32,
 90: 34,
 95: 4,
 87: 13,
 89: 10,
 84: 23,
 103: 7,
 93: 20,
 94: 4,
 98: 8,
 101: 1}

In [62]:
for k,v in count.items():
    print(doc.vocab[k].text, "|",v)

NOUN | 98
VERB | 27
ADV | 15
ADP | 39
PROPN | 17
PUNCT | 32
DET | 34
PRON | 4
AUX | 13
CCONJ | 10
ADJ | 23
SPACE | 7
NUM | 20
PART | 4
SCONJ | 8
X | 1
