In [1]:
import spacy

In [7]:
# creating a pipeline with pre-trained english language identifier.
nlp = spacy.load("en_core_web_sm")

In [8]:
text = "Elon flew to mars yesterday. He carried chicken briyani with him"
doc = nlp(text)

In [21]:
for token in doc:
    print(token, " | ", token.pos_, " | ", spacy.explain(token.pos_)) # parts of speech tagging with explanation:

Elon  |  PROPN  |  proper noun
flew  |  VERB  |  verb
to  |  ADP  |  adposition
mars  |  NOUN  |  noun
yesterday  |  NOUN  |  noun
.  |  PUNCT  |  punctuation
He  |  PRON  |  pronoun
carried  |  VERB  |  verb
chicken  |  NOUN  |  noun
briyani  |  PROPN  |  proper noun
with  |  ADP  |  adposition
him  |  PRON  |  pronoun


In [24]:
# using TAG to get description about the form of verb and other extra info.
for token in doc:
    print(token, " | ", token.tag_, " | ", spacy.explain(token.tag_))

Elon  |  NNP  |  noun, proper singular
flew  |  VBD  |  verb, past tense
to  |  IN  |  conjunction, subordinating or preposition
mars  |  NNS  |  noun, plural
yesterday  |  NN  |  noun, singular or mass
.  |  .  |  punctuation mark, sentence closer
He  |  PRP  |  pronoun, personal
carried  |  VBD  |  verb, past tense
chicken  |  NN  |  noun, singular or mass
briyani  |  NNP  |  noun, proper singular
with  |  IN  |  conjunction, subordinating or preposition
him  |  PRP  |  pronoun, personal


### Spacy is smart enough to understand the meaning in a document about the language

In [29]:
news = """
REDMOND, Wash. — July 25, 2023 — Microsoft Corp. today announced the following results for the quarter ended June 30, 2023, as compared to the corresponding period of last fiscal year (test etc.):

·        Revenue was $56.2 billion and increased 8% (up 10% in constant currency)

·        Operating income was $24.3 billion and increased 18% (up 21% in constant currency)

·        Net income was $20.1 billion and increased 20% (up 23% in constant currency)

·        Diluted earnings per share was $2.69 and increased 21% (up 23% in constant currency)

“Organizations are asking not only how – but how fast – they can apply this next generation of AI to address the biggest opportunities and challenges they face – safely and responsibly,” said Satya Nadella, chairman and chief executive officer of Microsoft. “We remain focused on leading the new AI platform shift, helping customers use the Microsoft Cloud to get the most value out of their digital spend, and driving operating leverage.”
"""

### Will try to remove all the punctuations and extra characters from the news:

In [65]:
doc = nlp(news)
filtered_text = []
for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT","X"]:
        filtered_text.append(token.text)
        print(token, " | ", token.pos_)

REDMOND  |  PROPN
Wash.  |  PROPN
July  |  PROPN
25  |  NUM
2023  |  NUM
Microsoft  |  PROPN
Corp.  |  PROPN
today  |  NOUN
announced  |  VERB
the  |  DET
following  |  VERB
results  |  NOUN
for  |  ADP
the  |  DET
quarter  |  NOUN
ended  |  VERB
June  |  PROPN
30  |  NUM
2023  |  NUM
as  |  SCONJ
compared  |  VERB
to  |  ADP
the  |  DET
corresponding  |  ADJ
period  |  NOUN
of  |  ADP
last  |  ADJ
fiscal  |  ADJ
year  |  NOUN
test  |  NOUN
Revenue  |  NOUN
was  |  AUX
$  |  SYM
56.2  |  NUM
billion  |  NUM
and  |  CCONJ
increased  |  VERB
8  |  NUM
%  |  NOUN
up  |  ADV
10  |  NUM
%  |  NOUN
in  |  ADP
constant  |  ADJ
currency  |  NOUN
Operating  |  VERB
income  |  NOUN
was  |  AUX
$  |  SYM
24.3  |  NUM
billion  |  NUM
and  |  CCONJ
increased  |  VERB
18  |  NUM
%  |  NOUN
up  |  ADV
21  |  NUM
%  |  NOUN
in  |  ADP
constant  |  ADJ
currency  |  NOUN
Net  |  ADJ
income  |  NOUN
was  |  AUX
$  |  SYM
20.1  |  NUM
billion  |  NUM
and  |  CCONJ
increased  |  VERB
20  |  NUM
%  |  NOUN


In [74]:
# creating a new doc after removing all those irrelevant items:
# filtered_text
# type(filtered_text[0])
clean_news = ' '.join(filtered_text)

In [75]:
clean_news

'REDMOND Wash. July 25 2023 Microsoft Corp. today announced the following results for the quarter ended June 30 2023 as compared to the corresponding period of last fiscal year test Revenue was $ 56.2 billion and increased 8 % up 10 % in constant currency Operating income was $ 24.3 billion and increased 18 % up 21 % in constant currency Net income was $ 20.1 billion and increased 20 % up 23 % in constant currency Diluted earnings per share was $ 2.69 and increased 21 % up 23 % in constant currency Organizations are asking not only how but how fast they can apply this next generation of AI to address the biggest opportunities and challenges they face safely and responsibly said Satya Nadella chairman and chief executive officer of Microsoft We remain focused on leading the new AI platform shift helping customers use the Microsoft Cloud to get the most value out of their digital spend and driving operating leverage'

In [78]:
# counting the different part of speech in the text:
doc2 = nlp(clean_news)
count = doc2.count_by(spacy.attrs.POS)
count

{96: 14,
 93: 19,
 92: 34,
 100: 22,
 90: 8,
 85: 13,
 98: 3,
 84: 16,
 87: 6,
 99: 4,
 89: 9,
 86: 8,
 94: 3,
 95: 4}

In [79]:
# getting the count of the different parts of speech in the news 
for key, value in count.items():
    print(doc.vocab[key].text," | ",  value)

PROPN  |  14
NUM  |  19
NOUN  |  34
VERB  |  22
DET  |  8
ADP  |  13
SCONJ  |  3
ADJ  |  16
AUX  |  6
SYM  |  4
CCONJ  |  9
ADV  |  8
PART  |  3
PRON  |  4


# Exercise

# Exercise for Spacy POS tutorial,

**1.** You are parsing a news story from cnbc.com. News story is stores in ```news_story.txt``` which is available in this same folder. You need to,

**i.** Extract all NOUN tokens from this story. You will have to read the file in python first to collect all the text and then extract NOUNs in a python list

**ii.** Extract all numbers (NUM POS type) in a python list

**iii.** Print a count of all POS tags in this story

In [102]:
import pandas as pd

In [103]:
# open the file in read mode:
news_file = open("news_story.txt", 'r')

In [104]:
# reading the content of the file.
content = news_file.read()
news_file.close() # closing the opened file.
content

'Inflation rose again in April, continuing a climb that has pushed consumers to the brink and is threatening the economic expansion, the Bureau of Labor Statistics reported Wednesday.\n\nThe consumer price index, a broad-based measure of prices for goods and services, increased 8.3% from a year ago, higher than the Dow Jones estimate for an 8.1% gain. That represented a slight ease from March’s peak but was still close to the highest level since the summer of 1982.\n\nRemoving volatile food and energy prices, so-called core CPI still rose 6.2%, against expectations for a 6% gain, clouding hopes that inflation had peaked in March.\n\nThe month-over-month gains also were higher than expectations — 0.3% on headline CPI versus the 0.2% estimate and a 0.6% increase for core, against the outlook for a 0.4% gain.\n\nThe price gains also meant that workers continued to lose ground. Real wages adjusted for inflation decreased 0.1% on the month despite a nominal increase of 0.3% in average hourl

In [105]:
# Extracting all the nouns- pos from the content:

In [108]:
# first of all let's creat a pipeline for the language processing using spacy
# and importing the pre-trained linguistic model:
import spacy
nlp = spacy.load("en_core_web_sm") # loading the pre_trained pipeline

In [109]:
# passing the data through the pipeline:
doc = nlp(content)

In [121]:
proper_noun = []
nums = []

for token in doc:
    if token.pos_ is "PROPN":
        proper_noun.append(token)
    if token.pos_ is "NUM":
        nums.append(token) 

In [124]:
proper_noun

[April,
 Bureau,
 Labor,
 Statistics,
 Wednesday,
 Dow,
 Jones,
 March,
 CPI,
 March,
 CPI,
 Covid,
 Federal,
 Reserve,
 Wednesday,
 Fed]

In [126]:
# printing the count of alll the different parts of speech:
count = doc.count_by(spacy.attrs.POS)
count

{92: 96,
 100: 27,
 86: 15,
 85: 39,
 96: 16,
 97: 32,
 90: 34,
 95: 4,
 87: 13,
 89: 10,
 84: 23,
 103: 7,
 93: 19,
 94: 4,
 98: 8,
 101: 1}

In [131]:
for key, value in count.items():
    print(doc.vocab[key].text8, " | ",  value)

NOUN  |  96
VERB  |  27
ADV  |  15
ADP  |  39
PROPN  |  16
PUNCT  |  32
DET  |  34
PRON  |  4
AUX  |  13
CCONJ  |  10
ADJ  |  23
SPACE  |  7
NUM  |  19
PART  |  4
SCONJ  |  8
X  |  1
