In [1]:
import spacy

# POS tags

In [3]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_))

Elon  |  PROPN  |  proper noun
flew  |  VERB  |  verb
to  |  ADP  |  adposition
mars  |  NOUN  |  noun
yesterday  |  NOUN  |  noun
.  |  PUNCT  |  punctuation
He  |  PRON  |  pronoun
carried  |  VERB  |  verb
biryani  |  ADJ  |  adjective
masala  |  NOUN  |  noun
with  |  ADP  |  adposition
him  |  PRON  |  pronoun


You can check https://v2.spacy.io/api/annotation for the complete list of pos categories in spacy.

https://en.wikipedia.org/wiki/Preposition_and_postposition

https://en.wikipedia.org/wiki/Part_of_speech

# Tags

In [7]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_) , 
          " | ", token.tag_, "| ", spacy.explain(token.tag_))

Wow  |  INTJ  |  interjection  |  UH |  interjection
!  |  PUNCT  |  punctuation  |  . |  punctuation mark, sentence closer
Dr.  |  PROPN  |  proper noun  |  NNP |  noun, proper singular
Strange  |  PROPN  |  proper noun  |  NNP |  noun, proper singular
made  |  VERB  |  verb  |  VBD |  verb, past tense
265  |  NUM  |  numeral  |  CD |  cardinal number
million  |  NUM  |  numeral  |  CD |  cardinal number
$  |  NUM  |  numeral  |  CD |  cardinal number
on  |  ADP  |  adposition  |  IN |  conjunction, subordinating or preposition
the  |  DET  |  determiner  |  DT |  determiner
very  |  ADV  |  adverb  |  RB |  adverb
first  |  ADJ  |  adjective  |  JJ |  adjective (English), other noun-modifier (Chinese)
day  |  NOUN  |  noun  |  NN |  noun, singular or mass


## In below sentences Spacy figures out the past vs present tense for quit

In [8]:
doc = nlp("He quits the job")

print(doc[1].text, "|", doc[1].tag_, "|", spacy.explain(doc[1].tag_))

quits | VBZ | verb, 3rd person singular present


## Removing all SPACE, PUNCT and X token from text

Processing microsoft's earning report: https://www.microsoft.com/en-us/investor/earnings/FY-2023-Q2/press-release-webcast

In [9]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2022, as compared to the corresponding period of last fiscal year:

·        Revenue was $52.7 billion and increased 2%  

·        Operating income was $20.4 billion GAAP and $21.6 billion non-GAAP, and decreased 8% and 3%, respectively

·        Net income was $16.4 billion GAAP and $17.4 billion non-GAAP, and decreased 12% and 7%, respectively

·        Diluted earnings per share was $2.20 GAAP and $2.32 non-GAAP, and decreased 11% and 6%, respectively
“The next major wave of computing is being born, as the Microsoft Cloud turns the world’s most advanced AI models into a new computing platform,” said Satya Nadella, chairman and chief executive officer of Microsoft. “We are committed to helping our customers use our platforms and tools to do more with less today and innovate for the future in the new era of AI.”
"""

doc = nlp(earnings_text)

filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        filtered_tokens.append(token)

In [11]:
filtered_tokens[:20]

[Microsoft,
 Corp.,
 today,
 announced,
 the,
 following,
 results,
 for,
 the,
 quarter,
 ended,
 December,
 31,
 2022,
 as,
 compared,
 to,
 the,
 corresponding,
 period]

In [13]:
count = doc.count_by(spacy.attrs.POS)
count

{96: 10,
 92: 43,
 100: 17,
 90: 9,
 85: 12,
 93: 21,
 97: 22,
 98: 2,
 84: 14,
 103: 10,
 87: 7,
 99: 7,
 89: 13,
 86: 4,
 94: 2,
 95: 3}

In [15]:
for k , v in count.items():
    print(doc.vocab[k].text ," | " ,v)

PROPN  |  10
NOUN  |  43
VERB  |  17
DET  |  9
ADP  |  12
NUM  |  21
PUNCT  |  22
SCONJ  |  2
ADJ  |  14
SPACE  |  10
AUX  |  7
SYM  |  7
CCONJ  |  13
ADV  |  4
PART  |  2
PRON  |  3
