### Mapping a word to it's base word
- talking -> talk
- adjustable -> adjust
- flying -> fly
etc.

### This process is called Stemming. The base word is called lemma. And the use of knowledge of a particular language to derive the base word is called Lemmatization.

Lemmatization is a more advanced process and is language specific. It is more accurate than stemming. But it is computationally expensive and takes more time than stemming. However both have their different use cases.

### In this notebook, I will be using the NLTK library to perform lemmatization on the text data.

In [1]:
import nltk
import spacy

In [3]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ['talking', 'adjustable', 'flying']

#we can see that all the answers are not correct here, the stemmer only uses fixed rules and lacks linguistic ability
for word in words:
    print(word, ' ---> ', stemmer.stem(word))

talking  --->  talk
adjustable  --->  adjust
flying  --->  fli


In [4]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('eating eat talk ate adjustable rat meeting better ability')

for token in doc:
    print(token.text, ' | ', token.lemma_)

eating  |  eat
eat  |  eat
talk  |  talk
ate  |  eat
adjustable  |  adjustable
rat  |  rat
meeting  |  meet
better  |  well
ability  |  ability


In [7]:
# adding custom rule
ar = nlp.get_pipe('attribute_ruler')
ar.add([[{'TEXT':'Bro'}],[{'TEXT':'Bruv'}]],{'LEMMA':'Brother'})

doc = nlp("Bro, you wanna hit chest today? Hell yeah, bruv")

for token in doc:

    print(token.text, ' | ', token.lemma_)

Bro  |  Brother
,  |  ,
you  |  you
wanna  |  wanna
hit  |  hit
chest  |  chest
today  |  today
?  |  ?
Hell  |  hell
yeah  |  yeah
,  |  ,
bruv  |  Brother


# Parts of Speect tagging

In [8]:
import spacy

In [12]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('Elon flew to mars yesterday. He forgot to carry his mobile phone')

for token in doc:
    print(token, ' | ', token.pos_ , ' | ', spacy.explain(token.pos_) , ' | ', token.tag_ , ' | ', spacy.explain(token.tag_))

Elon  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
flew  |  VERB  |  verb  |  VBD  |  verb, past tense
to  |  ADP  |  adposition  |  IN  |  conjunction, subordinating or preposition
mars  |  NOUN  |  noun  |  NNS  |  noun, plural
yesterday  |  NOUN  |  noun  |  NN  |  noun, singular or mass
.  |  PUNCT  |  punctuation  |  .  |  punctuation mark, sentence closer
He  |  PRON  |  pronoun  |  PRP  |  pronoun, personal
forgot  |  VERB  |  verb  |  VBD  |  verb, past tense
to  |  PART  |  particle  |  TO  |  infinitival "to"
carry  |  VERB  |  verb  |  VB  |  verb, base form
his  |  PRON  |  pronoun  |  PRP$  |  pronoun, possessive
mobile  |  ADJ  |  adjective  |  JJ  |  adjective (English), other noun-modifier (Chinese)
phone  |  NOUN  |  noun  |  NN  |  noun, singular or mass


In [13]:
text = """REDMOND, Wash. — April 25, 2024 — Microsoft Corp. today announced the following results for the quarter ended March 31, 2024, as compared to the corresponding period of last fiscal year:

·        Revenue was $61.9 billion and increased 17%

·        Operating income was $27.6 billion and increased 23%

·        Net income was $21.9 billion and increased 20%

·        Diluted earnings per share was $2.94 and increased 20%

“Microsoft Copilot and Copilot stack are orchestrating a new era of AI transformation, driving better business outcomes across every role and industry," said Satya Nadella, chairman and chief executive officer of Microsoft."""

In [14]:
doc = nlp(text=text)
for token in doc:
    if token.pos_ not in ['SPACE','X','PUNCT']:
        print(token.text, ' | ', token.pos_, '|', spacy.explain(token.pos_))

REDMOND  |  PROPN | proper noun
Wash.  |  PROPN | proper noun
April  |  PROPN | proper noun
25  |  NUM | numeral
2024  |  NUM | numeral
Microsoft  |  PROPN | proper noun
Corp.  |  PROPN | proper noun
today  |  NOUN | noun
announced  |  VERB | verb
the  |  DET | determiner
following  |  VERB | verb
results  |  NOUN | noun
for  |  ADP | adposition
the  |  DET | determiner
quarter  |  NOUN | noun
ended  |  VERB | verb
March  |  PROPN | proper noun
31  |  NUM | numeral
2024  |  NUM | numeral
as  |  SCONJ | subordinating conjunction
compared  |  VERB | verb
to  |  ADP | adposition
the  |  DET | determiner
corresponding  |  ADJ | adjective
period  |  NOUN | noun
of  |  ADP | adposition
last  |  ADJ | adjective
fiscal  |  ADJ | adjective
year  |  NOUN | noun
Revenue  |  NOUN | noun
was  |  AUX | auxiliary
$  |  SYM | symbol
61.9  |  NUM | numeral
billion  |  NUM | numeral
and  |  CCONJ | coordinating conjunction
increased  |  VERB | verb
17  |  NUM | numeral
%  |  NOUN | noun
Operating  |  VE

In [15]:
import spacy.attrs

#IMP
count = doc.count_by(spacy.attrs.POS)
count

{96: 12,
 97: 17,
 93: 15,
 92: 24,
 100: 13,
 90: 5,
 85: 7,
 98: 1,
 84: 8,
 103: 9,
 87: 5,
 99: 4,
 89: 7}

In [18]:
for key in count:
    print(doc.vocab[key].text,' -->',count[key])

PROPN  --> 12
PUNCT  --> 17
NUM  --> 15
NOUN  --> 24
VERB  --> 13
DET  --> 5
ADP  --> 7
SCONJ  --> 1
ADJ  --> 8
SPACE  --> 9
AUX  --> 5
SYM  --> 4
CCONJ  --> 7
