### **Lemmatization using NLTK and spaCy**

**Import libraries:**

In [1]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

In [4]:
import spacy

**Performing Lemmatization using NLTK:**

In [5]:
# Python class within NLTK that utilizes WordNet to perform lemmatization
nltk.download('wordnet')

# omw-1.4 stands for "Open Multilingual WordNet 1.4", contains essential files for the WordNet corpus 
nltk.download('omw-1.4') 

# it triggers the training of a part-of-speech tagger model directly within the Python environment
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [6]:
# Sample text
text = "He was running very fast when he tripped and fell onto the ground."

* Create a lemmatizer object:

In [7]:
lemmatizer = WordNetLemmatizer()

In [8]:
for word, pos in nltk.pos_tag(text.split()):
    print(word, pos)

He PRP
was VBD
running VBG
very RB
fast RB
when WRB
he PRP
tripped VBD
and CC
fell VBD
onto IN
the DT
ground. NN


* Iterate through each word and lemmatize based on its part-of-speech (POS) tag:

In [9]:
for word, pos in nltk.pos_tag(text.split()):
       if pos.startswith('V'): # Verb
           print(f"NLTK: {word} -> {lemmatizer.lemmatize(word, 'v')}")
       elif pos.startswith('N'): # Noun
           print(f"NLTK: {word} -> {lemmatizer.lemmatize(word, 'n')}")
       elif pos.startswith('R'): # Adjective
           print(f"NLTK: {word} -> {lemmatizer.lemmatize(word, 'a')}")
       else: # Other (leave unchanged)
           print(f"NLTK: {word} -> {word}")

NLTK: He -> He
NLTK: was -> be
NLTK: running -> run
NLTK: very -> very
NLTK: fast -> fast
NLTK: when -> when
NLTK: he -> he
NLTK: tripped -> trip
NLTK: and -> and
NLTK: fell -> fell
NLTK: onto -> onto
NLTK: the -> the
NLTK: ground. -> ground.


**Performing Lemmatization using spaCy:**

In [10]:
nlp = spacy.load("en_core_web_sm")

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

**Create a Doc object and iterate through tokens:**

In [None]:
doc = nlp(text)
for token in doc:
    print(f"spaCy: {token.text} -> {token.lemma_}")

spaCy: He -> he
spaCy: was -> be
spaCy: running -> run
spaCy: very -> very
spaCy: fast -> fast
spaCy: when -> when
spaCy: he -> he
spaCy: tripped -> trip
spaCy: and -> and
spaCy: fell -> fall
spaCy: onto -> onto
spaCy: the -> the
spaCy: ground -> ground
spaCy: . -> .


**Differences between NLTK and spaCy:**

* **NLTK:**
    * Requires specifying part-of-speech tag for WordNet lemmatizer.
    * Offers more control over the process and flexibility for rule-based approaches.
    * Can be slower for large datasets.
* **spaCy:**
    * Uses statistical model for context-aware lemmatization.
    * Faster and generally more accurate for everyday language.
    * Less control over the specific lemma chosen.
