### **Lemmatization using NLTK and spaCy**

**Import libraries:**

In [1]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

In [2]:
import spacy

**Performing Lemmatization using NLTK:**

In [3]:
# Python class within NLTK that utilizes WordNet to perform lemmatization
nltk.download('wordnet')

# omw-1.4 stands for "Open Multilingual WordNet 1.4", contains essential files for the WordNet corpus 
nltk.download('omw-1.4') 

# it triggers the training of a part-of-speech tagger model directly within the Python environment
nltk.download('averaged_perceptron_tagger')

[nltk_data] Error loading wordnet: <urlopen error [WinError 10065] A
[nltk_data]     socket operation was attempted to an unreachable host>
[nltk_data] Error loading omw-1.4: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno 11001] getaddrinfo failed>


False

In [4]:
# Sample text
text = "He was running very fast when he tripped and fell onto the ground."

* Create a lemmatizer object:

In [5]:
lemmatizer = WordNetLemmatizer()

In [6]:
for word, pos in nltk.pos_tag(text.split()):
    print(word, pos)

He PRP
was VBD
running VBG
very RB
fast RB
when WRB
he PRP
tripped VBD
and CC
fell VBD
onto IN
the DT
ground. NN


* Iterate through each word and lemmatize based on its part-of-speech (POS) tag:

In [7]:
for word, pos in nltk.pos_tag(text.split()):
       if pos.startswith('V'): # Verb
           print(f"NLTK: {word} -> {lemmatizer.lemmatize(word, 'v')}")
       elif pos.startswith('N'): # Noun
           print(f"NLTK: {word} -> {lemmatizer.lemmatize(word, 'n')}")
       elif pos.startswith('R'): # Adjective
           print(f"NLTK: {word} -> {lemmatizer.lemmatize(word, 'a')}")
       else: # Other (leave unchanged)
           print(f"NLTK: {word} -> {word}")

NLTK: He -> He
NLTK: was -> be
NLTK: running -> run
NLTK: very -> very
NLTK: fast -> fast
NLTK: when -> when
NLTK: he -> he
NLTK: tripped -> trip
NLTK: and -> and
NLTK: fell -> fell
NLTK: onto -> onto
NLTK: the -> the
NLTK: ground. -> ground.


**Performing Lemmatization using spaCy:**

In [8]:
nlp = spacy.load("en_core_web_sm")

**Create a Doc object and iterate through tokens:**

In [9]:
doc = nlp(text)
for token in doc:
    print(f"spaCy: {token.text} -> {token.lemma_}")

spaCy: He -> he
spaCy: was -> be
spaCy: running -> run
spaCy: very -> very
spaCy: fast -> fast
spaCy: when -> when
spaCy: he -> he
spaCy: tripped -> trip
spaCy: and -> and
spaCy: fell -> fall
spaCy: onto -> onto
spaCy: the -> the
spaCy: ground -> ground
spaCy: . -> .


**Differences between NLTK and spaCy:**

* **NLTK:**
    * Requires specifying part-of-speech tag for WordNet lemmatizer.
    * Offers more control over the process and flexibility for rule-based approaches.
    * Can be slower for large datasets.
* **spaCy:**
    * Uses statistical model for context-aware lemmatization.
    * Faster and generally more accurate for everyday language.
    * Less control over the specific lemma chosen.
