#  NLP Tokenization & Lemmatization: NLTK vs spaCy

This notebook demonstrates key NLP preprocessing tasks using two popular Python libraries: **NLTK** and **spaCy**.

We'll explore:
- Tokenization
- Stemming
- Lemmatization
- Comparison between NLTK and spaCy


In [1]:
import nltk
import spacy

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')

nlp = spacy.load("en_core_web_sm")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shirvanit.ir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\shirvanit.ir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shirvanit.ir\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\shirvanit.ir\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


##  Load Sample Text

We read input text from the `sample.txt` file.
Make sure this file exists in the same directory.


In [None]:
with open("sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

print(" Sample Text:\n")
print(text)


📄 Sample Text:

The striped bats are hanging on their feet for best.
He studies all the time, but he studied little yesterday.
They're looking for better solutions than what we had.
Running faster won't always get you the best results.
My friends' bikes were stolen, and the police are investigating.
Better tools can be found in newer technologies.
I was hoping that better results had been achieved.
The children are playing with robotic dogs in the park.
He gave better advice than the older advisor.
Cats chasing mice is a common cartoon scenario.


##  Tokenization

We'll tokenize the text using:
- **NLTK**'s `word_tokenize`
- **spaCy**'s built-in tokenizer


In [3]:
from nltk.tokenize import word_tokenize

# NLTK Tokenization
nltk_tokens = word_tokenize(text)
print("\n NLTK Tokens:")
print(nltk_tokens)

# spaCy Tokenization
doc = nlp(text)
spacy_tokens = [token.text for token in doc]
print("\n spaCy Tokens:")
print(spacy_tokens)



 NLTK Tokens:
['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best', '.', 'He', 'studies', 'all', 'the', 'time', ',', 'but', 'he', 'studied', 'little', 'yesterday', '.', 'They', "'re", 'looking', 'for', 'better', 'solutions', 'than', 'what', 'we', 'had', '.', 'Running', 'faster', 'wo', "n't", 'always', 'get', 'you', 'the', 'best', 'results', '.', 'My', 'friends', "'", 'bikes', 'were', 'stolen', ',', 'and', 'the', 'police', 'are', 'investigating', '.', 'Better', 'tools', 'can', 'be', 'found', 'in', 'newer', 'technologies', '.', 'I', 'was', 'hoping', 'that', 'better', 'results', 'had', 'been', 'achieved', '.', 'The', 'children', 'are', 'playing', 'with', 'robotic', 'dogs', 'in', 'the', 'park', '.', 'He', 'gave', 'better', 'advice', 'than', 'the', 'older', 'advisor', '.', 'Cats', 'chasing', 'mice', 'is', 'a', 'common', 'cartoon', 'scenario', '.']

 spaCy Tokens:
['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best', '.', '\n', 'He',

##  Stemming (NLTK Only)

Stemming reduces a word to its base or root form.
We'll compare three stemmers from NLTK:
- **Porter**
- **Lancaster**
- **Snowball**


In [4]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

print(f"{'Word':<15}{'Porter':<15}{'Lancaster':<15}{'Snowball'}")
print("-" * 60)

for word in nltk_tokens:
    if word.isalpha():
        print(f"{word:<15}{porter.stem(word):<15}{lancaster.stem(word):<15}{snowball.stem(word)}")


Word           Porter         Lancaster      Snowball
------------------------------------------------------------
The            the            the            the
striped        stripe         striped        stripe
bats           bat            bat            bat
are            are            ar             are
hanging        hang           hang           hang
on             on             on             on
their          their          their          their
feet           feet           feet           feet
for            for            for            for
best           best           best           best
He             he             he             he
studies        studi          study          studi
all            all            al             all
the            the            the            the
time           time           tim            time
but            but            but            but
he             he             he             he
studied        studi          study         

##  Lemmatization

Lemmatization brings a word to its base dictionary form (lemma).  
Unlike stemming, it uses context and part-of-speech (POS) information.

We'll use:
- **NLTK**'s `WordNetLemmatizer`
- **spaCy**'s `token.lemma_`


In [5]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(" NLTK Lemmatization:\n")

for word in nltk_tokens:
    if word.isalpha():
        print(f"{word:<15}{lemmatizer.lemmatize(word)}")


 NLTK Lemmatization:

The            The
striped        striped
bats           bat
are            are
hanging        hanging
on             on
their          their
feet           foot
for            for
best           best
He             He
studies        study
all            all
the            the
time           time
but            but
he             he
studied        studied
little         little
yesterday      yesterday
They           They
looking        looking
for            for
better         better
solutions      solution
than           than
what           what
we             we
had            had
Running        Running
faster         faster
wo             wo
always         always
get            get
you            you
the            the
best           best
results        result
My             My
friends        friend
bikes          bike
were           were
stolen         stolen
and            and
the            the
police         police
are            are
investigating  investig

In [6]:
print(" spaCy Lemmatization:\n")

for token in doc:
    if token.is_alpha:
        print(f"{token.text:<15}{token.lemma_}")


 spaCy Lemmatization:

The            the
striped        striped
bats           bat
are            be
hanging        hang
on             on
their          their
feet           foot
for            for
best           good
He             he
studies        study
all            all
the            the
time           time
but            but
he             he
studied        study
little         little
yesterday      yesterday
They           they
looking        look
for            for
better         well
solutions      solution
than           than
what           what
we             we
had            have
Running        run
faster         fast
wo             will
always         always
get            get
you            you
the            the
best           good
results        result
My             my
friends        friend
bikes          bike
were           be
stolen         steal
and            and
the            the
police         police
are            be
investigating  investigate
Better       

##  Final Comparison

| Feature          | NLTK                            | spaCy                          |
|------------------|----------------------------------|--------------------------------|
| **Tokenization** | Rule-based                      | Rule-based + Statistical       |
| **Stemming**     | Porter, Lancaster, Snowball     |  Not Supported                |
| **Lemmatization**| WordNet-based (limited context) | POS-aware and context-sensitive |
| **Ease of Use**  | Modular                         | Built-in NLP pipeline          |

 **Conclusion**:  
For advanced lemmatization and real-world usage, **spaCy** is typically more powerful and context-aware than **NLTK**.
