
# üìò Natural Language Processing (NLP): Tokenization, Stemming & Lemmatization

**What you'll learn:**  
- Clear definitions of **NLP** and **NLG**  
- The **NLP preprocessing pipeline**  
- How to do **Tokenization**, **Stemming**, and **Lemmatization** in NLTK  
- When to use each technique (with pros/cons)  

---



## 1Ô∏è‚É£ Introduction

### What is NLP?
**Natural Language Processing (NLP)** is a field of AI that helps computers understand, interpret, and generate human language. It powers applications like chatbots, search engines, summarizers, and translators.

### What is NLG?
**Natural Language Generation (NLG)** is a subfield of NLP focused on generating human-like text from structured or unstructured data. Examples: automated report writing, product descriptions, and conversational responses.

### Why text preprocessing matters
Raw text is noisy. We typically **normalize and structure** it before modeling. Three core steps you're learning here:
- **Tokenization:** Split text into sentences/words.
- **Stemming:** Chop words to their rough root forms.
- **Lemmatization:** Reduce words to valid dictionary forms (lemmas).



## 2Ô∏è‚É£ NLP Pipeline (Overview)

```
Raw Text
   ‚îÇ
   ‚îú‚îÄ‚îÄ‚ñ∫ Tokenization (sentences, words)
   ‚îÇ         ‚îÇ
   ‚îÇ         ‚îú‚îÄ‚îÄ‚ñ∫ Cleaning (lowercase, remove punctuation, keep alphabetic)
   ‚îÇ         ‚îî‚îÄ‚îÄ‚ñ∫ Stopword Removal (a, the, is, ...)
   ‚îÇ
   ‚îú‚îÄ‚îÄ‚ñ∫ Normalization
   ‚îÇ         ‚îú‚îÄ‚îÄ‚ñ∫ Stemming (rule-based roots)
   ‚îÇ         ‚îî‚îÄ‚îÄ‚ñ∫ Lemmatization (dictionary-based lemmas)
   ‚îÇ
   ‚îî‚îÄ‚îÄ‚ñ∫ Downstream Tasks (classification, NER, QA, search, etc.)
```



## 3Ô∏è‚É£ Setup (NLTK Resources)

Run the cell below **once** to download the required NLTK resources:
- `punkt` (tokenizer models)
- `stopwords` (list of common English stopwords)
- `wordnet` + `omw-1.4` (lemmatizer dictionary data)


In [None]:

# ‚¨áÔ∏è One-time downloads (safe to re-run)
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')



## 4Ô∏è‚É£ Example Text

We'll use the same paragraph across all steps for a fair comparison.


In [None]:

paragraph = "AI, machine learning and deep learning are common terms in enterprise\nIT and sometimes used interchangeably, especially by companies in their marketing materials.\nBut there are distinctions. The term AI, coined in the 1950s, refers to the simulation of human\nintelligence by machines. It covers an ever-changing set of capabilities as new technologies\nare developed. Technologies that come under the umbrella of AI include machine learning and\ndeep learning. Machine learning enables software applications to become more accurate at\npredicting outcomes without being explicitly programmed to do so. Machine learning algorithms\nuse historical data as input to predict new output values. This approach became vastly more\neffective with the rise of large data sets to train on. Deep learning, a subset of machine\nlearning, is based on our understanding of how the brain is structured. Deep learning's\nuse of artificial neural networks structure is the underpinning of recent advances in AI,\nincluding self-driving cars and ChatGPT.\n"
print(paragraph)



## 5Ô∏è‚É£ Tokenization

**Definition:** Splitting text into **sentences** and **words**.  
This is often the first step in any NLP pipeline.

**Why it matters:** Models and rules operate on tokens, not raw strings.


In [None]:

from nltk.tokenize import sent_tokenize, word_tokenize

# Sentence tokenization
sentences = sent_tokenize(paragraph)

# Word tokenization
words = word_tokenize(paragraph)

print("Number of sentences:", len(sentences))
print("Sample sentences (first 2):", sentences[:2])
print("\nNumber of word tokens:", len(words))
print("Sample word tokens (first 20):", words[:20])



### Stopwords & Cleaning Helper

We'll remove common words like *the, is, and, to,* etc., and keep only alphabetic tokens for clarity.


In [None]:

import string
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def clean_and_filter(tokens):
    cleaned = []
    for w in tokens:
        w_lower = w.lower()
        # keep alphabetic tokens only
        if w_lower.isalpha() and w_lower not in stop_words:
            cleaned.append(w_lower)
    return cleaned

cleaned_words = clean_and_filter(words)
print("Cleaned tokens (first 30):", cleaned_words[:30])



## 6Ô∏è‚É£ Stemming (PorterStemmer)

**Definition:** Heuristic process that chops endings to reach a **root** (may not be a valid word).  
**Pros:** Fast, simple.  
**Cons:** Can be **too aggressive** (`studies` ‚Üí `studi`, `better` ‚Üí `better`).

> üîß Your original `nlp2.py` kept only stopwords and then stemmed them. Here we fix it to **exclude** stopwords and stem meaningful tokens.


In [None]:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

# Stem words sentence-by-sentence
stemmed_sentences = []
for s in sentences:
    tokens = word_tokenize(s)
    tokens = clean_and_filter(tokens)  # remove stopwords, keep alphabetic
    stems = [stemmer.stem(t) for t in tokens]
    stemmed_sentences.append(" ".join(stems))

print("Original sentence (sample):", sentences[0])
print("Stemmed sentence (sample): ", stemmed_sentences[0])



## 7Ô∏è‚É£ Lemmatization (WordNetLemmatizer)

**Definition:** Reduces words to their **dictionary/canonical form** using vocabulary + morphology.  
**Pros:** More accurate than stemming (`studies` ‚Üí `study`, `mice` ‚Üí `mouse`).  
**Cons:** Slightly slower; often needs **POS tags** for best results.

> üîß Your original `nlp3.py` had indentation issues. Here it's corrected and aligned with proper stopword removal.


In [None]:

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

lemmatizer = WordNetLemmatizer()

# Simple POS map to WordNet tags
from nltk.corpus.reader import wordnet
def to_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default

lemmatized_sentences = []
for s in sentences:
    tokens = word_tokenize(s)
    tokens = clean_and_filter(tokens)
    tagged = pos_tag(tokens)  # [('word','NN'), ...]
    lemmas = [lemmatizer.lemmatize(w, to_wordnet_pos(tag)) for w, tag in tagged]
    lemmatized_sentences.append(" ".join(lemmas))

print("Original sentence (sample):    ", sentences[0])
print("Lemmatized sentence (sample): ", lemmatized_sentences[0])



## 8Ô∏è‚É£ Quick Comparison: Stemming vs Lemmatization

Below we compare outputs on the first few sentences.


In [None]:

for i in range(min(3, len(sentences))):
    print(f"\nSentence {i+1}:")
    print("Original:     ", sentences[i])
    print("Stemmed:      ", stemmed_sentences[i])
    print("Lemmatized:   ", lemmatized_sentences[i])



### üìä Stemming vs Lemmatization (At a Glance)

| Feature | Stemming | Lemmatization |
|---|---|---|
| Output form | Rough root (may not be a word) | Dictionary word (lemma) |
| Speed | Faster | Slower |
| Accuracy | Lower | Higher |
| Needs POS? | No | Recommended |
| Typical use | Quick search, lightweight preprocessing | Production NLP, linguistically clean features |

**Rule of thumb:**  
- Use **stemming** when speed matters and minor errors are acceptable.  
- Use **lemmatization** when correctness matters (search quality, analytics, production).  



## 9Ô∏è‚É£ Real-World Applications

- **Search engines:** Normalize queries & documents for better matching.  
- **Chatbots/Assistants:** Clean user input before intent classification.  
- **Topic Modeling & IR:** Normalize tokens for stable topics/indices.  
- **Sentiment Analysis:** Reduce sparsity by merging word variants.  

## üîü Summary

- **Tokenization** breaks text into usable units.  
- **Stopword removal + cleaning** reduces noise.  
- **Stemming** is fast but crude; **Lemmatization** is slower but accurate.  
- Together, these steps build a strong foundation for downstream NLP tasks.
