<span style="font-size:16px; font-weight:bold">Welcome to Natural language processing (NLP) in Python</span><br/>

Presented by: Reza Saadatyar (2024-2025)<br/>
E-mail: Reza.Saadatyar@outlook.com<br/>

<span style="font-size: 16px;font-weight:bold"> Stemming & Lemmatization:</span><br/>
Stemming and Lemmatization are two fundamental techniques in NLP used to reduce words to their root or base forms.<br/>

**Stemming:**<br/>
▪ Stemming is the process of removing suffixes (and sometimes prefixes) from words to obtain their stem or root form.<br/>
▪ The resulting stem may not always be a valid word in the language, but it helps group together words with similar meanings (e.g., "playing", "played", "plays" → "play").<br/>
▪ Stemming algorithms are typically rule-based and fast, but can be less accurate.<br/>

**Lemmatization:**<br/>
▪ Lemmatization reduces words to their base or dictionary form, known as the lemma.<br/>
▪ Unlike stemming, lemmatization considers the context and part of speech of a word, ensuring that the root form is a valid word (e.g., "better" → "good", "running" → "run").<br/>
▪ Lemmatization is generally more accurate but may require more computational resources and linguistic knowledge.<br/>

**Workflow for Stemming and Lemmatization:**<br/>
▪ `Lowercasing:` Convert all text to lowercase for consistency.<br/>
▪ `Tokenization:` Split text into individual words (tokens).<br/>
▪ `Stemming/Lemmatization:` Apply stemming or lemmatization to each token to obtain root forms.<br/>
▪ `Reconstruction (Optional):` Reconstruct the processed tokens back into text for further analysis.<br/>

These techniques are commonly used in text preprocessing to normalize words, improve search results, and enhance the performance of NLP models.<br/>

**Difference between Stemming and Lemmatization (with Lancaster Stemmer):**<br/>

The main difference between stemming and lemmatization is that stemming crudely removes word suffixes to arrive at a root form, which may not be a valid word, while lemmatization reduces words to their dictionary form (lemma), considering context and part of speech.<br/>

**Stemming (Lancaster):**<br/>
▪ The Lancaster stemmer is more aggressive than the Porter stemmer, often producing shorter stems.<br/>
▪ Example: "playing", "played", "plays" → "play" (Porter), but Lancaster may reduce further.<br/>

**Lemmatization:**<br/>
▪ Lemmatization always returns a valid word (lemma) and is context-aware.<br/>
▪ Example: "better" → "good" (with POS), "running" → "run".<br/>

**Comparison Table:**<br/>
| Word      | Lancaster Stem | Porter Stem | Lemma      |
|-----------|:--------------|:------------|:-----------|
| playing   | play           | play        | playing    |
| played    | play           | play        | played     |
| plays     | play           | play        | play       |
| better    | bet            | better      | better     |
| running   | run            | run         | running    |

The Lancaster stemmer can be too aggressive for some applications, while lemmatization is more accurate but slower.


<span style="dont-size:16.5px; color:rgb(245, 5, 5); font-weight:bold;">Importing libraries</span>

In [6]:
import nltk
from nltk import word_tokenize, wordpunct_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

import pandas as pd

In [None]:
documents = [
    "Cats are running",
    "Dogs played outside",
]

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()  # Create a stemmer object using the Porter algorithm
lemmatizer = WordNetLemmatizer() # Create a lemmatizer object using WordNet

# Tokenize, Stem, and Lemmatize each document
results = []  # Initialize an empty list to store results for each document
for doc in documents:  # Iterate over each document in the documents list
    tokens = word_tokenize(doc.lower())  # Tokenize the document after converting it to lowercase
    stems = [stemmer.stem(token) for token in tokens]  # Apply stemming to each token
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]  # Apply lemmatization to each token
    results.append({  # Append a dictionary with original text, tokens, stems, and lemmas to the results list
        "original": doc,  # Store the original document text
        "tokens": tokens,  # Store the list of tokens
        "stems": stems,  # Store the list of stemmed tokens
        "lemmas": lemmas  # Store the list of lemmatized tokens
    })

# Display results in a DataFrame
df = pd.DataFrame(results)
df.head()

Unnamed: 0,original,tokens,stems,lemmas
0,Cats are running,"[cats, are, running]","[cat, are, run]","[cat, are, running]"
1,Dogs played outside,"[dogs, played, outside]","[dog, play, outsid]","[dog, played, outside]"


In [34]:
# Example: Compare stemming and lemmatization for a few words
sample_words = ["playing", "played", "plays", "better", "running", "feet"]
comparison = []
for word in sample_words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    comparison.append({"word": word, "stem": stem, "lemma": lemma})

print("\nStemming vs Lemmatization Comparison:")
df = pd.DataFrame(comparison)
df.head()


Stemming vs Lemmatization Comparison:


Unnamed: 0,word,stem,lemma
0,playing,play,playing
1,played,play,played
2,plays,play,play
3,better,better,better
4,running,run,running
