# 📌 NLP Text Preprocessing (Beginner-Friendly Interactive Guide)

## 🔹 Objective


### Text preprocessing is an essential step in NLP. Before we apply deep learning algorithms, we must clean and format the text so that computers can process it effectively. This is to guide us for practical sessions of this course (text preprocessing), which includes:

* Lowercasing
* Removing Punctuation
* Removing Stopwords
* Stemming / Lemmatization

### Before we can apply NLP techniques like sentiment analysis or topic modeling, we must clean and preprocess text data to:
* ✔️ Ensure uniformity → Convert text to lowercase to avoid treating "Great" and "great" differently.
* ✔️ Remove unnecessary characters → Punctuation, special symbols, and emojis.
* ✔️ Eliminate stopwords → Words that do not add significant meaning (e.g., "the", "is", "and").
* ✔️ Apply stemming → Reduce words to their base/root form (e.g., "running" → "run").





## 1️⃣ Import Required Librariesatization


In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
print("Success")

Success


In [2]:
# Download required NLTK datasets
nltk.download('stopwords')  # Stopwords list
print("Stopword Downloaded")
nltk.download('punkt')  # Tokenization
print("Punkt Downloaded")
nltk.download('wordnet')  # WordNet for Lemmatization
print("WordNet Downloaded")

[nltk_data] Error loading stopwords: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


Stopword Downloaded


[nltk_data] Error loading punkt: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


Punkt Downloaded
WordNet Downloaded


[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


### 📌 Why These Libraries?

* pandas → Stores structured text data in a table format.
* re (regex) → Removes punctuation and special characters.
* nltk (Natural Language Toolkit) → Handles stopwords, stemming, and lemmatization.

## 2️⃣ Creating a Large, Diverse Dataset
### We will create a dataset of 50 customer reviews with:
* ✔️ Mixed sentiments (positive, negative, neutral)
* ✔️ Punctuation and special characters
* ✔️ Emojis for expression

In [3]:
# Sample dataset with 50 diverse reviews
reviews = [
    "I absolutely LOVE this product!! ❤️ It's super efficient and really worth the money. Definitely recommend! 👍",
    "Worst purchase ever... 😡 Waste of money. DO NOT BUY!! Full of issues.",
    "This product does what it says, but nothing special. 🤷‍♂️ It's okay for the price, I guess.",
    "AMAZING quality and fast shipping!!! 🚀🔥 #satisfied #fastdelivery",
    "Terrible! Had high expectations, but it broke in a week. Really disappointed. 😞",
    "This phone is great 📱, but the battery drains too fast. 🔋😕",
    "I love how easy it is to use! 🥰 Definitely a game-changer.",
    "Do not buy this laptop! 👎 It crashes every 10 minutes. So frustrating! 😡",
    "The camera quality is excellent! 📸 Love the night mode. 🌙✨",
    "Meh... the product is just average. 😐 I expected more for this price.",
    "Great customer service! 🙌 They replaced my faulty item within 24 hours.",
    "Horrible experience!! 💔 Received a broken item and no refund.",
    "I use this every day now. Super helpful! ✅",
    "The features are nice, but the software is laggy. 🤦‍♀️ Annoying!",
    "Best investment I've made this year! 🔥🔥",
    "Delivery took 2 months 😤 but the product is okay.",
    "Wouldn't recommend to anyone. Waste of time. 🙅‍♂️",
    "Super happy with my purchase!! 🎉 Everything works perfectly.",
    "This product changed my life. ✨ Absolutely incredible!",
    "Not bad, but also not great. Just okay. 😶",
    "Expected more for the price I paid. 😕",
    "Highly recommended! 👏 Fast shipping and great quality.",
    "This is my third purchase from this store and I'm never disappointed! ❤️",
    "Overpriced and underwhelming. 🫤 Could be better.",
    "Works well, but setup was a nightmare. 🛠️ Took me 2 hours!",
    "Love the design but the materials feel cheap. 🧐",
    "Absolutely worth it!! 💎 Super happy with this purchase.",
    "Stopped working after a month. 😔 Very disappointed.",
    "10/10! Would buy again. ⭐⭐⭐⭐⭐",
    "Disgusting smell 🤢, returned immediately.",
    "Best headphones I’ve ever used! 🎧 Sound quality is top-notch.",
    "Regret buying this. Not as advertised. 😠",
    "It does the job. Nothing exceptional. 🤷",
    "The size is perfect, but the fabric feels cheap. 🏷️",
    "Feels premium! 🔝 Great value for money.",
    "Returned because it didn't fit. 🚚 Hassle-free process.",
    "The manual is useless. Had to figure it out myself. 📖",
    "Why is this so expensive?? 💰 Not worth the price.",
    "Awesome customer support! 🎈 They solved my issue instantly.",
    "Very fragile. Broke after one drop. 🫣",
    "Exactly what I needed! 🎯 Highly recommended.",
    "Didn't expect much, but it exceeded my expectations! 🎊",
    "Scratches easily, but functions well. 🏁",
    "This brand never disappoints! 🏆 Will keep buying.",
    "Fake reviews everywhere. Product is terrible. 😒",
    "No words… just amazing. 🔥🔥🔥",
    "This was a gift and the recipient loved it! 🎁",
    "Cheap plastic, feels like a toy. 😡 Not recommended.",
    "Perfect for my needs. 👍 Would purchase again."
]

In [4]:
# Convert to DataFrame
df_reviews = pd.DataFrame(reviews, columns=["Review"])

# Display the dataset
df_reviews.head(20)  # Show first 10 rows for preview

Unnamed: 0,Review
0,I absolutely LOVE this product!! ❤️ It's super...
1,Worst purchase ever... 😡 Waste of money. DO NO...
2,"This product does what it says, but nothing sp..."
3,AMAZING quality and fast shipping!!! 🚀🔥 #satis...
4,"Terrible! Had high expectations, but it broke ..."
5,"This phone is great 📱, but the battery drains ..."
6,I love how easy it is to use! 🥰 Definitely a g...
7,Do not buy this laptop! 👎 It crashes every 10 ...
8,The camera quality is excellent! 📸 Love the ni...
9,Meh... the product is just average. 😐 I expect...


## 3️⃣ Lowercasing
### 📌 Why? 
### Because NLP models treat "Love" and "love" differently, so we standardize text by converting it all to lowercase.

In [5]:
# Convert text to lowercase
df_reviews["Review_Lowercase"] = df_reviews["Review"].str.lower()

# Display results
df_reviews.head(20)  # Show first 10 rows


Unnamed: 0,Review,Review_Lowercase
0,I absolutely LOVE this product!! ❤️ It's super...,i absolutely love this product!! ❤️ it's super...
1,Worst purchase ever... 😡 Waste of money. DO NO...,worst purchase ever... 😡 waste of money. do no...
2,"This product does what it says, but nothing sp...","this product does what it says, but nothing sp..."
3,AMAZING quality and fast shipping!!! 🚀🔥 #satis...,amazing quality and fast shipping!!! 🚀🔥 #satis...
4,"Terrible! Had high expectations, but it broke ...","terrible! had high expectations, but it broke ..."
5,"This phone is great 📱, but the battery drains ...","this phone is great 📱, but the battery drains ..."
6,I love how easy it is to use! 🥰 Definitely a g...,i love how easy it is to use! 🥰 definitely a g...
7,Do not buy this laptop! 👎 It crashes every 10 ...,do not buy this laptop! 👎 it crashes every 10 ...
8,The camera quality is excellent! 📸 Love the ni...,the camera quality is excellent! 📸 love the ni...
9,Meh... the product is just average. 😐 I expect...,meh... the product is just average. 😐 i expect...


### ✅ Effect:

#### "LOVE" → "love"
#### "BEST" → "best"

## 4️⃣ Removing Punctuation Marks
### 📌 Why?
### Punctuation doesn't contribute meaning to most NLP models.
### Removing punctuation simplifies text without affecting readability.

In [6]:
# Function to remove punctuation from text
def remove_punctuation(text):
    """
    This function removes all punctuation marks from the given text.
    
    - The function uses `re.sub(r'[^\w\s]', '', text)`, which means:
      - `[^\w\s]` → Matches any character that is NOT a word (`\w`) or a whitespace (`\s`).
      - `''` → Replaces all matched punctuation with an empty string.
    
    Example:
    ----------
    Input  : "Hello, World!!!"
    Output : "Hello World"
    """
    return re.sub(r'[^\w\s]', '', text)

# Apply the function to each review in the dataset
df_reviews["Review_NoPunct"] = df_reviews["Review_Lowercase"].apply(remove_punctuation)

# Display the first 20 rows to observe the changes
df_reviews.head(20)


Unnamed: 0,Review,Review_Lowercase,Review_NoPunct
0,I absolutely LOVE this product!! ❤️ It's super...,i absolutely love this product!! ❤️ it's super...,i absolutely love this product its super effi...
1,Worst purchase ever... 😡 Waste of money. DO NO...,worst purchase ever... 😡 waste of money. do no...,worst purchase ever waste of money do not buy...
2,"This product does what it says, but nothing sp...","this product does what it says, but nothing sp...",this product does what it says but nothing spe...
3,AMAZING quality and fast shipping!!! 🚀🔥 #satis...,amazing quality and fast shipping!!! 🚀🔥 #satis...,amazing quality and fast shipping satisfied f...
4,"Terrible! Had high expectations, but it broke ...","terrible! had high expectations, but it broke ...",terrible had high expectations but it broke in...
5,"This phone is great 📱, but the battery drains ...","this phone is great 📱, but the battery drains ...",this phone is great but the battery drains to...
6,I love how easy it is to use! 🥰 Definitely a g...,i love how easy it is to use! 🥰 definitely a g...,i love how easy it is to use definitely a gam...
7,Do not buy this laptop! 👎 It crashes every 10 ...,do not buy this laptop! 👎 it crashes every 10 ...,do not buy this laptop it crashes every 10 mi...
8,The camera quality is excellent! 📸 Love the ni...,the camera quality is excellent! 📸 love the ni...,the camera quality is excellent love the nigh...
9,Meh... the product is just average. 😐 I expect...,meh... the product is just average. 😐 i expect...,meh the product is just average i expected mo...


### ✅ Effect:

#### "Amazing product!!!" → "amazing product"
#### "Good, but pricey." → "good but pricey"


## 5️⃣ Removing Stopwords
### 📌 Why?
### Stopwords (e.g., "is", "the", "of") appear frequently but add little meaning.

In [7]:
# Load stopwords from the NLTK library
stop_words = set(stopwords.words('english'))  

"""
Explanation:
------------
- `stopwords.words('english')` loads a predefined list of common English stopwords.
- `set(stopwords.words('english'))` converts the list into a set for **faster lookup**.
- Example stopwords: {"the", "is", "in", "at", "which", "and", "but", "or", "a", "an"}

Why use a set?
--------------
- Checking if a word is in a **set** is faster (O(1) time complexity) compared to a **list** (O(n)).
- This makes the stopword removal process much **more efficient** for large datasets.
"""

# Function to remove stopwords from a given text
def remove_stopwords(text):
    """
    This function removes all stopwords from the given text.
    
    - It takes a sentence, splits it into individual words, 
      and filters out any word that is present in the `stop_words` set.
    - It then joins the remaining words back into a single cleaned string.

    Example:
    ----------
    Input  : "this is a great product with amazing quality"
    Output : "great product amazing quality"
    """
    
    # Split the sentence into words and keep only words not in stop_words
    cleaned_text = ' '.join(word for word in text.split() if word not in stop_words)
    
    return cleaned_text  # Return the processed text without stopwords

# Apply the remove_stopwords function to each review in the dataset
df_reviews["Review_NoStopwords"] = df_reviews["Review_NoPunct"].apply(remove_stopwords)

# Display the first 10 rows to observe the changes
df_reviews.head(10)


Unnamed: 0,Review,Review_Lowercase,Review_NoPunct,Review_NoStopwords
0,I absolutely LOVE this product!! ❤️ It's super...,i absolutely love this product!! ❤️ it's super...,i absolutely love this product its super effi...,absolutely love product super efficient really...
1,Worst purchase ever... 😡 Waste of money. DO NO...,worst purchase ever... 😡 waste of money. do no...,worst purchase ever waste of money do not buy...,worst purchase ever waste money buy full issues
2,"This product does what it says, but nothing sp...","this product does what it says, but nothing sp...",this product does what it says but nothing spe...,product says nothing special okay price guess
3,AMAZING quality and fast shipping!!! 🚀🔥 #satis...,amazing quality and fast shipping!!! 🚀🔥 #satis...,amazing quality and fast shipping satisfied f...,amazing quality fast shipping satisfied fastde...
4,"Terrible! Had high expectations, but it broke ...","terrible! had high expectations, but it broke ...",terrible had high expectations but it broke in...,terrible high expectations broke week really d...
5,"This phone is great 📱, but the battery drains ...","this phone is great 📱, but the battery drains ...",this phone is great but the battery drains to...,phone great battery drains fast
6,I love how easy it is to use! 🥰 Definitely a g...,i love how easy it is to use! 🥰 definitely a g...,i love how easy it is to use definitely a gam...,love easy use definitely gamechanger
7,Do not buy this laptop! 👎 It crashes every 10 ...,do not buy this laptop! 👎 it crashes every 10 ...,do not buy this laptop it crashes every 10 mi...,buy laptop crashes every 10 minutes frustrating
8,The camera quality is excellent! 📸 Love the ni...,the camera quality is excellent! 📸 love the ni...,the camera quality is excellent love the nigh...,camera quality excellent love night mode
9,Meh... the product is just average. 😐 I expect...,meh... the product is just average. 😐 i expect...,meh the product is just average i expected mo...,meh product average expected price


### ✅ Effect:

#### "this product is very good" → "product good"

## 6️⃣ Stemming & Lemmatization (Reducing Words to Their Base Form)
### 📌 Why is this step important?
### Both stemming and lemmatization reduce words to their base form, but they work differently.
### This is crucial for NLP models because different word forms (e.g., "running", "ran", "runs") should be treated as the same word.
### Example:
### Without processing: "He is running and she ran fast"
### With stemming: "He is run and she ran fast"
### With lemmatization: "He be running and she run fast"

In [11]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk

# Download WordNet dataset for lemmatization
nltk.download('wordnet')

# Initialize Stemmer and Lemmatizer
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Test words
words = ["running", "flies", "happiness", "better", "wolves", "studies"]

# Apply Stemming
stemmed_words = [ps.stem(word) for word in words]
print("Stemmed Words:", stemmed_words)

# Apply Lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)


Stemmed Words: ['run', 'fli', 'happi', 'better', 'wolv', 'studi']


[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python311\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python311\\share\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python311\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [8]:
# Initialize the Porter Stemmer (for stemming) and WordNet Lemmatizer (for lemmatization)
ps = PorterStemmer()  # Stemming
lemmatizer = WordNetLemmatizer()  # Lemmatization

"""
What is Stemming?
-----------------
- Stemming chops words down to their root form by **removing suffixes**.
- It applies **heuristic rules** rather than a linguistic approach.
- It does NOT always produce a real English word.

Example:
- "running" → "run"
- "happiness" → "happi"
- "better" → "better"  (incorrect, should be "good")

What is Lemmatization?
----------------------
- Lemmatization is **more accurate** than stemming.
- It uses a **dictionary-based approach** to return the correct base form of a word.
- It ensures words are **real** words in the English language.

Example:
- "running" → "run"
- "happiness" → "happiness"  (unchanged because it’s already a base form)
- "better" → "good"  (correctly mapped)
"""

# Function to apply stemming
def apply_stemming(text):
    """
    This function applies stemming to all words in a given text.
    
    - It splits the sentence into individual words.
    - Each word is reduced to its base form using the Porter Stemmer.
    - The words are then joined back into a processed string.

    Example:
    ----------
    Input  : "running quickly towards happiness"
    Output : "run quickli toward happi"
    """
    return ' '.join(ps.stem(word) for word in text.split())

In [9]:
# Apply stemming to the reviews
df_reviews["Review_Stemmed"] = df_reviews["Review_NoStopwords"].apply(apply_stemming)

# Display the first 10 rows to observe changes
df_reviews.head(20)


Unnamed: 0,Review,Review_Lowercase,Review_NoPunct,Review_NoStopwords,Review_Stemmed
0,I absolutely LOVE this product!! ❤️ It's super...,i absolutely love this product!! ❤️ it's super...,i absolutely love this product its super effi...,absolutely love product super efficient really...,absolut love product super effici realli worth...
1,Worst purchase ever... 😡 Waste of money. DO NO...,worst purchase ever... 😡 waste of money. do no...,worst purchase ever waste of money do not buy...,worst purchase ever waste money buy full issues,worst purchas ever wast money buy full issu
2,"This product does what it says, but nothing sp...","this product does what it says, but nothing sp...",this product does what it says but nothing spe...,product says nothing special okay price guess,product say noth special okay price guess
3,AMAZING quality and fast shipping!!! 🚀🔥 #satis...,amazing quality and fast shipping!!! 🚀🔥 #satis...,amazing quality and fast shipping satisfied f...,amazing quality fast shipping satisfied fastde...,amaz qualiti fast ship satisfi fastdeliveri
4,"Terrible! Had high expectations, but it broke ...","terrible! had high expectations, but it broke ...",terrible had high expectations but it broke in...,terrible high expectations broke week really d...,terribl high expect broke week realli disappoint
5,"This phone is great 📱, but the battery drains ...","this phone is great 📱, but the battery drains ...",this phone is great but the battery drains to...,phone great battery drains fast,phone great batteri drain fast
6,I love how easy it is to use! 🥰 Definitely a g...,i love how easy it is to use! 🥰 definitely a g...,i love how easy it is to use definitely a gam...,love easy use definitely gamechanger,love easi use definit gamechang
7,Do not buy this laptop! 👎 It crashes every 10 ...,do not buy this laptop! 👎 it crashes every 10 ...,do not buy this laptop it crashes every 10 mi...,buy laptop crashes every 10 minutes frustrating,buy laptop crash everi 10 minut frustrat
8,The camera quality is excellent! 📸 Love the ni...,the camera quality is excellent! 📸 love the ni...,the camera quality is excellent love the nigh...,camera quality excellent love night mode,camera qualiti excel love night mode
9,Meh... the product is just average. 😐 I expect...,meh... the product is just average. 😐 i expect...,meh the product is just average i expected mo...,meh product average expected price,meh product averag expect price


In [12]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')  # Download additional WordNet data


[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>
[nltk_data] Error loading omw-1.4: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


False

In [None]:
import nltk
nltk.download()


In [None]:
import nltk
nltk.data.path.append("C:/nltk_data")  # Change this path if needed
nltk.download('wordnet', download_dir="C:/nltk_data")
nltk.download('omw-1.4', download_dir="C:/nltk_data")


In [10]:
import nltk
from nltk.stem import WordNetLemmatizer

# Ensure that WordNet is downloaded
nltk.download('wordnet')

# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to apply lemmatization
def apply_lemmatization(text):
    """
    This function applies lemmatization to all words in a given text.
    
    - It splits the sentence into individual words.
    - Each word is converted to its base form using the WordNet Lemmatizer.
    - The words are then joined back into a processed string.

    Example:
    ----------
    Input  : "wolves better going"
    Output : "wolf good going"  (correct transformations)
    """
    return ' '.join(lemmatizer.lemmatize(word) for word in text.split())

# Apply lemmatization to the reviews
df_reviews["Review_Lemmatized"] = df_reviews["Review_NoStopwords"].apply(apply_lemmatization)

# Display the first 10 rows to observe changes
df_reviews.head(20)


[nltk_data] Error loading wordnet: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python311\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python311\\share\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Local\\Programs\\Python\\Python311\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [None]:
# Function to apply lemmatization
def apply_lemmatization(text):
    """
    This function applies lemmatization to all words in a given text.
    
    - It splits the sentence into individual words.
    - Each word is converted to its base form using the WordNet Lemmatizer.
    - The words are then joined back into a processed string.

    Example:
    ----------
    Input  : "running quickly towards happiness"
    Output : "running quickly towards happiness"  (unchanged)
    
    Input  : "wolves better going"
    Output : "wolf good going"  (correct transformations)
    """
    return ' '.join(lemmatizer.lemmatize(word) for word in text.split())


In [None]:
# Apply lemmatization to the reviews
df_reviews["Review_Lemmatized"] = df_reviews["Review_NoStopwords"].apply(apply_lemmatization)