#✅ Practical: Text Preprocessing Using NLTK and spaCy


#🎯 Objective:

Clean and preprocess raw customer reviews using:

NLTK for tokenization, stopword removal, and stemming

spaCy for advanced lemmatization and named entity recognition (NER)


#🛍️ Scenario: Synthetic Review Dataset (e.g., product reviews)

🧩 Step 1: Install Required Libraries

In [4]:
# Run this if not already installed
# Installs the Natural Language Toolkit (NLTK) and spaCy libraries.
# NLTK is used for various text processing tasks, and spaCy is an industrial-strength
# natural language processing (NLP) library.
!pip install nltk spacy

# Download required NLTK data
import nltk
# Downloads the 'punkt' tokenizer model from NLTK. This model is used for
# tokenizing text into sentences or words.
nltk.download('punkt')
# Downloads the 'stopwords' corpus from NLTK. Stopwords are common words (e.g., "the", "is", "and")
# that are often removed from text before processing to focus on more meaningful terms.
nltk.download('stopwords')

# Download spaCy model
# Downloads the small English language model for spaCy ('en_core_web_sm').
# This model includes pre-trained pipelines for tasks like tokenization, part-of-speech tagging,
# named entity recognition, and more, providing a good balance of speed and accuracy for many NLP tasks.
!python -m spacy download en_core_web_sm



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


#🧾 Step 2: Create Synthetic Review Data

In [5]:
# Sample review data
reviews = [
    "I absolutely loved the product! It works like a charm.",
    "Terrible customer service. I waited 30 minutes to talk to an agent.",
    "The packaging was okay, but the item arrived damaged.",
    "Excellent value for the price. Will buy again.",
    "Worst purchase ever. Completely useless and broke in a day!"
]


#🔤 Step 3: Preprocessing with NLTK

In [7]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

# Initialize components
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_nltk(text):
    # 1. Lowercase
    text = text.lower()
    # 2. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # 3. Tokenize
    tokens = word_tokenize(text)
    # 4. Remove stopwords and stem
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return tokens

# Apply preprocessing
print("🔹 Preprocessing with NLTK:\n")
for i, review in enumerate(reviews):
    print(f"Review {i+1}: {preprocess_nltk(review)}\n")


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


🔹 Preprocessing with NLTK:

Review 1: ['absolut', 'love', 'product', 'work', 'like', 'charm']

Review 2: ['terribl', 'custom', 'servic', 'wait', '30', 'minut', 'talk', 'agent']

Review 3: ['packag', 'okay', 'item', 'arriv', 'damag']

Review 4: ['excel', 'valu', 'price', 'buy']

Review 5: ['worst', 'purchas', 'ever', 'complet', 'useless', 'broke', 'day']



#🧠 Step 4: Preprocessing with spaCy

In [10]:
import spacy

# Load English model
# Loads the small English language model pre-trained for spaCy.
# This model provides various linguistic features like tokenization, part-of-speech tagging,
# lemmatization, and named entity recognition.
nlp = spacy.load("en_core_web_sm")

def preprocess_spacy(text):
    # Process the input text with the loaded spaCy NLP model.
    # This creates a 'Doc' object containing linguistic annotations for the text.
    doc = nlp(text)
    # Lemmatize and remove stopwords and punctuation
    # Iterates through each token in the processed document.
    # - `token.lemma_`: Returns the base form of the word (lemmatization).
    # - `token.lower()`: Converts the lemma to lowercase.
    # - `not token.is_stop`: Checks if the token is not a stop word (e.g., "the", "is").
    # - `not token.is_punct`: Checks if the token is not punctuation.
    # The list comprehension creates a list of filtered and lemmatized tokens.
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    return tokens

print("🔹 Preprocessing with spaCy:\n")
# Assuming 'reviews' is an iterable (e.g., a list of strings) containing text reviews.
# This loop iterates through each review, preprocesses it using the `preprocess_spacy` function,
# and prints the preprocessed tokens for each review.
for i, review in enumerate(reviews):
    print(f"Review {i+1}: {preprocess_spacy(review)}\n")

🔹 Preprocessing with spaCy:

Review 1: ['absolutely', 'love', 'product', 'work', 'like', 'charm']

Review 2: ['terrible', 'customer', 'service', 'wait', '30', 'minute', 'talk', 'agent']

Review 3: ['packaging', 'okay', 'item', 'arrive', 'damage']

Review 4: ['excellent', 'value', 'price', 'buy']

Review 5: ['bad', 'purchase', 'completely', 'useless', 'break', 'day']



#🎯 Step 5: Named Entity Recognition (NER) with spaCy

In [11]:
print("🔍 Named Entities:\n")
# Iterate through each review in the 'reviews' list along with its index.
for i, review in enumerate(reviews):
    # Process the current review text using the pre-loaded spaCy NLP model (nlp).
    # This creates a 'Doc' object, which is a container for accessing linguistic annotations.
    doc = nlp(review)
    print(f"Review {i+1} Entities:")
    # Iterate through each detected named entity in the 'doc' object.
    # Named entities are real-world objects such as persons, organizations, locations, etc.
    for ent in doc.ents:
        # Print the text of the named entity and its corresponding label (type of entity).
        # `ent.text` gives the exact text span of the entity.
        # `ent.label_` gives the string representation of the entity type (e.g., PERSON, GPE for geopolitical entity).
        print(f" - {ent.text} ({ent.label_})")
    print() # Print a blank line for better readability between reviews.

🔍 Named Entities:

Review 1 Entities:

Review 2 Entities:
 - 30 minutes (TIME)

Review 3 Entities:

Review 4 Entities:

Review 5 Entities:
 - a day (DATE)



#✅ Summary Comparison

| Feature           | NLTK                            | spaCy                          |
| ----------------- | ------------------------------- | ------------------------------ |
| Tokenization      | `word_tokenize()`               | Built-in via `nlp()`           |
| Stopword Removal  | Manual with `stopwords.words()` | `token.is_stop`                |
| Stemming          | `PorterStemmer` (rule-based)    | ❌                              |
| Lemmatization     | ❌                               | `token.lemma_` (context-aware) |
| Named Entity Rec. | ❌                               | ✅ `doc.ents`                   |
