# Lab 02: Basic NLP Preprocessing Techniques

**Course:** ITAI 2373 - Natural Language Processing  
**Module:** 02 - Text Preprocessing  
**Duration:** 2-3 hours  
**Student Name:** Chloe Tu

**Date:** 9/8/2025

---

## üéØ Learning Objectives

By completing this lab, you will:
1. Understand the critical role of preprocessing in NLP pipelines
2. Master fundamental text preprocessing techniques
3. Compare different libraries and their approaches
4. Analyze the effects of preprocessing on text data
5. Build a complete preprocessing pipeline
6. Load and work with different types of text datasets

## üìñ Introduction to NLP Preprocessing

Natural Language Processing (NLP) preprocessing refers to the initial steps taken to clean and transform raw text data into a format that's more suitable for analysis by machine learning algorithms.

### Why is preprocessing crucial?

1. **Standardization:** Ensures consistent text format across your dataset
2. **Noise Reduction:** Removes irrelevant information that could confuse algorithms
3. **Complexity Reduction:** Simplifies text to focus on meaningful patterns
4. **Performance Enhancement:** Improves the efficiency and accuracy of downstream tasks

### Real-world Impact
Consider searching for "running shoes" vs "Running Shoes!" - without preprocessing, these might be treated as completely different queries. Preprocessing ensures they're recognized as equivalent.

### ü§î Conceptual Question 1
**Before we start coding, think about your daily interactions with text processing systems (search engines, chatbots, translation apps). What challenges do you think these systems face when processing human language? List at least 3 specific challenges and explain why each is problematic.**

Here are three big challenges text systems face from my perspective:

**Challenge 1: Words have many meanings.** Think about the word "bank." It could be a place where you keep money, or the ground beside a river. People understand which "bank" is meant by looking at the rest of the sentence or conversation. But for a computer, it's hard to know the right meaning without really understanding the context, which can lead to mistakes in search results or chatbot answers.

**Challenge 2: People say the same thing in different ways.** I see people use synonyms, like "car" and "automobile," or phrase things differently, like "buy a ticket" versus "purchase a ticket." A computer needs to be smart enough to recognize that these different wordings mean the same thing. If it doesn't, a search engine might miss relevant results, or a translation app might not capture the full sense of what someone is trying to say.

**Challenge 3: Language is messy and changes.** People use slang, abbreviations (like "LOL"), make typos, and even use emojis to express feelings. Plus, new words and phrases pop up all the time. Computers are usually trained on clean, formal text. This makes it tough for them to understand everyday language, social media posts, or things said informally. They have to constantly learn and adapt to keep up with how people actually talk and write.

* * *

## üõ†Ô∏è Part 1: Environment Setup

We'll be working with two major NLP libraries:
- **NLTK (Natural Language Toolkit):** Comprehensive NLP library with extensive resources
- **spaCy:** Industrial-strength NLP with pre-trained models

**‚ö†Ô∏è Note:** Installation might take 2-3 minutes to complete.

In [1]:
# Step 1: Install Required Libraries
print("üîß Installing NLP libraries...")

!pip install -q nltk spacy
!python -m spacy download en_core_web_sm

print("‚úÖ Installation complete!")

üîß Installing NLP libraries...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
‚úÖ Installation complete!


### ü§î Conceptual Question 2
**Why do you think we need to install a separate language model (en_core_web_sm) for spaCy? What components might this model contain that help with text processing? Think about what information a computer needs to understand English text.**



I think I need to install a language model like `en_core_web_sm` for spaCy because spaCy isn't just about splitting text into words; it needs to understand the English language to do more complex tasks. Think of the model as spaCy's brain for English. This brain contains a huge vocabulary, words that understand the relationships between words helping it grasp meaning, syntax rules that know how English sentences are put together to figure out sentence structure, and is trained to spot named entities like people, places, organizations, and dates, which is super helpful for extracting information. Basically, the model gives spaCy the background knowledge and patterns it needs to process English text intelligently, not just as a string of characters.

* * *

In [2]:
# Step 2: Import Libraries and Download NLTK Data
import nltk
import spacy
import string
import re
from collections import Counter

# Download essential NLTK data
print("üì¶ Downloading NLTK data packages...")
nltk.download('punkt')      # For tokenization
nltk.download('stopwords')  # For stop word removal
nltk.download('wordnet')    # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging
nltk.download('punkt_tab') # Download punkt_tab resource

print("\n‚úÖ All imports and downloads completed!")

üì¶ Downloading NLTK data packages...


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...



‚úÖ All imports and downloads completed!


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## üìÇ Part 2: Sample Text Data

We'll work with different types of text to understand how preprocessing affects various text styles:
- Simple text
- Academic text (with citations, URLs)
- Social media text (with emojis, hashtags)
- News text (formal writing)
- Product reviews (informal, ratings)

In [3]:
# Step 3: Load Sample Texts
simple_text = "Natural Language Processing is a fascinating field of AI. It's amazing!"

academic_text = """
Dr. Smith's research on machine-learning algorithms is groundbreaking!
She published 3 papers in 2023, focusing on deep neural networks (DNNs).
The results were amazing - accuracy improved by 15.7%!
"This is revolutionary," said Prof. Johnson.
Visit https://example.com for more info. #NLP #AI @university
"""

social_text = "OMG! Just tried the new coffee shop ‚òïÔ∏è SO GOOD!!! Highly recommend üëç #coffee #yum üòç"

news_text = """
The stock market experienced significant volatility today, with tech stocks leading the decline.
Apple Inc. (AAPL) dropped 3.2%, while Microsoft Corp. fell 2.8%.
"We're seeing a rotation out of growth stocks," said analyst Jane Doe from XYZ Capital.
"""

review_text = """
This laptop is absolutely fantastic! I've been using it for 6 months and it's still super fast.
The battery life is incredible - lasts 8-10 hours easily.
Only complaint: the keyboard could be better. Overall rating: 4.5/5 stars.
"""

# Store all texts
sample_texts = {
    "Simple": simple_text,
    "Academic": academic_text.strip(),
    "Social Media": social_text,
    "News": news_text.strip(),
    "Product Review": review_text.strip()
}

print("üìÑ Sample texts loaded successfully!")
for name, text in sample_texts.items():
    preview = text[:80] + "..." if len(text) > 80 else text
    print(f"\nüè∑Ô∏è {name}: {preview}")

üìÑ Sample texts loaded successfully!

üè∑Ô∏è Simple: Natural Language Processing is a fascinating field of AI. It's amazing!

üè∑Ô∏è Academic: Dr. Smith's research on machine-learning algorithms is groundbreaking!
She publi...

üè∑Ô∏è Social Media: OMG! Just tried the new coffee shop ‚òïÔ∏è SO GOOD!!! Highly recommend üëç #coffee #yu...

üè∑Ô∏è News: The stock market experienced significant volatility today, with tech stocks lead...

üè∑Ô∏è Product Review: This laptop is absolutely fantastic! I've been using it for 6 months and it's st...


### ü§î Conceptual Question 3
**Looking at the different text types we've loaded, what preprocessing challenges do you anticipate for each type? For each text type below, identify at least 2 specific preprocessing challenges and explain why they might be problematic for NLP analysis.**


**Simple text challenges:**
1. Even in simple text, ambiguity can pop up. For example, "AI is amazing!" could mean "Artificial Intelligence" or refer to a person named "AI." It's usually clear to people from context, but for a machine, it might struggle to pick the right meaning without more information.
2. Contractions like "It's" can be a challenge. Do you keep them as one token or split them into "It" and "is"? The choice can affect things like vocabulary size and how relationships between words are learned.

**Academic text challenges:**
1. Academic text often uses citations (like "(Smith, 2023)") and references URLs. These aren't usually important for understanding the core content and just add noise. Removing or handling them correctly is needed so they don't interfere with analysis.
2. It uses lots of technical terms and abbreviations (like "DNNs"). NLP models might not recognize these, or they could have different meanings in other fields. Understanding or standardizing this specialized language is important.

**Social media text challenges:**
1. This text is full of informal language, slang, and abbreviations ("OMG," "SO GOOD!!!"). It also includes things like emojis and hashtags. This "messiness" makes it hard for traditional NLP methods that expect cleaner text.
2. Mentions (@username) and URLs are common. These often don't add semantic value to the message itself and can be removed to simplify the text, but you might lose information about who was mentioned or linked.

**News text challenges:**
1. News articles often contain named entities like company names (Apple Inc., Microsoft Corp.) and people's names (Jane Doe). Identifying these correctly is important for information extraction, but capitalization or variations in naming can make it tricky.
2. Quotes from people ("We're seeing...") include contractions and sometimes slightly less formal language than the main body of the text. Handling these nested structures and potentially different language styles within the same text can be complex.

**Product review challenges:**
1. Reviews often use informal language, misspellings, and emotional words ("absolutely fantastic," "super fast"). Capturing the sentiment accurately requires handling these variations and understanding how words contribute to positive or negative feelings.
2. Ratings (like "4.5/5 stars") are structured data within the text. Extracting this numerical information and linking it to the textual review requires specific parsing techniques beyond standard text processing.



## üî§ Part 3: Tokenization

### What is Tokenization?
Tokenization is the process of breaking down text into smaller, meaningful units called **tokens**. These tokens are typically words, but can also be sentences, characters, or subwords.

### Why is it Important?
- Most NLP algorithms work with individual tokens, not entire texts
- It's the foundation for all subsequent preprocessing steps
- Different tokenization strategies can significantly impact results

### Common Challenges:
- **Contractions:** "don't" ‚Üí "do" + "n't" or "don't"?
- **Punctuation:** Keep with words or separate?
- **Special characters:** How to handle @, #, URLs?

In [4]:
# Step 4: Tokenization with NLTK
from nltk.tokenize import word_tokenize, sent_tokenize

# Test on simple text
print("üîç NLTK Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Word tokenization
nltk_tokens = word_tokenize(simple_text)
print(f"\nWord tokens: {nltk_tokens}")
print(f"Number of tokens: {len(nltk_tokens)}")

# Sentence tokenization
sentences = sent_tokenize(simple_text)
print(f"\nSentences: {sentences}")
print(f"Number of sentences: {len(sentences)}")

üîç NLTK Tokenization Results
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Word tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
Number of tokens: 14

Sentences: ['Natural Language Processing is a fascinating field of AI.', "It's amazing!"]
Number of sentences: 2


### ü§î Conceptual Question 4
**Examine the NLTK tokenization results above. How did NLTK handle the contraction "It's"? What happened to the punctuation marks? Do you think this approach is appropriate for all NLP tasks? Explain your reasoning.**


**How "It's" was handled:** NLTK split the contraction "It's" into two separate tokens: "It" and "'s". This approach separates the pronoun from the contracted form of "is" or "has".

**Punctuation treatment:** NLTK treated punctuation marks like the period "." and exclamation mark "!" as individual tokens, distinct from the words they are attached to.

**Appropriateness for different tasks:** This method of splitting contractions and separating punctuation is a common approach in tokenization and is appropriate for many NLP tasks where the focus is on individual words or the strict separation of tokens. However, for some tasks, like sentiment analysis or preserving the exact original text structure for linguistic analysis, keeping contractions as single tokens or handling punctuation differently might be necessary. For instance, in sentiment analysis, "It's amazing!" has a clear positive sentiment, and splitting "It's" doesn't usually hurt, but consider "It's not good" where splitting "not" is crucial, which NLTK does. The handling of punctuation as separate tokens is generally beneficial as punctuation can indicate sentence boundaries or grammatical structure, but in cases like emoticons (e.g., ":)") or specific domain text, the punctuation might be an integral part of a meaningful token. Therefore, while NLTK's approach is a solid default, its appropriateness depends on the specific goals and requirements of the NLP task.

* * *

In [5]:
# Step 5: Tokenization with spaCy
nlp = spacy.load('en_core_web_sm')

print("üîç spaCy Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Process with spaCy
doc = nlp(simple_text)

# Extract tokens
spacy_tokens = [token.text for token in doc]
print(f"\nWord tokens: {spacy_tokens}")
print(f"Number of tokens: {len(spacy_tokens)}")

# Show detailed token information
print(f"\nüî¨ Detailed Token Analysis:")
print(f"{'Token':<12} {'POS':<8} {'Lemma':<12} {'Is Alpha':<8} {'Is Stop':<8}")
print("-" * 50)
for token in doc:
    print(f"{token.text:<12} {token.pos_:<8} {token.lemma_:<12} {token.is_alpha:<8} {token.is_stop:<8}")

üîç spaCy Tokenization Results
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Word tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
Number of tokens: 14

üî¨ Detailed Token Analysis:
Token        POS      Lemma        Is Alpha Is Stop 
--------------------------------------------------
Natural      PROPN    Natural      1        0       
Language     PROPN    Language     1        0       
Processing   NOUN     processing   1        0       
is           AUX      be           1        1       
a            DET      a            1        1       
fascinating  ADJ      fascinating  1        0       
field        NOUN     field        1        0       
of           ADP      of           1        1       
AI           PROPN    AI           1        0       
.            PUNCT    .            0        0       
It           PRON     it           1        1       
's           AU

### ü§î Conceptual Question 5
**Compare the NLTK and spaCy tokenization results. What differences do you notice? Which approach do you think would be better for different NLP tasks? Consider specific examples like sentiment analysis vs. information extraction.**

**Key differences observed:** Both NLTK and spaCy split "It's" and separate punctuation. However, spaCy provides more detailed information about each token, like its part of speech (POS), lemma (base form), and whether it's a stop word. NLTK just gives you the tokens themselves.

**Better for sentiment analysis:** For sentiment analysis, both could work. NLTK's simpler approach is often enough. But spaCy's ability to identify stop words easily might be helpful if you want to quickly remove common words that don't carry much sentiment.

**Better for information extraction:** spaCy is generally better for information extraction. Because it provides POS tags and can identify named entities (like names and places) with its models, it's more powerful for pulling specific pieces of information out of text. NLTK would require more manual steps to get this level of detail.

**Overall assessment:** spaCy offers a richer analysis of tokens which is useful for tasks needing deeper linguistic understanding, like information extraction. NLTK is simpler and faster for basic tokenization, suitable for tasks where just splitting words is enough, like some forms of sentiment analysis or simple text counting.

* * *

In [6]:
# Step 6: Test Tokenization on Complex Text
print("üß™ Testing on Social Media Text")
print("=" * 40)
print(f"Original: {social_text}")

# NLTK approach
social_nltk_tokens = word_tokenize(social_text)
print(f"\nNLTK tokens: {social_nltk_tokens}")

# spaCy approach
social_doc = nlp(social_text)
social_spacy_tokens = [token.text for token in social_doc]
print(f"spaCy tokens: {social_spacy_tokens}")

print(f"\nüìä Comparison:")
print(f"NLTK token count: {len(social_nltk_tokens)}")
print(f"spaCy token count: {len(social_spacy_tokens)}")

üß™ Testing on Social Media Text
Original: OMG! Just tried the new coffee shop ‚òïÔ∏è SO GOOD!!! Highly recommend üëç #coffee #yum üòç

NLTK tokens: ['OMG', '!', 'Just', 'tried', 'the', 'new', 'coffee', 'shop', '‚òïÔ∏è', 'SO', 'GOOD', '!', '!', '!', 'Highly', 'recommend', 'üëç', '#', 'coffee', '#', 'yum', 'üòç']
spaCy tokens: ['OMG', '!', 'Just', 'tried', 'the', 'new', 'coffee', 'shop', '‚òï', 'Ô∏è', 'SO', 'GOOD', '!', '!', '!', 'Highly', 'recommend', 'üëç', '#', 'coffee', '#', 'yum', 'üòç']

üìä Comparison:
NLTK token count: 22
spaCy token count: 23


### ü§î Conceptual Question 6
**Looking at how the libraries handled social media text (emojis, hashtags), which library seems more robust for handling "messy" real-world text? What specific advantages do you notice? How might this impact a real-world application like social media sentiment analysis?**

**More robust library:** Based on the output, spaCy seems a bit more robust for handling messy text like social media.

**Specific advantages:** spaCy handled the emoji with skin tone modifier (üëç) as two tokens (üëç and Ô∏è), which is a more accurate representation. NLTK treated the hashtag symbol and the word separately (# and coffee, # and yum). While spaCy also separated them, its overall approach with more detailed token information could be more useful.

**Impact on sentiment analysis:** For social media sentiment analysis, handling emojis and hashtags correctly is crucial because they often carry significant sentiment or context. A library that can better process these elements, like spaCy separating the emoji components, might lead to more accurate sentiment analysis results compared to one that treats them as simple punctuation or single, undifferentiated tokens.

* * *

## üõë Part 4: Stop Words Removal

### What are Stop Words?
Stop words are common words that appear frequently in a language but typically don't carry much meaningful information about the content. Examples include "the", "is", "at", "which", "on", etc.

### Why Remove Stop Words?
1. **Reduce noise** in the data
2. **Improve efficiency** by reducing vocabulary size
3. **Focus on content words** that carry semantic meaning

### When NOT to Remove Stop Words?
- **Sentiment analysis:** "not good" vs "good" - the "not" is crucial!
- **Question answering:** "What is the capital?" - "what" and "is" provide context

In [7]:
# Step 7: Explore Stop Words Lists
from nltk.corpus import stopwords

# Get NLTK English stop words
nltk_stopwords = set(stopwords.words('english'))
print(f"üìä NLTK has {len(nltk_stopwords)} English stop words")
print(f"First 20: {sorted(list(nltk_stopwords))[:20]}")

# Get spaCy stop words
spacy_stopwords = nlp.Defaults.stop_words
print(f"\nüìä spaCy has {len(spacy_stopwords)} English stop words")
print(f"First 20: {sorted(list(spacy_stopwords))[:20]}")

# Compare the lists
common_stopwords = nltk_stopwords.intersection(spacy_stopwords)
nltk_only = nltk_stopwords - spacy_stopwords
spacy_only = spacy_stopwords - nltk_stopwords

print(f"\nüîç Comparison:")
print(f"Common stop words: {len(common_stopwords)}")
print(f"Only in NLTK: {len(nltk_only)} - Examples: {sorted(list(nltk_only))[:5]}")
print(f"Only in spaCy: {len(spacy_only)} - Examples: {sorted(list(spacy_only))[:5]}")

üìä NLTK has 198 English stop words
First 20: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

üìä spaCy has 326 English stop words
First 20: ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also']

üîç Comparison:
Common stop words: 123
Only in NLTK: 75 - Examples: ['ain', 'aren', "aren't", 'couldn', "couldn't"]
Only in spaCy: 203 - Examples: ["'d", "'ll", "'m", "'re", "'s"]


### ü§î Conceptual Question 7
**Why do you think NLTK and spaCy have different stop word lists? Look at the examples of words that are only in one list - do you agree with these choices? Can you think of scenarios where these differences might significantly impact your NLP results?**


**Reasons for differences:** Different libraries are built with different goals and might use different criteria to decide which words are "stop words." Some might be more aggressive in removing words, while others might be more conservative.

**Agreement with choices:** Looking at the examples, some words only in NLTK's list seem like they *could* be stop words (like "ain't"), but others only in spaCy's list (like "'s") are also often treated as noise. Whether I agree really depends on the specific task.

**Scenarios where differences matter:** These differences can impact results when those specific words are important for the task. For example, if you're analyzing contractions, removing "'s" (as spaCy might) could be problematic. If a word only in one list is crucial for sentiment ("not" is often *not* a stop word for sentiment), removing it would hurt accuracy.

* * *

In [8]:
# Step 8: Remove Stop Words with NLTK
# Test on simple text
original_tokens = nltk_tokens  # From earlier tokenization
filtered_tokens = [word for word in original_tokens if word.lower() not in nltk_stopwords]

print("üß™ NLTK Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(original_tokens)}): {original_tokens}")
print(f"After removing stop words ({len(filtered_tokens)}): {filtered_tokens}")

# Show which words were removed
removed_words = [word for word in original_tokens if word.lower() in nltk_stopwords]
print(f"\nRemoved words: {removed_words}")

# Calculate reduction percentage
reduction = (len(original_tokens) - len(filtered_tokens)) / len(original_tokens) * 100
print(f"Vocabulary reduction: {reduction:.1f}%")

üß™ NLTK Stop Word Removal
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Original tokens (14): ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
After removing stop words (10): ['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', '.', "'s", 'amazing', '!']

Removed words: ['is', 'a', 'of', 'It']
Vocabulary reduction: 28.6%


In [9]:
# Step 9: Remove Stop Words with spaCy
doc = nlp(simple_text)
spacy_filtered = [token.text for token in doc if not token.is_stop and not token.is_punct]

print("üß™ spaCy Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(spacy_tokens)}): {spacy_tokens}")
print(f"After removing stop words & punctuation ({len(spacy_filtered)}): {spacy_filtered}")

# Show which words were removed
spacy_removed = [token.text for token in doc if token.is_stop or token.is_punct]
print(f"\nRemoved words: {spacy_removed}")

# Calculate reduction percentage
spacy_reduction = (len(spacy_tokens) - len(spacy_filtered)) / len(spacy_tokens) * 100
print(f"Vocabulary reduction: {spacy_reduction:.1f}%")

üß™ spaCy Stop Word Removal
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Original tokens (14): ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
After removing stop words & punctuation (7): ['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', 'amazing']

Removed words: ['is', 'a', 'of', '.', 'It', "'s", '!']
Vocabulary reduction: 50.0%


### ü§î Conceptual Question 8
**Compare the NLTK and spaCy stop word removal results. Which approach removed more words? Do you think removing punctuation (as spaCy did) is always a good idea? Give a specific example where keeping punctuation might be important for NLP analysis.**


**Which removed more:** spaCy removed more words in this specific example because it removed both stop words and punctuation, whereas NLTK only removed stop words.

**Punctuation removal assessment:** Removing punctuation is not always a good idea. While it can help simplify text and reduce noise for some tasks, punctuation can carry important meaning in other contexts.

**Example where punctuation matters:** In sentiment analysis, punctuation can significantly impact the perceived emotion. For example, "This is great!" expresses stronger positive sentiment than "This is great." Removing the exclamation mark would lose this intensity. Similarly, emoticons like ":)" or ":(" are made of punctuation and are crucial for understanding sentiment in informal text.

* * *

## üå± Part 5: Lemmatization and Stemming

### What is Lemmatization?
Lemmatization reduces words to their base or dictionary form (called a **lemma**). It considers context and part of speech to ensure the result is a valid word.

### What is Stemming?
Stemming reduces words to their root form by removing suffixes. It's faster but less accurate than lemmatization.

### Key Differences:
| Aspect | Stemming | Lemmatization |
|--------|----------|---------------|
| Speed | Fast | Slower |
| Accuracy | Lower | Higher |
| Output | May be non-words | Always valid words |
| Context | Ignores context | Considers context |

### Examples:
- **"running"** ‚Üí Stem: "run", Lemma: "run"
- **"better"** ‚Üí Stem: "better", Lemma: "good"
- **"was"** ‚Üí Stem: "wa", Lemma: "be"

In [10]:
# Step 10: Stemming with NLTK
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Test words that demonstrate stemming challenges
test_words = ['running', 'runs', 'ran', 'better', 'good', 'best', 'flying', 'flies', 'was', 'were', 'cats', 'dogs']

print("üåø Stemming Demonstration")
print("=" * 30)
print(f"{'Original':<12} {'Stemmed':<12}")
print("-" * 25)

for word in test_words:
    stemmed = stemmer.stem(word)
    print(f"{word:<12} {stemmed:<12}")

# Apply to our sample text
sample_tokens = [token for token in nltk_tokens if token.isalpha()]
stemmed_tokens = [stemmer.stem(token.lower()) for token in sample_tokens]

print(f"\nüß™ Applied to sample text:")
print(f"Original: {sample_tokens}")
print(f"Stemmed: {stemmed_tokens}")

üåø Stemming Demonstration
Original     Stemmed     
-------------------------
running      run         
runs         run         
ran          ran         
better       better      
good         good        
best         best        
flying       fli         
flies        fli         
was          wa          
were         were        
cats         cat         
dogs         dog         

üß™ Applied to sample text:
Original: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', 'It', 'amazing']
Stemmed: ['natur', 'languag', 'process', 'is', 'a', 'fascin', 'field', 'of', 'ai', 'it', 'amaz']


### ü§î Conceptual Question 9
**Look at the stemming results above. Can you identify any cases where stemming produced questionable results? For example, how were "better" and "good" handled? Do you think this is problematic for NLP applications? Explain your reasoning.**


**Questionable results identified:** Stemming produced questionable results for words like "ran" (stemmed to "ran"), "better" and "good" (stemmed to themselves), "flying" and "flies" (both stemmed to "fli"), and "was" (stemmed to "wa"). These stemmed forms are not always actual words.

**Assessment of "better" and "good":** "Better" and "good" are related in meaning but stemming didn't reduce them to a common root. They were left as they were, which means a system using stemming wouldn't recognize them as variations of the same concept.

**Impact on NLP applications:** This can be problematic. For a search engine, if someone searches for "running shoes," stemming "running" to "run" is fine. But if they search for "best price," and "best" isn't stemmed to "good," the search might miss results containing "good price." In sentiment analysis, stemming "better" and "good" separately could prevent the system from recognizing that both indicate positive sentiment. Stemming's aggressive approach can sometimes group unrelated words or fail to group related ones, impacting the accuracy of tasks that rely on understanding word relationships.

* * *

In [11]:
# Step 11: Lemmatization with spaCy
print("üå± spaCy Lemmatization Demonstration")
print("=" * 40)

# Test on a complex sentence
complex_sentence = "The researchers were studying the effects of running and swimming on better performance."
doc = nlp(complex_sentence)

print(f"Original: {complex_sentence}")
print(f"\n{'Token':<15} {'Lemma':<15} {'POS':<10} {'Explanation':<20}")
print("-" * 65)

for token in doc:
    if token.is_alpha:
        explanation = "No change" if token.text.lower() == token.lemma_ else "Lemmatized"
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10} {explanation:<20}")

# Extract lemmas
lemmas = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
print(f"\nüî§ Lemmatized tokens (no stop words): {lemmas}")

üå± spaCy Lemmatization Demonstration
Original: The researchers were studying the effects of running and swimming on better performance.

Token           Lemma           POS        Explanation         
-----------------------------------------------------------------
The             the             DET        No change           
researchers     researcher      NOUN       Lemmatized          
were            be              AUX        Lemmatized          
studying        study           VERB       Lemmatized          
the             the             DET        No change           
effects         effect          NOUN       Lemmatized          
of              of              ADP        No change           
running         run             VERB       Lemmatized          
and             and             CCONJ      No change           
swimming        swim            VERB       Lemmatized          
on              on              ADP        No change           
better          well       

In [12]:
# Step 12: Compare Stemming vs Lemmatization
comparison_words = ['better', 'running', 'studies', 'was', 'children', 'feet']

print("‚öñÔ∏è Stemming vs Lemmatization Comparison")
print("=" * 50)
print(f"{'Original':<12} {'Stemmed':<12} {'Lemmatized':<12}")
print("-" * 40)

for word in comparison_words:
    # Stemming
    stemmed = stemmer.stem(word)

    # Lemmatization with spaCy
    doc = nlp(word)
    lemmatized = doc[0].lemma_

    print(f"{word:<12} {stemmed:<12} {lemmatized:<12}")

‚öñÔ∏è Stemming vs Lemmatization Comparison
Original     Stemmed      Lemmatized  
----------------------------------------
better       better       well        
running      run          run         
studies      studi        study       
was          wa           be          
children     children     child       
feet         feet         foot        


### ü§î Conceptual Question 10
**Compare the stemming and lemmatization results. Which approach do you think is more suitable for:**
1. **A search engine** (where speed is crucial and you need to match variations of words)?
2. **A sentiment analysis system** (where accuracy and meaning preservation are important)?
3. **A real-time chatbot** (where both speed and accuracy matter)?

**Explain your reasoning for each choice.**

**1. Search engine:** Stemming is often better for a search engine. Speed is very important when someone searches, and stemming is faster. It's good enough to find variations of words even if the root isn't a real word. For example, stemming "running," "runs," and "ran" to "run" helps the search find documents with any of those words when the user searches for "run."

**2. Sentiment analysis:** Lemmatization is usually better for sentiment analysis. Accuracy and keeping the word's real meaning are key. Lemmatization gives you a valid base word ("better" to "good") which helps the system understand the true sentiment. Stemming might create non-words or fail to group related words, which could confuse the sentiment analysis.

**3. Real-time chatbot:** This is tricky and might depend on the chatbot's specific job. If speed is the absolute top priority, stemming might be chosen. However, for better understanding of user input and more accurate responses, especially in complex conversations, lemmatization is likely preferred despite being a bit slower. A hybrid approach or using pre-computed lemmas could also be an option to balance speed and accuracy.

* * *

## üßπ Part 6: Text Cleaning and Normalization

### What is Text Cleaning?
Text cleaning involves removing or standardizing elements that might interfere with analysis:
- **Case normalization** (converting to lowercase)
- **Punctuation removal**
- **Number handling** (remove, replace, or normalize)
- **Special character handling** (URLs, emails, mentions)
- **Whitespace normalization**

### Why is it Important?
- Ensures consistency across your dataset
- Reduces vocabulary size
- Improves model performance
- Handles edge cases in real-world data

In [13]:
# Step 13: Basic Text Cleaning
def basic_clean_text(text):
    """Apply basic text cleaning operations"""
    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove extra spaces again
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test basic cleaning
test_text = "   Hello WORLD!!! This has 123 numbers and   extra spaces.   "
cleaned = basic_clean_text(test_text)

print("üßπ Basic Text Cleaning")
print("=" * 30)
print(f"Original: '{test_text}'")
print(f"Cleaned: '{cleaned}'")
print(f"Length reduction: {(len(test_text) - len(cleaned))/len(test_text)*100:.1f}%")

üßπ Basic Text Cleaning
Original: '   Hello WORLD!!! This has 123 numbers and   extra spaces.   '
Cleaned: 'hello world this has numbers and extra spaces'
Length reduction: 26.2%


In [14]:
# Step 14: Advanced Cleaning for Social Media
def advanced_clean_text(text):
    """Apply advanced cleaning for social media and web text"""
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)

    # Convert hashtags (keep the word, remove #)
    text = re.sub(r'#(\w+)', r'\1', text)

    # Remove emojis (basic approach)
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Convert to lowercase and normalize whitespace
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test on social media text
print("üöÄ Advanced Cleaning on Social Media Text")
print("=" * 45)
print(f"Original: {social_text}")

cleaned_social = advanced_clean_text(social_text)
print(f"Cleaned: {cleaned_social}")
print(f"Length reduction: {(len(social_text) - len(cleaned_social))/len(social_text)*100:.1f}%")

üöÄ Advanced Cleaning on Social Media Text
Original: OMG! Just tried the new coffee shop ‚òïÔ∏è SO GOOD!!! Highly recommend üëç #coffee #yum üòç
Cleaned: omg! just tried the new coffee shop ‚òïÔ∏è so good!!! highly recommend coffee yum
Length reduction: 7.2%


### ü§î Conceptual Question 11
**Look at the advanced cleaning results for the social media text. What information was lost during cleaning? Can you think of scenarios where removing emojis and hashtags might actually hurt your NLP application? What about scenarios where keeping them would be beneficial?**


**Information lost:** During advanced cleaning, information like URLs, email addresses, mentions (@usernames), and the original hashtag symbols (#) were removed. Some specific emoji variations might also have been simplified or removed depending on the exact pattern used.

**Scenarios where removal hurts:** Removing emojis and hashtags can hurt NLP applications where they carry significant meaning or sentiment. For example, in sentiment analysis of social media, an emoji like üòÇ or a hashtag like #blessed strongly indicate positive sentiment. Removing them would strip away this crucial emotional information. Similarly, for tasks like identifying trending topics or understanding social network interactions, removing hashtags and mentions would eliminate valuable data.

**Scenarios where keeping helps:** Keeping emojis and hashtags is beneficial when they provide important context, sentiment, or structural information. For sentiment analysis, as mentioned, emojis are key indicators. For topic modeling or trend analysis, hashtags define themes. For social network analysis, mentions show interactions between users. In these cases, preserving these elements allows the NLP system to capture a more complete understanding of the text and its social context.

* * *

## üîß Part 7: Building a Complete Preprocessing Pipeline

Now let's combine everything into a comprehensive preprocessing pipeline that you can customize based on your needs.

### Pipeline Components:
1. **Text cleaning** (basic or advanced)
2. **Tokenization** (NLTK or spaCy)
3. **Stop word removal** (optional)
4. **Lemmatization/Stemming** (optional)
5. **Additional filtering** (length, etc.)

In [15]:
# Step 15: Complete Preprocessing Pipeline
def preprocess_text(text,
                   clean_level='basic',     # 'basic' or 'advanced'
                   remove_stopwords=True,
                   use_lemmatization=True,
                   use_stemming=False,
                   min_length=2):
    """
    Complete text preprocessing pipeline
    """
    # Step 1: Clean text
    if clean_level == 'basic':
        cleaned_text = basic_clean_text(text)
    else:
        cleaned_text = advanced_clean_text(text)

    # Step 2: Tokenize
    if use_lemmatization:
        # Use spaCy for lemmatization
        doc = nlp(cleaned_text)
        tokens = [token.lemma_.lower() for token in doc if token.is_alpha]
    else:
        # Use NLTK for basic tokenization
        tokens = word_tokenize(cleaned_text)
        tokens = [token for token in tokens if token.isalpha()]

    # Step 3: Remove stop words
    if remove_stopwords:
        if use_lemmatization:
            tokens = [token for token in tokens if token not in spacy_stopwords]
        else:
            tokens = [token.lower() for token in tokens if token.lower() not in nltk_stopwords]

    # Step 4: Apply stemming if requested
    if use_stemming and not use_lemmatization:
        tokens = [stemmer.stem(token.lower()) for token in tokens]

    # Step 5: Filter by length
    tokens = [token for token in tokens if len(token) >= min_length]

    return tokens

print("üîß Preprocessing Pipeline Created!")
print("‚úÖ Ready to test different configurations.")

üîß Preprocessing Pipeline Created!
‚úÖ Ready to test different configurations.


In [16]:
# Step 16: Test Different Pipeline Configurations
test_text = sample_texts["Product Review"]
print(f"üéØ Testing on: {test_text[:100]}...")
print("=" * 60)

# Configuration 1: Minimal processing
minimal = preprocess_text(test_text,
                         clean_level='basic',
                         remove_stopwords=False,
                         use_lemmatization=False,
                         use_stemming=False)
print(f"\n1. Minimal processing ({len(minimal)} tokens):")
print(f"   {minimal[:10]}...")

# Configuration 2: Standard processing
standard = preprocess_text(test_text,
                          clean_level='basic',
                          remove_stopwords=True,
                          use_lemmatization=True)
print(f"\n2. Standard processing ({len(standard)} tokens):")
print(f"   {standard[:10]}...")

# Configuration 3: Aggressive processing
aggressive = preprocess_text(test_text,
                            clean_level='advanced',
                            remove_stopwords=True,
                            use_lemmatization=False,
                            use_stemming=True,
                            min_length=3)
print(f"\n3. Aggressive processing ({len(aggressive)} tokens):")
print(f"   {aggressive[:10]}...")

# Show reduction percentages
original_count = len(word_tokenize(test_text))
print(f"\nüìä Token Reduction Summary:")
print(f"   Original: {original_count} tokens")
print(f"   Minimal: {len(minimal)} ({(original_count-len(minimal))/original_count*100:.1f}% reduction)")
print(f"   Standard: {len(standard)} ({(original_count-len(standard))/original_count*100:.1f}% reduction)")
print(f"   Aggressive: {len(aggressive)} ({(original_count-len(aggressive))/original_count*100:.1f}% reduction)")

üéØ Testing on: This laptop is absolutely fantastic! I've been using it for 6 months and it's still super fast.
The ...

1. Minimal processing (34 tokens):
   ['this', 'laptop', 'is', 'absolutely', 'fantastic', 'ive', 'been', 'using', 'it', 'for']...

2. Standard processing (18 tokens):
   ['laptop', 'absolutely', 'fantastic', 've', 'use', 'month', 'super', 'fast', 'battery', 'life']...

3. Aggressive processing (21 tokens):
   ['laptop', 'absolut', 'fantast', 'use', 'month', 'still', 'super', 'fast', 'batteri', 'life']...

üìä Token Reduction Summary:
   Original: 47 tokens
   Minimal: 34 (27.7% reduction)
   Standard: 18 (61.7% reduction)
   Aggressive: 21 (55.3% reduction)


### ü§î Conceptual Question 12
**Compare the three pipeline configurations (Minimal, Standard, Aggressive). For each configuration, analyze:**
1. **What information was preserved?**
2. **What information was lost?**
3. **What type of NLP task would this configuration be best suited for?**

**Minimal Processing:**
- Preserved: Most of the original words and punctuation are kept. Case is normalized.
- Lost: Only extra whitespace and basic capitalization differences are removed.
- Best for: Tasks where the exact wording and structure are important, like grammar checking, part-of-speech tagging, or when you need a large vocabulary.

**Standard Processing:**
- Preserved: Core meaningful words are kept. Variations of words are reduced to their base form (lemmatization).
- Lost: Punctuation, numbers, and common stop words are removed. Some subtle meaning from word variations might be lost.
- Best for: Many common NLP tasks like text classification, topic modeling, or information retrieval where focusing on the main content words is beneficial.

**Aggressive Processing:**
- Preserved: Only the most basic word roots are kept. More aggressive cleaning removes URLs, mentions, and emojis.
- Lost: Significant structural and contextual information is lost (punctuation, stop words, numbers, special characters). Word roots may not be actual words.
- Best for: Tasks where reducing dimensionality and focusing on very core word components is critical, often in highly specific or experimental scenarios, or for tasks where speed is paramount and some loss of nuance is acceptable (though lemmatization is usually preferred over stemming for accuracy).

* * *

In [17]:
# Step 17: Comprehensive Analysis Across Text Types
print("üî¨ Comprehensive Preprocessing Analysis")
print("=" * 50)

# Test standard preprocessing on all text types
results = {}
for name, text in sample_texts.items():
    original_tokens = len(word_tokenize(text))
    processed_tokens = preprocess_text(text,
                                      clean_level='basic',
                                      remove_stopwords=True,
                                      use_lemmatization=True)

    reduction = (original_tokens - len(processed_tokens)) / original_tokens * 100
    results[name] = {
        'original': original_tokens,
        'processed': len(processed_tokens),
        'reduction': reduction,
        'sample': processed_tokens[:8]
    }

    print(f"\nüìÑ {name}:")
    print(f"   Original: {original_tokens} tokens")
    print(f"   Processed: {len(processed_tokens)} tokens ({reduction:.1f}% reduction)")
    print(f"   Sample: {processed_tokens[:8]}")

# Summary table
print(f"\n\nüìã Summary Table")
print(f"{'Text Type':<15} {'Original':<10} {'Processed':<10} {'Reduction':<10}")
print("-" * 50)
for name, data in results.items():
    print(f"{name:<15} {data['original']:<10} {data['processed']:<10} {data['reduction']:<10.1f}%")

üî¨ Comprehensive Preprocessing Analysis

üìÑ Simple:
   Original: 14 tokens
   Processed: 7 tokens (50.0% reduction)
   Sample: ['natural', 'language', 'processing', 'fascinating', 'field', 'ai', 'amazing']

üìÑ Academic:
   Original: 61 tokens
   Processed: 26 tokens (57.4% reduction)
   Sample: ['dr', 'smith', 'research', 'machinelearning', 'algorithm', 'groundbreake', 'publish', 'paper']

üìÑ Social Media:
   Original: 22 tokens
   Processed: 10 tokens (54.5% reduction)
   Sample: ['omg', 'try', 'new', 'coffee', 'shop', 'good', 'highly', 'recommend']

üìÑ News:
   Original: 51 tokens
   Processed: 25 tokens (51.0% reduction)
   Sample: ['stock', 'market', 'experience', 'significant', 'volatility', 'today', 'tech', 'stock']

üìÑ Product Review:
   Original: 47 tokens
   Processed: 18 tokens (61.7% reduction)
   Sample: ['laptop', 'absolutely', 'fantastic', 've', 'use', 'month', 'super', 'fast']


üìã Summary Table
Text Type       Original   Processed  Reduction 
-------------

### ü§î Final Conceptual Question 13
**Looking at the comprehensive analysis results across all text types:**

1. **Which text type was most affected by preprocessing?** Why do you think this happened?
2. **Which text type was least affected?** What does this tell you about the nature of that text?
3. **If you were building an NLP system to analyze customer reviews for a business, which preprocessing approach would you choose and why?**
4. **What are the main trade-offs you need to consider when choosing preprocessing techniques for any NLP project?**

**1. Most affected text type:** Product Review was the most affected, showing the highest percentage reduction in tokens (61.7%). This likely happened because product reviews often contain more informal language, numbers (ratings), and potentially more stop words compared to more formal text types. The standard preprocessing pipeline (basic cleaning, stop word removal, lemmatization) would aggressively target these elements, leading to a higher reduction.

**2. Least affected text type:** Simple text was the least affected, with a 50.0% reduction. This tells us that simple text is already relatively clean and contains fewer complex structures, special characters, or a high density of stop words compared to the other text types. It's closer to the "ideal" input for standard NLP processing, so less is removed.

**3. For customer review analysis:** I would likely choose a **Standard Processing** approach, possibly with some modifications. Customer reviews contain sentiment, which means removing *all* stop words might be harmful ("not good"). I would likely keep negation words. Lemmatization is important to group word variations ("fantastic," "fantastically") for consistent sentiment analysis. Basic cleaning is good for removing punctuation and numbers (though extracting the rating itself would need a specific step before this). Aggressive cleaning might remove emojis or informal language crucial for sentiment.

**4. Main trade-offs to consider:** When choosing preprocessing techniques for any NLP project, you need to consider several trade-offs. One is the balance between **information loss and noise reduction**. More aggressive techniques reduce noise but can also remove valuable data needed for certain tasks, such as sentiment conveyed by emojis. Another trade-off is **speed versus accuracy**. Stemming is faster but less accurate than lemmatization; the choice depends on whether real-time performance or precise meaning is more critical for your application. You also need to think about **complexity versus robustness**. More complex cleaning methods handle messy text better but might be unnecessary for cleaner text and require more specific rules. Finally, there's the consideration of **generalization versus task specificity**. A general pipeline can be used for many tasks, but customizing it for your exact project will usually lead to better performance.

* * *

## üéØ Lab Summary and Reflection

Congratulations! You've completed a comprehensive exploration of NLP preprocessing techniques.

### üîë Key Concepts You've Mastered:

1. **Text Preprocessing Fundamentals** - Understanding why preprocessing is crucial
2. **Tokenization Techniques** - NLTK vs spaCy approaches and their trade-offs
3. **Stop Word Management** - When to remove them and when to keep them
4. **Morphological Processing** - Stemming vs lemmatization for different use cases
5. **Text Cleaning Strategies** - Basic vs advanced cleaning for different text types
6. **Pipeline Design** - Building modular, configurable preprocessing systems

### üéì Real-World Applications:
These techniques form the foundation for search engines, chatbots, sentiment analysis, document classification, machine translation, and information extraction systems.

### üí° Key Insights to Remember:
- **No Universal Solution**: Different NLP tasks require different preprocessing approaches
- **Trade-offs Are Everywhere**: Balance information preservation with noise reduction
- **Context Matters**: The same technique can help or hurt depending on your use case
- **Experimentation Is Key**: Always test and measure impact on your specific task

---

**Excellent work completing Lab 02!** üéâ

For your reflection journal, focus on the insights you gained about when and why to use different techniques, the challenges you encountered, and connections you made to real-world applications.