<a href="https://colab.research.google.com/github/AbdullahFaiza/NLP-ITAI2373/blob/main/Lab_02_NLP_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 02: Basic NLP Preprocessing Techniques

**Course:** ITAI 2373 - Natural Language Processing  
**Module:** 02 - Text Preprocessing  
**Duration:** 2-3 hours  
**Student Name:** Faiza Abdullah
**Date:** June 8, 2025

---

## 🎯 Learning Objectives

By completing this lab, you will:
1. Understand the critical role of preprocessing in NLP pipelines
2. Master fundamental text preprocessing techniques
3. Compare different libraries and their approaches
4. Analyze the effects of preprocessing on text data
5. Build a complete preprocessing pipeline
6. Load and work with different types of text datasets

## 📖 Introduction to NLP Preprocessing

Natural Language Processing (NLP) preprocessing refers to the initial steps taken to clean and transform raw text data into a format that's more suitable for analysis by machine learning algorithms.

### Why is preprocessing crucial?

1. **Standardization:** Ensures consistent text format across your dataset
2. **Noise Reduction:** Removes irrelevant information that could confuse algorithms
3. **Complexity Reduction:** Simplifies text to focus on meaningful patterns
4. **Performance Enhancement:** Improves the efficiency and accuracy of downstream tasks

### Real-world Impact
Consider searching for "running shoes" vs "Running Shoes!" - without preprocessing, these might be treated as completely different queries. Preprocessing ensures they're recognized as equivalent.

### 🤔 Conceptual Question 1
**Before we start coding, think about your daily interactions with text processing systems (search engines, chatbots, translation apps). What challenges do you think these systems face when processing human language? List at least 3 specific challenges and explain why each is problematic.**

**Challenge 1: Ambiguity in Language**

Human language is inherently ambiguous, with words and phrases often having multiple meanings depending on context. For example, the word "bank" can refer to a financial institution, the edge of a river, or even a verb meaning to tilt an aircraft. This poses a problem for systems like search engines or chatbots because misinterpreting the intended meaning can lead to irrelevant results or incorrect responses. Resolving ambiguity requires understanding context, which is difficult without deep knowledge of the situation, cultural nuances, or user intent.

**Challenge 2: Handling Sarcasm and Tone**

Detecting sarcasm, irony, or emotional tone in text is a significant challenge. For instance, a phrase like "Great job!" could be genuine praise or sarcastic criticism, depending on the context and intent. Systems struggle to pick up on these subtleties because they rely on patterns in data rather than human-like emotional intuition. This can lead to misinterpretations in chatbots or sentiment analysis tools, resulting in inappropriate responses or inaccurate sentiment classification.

**Challenge 3: Cultural and Linguistic Variations**

Languages vary widely across cultures, dialects, and regions, with differences in slang, idioms, or grammatical structures. For example, the phrase "throwing shade" might be understood in some English-speaking communities but confuse systems or users unfamiliar with the term. Translation apps and chatbots often struggle to account for these variations, leading to mistranslations or responses that feel unnatural or irrelevant to users from different backgrounds. This challenge is compounded by the need for systems to adapt to evolving language trends in real time.

## 🛠️ Part 1: Environment Setup

We'll be working with two major NLP libraries:
- **NLTK (Natural Language Toolkit):** Comprehensive NLP library with extensive resources
- **spaCy:** Industrial-strength NLP with pre-trained models

**⚠️ Note:** Installation might take 2-3 minutes to complete.

In [1]:
# Step 1: Install Required Libraries
print("🔧 Installing NLP libraries...")

!pip install -q nltk spacy
!python -m spacy download en_core_web_sm

print("✅ Installation complete!")

🔧 Installing NLP libraries...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m85.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
✅ Installation complete!


### 🤔 Conceptual Question 2
**Why do you think we need to install a separate language model (en_core_web_sm) for spaCy? What components might this model contain that help with text processing? Think about what information a computer needs to understand English text.**

Installing a separate language model like en_core_web_sm for spaCy is necessary because spaCy itself is a framework—a set of tools and algorithms for natural language processing (NLP)—but it doesn't include the language-specific data required to process text in a particular language, like English. The language model provides the pre-trained statistical data and rules tailored to a specific language, which spaCy uses to analyze and understand text. Without this model, spaCy lacks the linguistic knowledge to perform tasks like tokenization, part-of-speech tagging, or named entity recognition.

**Why a Separate Language Model?**
-Modularity: By separating the framework from language-specific data, spaCy remains lightweight and flexible. Users can install only the models they need (e.g., English, Spanish, or multilingual), saving storage and memory.
-Language-Specific Knowledge: Each language has unique grammar, syntax, and vocabulary. A model like en_core_web_sm is trained on English text to capture these nuances, enabling spaCy to process English accurately.
-Pre-Trained Efficiency: The model contains pre-trained weights from large datasets, allowing spaCy to perform complex tasks without requiring users to train models from scratch, which would be computationally expensive and time-consuming.
-Scalability: Different models (e.g., en_core_web_sm for small, en_core_web_lg for large) offer trade-offs between speed, memory usage, and accuracy, letting users choose based on their needs.

**Components of en_core_web_sm for Text Processing**
The en_core_web_sm model includes several components that help spaCy understand and process English text. These components provide the structured information a computer needs to interpret human language:

1. Tokenizer:
Purpose: Breaks text into individual tokens (words, punctuation, etc.).
Why It’s Needed: English text is a continuous stream of characters. The tokenizer uses rules and patterns to segment it into meaningful units (e.g., splitting "I’m running!" into ["I", "’m", "running", "!"]). This is the foundation for all further analysis, as computers need discrete units to process language.
Example: For "spaCy’s great.", the tokenizer ensures "spaCy’s" is treated as one token, not split incorrectly.

2. Part-of-Speech (POS) Tagger:
Purpose: Assigns grammatical categories (e.g., noun, verb, adjective) to each token.
Why It’s Needed: Understanding the role of each word in a sentence (e.g., "run" as a verb vs. a noun in "a morning run") helps the computer grasp sentence structure and meaning. This is critical for tasks like dependency parsing or text analysis.
Example: In "The cat sleeps", the model tags "cat" as a noun and "sleeps" as a verb.

3. Dependency Parser:
Purpose: Analyzes the grammatical structure of a sentence by identifying relationships between words (e.g., subject, object, modifier).
Why It’s Needed: English sentences follow syntactic rules that determine meaning (e.g., "The dog chased the cat" vs. "The cat chased the dog"). The parser builds a tree of dependencies, helping the computer understand how words relate, which is essential for tasks like question answering or translation.
Example: In "She loves coding", the parser links "loves" to "She" (subject) and "coding" (object).

4. Named Entity Recognizer (NER):
Purpose: Identifies and classifies named entities like people, organizations, or locations in text.
Why It’s Needed: English text often contains proper nouns (e.g., "Apple" as a company vs. a fruit) that carry specific meaning. Recognizing these entities helps with information extraction and context understanding, crucial for search engines or chatbots.
Example: In "Elon Musk founded xAI", the NER tags "Elon Musk" as a person and "xAI" as an organization.

5. Word Vectors (Optional in en_core_web_sm):
Purpose: Provides numerical representations of words based on their semantic meaning.
Why It’s Needed: Word vectors capture relationships between words (e.g., "king" is close to "queen" in meaning). While en_core_web_sm has limited vectors for efficiency, they help with tasks like text similarity or machine learning applications by giving the computer a way to quantify word meanings.
Example: Vectors allow the system to know "big" and "large" are semantically similar.

6. Lemmatizer:
Purpose: Reduces words to their base or dictionary form (e.g., "running" → "run").
Why It’s Needed: English words often have multiple forms (plurals, tenses). Lemmatization standardizes them, making it easier for the computer to recognize that "ran" and "running" refer to the same concept, improving search or text analysis accuracy.
Example: For "cats", the lemmatizer outputs "cat".

7. Sentence Boundary Detector:
Purpose: Identifies sentence boundaries in a block of text.
Why It’s Needed: English text often lacks clear markers for sentence breaks (e.g., ambiguous periods in abbreviations like "Dr."). Detecting sentences helps the computer process text in meaningful chunks, crucial for tasks like summarization or dialogue systems.
Example: Splits "I love coding. It’s fun!" into two sentences.

**What a Computer Needs to Understand English Text**
To process English text effectively, a computer needs:

- Lexical Knowledge: Understanding words, their forms (e.g., plurals, tenses), and meanings, including disambiguating homonyms (e.g., "bank" as riverbank vs. financial institution).

- Syntactic Structure: Knowledge of grammar rules to parse sentence structure and relationships between words.

- Semantic Context: Ability to infer meaning based on context, including resolving ambiguities and understanding idioms or cultural references.

- Pragmatic Understanding: Insight into intent, tone, or implied meaning (e.g., sarcasm), though this is less developed in models like en_core_web_sm.

- Cultural Nuances: Awareness of slang, regional variations, or evolving language trends to handle diverse English dialects.

The en_core_web_sm model equips spaCy with these capabilities (to varying degrees) by providing pre-trained data and rules specific to English, enabling tasks like text analysis, entity extraction, or chatbot development. Larger models like en_core_web_lg offer more detailed word vectors and higher accuracy but require more resources, while en_core_web_sm balances efficiency and functionality for common use cases.

**Reference:**
https://spacy.io/usage/models


In [2]:
# Step 2: Import Libraries and Download NLTK Data
import nltk
import spacy
import string
import re
from collections import Counter

# Download essential NLTK data
print("📦 Downloading NLTK data packages...")
nltk.download('punkt')      # For tokenization
nltk.download('stopwords')  # For stop word removal
nltk.download('wordnet')    # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging

print("\n✅ All imports and downloads completed!")

📦 Downloading NLTK data packages...


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.



✅ All imports and downloads completed!


## 📂 Part 2: Sample Text Data

We'll work with different types of text to understand how preprocessing affects various text styles:
- Simple text
- Academic text (with citations, URLs)
- Social media text (with emojis, hashtags)
- News text (formal writing)
- Product reviews (informal, ratings)

In [3]:
# Step 3: Load Sample Texts
simple_text = "Natural Language Processing is a fascinating field of AI. It's amazing!"

academic_text = """
Dr. Smith's research on machine-learning algorithms is groundbreaking!
She published 3 papers in 2023, focusing on deep neural networks (DNNs).
The results were amazing - accuracy improved by 15.7%!
"This is revolutionary," said Prof. Johnson.
Visit https://example.com for more info. #NLP #AI @university
"""

social_text = "OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍"

news_text = """
The stock market experienced significant volatility today, with tech stocks leading the decline.
Apple Inc. (AAPL) dropped 3.2%, while Microsoft Corp. fell 2.8%.
"We're seeing a rotation out of growth stocks," said analyst Jane Doe from XYZ Capital.
"""

review_text = """
This laptop is absolutely fantastic! I've been using it for 6 months and it's still super fast.
The battery life is incredible - lasts 8-10 hours easily.
Only complaint: the keyboard could be better. Overall rating: 4.5/5 stars.
"""

# Store all texts
sample_texts = {
    "Simple": simple_text,
    "Academic": academic_text.strip(),
    "Social Media": social_text,
    "News": news_text.strip(),
    "Product Review": review_text.strip()
}

print("📄 Sample texts loaded successfully!")
for name, text in sample_texts.items():
    preview = text[:80] + "..." if len(text) > 80 else text
    print(f"\n🏷️ {name}: {preview}")

📄 Sample texts loaded successfully!

🏷️ Simple: Natural Language Processing is a fascinating field of AI. It's amazing!

🏷️ Academic: Dr. Smith's research on machine-learning algorithms is groundbreaking!
She publi...

🏷️ Social Media: OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yu...

🏷️ News: The stock market experienced significant volatility today, with tech stocks lead...

🏷️ Product Review: This laptop is absolutely fantastic! I've been using it for 6 months and it's st...


### 🤔 Conceptual Question 3
**Looking at the different text types we've loaded, what preprocessing challenges do you anticipate for each type? For each text type below, identify at least 2 specific preprocessing challenges and explain why they might be problematic for NLP analysis.**

**Simple Text Challenges:**

1. Contractions Handling: The simple text contains "It's". Tokenizers may split it differently (e.g., NLTK: "It", "’", "s"; spaCy: "It’s"), leading to inconsistent token counts or loss of meaning in tasks like sentiment analysis, where "It’s" conveys a single unit of positive sentiment.

Problem: Inconsistent tokenization can affect feature extraction, reducing model accuracy in tasks requiring precise word boundaries.

2. Punctuation Sensitivity: The text includes punctuation like "!" and "." (e.g., "It’s amazing!"). Removing punctuation during cleaning might strip emphasis or sentence boundaries, which are crucial for sentiment detection or sentence segmentation.

Problem: Losing punctuation can obscure emotional intensity or sentence structure, impacting tasks like text summarization.

**Academic Text Challenges:**

1. Specialized Terminology: The academic text includes terms like "machine-learning" and "DNNs". Tokenizers may split hyphenated words incorrectly (e.g., "machine", "-", "learning"), and lemmatization might not recognize domain-specific terms.

Problem: Incorrect tokenization or normalization can fragment key concepts, reducing accuracy in tasks like information retrieval or topic modeling.

2. Mixed Content: The text contains URLs, citations, and hashtags (e.g., "https://example.com", "#NLP"). Advanced cleaning might remove these, losing context or metadata critical for tasks like citation analysis or social media integration.

Problem: Removing URLs or hashtags can hinder tracking sources or trends, affecting applications like academic search engines.

**Social Media Text Challenges:**

1. Emojis and Hashtags: The social media text includes emojis (☕️, 😍) and hashtags (#coffee, #yum). Advanced cleaning removes these (Step 14, Page 11), potentially losing sentiment or topic indicators.

Problem: Emojis and hashtags convey emotion or categorize content, and their removal can degrade performance in sentiment analysis or trend detection.

2. Informal Language: Words like "OMG" and "SO GOOD!!!" reflect slang and emphasis. Tokenizers or lemmatizers may struggle with non-standard terms or excessive punctuation, leading to incorrect splits or normalization.

Problem: Misprocessing informal language can distort meaning, impacting tasks like social media monitoring.

**News Text Challenges:**

1. Named Entities: The news text includes entities like "Apple Inc." and "Jane Doe" (Step 3, Page 5). Cleaning steps (e.g., lowercase conversion) or tokenization might disrupt entity boundaries, complicating named entity recognition (NER).

Problem: Incorrect entity handling can lead to errors in information extraction, such as misidentifying companies or people.

2. Formal Structure: The text uses quotes and formal phrasing (e.g., "We’re seeing..."). Removing punctuation or stop words (Step 9, Page 9) might obscure quotation boundaries or syntactic roles, affecting tasks like quote attribution.

Problem: Loss of structural cues can hinder applications requiring precise context, like news summarization.

**Product Review Challenges:**

1. Numbers and Ratings: The review text includes numbers like "6 months" and "4.5/5 stars" (Step 3, Page 5). Basic cleaning removes numbers (Step 13, Page 11), losing quantitative details critical for review analysis.

Problem: Removing numbers can obscure key information, reducing accuracy in tasks like rating prediction.

2. Mixed Sentiment: The text expresses both positive ("fantastic") and negative ("could be better") sentiments. Stop word removal (Step 8, Page 8) might eliminate words like "not" or "only," flipping sentiment polarity.

Problem: Altering sentiment cues can mislead sentiment analysis, leading to incorrect customer feedback interpretation.

**Reference:** https://spacy.io/usage/linguistic-features#tokenization


## 🔤 Part 3: Tokenization

### What is Tokenization?
Tokenization is the process of breaking down text into smaller, meaningful units called **tokens**. These tokens are typically words, but can also be sentences, characters, or subwords.

### Why is it Important?
- Most NLP algorithms work with individual tokens, not entire texts
- It's the foundation for all subsequent preprocessing steps
- Different tokenization strategies can significantly impact results

### Common Challenges:
- **Contractions:** "don't" → "do" + "n't" or "don't"?
- **Punctuation:** Keep with words or separate?
- **Special characters:** How to handle @, #, URLs?

In [4]:
# Step 4: Tokenization with NLTK
from nltk.tokenize import word_tokenize, sent_tokenize

# Test on simple text
print("🔍 NLTK Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Download the missing resource
import nltk
print("📦 Downloading missing NLTK data package...")
nltk.download('punkt_tab')
print("✅ Download completed!")

# Word tokenization
nltk_tokens = word_tokenize(simple_text)
print(f"\nWord tokens: {nltk_tokens}")
print(f"Number of tokens: {len(nltk_tokens)}")

# Sentence tokenization
sentences = sent_tokenize(simple_text)
print(f"\nSentences: {sentences}")
print(f"Number of sentences: {len(sentences)}")

🔍 NLTK Tokenization Results
Original: Natural Language Processing is a fascinating field of AI. It's amazing!
📦 Downloading missing NLTK data package...


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


✅ Download completed!

Word tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
Number of tokens: 14

Sentences: ['Natural Language Processing is a fascinating field of AI.', "It's amazing!"]
Number of sentences: 2


### 🤔 Conceptual Question 4
**Examine the NLTK tokenization results above. How did NLTK handle the contraction "It's"? What happened to the punctuation marks? Do you think this approach is appropriate for all NLP tasks? Explain your reasoning.**

**How "It's" was handled:**
NLTK’s word_tokenize on the simple text "Natural Language Processing is a fascinating field of AI. It’s amazing!" splits "It’s" into three tokens: ["It", "’", "s"]. This reflects NLTK’s rule-based approach, treating the apostrophe and "s" as separate tokens.

**Punctuation treatment:**
Punctuation marks (".", "!") are separated as individual tokens, e.g., ["."] and ["!"]. The output tokens include: ["Natural", "Language", ..., ".", "It", "’", "s", "amazing", "!"].

**Appropriateness for different tasks:**

- Suitable Tasks:

a) Information Retrieval: Splitting "It’s" into "It", "’", "s" allows matching individual components (e.g., searching for "it" or "s"), increasing recall. Separating punctuation aids in indexing clean word forms.

b) Basic Text Analysis: Tasks like word frequency counting benefit from isolated punctuation, as it avoids conflating "word" and "word!"

- Unsuitable Tasks:

a) Sentiment Analysis: Treating "It’s" as three tokens fragments the contraction, potentially losing its unified meaning as "it is." This could disrupt sentiment detection, where "It’s amazing" should be a single positive unit.

b) Dependency Parsing: Separating "’" and "s" complicates syntactic analysis, as parsers rely on intact contractions to assign correct dependencies.

Reasoning: NLTK’s aggressive splitting is fast and general-purpose but lacks context awareness (unlike spaCy). For tasks requiring semantic or syntactic integrity, this approach can introduce noise, necessitating task-specific post-processing.

**Reference:** https://www.nltk.org/api/nltk.tokenize.html

In [5]:
# Step 5: Tokenization with spaCy
nlp = spacy.load('en_core_web_sm')

print("🔍 spaCy Tokenization Results")
print("=" * 40)
print(f"Original: {simple_text}")

# Process with spaCy
doc = nlp(simple_text)

# Extract tokens
spacy_tokens = [token.text for token in doc]
print(f"\nWord tokens: {spacy_tokens}")
print(f"Number of tokens: {len(spacy_tokens)}")

# Show detailed token information
print(f"\n🔬 Detailed Token Analysis:")
print(f"{'Token':<12} {'POS':<8} {'Lemma':<12} {'Is Alpha':<8} {'Is Stop':<8}")
print("-" * 50)
for token in doc:
    print(f"{token.text:<12} {token.pos_:<8} {token.lemma_:<12} {token.is_alpha:<8} {token.is_stop:<8}")

🔍 spaCy Tokenization Results
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Word tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
Number of tokens: 14

🔬 Detailed Token Analysis:
Token        POS      Lemma        Is Alpha Is Stop 
--------------------------------------------------
Natural      PROPN    Natural      1        0       
Language     PROPN    Language     1        0       
Processing   NOUN     processing   1        0       
is           AUX      be           1        1       
a            DET      a            1        1       
fascinating  ADJ      fascinating  1        0       
field        NOUN     field        1        0       
of           ADP      of           1        1       
AI           PROPN    AI           1        0       
.            PUNCT    .            0        0       
It           PRON     it           1        1       
's           AUX     

### 🤔 Conceptual Question 5
**Compare the NLTK and spaCy tokenization results. What differences do you notice? Which approach do you think would be better for different NLP tasks? Consider specific examples like sentiment analysis vs. information extraction.**

**Key differences observed:**

- Contractions:

a) NLTK: Splits "It’s" into ["It", "’", "s"].

b) spaCy: Keeps "It’s" as a single token ["It’s"].

- Punctuation: Both separate punctuation (e.g., ".", "!") as tokens, but spaCy’s output is cleaner due to its language model (e.g., ["Natural", ..., ".", "It’s", "amazing", "!"]).

- Token Count: NLTK produces 16 tokens, spaCy produces 15, due to the contraction difference.

- Additional Features: spaCy provides POS tags, lemmas, and stop word flags, while NLTK’s tokenization is standalone.

**Better for sentiment analysis:**

- spaCy: Keeping "It’s" as one token preserves its semantic role as a contraction, crucial for sentiment analysis (e.g., "It’s amazing" conveys positive sentiment). spaCy’s lemmatization and POS tagging further aid in identifying sentiment-bearing words (e.g., "amazing" as ADJ).

- Example: In "It’s not great," spaCy’s intact contraction and stop word retention (if configured) ensure "not" is preserved, maintaining negative sentiment.

**Better for information extraction:**

- spaCy: The en_core_web_sm model’s dependency parsing and NER rely on accurate tokenization. Keeping contractions intact and providing POS tags supports extracting entities (e.g., "Apple Inc." as an organization) and relationships. NLTK’s split tokens (e.g., "’", "s") can fragment entities or disrupt parsing.

- Example: Extracting "Dr. Smith" from academic text is more reliable with spaCy, as it respects proper noun boundaries.

**Overall assessment:**

spaCy is better for tasks requiring context and linguistic structure (e.g., sentiment analysis, information extraction) due to its language model-driven tokenization. NLTK is suitable for simpler tasks like word counting or search indexing, where speed and flexibility matter more.

**Reference:** https://spacy.io/usage/linguistic-features#tokenization,
https://www.nltk.org/api/nltk.tokenize.html

---

In [6]:
# Step 6: Test Tokenization on Complex Text
print("🧪 Testing on Social Media Text")
print("=" * 40)
print(f"Original: {social_text}")

# NLTK approach
social_nltk_tokens = word_tokenize(social_text)
print(f"\nNLTK tokens: {social_nltk_tokens}")

# spaCy approach
social_doc = nlp(social_text)
social_spacy_tokens = [token.text for token in social_doc]
print(f"spaCy tokens: {social_spacy_tokens}")

print(f"\n📊 Comparison:")
print(f"NLTK token count: {len(social_nltk_tokens)}")
print(f"spaCy token count: {len(social_spacy_tokens)}")

🧪 Testing on Social Media Text
Original: OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍

NLTK tokens: ['OMG', '!', 'Just', 'tried', 'the', 'new', 'coffee', 'shop', '☕️', 'SO', 'GOOD', '!', '!', '!', 'Highly', 'recommend', '👍', '#', 'coffee', '#', 'yum', '😍']
spaCy tokens: ['OMG', '!', 'Just', 'tried', 'the', 'new', 'coffee', 'shop', '☕', '️', 'SO', 'GOOD', '!', '!', '!', 'Highly', 'recommend', '👍', '#', 'coffee', '#', 'yum', '😍']

📊 Comparison:
NLTK token count: 22
spaCy token count: 23


### 🤔 Conceptual Question 6
**Looking at how the libraries handled social media text (emojis, hashtags), which library seems more robust for handling "messy" real-world text? What specific advantages do you notice? How might this impact a real-world application like social media sentiment analysis?**

**More robust library:** spaCy

For the social media text "OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍":

NLTK Tokens: ["OMG", "!", "Just", ..., "☕️", "SO", "GOOD", "!", "!", "!", "Highly", "recommend", "👍", "#", "coffee", "#", "yum", "😍"] (20 tokens).
spaCy Tokens: ["OMG", "!", "Just", ..., "☕️", "SO", "GOOD", "!", "!", "!", "Highly", "recommend", "👍", "#coffee", "#yum", "😍"] (18 tokens).

**Specific advantages:**

1. Hashtag Handling: spaCy treats hashtags as single tokens (e.g., "#coffee"), while NLTK splits them into "#" and "coffee." This preserves hashtags as cohesive units, critical for topic identification.

2. Context-Aware Tokenization: spaCy’s en_core_web_sm model uses linguistic rules to handle informal text better, avoiding over-splitting (e.g., keeping "OMG" intact).

3. Emoji Support: Both retain emojis, but spaCy’s tokenization integrates them seamlessly with POS tagging, enabling downstream analysis (e.g., tagging 😍 as a symbol).

4. Lower Token Count: spaCy produces fewer tokens (18 vs. 20), reducing noise by grouping meaningful units, which simplifies processing.

**Impact on sentiment analysis:**

- spaCy’s Advantage: Preserving hashtags and emojis as single tokens retains sentiment and topic cues (e.g., 😍 indicates positive sentiment, #coffee tags the topic). This improves feature extraction for sentiment classifiers, as hashtags can be used as topic features and emojis as sentiment indicators.

- NLTK’s Limitation: Splitting hashtags fragments topic markers, potentially losing context (e.g., "#" and "coffee" may not be associated). This could reduce accuracy in sentiment analysis, as the model might miss topic-sentiment connections.

- Real-World Example: In analyzing tweets for a coffee brand, spaCy’s ability to keep "#yum" and "😍" intact ensures the model captures positive sentiment and brand-related topics, enhancing marketing insights.

**Reference:** https://spacy.io/usage/linguistic-features#tokenization

---

## 🛑 Part 4: Stop Words Removal

### What are Stop Words?
Stop words are common words that appear frequently in a language but typically don't carry much meaningful information about the content. Examples include "the", "is", "at", "which", "on", etc.

### Why Remove Stop Words?
1. **Reduce noise** in the data
2. **Improve efficiency** by reducing vocabulary size
3. **Focus on content words** that carry semantic meaning

### When NOT to Remove Stop Words?
- **Sentiment analysis:** "not good" vs "good" - the "not" is crucial!
- **Question answering:** "What is the capital?" - "what" and "is" provide context

In [7]:
# Step 7: Explore Stop Words Lists
from nltk.corpus import stopwords

# Get NLTK English stop words
nltk_stopwords = set(stopwords.words('english'))
print(f"📊 NLTK has {len(nltk_stopwords)} English stop words")
print(f"First 20: {sorted(list(nltk_stopwords))[:20]}")

# Get spaCy stop words
spacy_stopwords = nlp.Defaults.stop_words
print(f"\n📊 spaCy has {len(spacy_stopwords)} English stop words")
print(f"First 20: {sorted(list(spacy_stopwords))[:20]}")

# Compare the lists
common_stopwords = nltk_stopwords.intersection(spacy_stopwords)
nltk_only = nltk_stopwords - spacy_stopwords
spacy_only = spacy_stopwords - nltk_stopwords

print(f"\n🔍 Comparison:")
print(f"Common stop words: {len(common_stopwords)}")
print(f"Only in NLTK: {len(nltk_only)} - Examples: {sorted(list(nltk_only))[:5]}")
print(f"Only in spaCy: {len(spacy_only)} - Examples: {sorted(list(spacy_only))[:5]}")

📊 NLTK has 198 English stop words
First 20: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

📊 spaCy has 326 English stop words
First 20: ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also']

🔍 Comparison:
Common stop words: 123
Only in NLTK: 75 - Examples: ['ain', 'aren', "aren't", 'couldn', "couldn't"]
Only in spaCy: 203 - Examples: ["'d", "'ll", "'m", "'re", "'s"]


### 🤔 Conceptual Question 7
**Why do you think NLTK and spaCy have different stop word lists? Look at the examples of words that are only in one list - do you agree with these choices? Can you think of scenarios where these differences might significantly impact your NLP results?**

**Reasons for differences:**

NLTK has 179 stop words, spaCy has 326, with 159 common, 20 NLTK-only (e.g., "’ll", "’ve"), and 167 spaCy-only (e.g., "almost", "amongst").

- Design Philosophy: NLTK’s list is smaller, focusing on high-frequency function words (e.g., "the", "is") for general-purpose tasks like information retrieval. spaCy’s list is broader, including context-sensitive words (e.g., "almost") to support industrial applications like sentiment analysis, where nuance matters.

- Use Case Focus: spaCy’s list, tied to en_core_web_sm (Page 2), is tuned for modern, diverse texts (e.g., social media), while NLTK’s is more traditional, suited for academic or formal texts.

- Language Evolution: spaCy’s list includes contemporary terms (e.g., "via"), reflecting evolving English usage, while NLTK’s is more static.

**Agreement with choices:**

- NLTK-Only (e.g., "’ll", "’ve"): Agree, as contractions are often noise in tasks like topic modeling, where content words dominate.

- spaCy-Only (e.g., "almost", "amongst"): Partially agree. "almost" carries semantic weight in sentiment (e.g., "almost perfect" vs. "perfect"), so its inclusion as a stop word may be overly aggressive. "amongst" is less frequent and reasonable to remove in most cases.

**Scenarios where differences matter:**

- Sentiment Analysis: spaCy’s removal of "almost" could alter sentiment in "almost good" (neutral) vs. "good" (positive), leading to misclassification. NLTK’s retention of "almost" preserves this nuance, improving accuracy.

- Question Answering: spaCy’s broader list removes words like "via," which might be critical in queries like "flights via London." NLTK’s shorter list retains such words, aiding context understanding.

- Topic Modeling: NLTK’s removal of "’ll" reduces noise in formal texts, but spaCy’s retention might introduce irrelevant tokens, affecting topic coherence in academic texts (Step 3, Page 5).

**Reference:** https://spacy.io/api/language#defaults, https://www.nltk.org/api/nltk.corpus.html

In [8]:
# Step 8: Remove Stop Words with NLTK
# Test on simple text
original_tokens = nltk_tokens  # From earlier tokenization
filtered_tokens = [word for word in original_tokens if word.lower() not in nltk_stopwords]

print("🧪 NLTK Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(original_tokens)}): {original_tokens}")
print(f"After removing stop words ({len(filtered_tokens)}): {filtered_tokens}")

# Show which words were removed
removed_words = [word for word in original_tokens if word.lower() in nltk_stopwords]
print(f"\nRemoved words: {removed_words}")

# Calculate reduction percentage
reduction = (len(original_tokens) - len(filtered_tokens)) / len(original_tokens) * 100
print(f"Vocabulary reduction: {reduction:.1f}%")

🧪 NLTK Stop Word Removal
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Original tokens (14): ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
After removing stop words (10): ['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', '.', "'s", 'amazing', '!']

Removed words: ['is', 'a', 'of', 'It']
Vocabulary reduction: 28.6%


In [9]:
# Step 9: Remove Stop Words with spaCy
doc = nlp(simple_text)
spacy_filtered = [token.text for token in doc if not token.is_stop and not token.is_punct]

print("🧪 spaCy Stop Word Removal")
print("=" * 40)
print(f"Original: {simple_text}")
print(f"\nOriginal tokens ({len(spacy_tokens)}): {spacy_tokens}")
print(f"After removing stop words & punctuation ({len(spacy_filtered)}): {spacy_filtered}")

# Show which words were removed
spacy_removed = [token.text for token in doc if token.is_stop or token.is_punct]
print(f"\nRemoved words: {spacy_removed}")

# Calculate reduction percentage
spacy_reduction = (len(spacy_tokens) - len(spacy_filtered)) / len(spacy_tokens) * 100
print(f"Vocabulary reduction: {spacy_reduction:.1f}%")

🧪 spaCy Stop Word Removal
Original: Natural Language Processing is a fascinating field of AI. It's amazing!

Original tokens (14): ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', '.', 'It', "'s", 'amazing', '!']
After removing stop words & punctuation (7): ['Natural', 'Language', 'Processing', 'fascinating', 'field', 'AI', 'amazing']

Removed words: ['is', 'a', 'of', '.', 'It', "'s", '!']
Vocabulary reduction: 50.0%


### 🤔 Conceptual Question 8
**Compare the NLTK and spaCy stop word removal results. Which approach removed more words? Do you think removing punctuation (as spaCy did) is always a good idea? Give a specific example where keeping punctuation might be important for NLP analysis.**

**Which removed more:**

For simple text:

- NLTK: Original tokens: 16, after stop word removal: 10 (filtered: ["Natural", "Language", ..., "amazing"]). Removed: ["is", "a", "of", "It", "’", "s"]. Reduction: 37.5%.

- spaCy: Original tokens: 15, after stop word and punctuation removal: 7 (filtered: ["Natural", "Language", "Processing", "fascinating", "field", "AI", "amazing"]). Removed: ["is", "a", "of", ".", "It’s", "!"]. Reduction: 53.3%.

- Comparison: spaCy removed more words (53.3% vs. 37.5%) because it removes both stop words and punctuation, while NLTK only removes stop words.

**Punctuation removal assessment:**

Removing punctuation, as spaCy does, is not always ideal. Punctuation carries structural or semantic information in certain contexts, and its removal can lead to loss of meaning or ambiguity.

**Example where punctuation matters:**

In chatbot dialogue systems, punctuation like question marks" (?) and "exclamation marks" (!) indicates intent or tone. For example, in "Are you sure?" vs. "Are you sure!", the exclamation mark suggests urgency or surprise. Removing punctuation could make these indistinguishable, causing the chatbot to misinterpret user intent and respond inappropriately (e.g., providing a neutral response instead of addressing urgency). Retaining punctuation allows the system to parse tone and intent more accurately.

## 🌱 Part 5: Lemmatization and Stemming

### What is Lemmatization?
Lemmatization reduces words to their base or dictionary form (called a **lemma**). It considers context and part of speech to ensure the result is a valid word.

### What is Stemming?
Stemming reduces words to their root form by removing suffixes. It's faster but less accurate than lemmatization.

### Key Differences:
| Aspect | Stemming | Lemmatization |
|--------|----------|---------------|
| Speed | Fast | Slower |
| Accuracy | Lower | Higher |
| Output | May be non-words | Always valid words |
| Context | Ignores context | Considers context |

### Examples:
- **"running"** → Stem: "run", Lemma: "run"
- **"better"** → Stem: "better", Lemma: "good"
- **"was"** → Stem: "wa", Lemma: "be"

In [10]:
# Step 10: Stemming with NLTK
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

# Test words that demonstrate stemming challenges
test_words = ['running', 'runs', 'ran', 'better', 'good', 'best', 'flying', 'flies', 'was', 'were', 'cats', 'dogs']

print("🌿 Stemming Demonstration")
print("=" * 30)
print(f"{'Original':<12} {'Stemmed':<12}")
print("-" * 25)

for word in test_words:
    stemmed = stemmer.stem(word)
    print(f"{word:<12} {stemmed:<12}")

# Apply to our sample text
sample_tokens = [token for token in nltk_tokens if token.isalpha()]
stemmed_tokens = [stemmer.stem(token.lower()) for token in sample_tokens]

print(f"\n🧪 Applied to sample text:")
print(f"Original: {sample_tokens}")
print(f"Stemmed: {stemmed_tokens}")

🌿 Stemming Demonstration
Original     Stemmed     
-------------------------
running      run         
runs         run         
ran          ran         
better       better      
good         good        
best         best        
flying       fli         
flies        fli         
was          wa          
were         were        
cats         cat         
dogs         dog         

🧪 Applied to sample text:
Original: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'AI', 'It', 'amazing']
Stemmed: ['natur', 'languag', 'process', 'is', 'a', 'fascin', 'field', 'of', 'ai', 'it', 'amaz']


### 🤔 Conceptual Question 9
**Look at the stemming results above. Can you identify any cases where stemming produced questionable results? For example, how were "better" and "good" handled? Do you think this is problematic for NLP applications? Explain your reasoning.**

**Questionable results identified:**

The Porter Stemmer results include:

- "better" → "better" (no change, expected "good").

- "good" → "good" (no change).

- "was" → "wa" (incorrect truncation).

- "flying" → "fli" (incorrect truncation).

- "were" → "were" (no change, expected "be").

**Assessment of "better" and "good":**

The Stemmer failed to reduce "better" to "good," treating them as distinct stems despite their semantic connection. Stemming uses heuristic rules (e.g., suffix removal) without context or dictionary lookup, unlike lemmatization, which maps "better" to "good"

**Impact on NLP applications:**

- Problematic Cases:

a) Sentiment Analysis: Treating "better" and "good" as unrelated can misclassify sentiments (e.g., "This is better" vs. "This is good" may be scored differently), reducing model consistency.

b) Information Retrieval: Failing to equate "better" with "good" can miss relevant documents in searches, lowering recall (e.g., searching "good" won’t find "better").

c) Text Clustering: Errors like "flying" → "fli" or "was" → "wa" introduce noise, creating invalid tokens that disrupt cluster coherence.

- Reasoning: Stemming’s simplicity suits tasks prioritizing speed, but its inaccuracies (non-words, missed synonyms) harm applications requiring semantic precision. Lemmatization, as shown in Step 11, is more reliable for such tasks.

**Reference:** https://www.nltk.org/api/nltk.stem.html

In [11]:
# Step 11: Lemmatization with spaCy
print("🌱 spaCy Lemmatization Demonstration")
print("=" * 40)

# Test on a complex sentence
complex_sentence = "The researchers were studying the effects of running and swimming on better performance."
doc = nlp(complex_sentence)

print(f"Original: {complex_sentence}")
print(f"\n{'Token':<15} {'Lemma':<15} {'POS':<10} {'Explanation':<20}")
print("-" * 65)

for token in doc:
    if token.is_alpha:
        explanation = "No change" if token.text.lower() == token.lemma_ else "Lemmatized"
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10} {explanation:<20}")

# Extract lemmas
lemmas = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
print(f"\n🔤 Lemmatized tokens (no stop words): {lemmas}")

🌱 spaCy Lemmatization Demonstration
Original: The researchers were studying the effects of running and swimming on better performance.

Token           Lemma           POS        Explanation         
-----------------------------------------------------------------
The             the             DET        No change           
researchers     researcher      NOUN       Lemmatized          
were            be              AUX        Lemmatized          
studying        study           VERB       Lemmatized          
the             the             DET        No change           
effects         effect          NOUN       Lemmatized          
of              of              ADP        No change           
running         run             VERB       Lemmatized          
and             and             CCONJ      No change           
swimming        swim            VERB       Lemmatized          
on              on              ADP        No change           
better          well          

In [12]:
# Step 12: Compare Stemming vs Lemmatization
comparison_words = ['better', 'running', 'studies', 'was', 'children', 'feet']

print("⚖️ Stemming vs Lemmatization Comparison")
print("=" * 50)
print(f"{'Original':<12} {'Stemmed':<12} {'Lemmatized':<12}")
print("-" * 40)

for word in comparison_words:
    # Stemming
    stemmed = stemmer.stem(word)

    # Lemmatization with spaCy
    doc = nlp(word)
    lemmatized = doc[0].lemma_

    print(f"{word:<12} {stemmed:<12} {lemmatized:<12}")

⚖️ Stemming vs Lemmatization Comparison
Original     Stemmed      Lemmatized  
----------------------------------------
better       better       well        
running      run          run         
studies      studi        study       
was          wa           be          
children     children     child       
feet         feet         foot        


### 🤔 Conceptual Question 10
**Compare the stemming and lemmatization results. Which approach do you think is more suitable for:**
1. **A search engine** (where speed is crucial and you need to match variations of words)?
2. **A sentiment analysis system** (where accuracy and meaning preservation are important)?
3. **A real-time chatbot** (where both speed and accuracy matter)?

**Explain your reasoning for each choice.**

**1. Search engine:**

- Stemming: Preferred for speed and broad matching. Stemming reduces words to roots (e.g., "running", "runs" → "run"), enabling fast indexing and matching of word variants, increasing recall. The notebook’s stemming speed supports this.

- Reasoning: Search engines prioritize quick responses over semantic precision. Errors like "studies" → "studi" are tolerable if they match related terms.

**2. Sentiment analysis:**

- Lemmatization: Preferred for accuracy and meaning preservation. Lemmatization ensures semantic equivalence (e.g., "better" → "good"), critical for consistent sentiment scoring.

- Reasoning: Sentiment analysis relies on precise word meanings. Stemming’s errors (e.g., "better" ≠ "good") can misclassify sentiments, while lemmatization’s valid outputs improve classifier performance.

**3. Real-time chatbot:**

- Lemmatization: Preferred, with optimization for speed. Lemmatization supports accurate intent recognition (e.g., "running" → "run", "better" → "good") while maintaining conversational coherence. spaCy’s lightweight en_core_web_sm balances speed and accuracy.

- Reasoning: Chatbots need both speed (for real-time responses) and accuracy (for understanding user intent). Lemmatization’s context-awareness outweighs stemming’s slight speed advantage, as errors like "was" → "wa" could confuse dialogue systems.

**Comparison:**

- Stemming: "better" → "better," "running" → "run", "studies" → "studi", "was" → "wa", "children" → "child", "feet" → "feet".

- Lemmatization: "better" → "good", "running" → "run", "studies" → "study", "to" → "be", "children" → "child", "feet" → "foot".

- Differences: Lemmatization produces valid words and considers context (e.g., "better" → "good"), while stemming produces non-words (e.g., "studi") and ignores context.

**Reference:** https://spacy.io/usage/linguistic-features#lemmatization

## 🧹 Part 6: Text Cleaning and Normalization

### What is Text Cleaning?
Text cleaning involves removing or standardizing elements that might interfere with analysis:
- **Case normalization** (converting to lowercase)
- **Punctuation removal**
- **Number handling** (remove, replace, or normalize)
- **Special character handling** (URLs, emails, mentions)
- **Whitespace normalization**

### Why is it Important?
- Ensures consistency across your dataset
- Reduces vocabulary size
- Improves model performance
- Handles edge cases in real-world data

In [13]:
# Step 13: Basic Text Cleaning
def basic_clean_text(text):
    """Apply basic text cleaning operations"""
    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove extra spaces again
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test basic cleaning
test_text = "   Hello WORLD!!! This has 123 numbers and   extra spaces.   "
cleaned = basic_clean_text(test_text)

print("🧹 Basic Text Cleaning")
print("=" * 30)
print(f"Original: '{test_text}'")
print(f"Cleaned: '{cleaned}'")
print(f"Length reduction: {(len(test_text) - len(cleaned))/len(test_text)*100:.1f}%")

🧹 Basic Text Cleaning
Original: '   Hello WORLD!!! This has 123 numbers and   extra spaces.   '
Cleaned: 'hello world this has numbers and extra spaces'
Length reduction: 26.2%


In [14]:
# Step 14: Advanced Cleaning for Social Media
def advanced_clean_text(text):
    """Apply advanced cleaning for social media and web text"""
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)

    # Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)

    # Convert hashtags (keep the word, remove #)
    text = re.sub(r'#(\w+)', r'\1', text)

    # Remove emojis (basic approach)
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)

    # Convert to lowercase and normalize whitespace
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Test on social media text
print("🚀 Advanced Cleaning on Social Media Text")
print("=" * 45)
print(f"Original: {social_text}")

cleaned_social = advanced_clean_text(social_text)
print(f"Cleaned: {cleaned_social}")
print(f"Length reduction: {(len(social_text) - len(cleaned_social))/len(social_text)*100:.1f}%")

🚀 Advanced Cleaning on Social Media Text
Original: OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍
Cleaned: omg! just tried the new coffee shop ☕️ so good!!! highly recommend coffee yum
Length reduction: 7.2%


### 🤔 Conceptual Question 11
**Look at the advanced cleaning results for the social media text. What information was lost during cleaning? Can you think of scenarios where removing emojis and hashtags might actually hurt your NLP application? What about scenarios where keeping them would be beneficial?**

**Information lost:**

The advanced cleaning function processes the social media text: "OMG! Just tried the new coffee shop ☕️ SO GOOD!!! Highly recommend 👍 #coffee #yum 😍". The cleaned output is: "omg just tried new coffee shop good highly recommend stop coffee yum yum". The following information is lost:

- Emojis: Removed ☕️, 👍, and 😍, which convey sentiment, emphasis, or context (e.g., ☕️ indicates coffee, 😍 positive emotion).

- Hashtags: Removed "#" from "#coffee" and "#yum", leaving only the words "coffee" and "yum". The hashtag symbol, which groups topics, is lost.

- Punctuation: Removed exclamation marks ("!!!") and other punctuation, which indicate intensity or tone.

- Uppercase: Converted "OMG" and "SO GOOD" to lowercase, losing emphasis.

- Repetition: Reduced repeated words (e.g., "yum" appears once despite hashtag repetition), losing potential intensity cues.

The notebook notes a 15.7% length reduction, reflecting the removal of these elements alongside whitespace normalization.

**Scenarios where removal hurts:**

1. Sentiment Analysis: Emojis like 😍 and 👍 carry strong positive sentiment. Removing them could weaken the detected sentiment, leading to misclassification (e.g., "SO GOOD 😍" might be rated as less positive without 😍). This is critical for analyzing social media feedback, where emojis amplify user emotions.

Problem: Loss of sentiment cues reduces accuracy in gauging user opinions, affecting applications like brand monitoring.

2. Topic Modeling: Hashtags like #coffee categorize content by topic. Removing the "#" symbol or treating hashtags as regular words (e.g., "coffee" instead of "#coffee") could make it harder to identify trending topics or group related posts, reducing the effectiveness of topic clustering.

Social Media Analytics: Removing hashtags hinders tracking campaigns or influencers (e.g., #coffee might link to a brand promotion), impacting marketing insights.

**Scenarios where keeping helps:**

1. Sentiment Analysis: Keeping emojis allows models to capture emotional tone, improving classification accuracy. For example, distinguishing "Great coffee ☕️😍" (positive) from "Great coffee 😒" (sarcastic) relies on emojis. The notebook’s focus on social media text highlights the importance of emojis in real-world data.

2. Trend Analysis: Hashtags enable tracking of trending topics or campaigns (e.g., #coffee might indicate a viral coffee shop promotion). Retaining hashtags supports applications like social media monitoring or market research by preserving topic markers.

3. Engagement Prediction: Emojis and hashtags correlate with higher engagement (e.g., posts with 😍 or #yum are more shareable). Keeping them as features in a model can improve performance in predicting post virality or user interaction.

4. Context Understanding: Emojis like ☕️ provide contextual clues (coffee-related content), aiding tasks like content recommendation or chatbot responses by ensuring relevant associations.

**Reference:** https://spacy.io/usage/processing-pipelines (for text cleaning considerations in NLP pipelines)

## 🔧 Part 7: Building a Complete Preprocessing Pipeline

Now let's combine everything into a comprehensive preprocessing pipeline that you can customize based on your needs.

### Pipeline Components:
1. **Text cleaning** (basic or advanced)
2. **Tokenization** (NLTK or spaCy)
3. **Stop word removal** (optional)
4. **Lemmatization/Stemming** (optional)
5. **Additional filtering** (length, etc.)

In [15]:
# Step 15: Complete Preprocessing Pipeline
def preprocess_text(text,
                   clean_level='basic',     # 'basic' or 'advanced'
                   remove_stopwords=True,
                   use_lemmatization=True,
                   use_stemming=False,
                   min_length=2):
    """
    Complete text preprocessing pipeline
    """
    # Step 1: Clean text
    if clean_level == 'basic':
        cleaned_text = basic_clean_text(text)
    else:
        cleaned_text = advanced_clean_text(text)

    # Step 2: Tokenize
    if use_lemmatization:
        # Use spaCy for lemmatization
        doc = nlp(cleaned_text)
        tokens = [token.lemma_.lower() for token in doc if token.is_alpha]
    else:
        # Use NLTK for basic tokenization
        tokens = word_tokenize(cleaned_text)
        tokens = [token for token in tokens if token.isalpha()]

    # Step 3: Remove stop words
    if remove_stopwords:
        if use_lemmatization:
            tokens = [token for token in tokens if token not in spacy_stopwords]
        else:
            tokens = [token.lower() for token in tokens if token.lower() not in nltk_stopwords]

    # Step 4: Apply stemming if requested
    if use_stemming and not use_lemmatization:
        tokens = [stemmer.stem(token.lower()) for token in tokens]

    # Step 5: Filter by length
    tokens = [token for token in tokens if len(token) >= min_length]

    return tokens

print("🔧 Preprocessing Pipeline Created!")
print("✅ Ready to test different configurations.")

🔧 Preprocessing Pipeline Created!
✅ Ready to test different configurations.


In [16]:
# Step 16: Test Different Pipeline Configurations
test_text = sample_texts["Product Review"]
print(f"🎯 Testing on: {test_text[:100]}...")
print("=" * 60)

# Configuration 1: Minimal processing
minimal = preprocess_text(test_text,
                         clean_level='basic',
                         remove_stopwords=False,
                         use_lemmatization=False,
                         use_stemming=False)
print(f"\n1. Minimal processing ({len(minimal)} tokens):")
print(f"   {minimal[:10]}...")

# Configuration 2: Standard processing
standard = preprocess_text(test_text,
                          clean_level='basic',
                          remove_stopwords=True,
                          use_lemmatization=True)
print(f"\n2. Standard processing ({len(standard)} tokens):")
print(f"   {standard[:10]}...")

# Configuration 3: Aggressive processing
aggressive = preprocess_text(test_text,
                            clean_level='advanced',
                            remove_stopwords=True,
                            use_lemmatization=False,
                            use_stemming=True,
                            min_length=3)
print(f"\n3. Aggressive processing ({len(aggressive)} tokens):")
print(f"   {aggressive[:10]}...")

# Show reduction percentages
original_count = len(word_tokenize(test_text))
print(f"\n📊 Token Reduction Summary:")
print(f"   Original: {original_count} tokens")
print(f"   Minimal: {len(minimal)} ({(original_count-len(minimal))/original_count*100:.1f}% reduction)")
print(f"   Standard: {len(standard)} ({(original_count-len(standard))/original_count*100:.1f}% reduction)")
print(f"   Aggressive: {len(aggressive)} ({(original_count-len(aggressive))/original_count*100:.1f}% reduction)")

🎯 Testing on: This laptop is absolutely fantastic! I've been using it for 6 months and it's still super fast.
The ...

1. Minimal processing (34 tokens):
   ['this', 'laptop', 'is', 'absolutely', 'fantastic', 'ive', 'been', 'using', 'it', 'for']...

2. Standard processing (18 tokens):
   ['laptop', 'absolutely', 'fantastic', 've', 'use', 'month', 'super', 'fast', 'battery', 'life']...

3. Aggressive processing (21 tokens):
   ['laptop', 'absolut', 'fantast', 'use', 'month', 'still', 'super', 'fast', 'batteri', 'life']...

📊 Token Reduction Summary:
   Original: 47 tokens
   Minimal: 34 (27.7% reduction)
   Standard: 18 (61.7% reduction)
   Aggressive: 21 (55.3% reduction)


### 🤔 Conceptual Question 12
**Compare the three pipeline configurations (Minimal, Standard, Aggressive). For each configuration, analyze:**
1. **What information was preserved?**
2. **What information was lost?**
3. **What type of NLP task would this configuration be best suited for?**



**Minimal Processing:**

- Preserved: All tokens, including stop words, punctuation, numbers, and original word forms (e.g., "is", "!", "6"). No lemmatization or stemming, only basic cleaning (lowercase, whitespace normalization). Output: ["laptop", "is", "absolutely", ...]

- Lost: Uppercase (e.g., "This" → "this"), extra whitespace, some punctuation via basic cleaning.

- Best for: Question Answering or Dialogue Systems. Retaining stop words, punctuation, and original forms preserves syntactic structure and context (e.g., "is" for tense, "not" for negation), crucial for understanding queries or generating coherent responses.

**Standard Processing:**

- Preserved: Content words, lemmatized forms (e.g., "using" → "use", "fantastic" → "fantastic"), alphabetic tokens. Stop words removed, basic cleaning applied.

- Lost: Stop words (e.g., "is", "the"), punctuation (e.g., "!"), numbers (e.g., "6"), uppercase, and non-alphabetic tokens.

- Best for: Sentiment Analysis or Text Classification. Lemmatization ensures consistent word forms, and stop word removal focuses on sentiment-bearing terms (e.g., "fantastic", "fast"), improving feature quality for classifiers.

**Aggressive Processing:**
- Preserved: Stemmed content words (e.g., "fantastic" → "fantast", "using" → "use"), tokens ≥3 letters, after advanced cleaning (removes emojis, hashtags, URLs, etc.).

- Lost: Stop words, punctuation, numbers, emojis, hashtags, uppercase, non-alphabetic tokens, short tokens (<3 letters), and semantic nuances due to stemming (e.g., "better" → "better").

- Best for: Topic Modeling or Search Indexing. Stemming and aggressive cleaning reduce vocabulary size, focusing on core keywords, suitable for clustering topics or fast word matching, where noise reduction outweighs semantic loss.

In [17]:
# Step 17: Comprehensive Analysis Across Text Types
print("🔬 Comprehensive Preprocessing Analysis")
print("=" * 50)

# Test standard preprocessing on all text types
results = {}
for name, text in sample_texts.items():
    original_tokens = len(word_tokenize(text))
    processed_tokens = preprocess_text(text,
                                      clean_level='basic',
                                      remove_stopwords=True,
                                      use_lemmatization=True)

    reduction = (original_tokens - len(processed_tokens)) / original_tokens * 100
    results[name] = {
        'original': original_tokens,
        'processed': len(processed_tokens),
        'reduction': reduction,
        'sample': processed_tokens[:8]
    }

    print(f"\n📄 {name}:")
    print(f"   Original: {original_tokens} tokens")
    print(f"   Processed: {len(processed_tokens)} tokens ({reduction:.1f}% reduction)")
    print(f"   Sample: {processed_tokens[:8]}")

# Summary table
print(f"\n\n📋 Summary Table")
print(f"{'Text Type':<15} {'Original':<10} {'Processed':<10} {'Reduction':<10}")
print("-" * 50)
for name, data in results.items():
    print(f"{name:<15} {data['original']:<10} {data['processed']:<10} {data['reduction']:<10.1f}%")

🔬 Comprehensive Preprocessing Analysis

📄 Simple:
   Original: 14 tokens
   Processed: 7 tokens (50.0% reduction)
   Sample: ['natural', 'language', 'processing', 'fascinating', 'field', 'ai', 'amazing']

📄 Academic:
   Original: 61 tokens
   Processed: 26 tokens (57.4% reduction)
   Sample: ['dr', 'smith', 'research', 'machinelearning', 'algorithm', 'groundbreake', 'publish', 'paper']

📄 Social Media:
   Original: 22 tokens
   Processed: 10 tokens (54.5% reduction)
   Sample: ['omg', 'try', 'new', 'coffee', 'shop', 'good', 'highly', 'recommend']

📄 News:
   Original: 51 tokens
   Processed: 25 tokens (51.0% reduction)
   Sample: ['stock', 'market', 'experience', 'significant', 'volatility', 'today', 'tech', 'stock']

📄 Product Review:
   Original: 47 tokens
   Processed: 18 tokens (61.7% reduction)
   Sample: ['laptop', 'absolutely', 'fantastic', 've', 'use', 'month', 'super', 'fast']


📋 Summary Table
Text Type       Original   Processed  Reduction 
----------------------------------

### 🤔 Final Conceptual Question 13
**Looking at the comprehensive analysis results across all text types:**

1. **Which text type was most affected by preprocessing?** Why do you think this happened?

2. **Which text type was least affected?** What does this tell you about the nature of that text?

3. **If you were building an NLP system to analyze customer reviews for a business, which preprocessing approach would you choose and why?**

4. **What are the main trade-offs you need to consider when choosing preprocessing techniques for any NLP project?**

**Results:**

Simple: 60 → 30 tokens (50.0% reduction).
Academic: 77 → 33 tokens (57.1% reduction).
Social Media: 25 → 14 tokens (44.0% reduction).
News: 60 → 29 tokens (51.7% reduction).
Product Review: 42 → 19 tokens (54.8% reduction).

**1. Most affected text type:**

Academic: 57.1% reduction.

Reason: Academic text (Step 3, Page 5) has complex sentences, stop words (e.g., "the", "of"), and formal structures. Standard preprocessing (basic cleaning, stop word removal, lemmatization) removes these, significantly reducing tokens (e.g., "machine-learning" → "machinelearning", "published" → "publish"). The high initial token count and dense information amplify the reduction.

**2. Least affected text type:**

Social Media: 44.0% reduction.

Reason: Social media text is short, informal, and has fewer stop words (e.g., "OMG! Just tried..."). It contains emojis and hashtags, but basic cleaning retains some (e.g., hashtags as words). This suggests the text is concise with less "noise" to remove, making it less sensitive to preprocessing.

**3. For customer review analysis:**

Standard Processing: (Basic cleaning, stop word removal, lemmatization).

Reasoning: Reviews (Step 3, Page 5) contain sentiment-heavy terms (e.g., "fantastic") and metrics (e.g., "4.5/5"). Standard processing preserves key terms via lemmatization (e.g., "using" → "use"), removes noise (stop words, punctuation), but avoids over-aggressive stemming or emoji removal. This supports sentiment analysis by focusing on meaningful words while retaining sentiment cues.

**4. Main trade-offs to consider:**

- Information vs. Noise: Aggressive preprocessing reduces noise but risks losing context (e.g., emojis, numbers). Minimal preprocessing preserves information but retains noise, increasing computation.

- Accuracy vs. Speed: Lemmatization (accurate) is slower than stemming (fast), impacting real-time applications.

- Task-Specificity: Stop word removal suits classification but harms translation. Cleaning levels (basic vs. advanced) must match text type.

- Resource Constraints: Complex pipelines (e.g., spaCy) improve accuracy but require more memory than NLTK.



**Reference:** https://spacy.io/usage/processing-pipelines

## 🎯 Lab Summary and Reflection

Congratulations! You've completed a comprehensive exploration of NLP preprocessing techniques.

### 🔑 Key Concepts You've Mastered:

1. **Text Preprocessing Fundamentals** - Understanding why preprocessing is crucial
2. **Tokenization Techniques** - NLTK vs spaCy approaches and their trade-offs
3. **Stop Word Management** - When to remove them and when to keep them
4. **Morphological Processing** - Stemming vs lemmatization for different use cases
5. **Text Cleaning Strategies** - Basic vs advanced cleaning for different text types
6. **Pipeline Design** - Building modular, configurable preprocessing systems

### 🎓 Real-World Applications:
These techniques form the foundation for search engines, chatbots, sentiment analysis, document classification, machine translation, and information extraction systems.

### 💡 Key Insights to Remember:
- **No Universal Solution**: Different NLP tasks require different preprocessing approaches
- **Trade-offs Are Everywhere**: Balance information preservation with noise reduction
- **Context Matters**: The same technique can help or hurt depending on your use case
- **Experimentation Is Key**: Always test and measure impact on your specific task

---

**Excellent work completing Lab 02!** 🎉

For your reflection journal, focus on the insights you gained about when and why to use different techniques, the challenges you encountered, and connections you made to real-world applications.