# 📝 Chapter 1: Text Preprocessing

## 📌 Overview  
Text preprocessing is the first and one of the most critical steps in any NLP pipeline. Raw text data is often messy and inconsistent, so cleaning and structuring the data makes it ready for machine learning models or statistical analysis.

This chapter covers:
- Text normalization
- Tokenization
- Stopword removal
- Stemming and lemmatization
- POS tagging

---

## 1️⃣ Text Normalization  
**Goal:** Standardize the text format to reduce variability.

Common steps:
- Lowercasing all text  
- Removing punctuation and special characters  
- Removing numbers (optional, task-dependent)  
- Removing extra whitespace  

In [36]:
import re
# Original sentence
text = "The Quick Brown Fox! Jumps over 123 lazy dogs."

# Convert all characters to lowercase
text = text.lower()

# Remove all characters that are NOT lowercase letters (a-z) or whitespace
text = re.sub(r'[^a-z\s]', ' ', text)

# Print the cleaned text
print(text)  # Output: "the quick brown fox jumps over  lazy dogs"

the quick brown fox  jumps over     lazy dogs 


# 2️⃣ Tokenization

Goal: Split the text into individual units (tokens), typically words or subwords.

Example using NLTK:

In [25]:
#%pip install nltk

In [26]:
import os
print(os.getcwd())

import nltk  # Natural Language Toolkit
nltk.download('punkt') # Download the punkt tokenizer model
nltk.download('punkt_tab') # because it keeps giving error "Resource punkt_tab not found

from nltk.tokenize import word_tokenize  # Import word tokenizer


/Users/moka/Documents/GitHub/NLP


[nltk_data] Downloading package punkt to /Users/moka/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/moka/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [37]:
# Example sentence
sentence = "Natural Language Processing (NLP) is fun!"

# Apply tokenization
tokens = word_tokenize(sentence)

# Print the list of tokens
print(tokens)  # Output: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'fun', '!']



['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'fun', '!']


# 3️⃣ Stopword Removal

Goal: Remove common words that usually don’t carry meaningful information like "the," "is," "and."

Example using NLTK:

In [28]:
from nltk.corpus import stopwords  # Import stopword list
nltk.download('stopwords')  # Download the English stopwords

# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

# Filter out stopwords from the tokenized words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Print the filtered tokens (stopwords removed)
print(filtered_tokens)


['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'fun', '!']


[nltk_data] Downloading package stopwords to /Users/moka/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# 4️⃣ Stemming and Lemmatization

Both techniques reduce words to their root form:

- Stemming: Applies heuristic rules (may not produce actual words).
- Lemmatization: Uses vocabulary and grammar rules (returns valid words).

Example using NLTK Stemmer:

In [29]:
from nltk.stem import PorterStemmer  # Import the Porter Stemmer

# Create a stemmer instance
stemmer = PorterStemmer()

# Apply stemming
print(stemmer.stem('running'))  # Output: 'run'
print(stemmer.stem('flies'))    # Output: 'fli' (note: not always a valid word)


run
fli


Example using NLTK Lemmatizer:

In [30]:
from nltk.stem import WordNetLemmatizer  # Import the lemmatizer
nltk.download('wordnet')  # Download WordNet data for lemmatization

# Create a lemmatizer instance
lemmatizer = WordNetLemmatizer()

# Apply lemmatization (specify part-of-speech for accuracy)
print(lemmatizer.lemmatize('running', pos='v'))  # Output: 'run'
print(lemmatizer.lemmatize('flies', pos='n'))    # Output: 'fly'


[nltk_data] Downloading package wordnet to /Users/moka/nltk_data...


run
fly


# 5️⃣ Part-of-Speech (POS) Tagging

Goal: Assign a grammatical category (noun, verb, adjective, etc.) to each token.

Example using NLTK:

In [33]:
nltk.download('averaged_perceptron_tagger')  # Download POS tagger model
nltk.download('averaged_perceptron_tagger_eng') # Download English POS tagger model

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenize the sentence
tokens = word_tokenize(sentence)
print(tokens)

# Apply POS tagging
pos_tags = nltk.pos_tag(tokens)

# Print the tokens with their corresponding POS tags
print(pos_tags)
# Example output: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ...]


['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/moka/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/moka/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


## 🎯 Practice Questions and Answers

### 1. Why is tokenization an important step in NLP?

Tokenization is the process of breaking down text into smaller units, such as words, subwords, or sentences.  
It is important because:
- It transforms raw text into structured data that models can process.
- Many NLP tasks, like text classification or sentiment analysis, operate on individual words or tokens.
- It helps handle punctuation, contractions, and special characters appropriately.
- Without tokenization, the model would not be able to understand where one word ends and another begins.

---

### 2. What is the difference between stemming and lemmatization?

|                | Stemming                              | Lemmatization                               |
|----------------|----------------------------------------|----------------------------------------------|
| **Definition** | Cuts off word endings to reduce words to their base/root form (often not a real word). | Reduces words to their dictionary form (lemma) using linguistic rules. |
| **Example**    | "running" → "run", "flies" → "fli"     | "running" → "run", "flies" → "fly"           |
| **Approach**   | Rule-based, fast, and heuristic        | More accurate, uses vocabulary and grammar   |
| **Result**     | May produce non-meaningful roots       | Produces meaningful base forms              |

---

### 3. Can stopword removal hurt model performance in some tasks? Why or why not?

Yes, stopword removal can hurt model performance depending on the task.

- In tasks like **sentiment analysis**, words like *"not"* or *"never"* are important for understanding meaning and sentiment. Removing them could reverse or obscure the sentiment.
- In **language modeling** or **machine translation**, stopwords contribute to sentence structure and meaning, so removing them might reduce accuracy.
- However, for tasks like **topic modeling** or **information retrieval**, stopword removal often helps by reducing noise.

**Conclusion:** Whether stopword removal is beneficial depends on the specific NLP task and the importance of those words for meaning.
