# 📝 Chapter 1: Text Preprocessing

## 📌 Overview  
Text preprocessing is the first and one of the most critical steps in any NLP pipeline. Raw text data is often messy and inconsistent, so cleaning and structuring the data makes it ready for machine learning models or statistical analysis.

This chapter covers:
- Text normalization
- Tokenization
- Stopword removal
- Stemming and lemmatization
- POS tagging

---

## 1️⃣ Text Normalization  
**Goal:** Standardize the text format to reduce variability.

Common steps:
- Lowercasing all text  
- Removing punctuation and special characters  
- Removing numbers (optional, task-dependent)  
- Removing extra whitespace  

In [2]:
import re
# Original sentence
text = "The Quick Brown Fox! Jumps over 123 lazy dogs."

# Convert all characters to lowercase
text = text.lower()

# Remove all characters that are NOT lowercase letters (a-z) or whitespace
text = re.sub(r'[^a-z\s]', '', text)

# Print the cleaned text
print(text)  # Output: "the quick brown fox jumps over  lazy dogs"

the quick brown fox jumps over  lazy dogs


# 2️⃣ Tokenization

Goal: Split the text into individual units (tokens), typically words or subwords.

Example using NLTK:

In [26]:
#%pip install nltk

Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement punkt (from versions: none)[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
[31mERROR: No matching distribution found for punkt[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [5]:
import os
import nltk  # Natural Language Toolkit

print(os.getcwd())

# Set the NLTK data path to the nltk_data folder in your current directory
nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))

# Download the punkt tokenizer (this will ensure it's in the correct location)
nltk.download('punkt', download_dir=os.path.join(os.getcwd(), "nltk_data"))


/Users/moka/Documents/GitHub/NLP


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/moka/Documents/GitHub/NLP/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
from nltk.tokenize import word_tokenize  # Import word tokenizer

# Example sentence
sentence = "Natural Language Processing (NLP) is fun!"

# Apply tokenization
tokens = word_tokenize(sentence)

# Print the list of tokens
print(tokens)  # Output: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'fun', '!']


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/moka/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/share/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/Users/moka/Documents/GitHub/NLP/nltk_data'
**********************************************************************


# 3️⃣ Stopword Removal

Goal: Remove common words that usually don’t carry meaningful information like "the," "is," "and."

Example using NLTK:

In [7]:
from nltk.corpus import stopwords  # Import stopword list
nltk.download('stopwords')  # Download the English stopwords

# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

# Filter out stopwords from the tokenized words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Print the filtered tokens (stopwords removed)
print(filtered_tokens)


[nltk_data] Downloading package stopwords to /Users/moka/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


NameError: name 'tokens' is not defined

# 4️⃣ Stemming and Lemmatization

Both techniques reduce words to their root form:

Stemming: Applies heuristic rules (may not produce actual words).
Lemmatization: Uses vocabulary and grammar rules (returns valid words).
Example using NLTK Stemmer:

In [8]:
from nltk.stem import PorterStemmer  # Import the Porter Stemmer

# Create a stemmer instance
stemmer = PorterStemmer()

# Apply stemming
print(stemmer.stem('running'))  # Output: 'run'
print(stemmer.stem('flies'))    # Output: 'fli' (note: not always a valid word)


run
fli


Example using NLTK Lemmatizer:

In [9]:
from nltk.stem import WordNetLemmatizer  # Import the lemmatizer
nltk.download('wordnet')  # Download WordNet data for lemmatization

# Create a lemmatizer instance
lemmatizer = WordNetLemmatizer()

# Apply lemmatization (specify part-of-speech for accuracy)
print(lemmatizer.lemmatize('running', pos='v'))  # Output: 'run'
print(lemmatizer.lemmatize('flies', pos='n'))    # Output: 'fly'


[nltk_data] Downloading package wordnet to /Users/moka/nltk_data...


run
fly


# 5️⃣ Part-of-Speech (POS) Tagging

Goal: Assign a grammatical category (noun, verb, adjective, etc.) to each token.

Example using NLTK:

In [10]:
nltk.download('averaged_perceptron_tagger')  # Download POS tagger model

# Example sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Apply POS tagging
pos_tags = nltk.pos_tag(tokens)

# Print the tokens with their corresponding POS tags
print(pos_tags)
# Example output: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ...]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/moka/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/moka/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/share/nltk_data'
    - '/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## 🎯 Practice Questions:
1. Why is tokenization an important step in NLP?
2. What is the difference between stemming and lemmatization?
3. Can stopword removal hurt model performance in some tasks? Why or why not?