# Text Preprocessing in NLP

This notebook demonstrates essential text preprocessing techniques used in Natural Language Processing.

## What you'll learn:
- Tokenization
- Lowercasing
- Punctuation removal
- Stop word removal
- Stemming and Lemmatization
- Complete preprocessing pipeline

In [None]:
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

print("Libraries imported successfully!")

## Sample Text for Demonstration

In [None]:
# Sample text for demonstration
sample_text = """Hello World! This is a SAMPLE text for demonstrating 
text preprocessing techniques in Natural Language Processing. 
It contains various punctuation marks, UPPERCASE letters, 
and common stopwords like 'the', 'and', 'is'. 
We'll also see how stemming and lemmatization work with words like 
'running', 'ran', 'better', and 'best'."""

print("Original text:")
print(sample_text)
print(f"\nText length: {len(sample_text)} characters")

## Step 1: Tokenization
Breaking text into individual words or sentences.

In [None]:
# Sentence tokenization
sentences = sent_tokenize(sample_text)
print("Sentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence}")

print(f"\nNumber of sentences: {len(sentences)}")

In [None]:
# Word tokenization
tokens = word_tokenize(sample_text)
print("Tokens:")
print(tokens)
print(f"\nNumber of tokens: {len(tokens)}")

## Step 2: Convert to Lowercase
Standardizing text case for consistency.

In [None]:
# Convert to lowercase
lowercase_tokens = [token.lower() for token in tokens]
print("Lowercase tokens:")
print(lowercase_tokens[:20])  # Show first 20 tokens

# Compare before and after
print("\nBefore:", tokens[:10])
print("After:", lowercase_tokens[:10])

## Step 3: Remove Punctuation
Cleaning out punctuation marks that don't add meaning.

In [None]:
# Check what punctuation looks like
print("Punctuation marks:", string.punctuation)

# Remove punctuation
no_punct_tokens = [token for token in lowercase_tokens if token not in string.punctuation]
print("\nTokens without punctuation:")
print(no_punct_tokens)

print(f"\nTokens before: {len(lowercase_tokens)}")
print(f"Tokens after: {len(no_punct_tokens)}")

## Step 4: Remove Stop Words
Filtering out common words that don't carry much meaning.

In [None]:
# Get English stopwords
stop_words = set(stopwords.words('english'))
print("Number of stopwords:", len(stop_words))
print("First 20 stopwords:", list(stop_words)[:20])

# Remove stopwords
filtered_tokens = [token for token in no_punct_tokens if token not in stop_words]
print("\nTokens after removing stopwords:")
print(filtered_tokens)

print(f"\nTokens before: {len(no_punct_tokens)}")
print(f"Tokens after: {len(filtered_tokens)}")

## Step 5: Stemming
Reducing words to their root form.

In [None]:
# Initialize stemmer
stemmer = PorterStemmer()

# Apply stemming
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print("Stemmed tokens:")
print(stemmed_tokens)

# Show some examples of stemming
example_words = ['running', 'ran', 'runs', 'better', 'best', 'preprocessing', 'demonstrates']
print("\nStemming examples:")
for word in example_words:
    print(f"{word} -> {stemmer.stem(word)}")

## Step 6: Lemmatization
Converting words to their base or dictionary form.

In [None]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
print("Lemmatized tokens:")
print(lemmatized_tokens)

# Compare stemming vs lemmatization
print("\nStemming vs Lemmatization:")
for word in example_words:
    print(f"{word} -> Stem: {stemmer.stem(word)}, Lemma: {lemmatizer.lemmatize(word)}")

## Complete Preprocessing Function
Putting it all together in a reusable function.

In [None]:
def preprocess_text(text, remove_stopwords=True, use_stemming=False, use_lemmatization=True):
    """
    Complete text preprocessing pipeline.
    
    Args:
        text (str): Raw text to preprocess
        remove_stopwords (bool): Whether to remove stopwords
        use_stemming (bool): Whether to apply stemming
        use_lemmatization (bool): Whether to apply lemmatization
    
    Returns:
        list: Preprocessed tokens
    """
    # Tokenize
    tokens = word_tokenize(text)
    
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    
    # Remove punctuation and non-alphabetic tokens
    tokens = [token for token in tokens if token.isalpha()]
    
    # Remove stopwords
    if remove_stopwords:
        stop_words = set(stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words]
    
    # Apply stemming
    if use_stemming:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(token) for token in tokens]
    
    # Apply lemmatization
    if use_lemmatization:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens

# Test the function
processed = preprocess_text(sample_text)
print("Final preprocessed tokens:")
print(processed)
print(f"\nOriginal tokens: {len(word_tokenize(sample_text))}")
print(f"Processed tokens: {len(processed)}")

## Comparing Different Preprocessing Options

In [None]:
# Test different preprocessing options
print("Comparison of preprocessing options:")
print("=" * 50)

# Basic preprocessing
basic = preprocess_text(sample_text, remove_stopwords=False, use_stemming=False, use_lemmatization=False)
print(f"Basic (lowercase, no punctuation): {len(basic)} tokens")
print(basic[:10])

# With stopword removal
no_stopwords = preprocess_text(sample_text, remove_stopwords=True, use_stemming=False, use_lemmatization=False)
print(f"\nWithout stopwords: {len(no_stopwords)} tokens")
print(no_stopwords[:10])

# With stemming
with_stemming = preprocess_text(sample_text, remove_stopwords=True, use_stemming=True, use_lemmatization=False)
print(f"\nWith stemming: {len(with_stemming)} tokens")
print(with_stemming[:10])

# With lemmatization (default)
with_lemmatization = preprocess_text(sample_text)
print(f"\nWith lemmatization: {len(with_lemmatization)} tokens")
print(with_lemmatization[:10])

## Practice Exercise

Try preprocessing this text with different options and observe the differences:

In [None]:
practice_text = """
The runners were running quickly through the beautiful gardens. 
They had been training for months, and their performance was better than expected. 
"This is the best race I've ever run!" shouted one of the participants.
"""

# Your turn: try different preprocessing options on this text
# TODO: Experiment with the preprocessing function

result = preprocess_text(practice_text)
print("Preprocessed practice text:")
print(result)

## Key Takeaways

1. **Tokenization** breaks text into meaningful units
2. **Lowercasing** ensures consistency
3. **Punctuation removal** cleans the text
4. **Stopword removal** focuses on meaningful words
5. **Stemming** is fast but sometimes inaccurate
6. **Lemmatization** is more accurate but slower

The choice of preprocessing techniques depends on your specific use case and the trade-offs you're willing to make between accuracy and performance.