# 01 - Text Preprocessing and Cleaning

## Learning Objectives
- Master text preprocessing techniques for NLP
- Understand the impact of different cleaning strategies
- Build a robust preprocessing pipeline
- Prepare data for tokenization and modeling

## Phase 1: PyTorch Fundamentals 🧠
*Build everything from scratch to understand the foundations*

## Phase 2: Transformers Enhancement 🚀
*Enhance with modern NLP tools after mastering fundamentals*

---

## Preprocessing Strategy

Based on your exploration findings, implement a comprehensive text cleaning pipeline that handles:
- Special characters and URLs
- Case normalization
- Punctuation handling
- Stopword removal (optional)
- Custom cleaning rules for social media text


## TODO 1: Basic Text Cleaning Functions

**Goal**: Create reusable text cleaning functions

**Steps**:
1. Implement URL removal function:
   - Remove http/https URLs
   - Handle shortened URLs
2. Implement username/hashtag handling:
   - Decide whether to remove or keep @mentions and #hashtags
3. Implement special character cleaning:
   - Handle emojis (remove or convert to text)
   - Clean punctuation
   - Handle repeated characters (e.g., "soooooo" → "so")

**Hint**: Use `re` module for regex patterns, consider different strategies for each element

**Expected Output**: Clean, reusable functions for text preprocessing


In [10]:
# TODO 1: Basic text cleaning functions
# Your implementation here
# Create url removal function
import re
import pandas as pd
import emoji
import nltk
import string

df = pd.read_csv("../data/raw/train.csv")

def remove_url(text):
    if not isinstance(text, str):
        return ""
    return re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", text)

# Define a function to count emojis in a string
def count_emojis(text):
    if not isinstance(text, str):
        return 0
    return sum(1 for char in text if char in emoji.EMOJI_DATA)

def remove_mentions(text):
    if not isinstance(text, str):
        return ""
    return re.sub(r"@[^\s]+", "", text)

def remove_hashtags(text):
    if not isinstance(text, str):
        return ""
    return re.sub(r"#", "", text)

def convert_emojis_to_text(text):
    if not isinstance(text, str):
        return ""
    # Replace emoji characters in the text with their text names (e.g., ":smile:")
    return emoji.demojize(text, delimiters=("", ""))

def get_emoji_text_or_none(text):
    if not isinstance(text, str):
        return None
    emoji_text = convert_emojis_to_text(text)
    # If any emoji exists (detected by presence of ":" from demojize output)
    if ':' in emoji_text:
        return emoji_text
    else:
        return None

# Column if hashtag and if mention
df["has_hashtag"] = df["text"].str.contains(r"#")
df["has_mention"] = df["text"].str.contains(r"@")
df["has_url"] = df["text"].str.contains(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+")

df["number_urls"] = df["text"].str.count(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+")
df["number_hashtags"] = df["text"].str.count(r"#")
df["number_mentions"] = df["text"].str.count(r"@")

# Average number of urls, hashtags, mentions, and emojis per tweet
print(f"Average number of urls per tweet: {df['number_urls'].mean()}")
print(f"Average number of hashtags per tweet: {df['number_hashtags'].mean()}")
print(f"Average number of mentions per tweet: {df['number_mentions'].mean()}")

# Create text_clean column lowercased
df["text_clean"] = df["text"].apply(lambda x: remove_url(x).lower())
df["text_clean"] = df["text_clean"].apply(lambda x: remove_mentions(x).lower())
df["text_clean"] = df["text_clean"].apply(lambda x: remove_hashtags(x).lower())



df["emojis_text"] = df["text"].apply(get_emoji_text_or_none)

print(df["text_clean"].head())

print(df.head())





Average number of urls per tweet: 0.6201234730066991
Average number of hashtags per tweet: 0.4469985551031131
Average number of mentions per tweet: 0.36240641008800734
0    our deeds are the reason of this earthquake ma...
1               forest fire near la ronge sask. canada
2    all residents asked to 'shelter in place' are ...
3    13,000 people receive wildfires evacuation ord...
4    just got sent this photo from ruby alaska as s...
Name: text_clean, dtype: object
   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  has_hashtag  has_mention  has_url  number_urls  num

## TODO 2: Text Normalization

**Goal**: Standardize text format for consistent processing

**Steps**:
1. Implement case normalization:
   - Decide on lowercasing strategy
   - Handle acronyms and proper nouns
2. Implement whitespace normalization:
   - Remove extra spaces
   - Handle line breaks and tabs
3. Implement number handling:
   - Replace numbers with tokens or remove
   - Handle dates and times

**Hint**: Consider the impact on meaning - some information might be lost in normalization

**Expected Output**: Consistent text format across the dataset


In [14]:
# TODO 2: Text normalization
# Your implementation here
import inflect
def whitespace_normalization(text):
    if not isinstance(text, str):
        return ""
    return re.sub(r'\s+', ' ', text).strip()

def replace_numbers_with_tokens(text):
    if not isinstance(text, str):
        return ""
    p = inflect.engine()
    def number_to_text(match):
        numstr = match.group()
        return p.number_to_words(numstr)
    return re.sub(r'\d+', number_to_text, text)

df["text_clean"] = df["text_clean"].apply(whitespace_normalization)
df["text_clean"] = df["text_clean"].apply(replace_numbers_with_tokens)

# test
test_text = "Hello 123, how are you? 456"
print(replace_numbers_with_tokens(test_text))


Hello one hundred and twenty-three, how are you? four hundred and fifty-six


## TODO 3: Stopword and Noise Removal

**Goal**: Remove common words and noise that don't contribute to classification

**Steps**:
1. Implement stopword removal:
   - Use NLTK or spaCy stopword lists
   - Consider domain-specific stopwords
   - Evaluate impact on disaster detection
2. Implement noise removal:
   - Remove very short words (< 2 characters)
   - Remove very long words (potential typos)
   - Handle non-alphabetic sequences

**Hint**: Be careful with stopwords - some might be important for disaster detection (e.g., "fire", "flood")

**Expected Output**: Cleaned text with reduced noise


In [23]:
# TODO 3: Stopword and noise removal
# Your implementation here
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Implement stop word removal but keeping some 50 critical words related to disasters
disaster_50_words_list = [
    "fire", "flood", "earthquake", "hurricane", "tornado", "tsunami", "volcano", "wildfire", "storm", "hail",
    "avalanche", "drought", "cyclone", "disaster", "emergency", "crisis", "accident", "collapse", "explosion", "meteorite",
    "hunger", "disease", "death", "injury", "damage", "destruction", "evacuation", "pandemic", "eruption", "aftershock", "landslide",
    "mudslide", "outbreak", "plague", "rescue", "survivors", "fatality", "wreckage", "contamination", "quarantine", "hazard",
    "sinkhole", "blackout", "blizzard", "storm surge", "typhoon", "structural", "apocalypse", "casualty", "distress", "emergency services"
]

df["word_count"] = df["text_clean"].apply(lambda x: len(x.split()))
print("Head before removing stop words:")
print(df.head())

# Create a stopwords set but EXCLUDING the disaster-related words so that those are not removed
# Handle multi-word phrases: remove them from stopwords if present as phrases or as separated words (normalize to lower-case)
stopword_set = set(stopwords.words('english'))

# Remove all single-word disaster keywords from the stopword set, NOT from the text
for word in disaster_50_words_list:
    # Only remove from stopword_set the single words (if phrase, will be handled in text)
    for w in word.lower().split():
        if w in stopword_set:
            stopword_set.remove(w)

# Function to remove stop words while keeping disaster-related words
def remove_stop_words(text, stop_words=stopword_set, keywords=disaster_50_words_list):
    if not isinstance(text, str):
        return ""
    # For multi-word keywords, preserve them by temporarily replacing with placeholders
    placeholder_map = {}
    text_norm = text
    for idx, phrase in enumerate(keywords):
        phrase_l = phrase.lower()
        if " " in phrase_l and phrase_l in text_norm.lower():
            placeholder = f"__KEYWORDPHRASE{idx}__"
            placeholder_map[placeholder] = phrase_l
            # Replace the phrase with placeholder (case-insensitive)
            # To be more robust, use regular expressions for full matches
            text_norm = re.sub(re.escape(phrase_l), placeholder, text_norm, flags=re.IGNORECASE)

    words = text_norm.split()
    filtered_words = []
    for word in words:
        wl = word.lower()
        # If this is a placeholder or a single-keyword, keep it
        if word in placeholder_map or wl in [kw.lower() for kw in keywords]:
            filtered_words.append(word)
        elif wl not in stop_words:
            filtered_words.append(word)
    # Restore multi-word phrases
    result = ' '.join(filtered_words)
    for placeholder, phrase in placeholder_map.items():
        result = result.replace(placeholder, phrase)
    return result

# Remove very short and very long words
def remove_short_and_long_words(text):
    if not isinstance(text, str):
        return ""
    words = text.split()
    filtered_words = [word for word in words if len(word) > 2 and len(word) < 15]
    return ' '.join(filtered_words)

def remove_punctuation(text):
    if not isinstance(text, str):
        return ""
    return text.translate(str.maketrans('', '', string.punctuation))

def convert_lower(text):
    if not isinstance(text, str):
        return ""
    return text.lower()

df["text_clean"] = df["text_clean"].apply(remove_stop_words)
df["text_clean"] = df["text_clean"].apply(remove_short_and_long_words)
df["text_clean"] = df["text_clean"].apply(remove_punctuation)
df["text_clean"] = df["text_clean"].apply(convert_lower)


print("Head after removing stop words and short/long words:")
print(df.head())

# Test it works
test_text = "The fire in the forest is a disaster that has caused a lot of damage."
print(remove_stop_words(test_text))
print(remove_short_and_long_words(test_text))
print(remove_punctuation(test_text))










[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/franciscoteixeirabarbosa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Head before removing stop words:
   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  has_hashtag  has_mention  has_url  number_urls  number_hashtags  \
0       1         True        False    False            0                1   
1       1        False        False    False            0                0   
2       1        False        False    False            0                0   
3       1         True        False    False            0                1   
4       1         True        False    False            0                2   

   number_mentions                 

## TODO 4: Complete Preprocessing Pipeline

**Goal**: Combine all cleaning steps into a comprehensive pipeline

**Steps**:
1. Create a main preprocessing function that:
   - Applies all cleaning steps in order
   - Handles edge cases (empty strings, very short text)
   - Returns cleaned text
2. Test the pipeline on sample data:
   - Compare before/after examples
   - Measure processing time
   - Check for any issues

**Hint**: Make the pipeline configurable with parameters for different cleaning strategies

**Expected Output**: Robust preprocessing pipeline ready for the entire dataset


In [30]:
# TODO 4: Complete preprocessing pipeline
# Your implementation here
def preprocess_text(text):
    text = remove_url(text)
    text = remove_mentions(text)
    text = remove_hashtags(text)
    text = remove_punctuation(text)
    text = remove_stop_words(text)
    text = remove_short_and_long_words(text)
    text = convert_lower(text)
    return text

# Test representative with all the preprocessing steps
test_text = [
    "",  # empty string
    "a",  # single short word
    "LOL!!!",  # punctuation and acronym
    "No disaster here.",  # contains a stopword
    "RT @user: #BreakingNews Disaster at www.example.com!!",  # contains RT, mention, hashtag, URL, punctuation
    "FLOOD....flood.... Flooded.",  # repeated, case variants, punctuation
    "     ",  # whitespace only
    "This is a test tweet with a reallysuperlongwordthatexceedslimits",  # very long word
    "Fire! 🔥🔥🔥 Very bad situation... #emergency",  # emojis, punctuation, hashtag
    "OMG 💥",  # uppercase acronym and emoji
    "The THE the tHe",  # case sensitivity
    "The fire in the forest is a disaster that has caused a lot of damage 🔥 and OMG 💥.",  # emojis, punctuation, hashtag
    "http://link.com Disaster!!!",  # link + exclamation
]
# Apply preprocessing step to each test_text item and print the results
for i, t in enumerate(test_text):
    print(f"Original: {repr(t)}")
    print(f"Processed: {repr(preprocess_text(t))}")
    print('-'*50)


Original: ''
Processed: ''
--------------------------------------------------
Original: 'a'
Processed: ''
--------------------------------------------------
Original: 'LOL!!!'
Processed: 'lol'
--------------------------------------------------
Original: 'No disaster here.'
Processed: 'disaster'
--------------------------------------------------
Original: 'RT @user: #BreakingNews Disaster at www.example.com!!'
Processed: 'breakingnews disaster wwwexamplecom'
--------------------------------------------------
Original: 'FLOOD....flood.... Flooded.'
Processed: 'floodflood flooded'
--------------------------------------------------
Original: '     '
Processed: ''
--------------------------------------------------
Original: 'This is a test tweet with a reallysuperlongwordthatexceedslimits'
Processed: 'test tweet'
--------------------------------------------------
Original: 'Fire! 🔥🔥🔥 Very bad situation... #emergency'
Processed: 'fire 🔥🔥🔥 bad situation emergency'
----------------------------

## TODO 5: Apply Preprocessing and Save Results

**Goal**: Process the entire dataset and save cleaned data

**Steps**:
1. Apply preprocessing pipeline to training data
2. Apply same pipeline to test data
3. Save cleaned data to `data/interim/` folder
4. Compare before/after statistics
5. Document any data lost during preprocessing

**Expected Output**: Clean datasets ready for tokenization

---

## Phase 2: Transformers Enhancement

*After completing Phase 1, consider these enhancements:*

- Use HuggingFace tokenizers for consistent preprocessing
- Leverage pre-trained tokenization (BERT, RoBERTa)
- Compare custom preprocessing vs. transformer tokenization
- Analyze preprocessing impact on different model architectures


In [44]:
# TODO 5: Apply preprocessing and save results
# Your implementation here
# Create a column for potential debugging purposes so it is present in the cleaned dataset
# One potential way to check that the preprocessing is working as expected is to compare the word count before and after preprocessing
df["word_count_difference"] = df["word_count"] - df["text_clean"].apply(lambda x: len(x.split()))


import pandas as pd

# Save the cleaned dataset
df.to_csv("../data/interim/train_cleaned.csv", index=False)

df_test = pd.read_csv("../data/raw/test.csv")
# Add same column for test dataset ['id', 'keyword', 'location', 'text', 'target', 'has_hashtag','has_mention', 'has_url', 'number_urls', 'number_hashtags', 'number_mentions', 'text_clean', 'emojis_text', 'word_count', 'word_count_difference']

df_test["has_hashtag"] = df_test["text"].str.contains(r"#")
df_test["has_mention"] = df_test["text"].str.contains(r"@")
df_test["has_url"] = df_test["text"].str.contains(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+")
df_test["number_urls"] = df_test["text"].str.count(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+")
df_test["number_hashtags"] = df_test["text"].str.count(r"#")
df_test["number_mentions"] = df_test["text"].str.count(r"@")
df_test["word_count"] = df_test["text"].apply(lambda x: len(x.split()))
df_test["text_clean"] = df_test["text"].apply(preprocess_text)
df_test["word_count_difference"] = df_test["word_count"] - df_test["text_clean"].apply(lambda x: len(x.split()))
df_test["emojis_text"] = df_test["text"].apply(get_emoji_text_or_none)

if df.columns.difference(df_test.columns).empty:
    print("Columns are not the same")
else:
    print("Columns are the same")

# Check that train and test have the same columns except for "target" (test should not have it)
expected_columns = [col for col in df.columns if col != "target"]
missing_from_test = [col for col in expected_columns if col not in df_test.columns]
extra_in_test = [col for col in df_test.columns if col not in expected_columns]

if not missing_from_test and not extra_in_test:
    print(f"✅ Test columns match train columns (except 'target'). Train columns: {len(df.columns)}, test columns: {len(df_test.columns)}")
else:
    print(f"❌ Columns mismatch. Missing from test: {missing_from_test}, Extra in test: {extra_in_test}")

# Checking which columns are missing
missing_columns = []

for col in df.columns:
    if col not in df_test.columns:
        missing_columns.append(col)

print(f"Missing columns: {missing_columns}")

if "target" in missing_columns:
    print("Target column is missing, which is fine as it is not needed for the test dataset")
else:
    print("Target column is present, which is not fine as it is not needed for the test dataset")

# Save the cleaned test dataset
df_test.to_csv("../data/interim/test_cleaned.csv", index=False)

print(df.columns)
print(df_test.columns)

Columns are the same
✅ Test columns match train columns (except 'target'). Train columns: 15, test columns: 14
Missing columns: ['target']
Target column is missing, which is fine as it is not needed for the test dataset
Index(['id', 'keyword', 'location', 'text', 'target', 'has_hashtag',
       'has_mention', 'has_url', 'number_urls', 'number_hashtags',
       'number_mentions', 'text_clean', 'emojis_text', 'word_count',
       'word_count_difference'],
      dtype='object')
Index(['id', 'keyword', 'location', 'text', 'has_hashtag', 'has_mention',
       'has_url', 'number_urls', 'number_hashtags', 'number_mentions',
       'word_count', 'text_clean', 'word_count_difference', 'emojis_text'],
      dtype='object')


## Phase 2: HuggingFace Tokenizers Comparison

**Goal**: Compare custom preprocessing with HuggingFace tokenizers

**Steps**:
1. Install and import HuggingFace transformers
2. Load BERT tokenizer and compare outputs
3. Analyze differences between custom vs. transformer tokenization
4. Document pros/cons of each approach

**Expected Output**: Understanding of when to use custom vs. pre-trained tokenizers

In [None]:
# Phase 2: HuggingFace Tokenizers Comparison
# Your implementation here

# Install transformers if not already installed
# !pip install transformers

from transformers import BertTokenizer
import torch

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sample texts for comparison
sample_texts = [
    "Fire! 🔥🔥🔥 Very bad situation... #emergency",
    "RT @user: #BreakingNews Disaster at www.example.com!!",
    "The fire in the forest is a disaster that has caused a lot of damage 🔥",
    "Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all",
    "Forest fire near La Ronge Sask. Canada"
]

print("=== COMPARISON: Custom Preprocessing vs. BERT Tokenizer ===\n")

for i, text in enumerate(sample_texts):
    print(f"Sample {i+1}:")
    print(f"Original: {repr(text)}")
    
    # Custom preprocessing
    custom_cleaned = preprocess_text(text)
    print(f"Custom cleaned: {repr(custom_cleaned)}")
    
    # BERT tokenization
    bert_tokens = tokenizer.tokenize(text)
    bert_cleaned = " ".join(bert_tokens)
    print(f"BERT tokens: {bert_tokens}")
    print(f"BERT cleaned: {repr(bert_cleaned)}")
    
    # Compare lengths
    custom_words = len(custom_cleaned.split()) if custom_cleaned else 0
    bert_words = len(bert_tokens)
    
    print(f"Custom word count: {custom_words}")
    print(f"BERT token count: {bert_words}")
    print("-" * 80)


In [None]:
# Analyze tokenization differences and create comparison statistics

def analyze_tokenization_differences(df_sample, tokenizer):
    """
    Compare custom preprocessing vs BERT tokenization on a sample of data
    """
    results = {
        'custom_word_counts': [],
        'bert_token_counts': [],
        'length_ratios': [],
        'samples': []
    }
    
    # Sample a subset for analysis (first 100 rows)
    sample_df = df_sample.head(100)
    
    for idx, row in sample_df.iterrows():
        original_text = row['text']
        custom_cleaned = preprocess_text(original_text)
        bert_tokens = tokenizer.tokenize(original_text)
        
        custom_words = len(custom_cleaned.split()) if custom_cleaned else 0
        bert_tokens_count = len(bert_tokens)
        
        results['custom_word_counts'].append(custom_words)
        results['bert_token_counts'].append(bert_tokens_count)
        
        if custom_words > 0:
            ratio = bert_tokens_count / custom_words
            results['length_ratios'].append(ratio)
        
        results['samples'].append({
            'original': original_text[:50] + "...",
            'custom_count': custom_words,
            'bert_count': bert_tokens_count
        })
    
    return results

# Run analysis
print("=== TOKENIZATION ANALYSIS ON SAMPLE DATA ===\n")
tokenization_results = analyze_tokenization_differences(df, tokenizer)

# Calculate statistics
avg_custom = sum(tokenization_results['custom_word_counts']) / len(tokenization_results['custom_word_counts'])
avg_bert = sum(tokenization_results['bert_token_counts']) / len(tokenization_results['bert_token_counts'])
avg_ratio = sum(tokenization_results['length_ratios']) / len(tokenization_results['length_ratios'])

print(f"Average custom preprocessing word count: {avg_custom:.2f}")
print(f"Average BERT token count: {avg_bert:.2f}")
print(f"Average BERT/Custom ratio: {avg_ratio:.2f}")
print(f"BERT typically produces {avg_ratio:.1f}x more tokens than custom preprocessing")

# Show some examples
print("\n=== SAMPLE COMPARISONS ===")
for i, sample in enumerate(tokenization_results['samples'][:5]):
    print(f"Sample {i+1}:")
    print(f"  Text: {sample['original']}")
    print(f"  Custom words: {sample['custom_count']}, BERT tokens: {sample['bert_count']}")
    print()


## Summary and Recommendations

### Custom Preprocessing vs. HuggingFace Tokenizers

**Custom Preprocessing Advantages:**
- ✅ **Domain-specific**: Preserves disaster-related keywords
- ✅ **Controllable**: Full control over cleaning steps
- ✅ **Interpretable**: Clear understanding of each transformation
- ✅ **Efficient**: Smaller vocabulary, faster processing

**HuggingFace Tokenizers Advantages:**
- ✅ **Pre-trained**: Optimized for specific models (BERT, RoBERTa)
- ✅ **Robust**: Handles edge cases and special tokens
- ✅ **Standardized**: Consistent across different projects
- ✅ **Subword handling**: Better handling of OOV words

**Recommendations:**
1. **Phase 1 (PyTorch Fundamentals)**: Use custom preprocessing for learning
2. **Phase 2 (Transformers)**: Use HuggingFace tokenizers for better performance
3. **Hybrid Approach**: Combine custom cleaning with pre-trained tokenizers

### Next Steps:
- Move to `02_vocab_and_dataloader.ipynb` for vocabulary building
- Use cleaned datasets from `data/interim/` folder
- Implement custom PyTorch data loaders with your cleaned text
