# Text Preprocessing for NLP and Memory Optimization

This notebook explores various text preprocessing techniques for natural language processing (NLP) tasks, with a focus on memory optimization using generators and iterators. We'll examine:

1. Different text preprocessing methods for classification
2. Memory-efficient data loading techniques
3. Performance benchmarking of different approaches
4. Integration with the DVC pipeline

## Setup and Imports

In [None]:
# Standard libraries
import os
import sys
import time
import gc
import json
import psutil
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

In [None]:
# Add project root to path
sys.path.append('..')

# Import custom modules
from src.config import Config
from src.data.data_loader import DataLoader
from src.data.preprocessing import TextPreprocessor
from src.utils.decorators import timing_decorator, memory_usage_decorator

## 1. Create Sample Dataset for Experimentation

In [None]:
# Create a larger sample text classification dataset
sentences = [
    "This product is amazing! I love it.",
    "Terrible service, would not recommend to anyone.",
    "The quality is acceptable but the price is too high.",
    "Best purchase I've made all year! Works perfectly.",
    "Completely useless product. Waste of money.",
    "Not bad, but not great either. Just average.",
    "The customer service was very helpful and responsive.",
    "Broke after two days. Very disappointed.",
    "Exceeded my expectations in every way possible!",
    "Slightly overpriced for what you get, but decent quality."
]
labels = ['positive', 'negative', 'neutral', 'positive', 'negative', 
          'neutral', 'positive', 'negative', 'positive', 'neutral']

# Create a larger dataset by repeating and modifying slightly
import random
random.seed(42)

# Words to randomly insert/replace for variation
adjectives = ['good', 'great', 'excellent', 'terrible', 'awful', 'bad', 'decent', 'fine', 'poor', 'exceptional']
adverbs = ['very', 'extremely', 'somewhat', 'quite', 'rather', 'incredibly', 'barely', 'hardly', 'really', 'truly']

large_texts = []
large_labels = []

for _ in range(100):  # Create 1000 examples
    idx = random.randint(0, len(sentences)-1)
    text = sentences[idx]
    label = labels[idx]
    
    # Add some variation
    if random.random() > 0.7:
        words = text.split()
        if len(words) > 3:
            pos = random.randint(0, len(words)-1)
            if random.random() > 0.5:
                words[pos] = random.choice(adjectives)
            else:
                words.insert(pos, random.choice(adverbs))
            text = ' '.join(words)
    
    large_texts.append(text)
    large_labels.append(label)

# Create DataFrame
df_large = pd.DataFrame({
    'text': large_texts,
    'label': large_labels
})

print(f"Created dataset with {len(df_large)} examples")
df_large.head()

In [None]:
# Save to CSV for further experimentation
os.makedirs('../data/raw', exist_ok=True)
csv_path = '../data/raw/sample_reviews.csv'
df_large.to_csv(csv_path, index=False)
print(f"Saved to {csv_path}")

## 2. Basic Text Preprocessing Functions

Let's examine how different preprocessing methods affect text data.

In [None]:
def display_preprocessing_steps(text):
    """Display the effect of each preprocessing step."""
    print(f"Original: '{text}'")
    
    # Lowercase
    lowercase = text.lower()
    print(f"Lowercase: '{lowercase}'")
    
    # Remove punctuation
    import string
    no_punct = lowercase.translate(str.maketrans('', '', string.punctuation))
    print(f"No punctuation: '{no_punct}'")
    
    # Tokenize
    tokens = word_tokenize(no_punct)
    print(f"Tokenized: {tokens}")
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    print(f"No stopwords: {filtered_tokens}")
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(token) for token in filtered_tokens]
    print(f"Stemmed: {stemmed}")
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    print(f"Lemmatized: {lemmatized}")
    
    return {
        'original': text,
        'lowercase': lowercase,
        'no_punct': no_punct,
        'tokens': tokens,
        'filtered': filtered_tokens,
        'stemmed': stemmed,
        'lemmatized': lemmatized
    }

In [None]:
# Test with a sample sentence
sample_sentence = "This product is amazing! I love it and would recommend it to everyone."
processed = display_preprocessing_steps(sample_sentence)

## 3. Using the TextPreprocessor Class

Now, let's use our custom TextPreprocessor class with different configurations.

In [None]:
# Initialize preprocessor with different configurations
preprocessor_basic = TextPreprocessor({
    'lowercase': True,
    'remove_punctuation': True,
    'remove_numbers': False,
    'remove_stopwords': False,
    'stemming': False,
    'lemmatization': False
})

preprocessor_stopwords = TextPreprocessor({
    'lowercase': True,
    'remove_punctuation': True,
    'remove_numbers': False,
    'remove_stopwords': True,
    'stemming': False,
    'lemmatization': False
})

preprocessor_stemming = TextPreprocessor({
    'lowercase': True,
    'remove_punctuation': True,
    'remove_numbers': False,
    'remove_stopwords': True,
    'stemming': True,
    'lemmatization': False
})

preprocessor_lemmatization = TextPreprocessor({
    'lowercase': True,
    'remove_punctuation': True,
    'remove_numbers': False,
    'remove_stopwords': True,
    'stemming': False,
    'lemmatization': True
})

In [None]:
# Compare preprocessing results
test_sentences = [
    "This product is amazing! I love it.",
    "Terrible service, would not recommend to anyone.",
    "The system crashed 3 times during the demo on 04/25/2023."
]

for sentence in test_sentences:
    print(f"\nOriginal: {sentence}")
    print(f"Basic: {preprocessor_basic.preprocess_text(sentence)}")
    print(f"No Stopwords: {preprocessor_stopwords.preprocess_text(sentence)}")
    print(f"Stemming: {preprocessor_stemming.preprocess_text(sentence)}")
    print(f"Lemmatization: {preprocessor_lemmatization.preprocess_text(sentence)}")

## 4. Memory-Efficient Data Loading

Now, let's explore how our memory-efficient data loading works using generators and iterators.

In [None]:
# Initialize the data loader
config = Config()
loader = DataLoader(csv_path, config)

# Get dataset statistics
stats = loader.get_statistics()
print(f"Dataset statistics:\n{json.dumps(stats, indent=2)}")

### 4.1 Loading Data in Batches vs. All at Once

In [None]:
# Function to measure memory usage
def get_memory_usage():
    """Get current memory usage in MB."""
    process = psutil.Process()
    return process.memory_info().rss / (1024 * 1024)

# Function to benchmark data loading methods
def benchmark_loading_methods(data_path, batch_sizes=[10, 50, 100, None]):
    """Benchmark different data loading methods."""
    results = []
    
    for batch_size in batch_sizes:
        # Create data loader with specified batch size
        config = Config()
        if batch_size:
            config.batch_size = batch_size
        loader = DataLoader(data_path, config)
        
        # Measure memory before
        gc.collect()
        time.sleep(1)  # Allow memory to stabilize
        memory_before = get_memory_usage()
        
        # Start timing
        start_time = time.time()
        
        if batch_size is None:
            # Load all data at once using pandas
            data = loader.load_pandas()
            method = "All at once (Pandas)"
        else:
            # Process data in batches using generator
            record_count = 0
            for batch in loader.load_batch_generator(batch_size=batch_size):
                record_count += len(batch)
                # Simulate processing each record
                for record in batch:
                    _ = record['text'].lower()
            method = f"Batch size {batch_size}"
        
        # End timing
        elapsed_time = time.time() - start_time
        
        # Measure memory after
        memory_after = get_memory_usage()
        memory_increase = memory_after - memory_before
        
        results.append({
            'method': method,
            'time': elapsed_time,
            'memory_increase_mb': memory_increase
        })
    
    return results

In [None]:
# Run the benchmark
benchmark_results = benchmark_loading_methods(csv_path)

# Create a DataFrame from results
df_benchmark = pd.DataFrame(benchmark_results)
df_benchmark

In [None]:
# Visualize the benchmark results
plt.figure(figsize=(12, 6))

# Plot execution time
plt.subplot(1, 2, 1)
sns.barplot(x='method', y='time', data=df_benchmark)
plt.title('Execution Time by Method')
plt.xlabel('Method')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=45)

# Plot memory usage
plt.subplot(1, 2, 2)
sns.barplot(x='method', y='memory_increase_mb', data=df_benchmark)
plt.title('Memory Usage by Method')
plt.xlabel('Method')
plt.ylabel('Memory Increase (MB)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

### 4.2 Iterating Through Records One by One

In [None]:
# Using the DataLoader's iterator interface
loader = DataLoader(csv_path, config)

# Process the first 5 records
for i, record in enumerate(loader):
    if i >= 5:  # Only show first 5
        break
    print(f"Record {i+1}: {record['text'][:50]}... [{record['label']}]")

## 5. Benchmarking Preprocessing Methods

In [None]:
# Function to benchmark preprocessing configurations
def benchmark_preprocessing(data_path, configs):
    """Benchmark different preprocessing configurations."""
    results = []
    
    # Load the sample data
    df = pd.read_csv(data_path)
    sample_texts = df['text'].tolist()[:20]  # Use first 20 for benchmark
    
    for config_name, config_options in configs.items():
        # Create preprocessor
        preprocessor = TextPreprocessor(config_options)
        
        # Start timing
        start_time = time.time()
        
        # Process all sample texts
        processed_texts = []
        for text in sample_texts:
            processed = preprocessor.preprocess_text(text)
            processed_texts.append(processed)
        
        # End timing
        elapsed_time = time.time() - start_time
        
        # Calculate average length reduction
        orig_lengths = [len(text) for text in sample_texts]
        proc_lengths = [len(text) for text in processed_texts]
        avg_reduction = 1 - (sum(proc_lengths) / sum(orig_lengths))
        
        results.append({
            'config': config_name,
            'time': elapsed_time,
            'avg_length_reduction': avg_reduction,
            'sample_result': processed_texts[0]
        })
    
    return results

In [None]:
# Define preprocessing configurations to test
preprocessing_configs = {
    'Basic (lowercase only)': {
        'lowercase': True,
        'remove_punctuation': False,
        'remove_stopwords': False,
        'stemming': False,
        'lemmatization': False
    },
    'No punctuation': {
        'lowercase': True,
        'remove_punctuation': True,
        'remove_stopwords': False,
        'stemming': False,
        'lemmatization': False
    },
    'No stopwords': {
        'lowercase': True,
        'remove_punctuation': True,
        'remove_stopwords': True,
        'stemming': False,
        'lemmatization': False
    },
    'Stemming': {
        'lowercase': True,
        'remove_punctuation': True,
        'remove_stopwords': True,
        'stemming': True,
        'lemmatization': False
    },
    'Lemmatization': {
        'lowercase': True,
        'remove_punctuation': True,
        'remove_stopwords': True,
        'stemming': False,
        'lemmatization': True
    },
    'Complete (all steps)': {
        'lowercase': True,
        'remove_punctuation': True,
        'remove_numbers': True,
        'remove_stopwords': True,
        'stemming': True,
        'lemmatization': True
    }
}

# Run the benchmark
preprocessing_results = benchmark_preprocessing(csv_path, preprocessing_configs)

# Create a DataFrame from results
df_preproc = pd.DataFrame(preprocessing_results)
df_preproc

In [None]:
# Visualize the preprocessing benchmark results
plt.figure(figsize=(12, 8))

# Plot execution time
plt.subplot(2, 1, 1)
sns.barplot(x='config', y='time', data=df_preproc)
plt.title('Preprocessing Time by Configuration')
plt.xlabel('Configuration')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=45)

# Plot length reduction
plt.subplot(2, 1, 2)
sns.barplot(x='config', y='avg_length_reduction', data=df_preproc)
plt.title('Average Text Length Reduction by Configuration')
plt.xlabel('Configuration')
plt.ylabel('Length Reduction (percentage)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 6. End-to-End Preprocessing Pipeline

In [None]:
# Set up the data loader
config = Config()
loader = DataLoader(csv_path, config)

# Set up the preprocessor with standard configuration
preprocessor = TextPreprocessor({
    'lowercase': True,
    'remove_punctuation': True,
    'remove_numbers': False,
    'remove_stopwords': True,
    'language': 'english',
    'stemming': False,
    'lemmatization': True
})

# Define output path
output_path = os.path.join(config.processed_data_path, 'processed_sample_reviews.csv')

# Process and save the dataset
preprocessor.preprocess_and_save(loader, output_path)

print(f"Dataset processed and saved to {output_path}")

In [None]:
# Load and examine the processed dataset
df_processed = pd.read_csv(output_path)
df_processed.head()

In [None]:
# Compare original and processed texts
comparison = pd.DataFrame({
    'original': df_large['text'].head(5),
    'processed': df_processed['processed_text'].head(5)
})
comparison

## 7. Integration with DVC Pipeline

In [None]:
# Display the DVC YAML pipeline configuration
!cat ../dvc.yaml

In [None]:
# Run DVC pipeline to preprocess data
!cd .. && dvc run -n preprocess -d {csv_path} -d src/data/preprocessing.py -d src/data/data_loader.py -o {output_path} python -c "from src.data.preprocessing import TextPreprocessor; from src.data.data_loader import DataLoader; from src.config import Config; config = Config(); preprocessor = TextPreprocessor(config.text_preprocessing); loader = DataLoader('{csv_path}', config); preprocessor.preprocess_and_save(loader, '{output_path}')"

## 8. Summary and Best Practices

Based on our exploration, here are some best practices for text preprocessing and memory optimization:

1. **Memory-Efficient Data Loading**:
   - Use generators and iterators for large datasets
   - Process data in batches rather than loading everything at once
   - Choose an optimal batch size based on your memory constraints

2. **Text Preprocessing**:
   - Start with basic preprocessing (lowercase, remove punctuation)
   - Remove stopwords for most classification tasks
   - Choose between stemming (faster) and lemmatization (more accurate) based on your needs
   - Customize preprocessing based on the specific domain

3. **Pipeline Integration**:
   - Use DVC to track both raw and processed datasets
   - Ensure reproducibility by versioning preprocessing code along with data
   - Create well-defined pipeline stages for preprocessing
   - Track dependencies between data and code

4. **Performance Considerations**:
   - More complex preprocessing leads to more processing time
   - Trade-offs between preprocessing thoroughness and speed/memory usage
   - Consider using multiprocessing for very large datasets