# Lab 2.5.3: Dataset Processing with Hugging Face Datasets

**Module:** 2.5 - Hugging Face Ecosystem  
**Time:** 2 hours  
**Difficulty:** ⭐⭐ (Intermediate)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Load datasets from the Hugging Face Hub
- [ ] Apply transformations with `map()` and `filter()`
- [ ] Create train/validation/test splits
- [ ] Handle large datasets with streaming
- [ ] Prepare datasets for training with tokenization

---

## Prerequisites

- Completed: Labs 2.5.1 and 2.5.2
- Knowledge of: Tokenization concepts, PyTorch DataLoader basics

---

## DGX Spark Advantage

With DGX Spark's 128GB unified memory, you can:
- Load larger datasets directly into memory (no streaming needed for most datasets)
- Use more parallel workers (`num_proc=4-8`) for faster preprocessing
- Process entire datasets without batching constraints
- Cache tokenized datasets for faster iteration

---

## Real-World Context

**The Data Pipeline Challenge**: You're building a sentiment classifier for product reviews. You have:
- 1 million reviews from multiple sources
- Mixed languages, varying lengths
- Imbalanced classes (80% positive, 20% negative)

The Hugging Face **Datasets** library handles all of this efficiently, with:
- Memory-mapped data (process datasets larger than RAM)
- Parallel processing with `num_proc`
- Built-in streaming for huge datasets
- One-line integration with the Trainer API

---

## ELI5: Datasets Library

> **Imagine you're organizing a massive library...**
>
> The old way: Load every book into your arms, then sort them. Your arms get tired fast!
>
> The Datasets way: Use a magical catalog that lets you:
> - **Point** to books without carrying them (memory-mapping)
> - **Clone yourself** to sort faster (parallel processing)
> - **Read page by page** without loading the whole book (streaming)
>
> **In AI terms:** It's a library that handles millions of examples efficiently:
> ```python
> # 1 million examples? No problem!
> dataset = load_dataset("imdb")  # Downloads once, cached forever
> dataset = dataset.map(tokenize, num_proc=8)  # 8x parallel processing
> ```

---

## Part 1: Loading Datasets

In [None]:
import torch
from datasets import load_dataset, Dataset, DatasetDict
from datasets import concatenate_datasets
from transformers import AutoTokenizer
import numpy as np
from collections import Counter
import time

print("Environment Check")
print("=" * 50)
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

### 1.1 Loading from the Hub

In [None]:
# Load the IMDB dataset - a classic for sentiment analysis
print("Loading IMDB dataset...")
start = time.time()
imdb = load_dataset("imdb")
print(f"Loaded in {time.time() - start:.2f} seconds")

# Explore the structure
print("\nDataset Structure:")
print(imdb)

print("\nSplits available:")
for split in imdb:
    print(f"  {split}: {len(imdb[split]):,} examples")

print("\nFeatures (columns):")
print(imdb['train'].features)

In [None]:
# Look at a single example
print("\nSample from training set:")
print("-" * 60)
sample = imdb['train'][0]
print(f"Label: {sample['label']} ({'positive' if sample['label'] == 1 else 'negative'})")
print(f"Text preview: {sample['text'][:300]}...")
print(f"Text length: {len(sample['text'])} characters")

### 1.2 Loading Dataset Subsets and Specific Splits

In [None]:
# Load only specific split
train_only = load_dataset("imdb", split="train")
print(f"Train only: {len(train_only):,} examples")

# Load a slice (for quick testing)
small_train = load_dataset("imdb", split="train[:1000]")
print(f"Small train (first 1000): {len(small_train):,} examples")

# Load percentage
train_10pct = load_dataset("imdb", split="train[:10%]")
print(f"10% of train: {len(train_10pct):,} examples")

# Load with train/test merged
all_data = load_dataset("imdb", split="train+test")
print(f"All data combined: {len(all_data):,} examples")

### 1.3 Loading Datasets with Configurations

In [None]:
# Many datasets have configurations (subsets)
# GLUE is a benchmark with multiple tasks

print("Loading GLUE SST-2 (Stanford Sentiment Treebank):")
sst2 = load_dataset("glue", "sst2")
print(sst2)

print("\nSample:")
print(sst2['train'][0])

---

## Part 2: Dataset Analysis

In [None]:
def analyze_dataset(dataset, text_column='text', label_column='label', name='Dataset'):
    """
    Comprehensive dataset analysis.
    """
    print(f"\n{'='*60}")
    print(f"DATASET ANALYSIS: {name}")
    print(f"{'='*60}")
    
    # Basic info
    print(f"\nSize: {len(dataset):,} examples")
    print(f"Columns: {dataset.column_names}")
    print(f"Features: {dataset.features}")
    
    # Label distribution
    if label_column in dataset.column_names:
        labels = dataset[label_column]
        label_counts = Counter(labels)
        print(f"\nLabel Distribution:")
        total = len(labels)
        for label, count in sorted(label_counts.items()):
            pct = 100 * count / total
            print(f"  Label {label}: {count:,} ({pct:.1f}%)")
    
    # Text length statistics
    if text_column in dataset.column_names:
        # Sample for efficiency
        sample_size = min(1000, len(dataset))
        lengths = [len(dataset[i][text_column]) for i in range(sample_size)]
        
        print(f"\nText Length (characters) - sampled {sample_size}:")
        print(f"  Mean: {np.mean(lengths):.0f}")
        print(f"  Std: {np.std(lengths):.0f}")
        print(f"  Min: {min(lengths)}")
        print(f"  Max: {max(lengths)}")
        print(f"  Median: {np.median(lengths):.0f}")
    
    # Memory estimate
    if hasattr(dataset, '_indices'):
        print(f"\nMemory: Using indices (memory-mapped)")
    else:
        try:
            size_mb = dataset.data.nbytes / 1e6
            print(f"\nEstimated memory: {size_mb:.1f} MB")
        except:
            pass

# Analyze IMDB
analyze_dataset(imdb['train'], name='IMDB Train')
analyze_dataset(imdb['test'], name='IMDB Test')

---

## Part 3: The `map()` Function - Your Swiss Army Knife

The `map()` function applies a transformation to every example in the dataset.

In [None]:
# Simple transformation: add text length
def add_length(example):
    example['text_length'] = len(example['text'])
    return example

# Apply to dataset
print("Adding text length column...")
imdb_with_length = imdb['train'].map(add_length)

print("\nNew columns:", imdb_with_length.column_names)
print("Sample:", imdb_with_length[0]['text_length'], "characters")

### 3.1 Batched Processing (Much Faster!)

In [None]:
# Batched transformation - processes multiple examples at once
def add_length_batched(examples):
    examples['text_length'] = [len(t) for t in examples['text']]
    return examples

# Compare speeds
print("Speed Comparison:")
print("-" * 50)

# Non-batched (slow)
start = time.time()
_ = imdb['train'].map(add_length)
print(f"Non-batched: {time.time() - start:.2f}s")

# Batched (fast!)
start = time.time()
_ = imdb['train'].map(add_length_batched, batched=True)
print(f"Batched: {time.time() - start:.2f}s")

# Batched + parallel (fastest!)
start = time.time()
_ = imdb['train'].map(add_length_batched, batched=True, num_proc=4)
print(f"Batched + 4 workers: {time.time() - start:.2f}s")

### 3.2 Tokenization with `map()`

In [None]:
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=256
    )

# Tokenize the dataset
print("Tokenizing IMDB dataset...")
start = time.time()

tokenized_imdb = imdb.map(
    tokenize_function,
    batched=True,
    num_proc=4,
    remove_columns=['text'],  # Remove original text to save memory
    desc="Tokenizing"
)

print(f"\nTokenization completed in {time.time() - start:.2f}s")
print("\nNew columns:", tokenized_imdb['train'].column_names)
print("\nSample tokenized example:")
print(f"  input_ids shape: {len(tokenized_imdb['train'][0]['input_ids'])}")
print(f"  attention_mask shape: {len(tokenized_imdb['train'][0]['attention_mask'])}")

---

## Part 4: Filtering Data

In [None]:
# Filter examples based on criteria

# Keep only reviews with at least 500 characters
long_reviews = imdb['train'].filter(
    lambda x: len(x['text']) >= 500,
    num_proc=4
)
print(f"Reviews >= 500 chars: {len(long_reviews):,} / {len(imdb['train']):,}")

# Keep only positive reviews
positive_only = imdb['train'].filter(lambda x: x['label'] == 1)
print(f"Positive reviews: {len(positive_only):,}")

# Keep reviews between 200-1000 characters
medium_reviews = imdb['train'].filter(
    lambda x: 200 <= len(x['text']) <= 1000
)
print(f"Medium reviews (200-1000 chars): {len(medium_reviews):,}")

### 4.1 Batched Filtering

In [None]:
# Batched filtering is faster for simple conditions
def filter_by_length_batched(examples):
    return [len(t) >= 500 for t in examples['text']]

# Compare speeds
print("Filter speed comparison:")

start = time.time()
_ = imdb['train'].filter(lambda x: len(x['text']) >= 500)
print(f"Non-batched: {time.time() - start:.2f}s")

start = time.time()
_ = imdb['train'].filter(filter_by_length_batched, batched=True)
print(f"Batched: {time.time() - start:.2f}s")

---

## Part 5: Creating Custom Splits

In [None]:
# Create train/validation/test splits
print("Creating custom splits from IMDB train...")

# Start with train split only
full_train = imdb['train']

# First split: separate test set (10%)
split1 = full_train.train_test_split(test_size=0.1, seed=42)
train_val = split1['train']
test_set = split1['test']

# Second split: separate validation from remaining train (10% of 90% = 9%)
split2 = train_val.train_test_split(test_size=0.1, seed=42)
train_set = split2['train']
val_set = split2['test']

print(f"\nFinal splits:")
print(f"  Train: {len(train_set):,} ({100*len(train_set)/len(full_train):.1f}%)")
print(f"  Val:   {len(val_set):,} ({100*len(val_set)/len(full_train):.1f}%)")
print(f"  Test:  {len(test_set):,} ({100*len(test_set)/len(full_train):.1f}%)")

# Create a DatasetDict
custom_splits = DatasetDict({
    'train': train_set,
    'validation': val_set,
    'test': test_set
})

print("\nCustom DatasetDict:")
print(custom_splits)

### 5.1 Stratified Splits (Maintaining Label Balance)

In [None]:
# Stratified split maintains label proportions
print("Stratified split (maintains label balance):")

stratified_split = full_train.train_test_split(
    test_size=0.2,
    stratify_by_column='label',
    seed=42
)

# Check label distribution in both splits
for split_name in ['train', 'test']:
    labels = stratified_split[split_name]['label']
    counts = Counter(labels)
    total = len(labels)
    print(f"\n{split_name}:")
    for label, count in sorted(counts.items()):
        print(f"  Label {label}: {count:,} ({100*count/total:.1f}%)")

---

## Part 6: Working with Large Datasets (Streaming)

In [None]:
# Streaming mode: process huge datasets without loading into memory
print("Loading C4 dataset in streaming mode...")
print("(C4 is ~300GB - we don't want to download all of it!)")

# Load in streaming mode
c4_stream = load_dataset(
    "allenai/c4",
    "en",  # English subset
    split="train",
    streaming=True,
    trust_remote_code=True
)

print(f"\nType: {type(c4_stream)}")
print("(IterableDataset - data is fetched on-demand)")

# Take a few examples
print("\nFirst 3 examples:")
for i, example in enumerate(c4_stream.take(3)):
    print(f"\nExample {i+1}:")
    print(f"  URL: {example.get('url', 'N/A')[:50]}...")
    print(f"  Text: {example['text'][:100]}...")

In [None]:
# Processing streaming datasets
print("\nProcessing streaming data:")

# Map operations work on streams too
processed_stream = c4_stream.map(
    lambda x: {'text_length': len(x['text'])}
)

# Filter also works
long_texts = processed_stream.filter(
    lambda x: x['text_length'] > 1000
)

# Shuffle with a buffer
shuffled = long_texts.shuffle(buffer_size=1000, seed=42)

# Take samples
print("\nSampling from processed stream:")
for i, example in enumerate(shuffled.take(3)):
    print(f"  Sample {i+1}: {example['text_length']} chars")

---

## Part 7: Preparing for Training

In [None]:
# Complete pipeline: Load -> Process -> Ready for Trainer
print("Complete Data Preparation Pipeline")
print("=" * 60)

# 1. Load dataset
print("\n1. Loading dataset...")
raw_dataset = load_dataset("imdb")

# 2. Create validation split
print("2. Creating validation split...")
split = raw_dataset['train'].train_test_split(test_size=0.1, seed=42)
raw_dataset = DatasetDict({
    'train': split['train'],
    'validation': split['test'],
    'test': raw_dataset['test']
})

# 3. Tokenize
print("3. Tokenizing...")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_and_prepare(examples):
    tokenized = tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=256
    )
    # Rename label to labels (required by Trainer)
    tokenized['labels'] = examples['label']
    return tokenized

processed_dataset = raw_dataset.map(
    tokenize_and_prepare,
    batched=True,
    num_proc=4,
    remove_columns=['text', 'label'],
    desc="Tokenizing"
)

# 4. Set format for PyTorch
print("4. Setting format for PyTorch...")
processed_dataset.set_format("torch")

print("\n" + "=" * 60)
print("READY FOR TRAINING!")
print("=" * 60)
print(f"\nDataset structure: {processed_dataset}")
print(f"\nColumns: {processed_dataset['train'].column_names}")
print(f"\nSample tensor shapes:")
sample = processed_dataset['train'][0]
for key, val in sample.items():
    if hasattr(val, 'shape'):
        print(f"  {key}: {val.shape}")
    else:
        print(f"  {key}: {type(val).__name__}")

### 7.1 Using with DataLoader

In [None]:
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

# Create a data collator (handles dynamic padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create DataLoader
train_dataloader = DataLoader(
    processed_dataset['train'],
    batch_size=16,
    shuffle=True,
    collate_fn=data_collator
)

# Get a batch
batch = next(iter(train_dataloader))
print("Batch contents:")
for key, val in batch.items():
    print(f"  {key}: {val.shape}")

---

## Part 8: Saving and Loading Processed Datasets

In [None]:
# Save processed dataset to disk
print("Saving processed dataset...")
processed_dataset.save_to_disk("../data/processed_imdb")
print("Saved to ../data/processed_imdb")

# Load it back
print("\nLoading from disk...")
from datasets import load_from_disk
loaded_dataset = load_from_disk("../data/processed_imdb")
print(loaded_dataset)

In [None]:
# Save to different formats

# As Parquet (efficient columnar format)
processed_dataset['train'].to_parquet("../data/imdb_train.parquet")
print("Saved as Parquet")

# As JSON
processed_dataset['train'].select(range(100)).to_json("../data/imdb_sample.json")
print("Saved sample as JSON")

# As CSV (note: tensors will be converted to lists)
# raw_dataset['train'].select(range(100)).to_csv("../data/imdb_sample.csv")
# print("Saved sample as CSV")

---

## Try It Yourself: Build a Data Pipeline

In [None]:
# YOUR CODE HERE
# Build a complete data pipeline for the AG News dataset
# Requirements:
# 1. Load the AG News dataset
# 2. Analyze the label distribution
# 3. Filter to only keep articles > 200 characters
# 4. Create train/val/test splits (80/10/10)
# 5. Tokenize with a transformer tokenizer
# 6. Set format for PyTorch

# Hint: AG News has 4 categories - World, Sports, Business, Sci/Tech

# Your code:
# ag_news = load_dataset("ag_news")
# ...


---

## Common Mistakes

### Mistake 1: Not Using Batched Processing

```python
# Wrong: Processes one example at a time (SLOW)
dataset.map(lambda x: tokenizer(x['text']))

# Right: Process in batches (FAST)
dataset.map(lambda x: tokenizer(x['text']), batched=True)
```

### Mistake 2: Forgetting `remove_columns`

```python
# Wrong: Keeps original text (wastes memory)
dataset.map(tokenize_fn, batched=True)

# Right: Remove columns you don't need
dataset.map(tokenize_fn, batched=True, remove_columns=['text'])
```

### Mistake 3: Wrong Label Column Name

```python
# Wrong: Trainer expects 'labels' (plural)
tokenized['label'] = examples['label']

# Right: Use 'labels'
tokenized['labels'] = examples['label']
```

### Mistake 4: Not Setting Format Before Training

```python
# Wrong: Data is in Arrow format
trainer = Trainer(train_dataset=dataset)

# Right: Convert to PyTorch tensors
dataset.set_format("torch")
trainer = Trainer(train_dataset=dataset)
```

---

## Checkpoint

You've learned:
- Loading datasets from the Hub with various options
- Analyzing dataset statistics
- Applying transformations with `map()` and `filter()`
- Creating stratified train/val/test splits
- Handling huge datasets with streaming
- Preparing datasets for the Trainer API

---

## Further Reading

- [Datasets Documentation](https://huggingface.co/docs/datasets)
- [Dataset Processing Guide](https://huggingface.co/docs/datasets/process)
- [Streaming Datasets](https://huggingface.co/docs/datasets/stream)

---

## Cleanup

In [None]:
import gc
import shutil

# Clean up saved files
# shutil.rmtree("../data/processed_imdb", ignore_errors=True)

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("\nLab 2.5.3 complete!")