# Task 9.3: Dataset Processing with Hugging Face Datasets

**Module:** 9 - Hugging Face Ecosystem  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Load datasets from the Hugging Face Hub
- [ ] Apply map(), filter(), and select() operations
- [ ] Process large datasets efficiently with batching and multiprocessing
- [ ] Create train/validation/test splits
- [ ] Understand streaming for massive datasets
- [ ] Save and share processed datasets

---

## Prerequisites

- Completed: Tasks 9.1 and 9.2 (Hub and Pipelines)
- Knowledge of: Pandas basics, tokenization concepts

---

## Real-World Context

Imagine you're a chef preparing ingredients for a restaurant:
- You receive raw ingredients (raw data)
- You wash, chop, measure, and portion everything (preprocessing)
- You organize into prep containers (batching)
- You store properly for service (caching)

Data preprocessing is 80% of ML work! The `datasets` library makes this efficient and reproducible.

**Real examples:**
- Training a chatbot: Process millions of conversation pairs
- Sentiment analysis: Tokenize and encode text with labels
- Translation: Align source and target language pairs
- Fine-tuning LLMs: Format instruction-response pairs

---

## ELI5: What is the Datasets Library?

> **Imagine you have a huge recipe book** with millions of recipes. Reading the whole book into your head at once would be impossible!
>
> The `datasets` library is like a magical bookmark that:
> - Lets you flip to any page instantly (memory-mapping)
> - Can modify recipes without rewriting the whole book (lazy evaluation)
> - Has helpers that can work on many recipes at once (multiprocessing)
> - Remembers your changes so you don't redo work (caching)
>
> **Special trick:** Instead of loading all recipes, it just remembers WHERE each recipe is. When you need one, it grabs just that page!
>
> **In AI terms:** The library uses Apache Arrow format for efficient memory-mapped I/O, enabling zero-copy reads and fast random access without loading entire datasets into RAM.

---

## Part 1: Loading Datasets

In [None]:
# Install required packages
# Note: These packages are pre-installed in the NGC PyTorch container.
# Running pip install ensures you have compatible versions.

!pip install -q "transformers>=4.35.0" "huggingface_hub>=0.19.0" "datasets>=2.14.0"

print("Packages ready!")

In [None]:
# Import the datasets library
from datasets import load_dataset, Dataset, DatasetDict
from datasets import load_dataset_builder
import torch
import numpy as np
import os

# Set cache directory (useful for DGX Spark with mounted volumes)
# os.environ['HF_DATASETS_CACHE'] = '/workspace/data/hf_cache'

print("Datasets library ready!")
print(f"Cache dir: {os.environ.get('HF_DATASETS_CACHE', '~/.cache/huggingface/datasets')}")

### Loading from the Hub

The simplest way to load a dataset:

In [None]:
# Load the IMDB dataset (sentiment analysis)
print("Loading IMDB dataset...")
imdb = load_dataset("imdb")

print(f"\nDataset structure:")
print(imdb)

print(f"\nTrain split size: {len(imdb['train']):,} examples")
print(f"Test split size: {len(imdb['test']):,} examples")

In [None]:
# Explore the structure
print("Dataset features (columns):")
print(imdb['train'].features)

print("\nFirst example:")
example = imdb['train'][0]
print(f"  Text: {example['text'][:200]}...")
print(f"  Label: {example['label']} (0=negative, 1=positive)")

### Loading Specific Splits and Subsets

In [None]:
# Load only the train split
train_only = load_dataset("imdb", split="train")
print(f"Train only: {len(train_only):,} examples")

# Load a subset (first 1000 examples)
small_train = load_dataset("imdb", split="train[:1000]")
print(f"First 1000: {len(small_train):,} examples")

# Load a percentage
tiny_train = load_dataset("imdb", split="train[:5%]")
print(f"First 5%: {len(tiny_train):,} examples")

# Load a range
middle = load_dataset("imdb", split="train[1000:2000]")
print(f"Range 1000-2000: {len(middle):,} examples")

### What Just Happened?

When you load a dataset:
1. **Download**: Files are downloaded to cache (~/.cache/huggingface/datasets)
2. **Convert**: Data is converted to Arrow format (efficient binary)
3. **Memory-map**: Files are memory-mapped (not loaded into RAM!)
4. **Cache**: Processing is cached for future use

On your DGX Spark with 128GB memory, you can load MUCH larger datasets than typical systems!

---

## Part 2: Exploring Dataset Information

In [None]:
# Get info without downloading
builder = load_dataset_builder("squad")

print("SQuAD Dataset Info:")
print(f"  Description: {builder.info.description[:200]}...")
print(f"  Homepage: {builder.info.homepage}")
print(f"  Features: {builder.info.features}")
print(f"  Download size: {builder.info.download_size / 1e6:.1f} MB")
print(f"  Dataset size: {builder.info.dataset_size / 1e6:.1f} MB")

In [None]:
# Explore available datasets
from huggingface_hub import list_datasets

# Find popular datasets for specific tasks
text_class_datasets = list(list_datasets(
    filter="task_categories:text-classification",
    sort="downloads",
    direction=-1,
    limit=10
))

print("Top 10 Text Classification Datasets:\n")
for i, ds in enumerate(text_class_datasets, 1):
    print(f"{i}. {ds.id} ({ds.downloads:,} downloads)")

---

## Part 3: The Map Operation (Your Most Important Tool)

`.map()` applies a function to every example. It's how you:
- Tokenize text
- Add new columns
- Transform existing columns
- Clean data

In [None]:
# Simple map: add text length
def add_length(example):
    """Add word count to each example."""
    example['word_count'] = len(example['text'].split())
    return example

# Apply to small subset
small_imdb = load_dataset("imdb", split="train[:100]")

print("Before map:")
print(f"  Columns: {small_imdb.column_names}")

# Apply the function
small_imdb = small_imdb.map(add_length)

print("\nAfter map:")
print(f"  Columns: {small_imdb.column_names}")
print(f"  Sample word counts: {small_imdb['word_count'][:5]}")

### Batched Map (Much Faster!)

In [None]:
import time

# Comparison: batched vs non-batched
test_data = load_dataset("imdb", split="train[:5000]")

# Non-batched (slower)
def process_single(example):
    example['upper_text'] = example['text'].upper()[:100]
    return example

start = time.time()
_ = test_data.map(process_single)
non_batched_time = time.time() - start

# Batched (faster!)
def process_batch(examples):
    examples['upper_text'] = [t.upper()[:100] for t in examples['text']]
    return examples

start = time.time()
_ = test_data.map(process_batch, batched=True, batch_size=1000)
batched_time = time.time() - start

print(f"Non-batched: {non_batched_time:.2f}s")
print(f"Batched: {batched_time:.2f}s")
print(f"Speedup: {non_batched_time/batched_time:.1f}x faster!")

### Tokenization with Map (The Most Common Use Case)

In [None]:
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenization function
def tokenize_function(examples):
    """Tokenize text using BERT tokenizer."""
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=256
    )

# Tokenize with batching and multiprocessing
tokenized = test_data.map(
    tokenize_function,
    batched=True,
    batch_size=1000,
    num_proc=4,  # Use 4 CPU cores
    remove_columns=['text']  # Remove original text to save memory
)

print("Tokenized dataset:")
print(f"  Columns: {tokenized.column_names}")
print(f"  Example input_ids shape: {len(tokenized[0]['input_ids'])}")

### What Just Happened?

The `map()` call:
1. Split data into batches of 1000
2. Distributed batches across 4 CPU cores
3. Applied tokenization in parallel
4. Removed original text column to save memory
5. Cached results for future use

**DGX Spark Tip:** With 12 CPU cores available, you can use `num_proc=8` or higher for even faster processing!

---

## Part 4: Filter and Select Operations

In [None]:
# Reload with text for filtering examples
imdb_sample = load_dataset("imdb", split="train[:10000]")

# Add word count
def add_stats(examples):
    examples['word_count'] = [len(t.split()) for t in examples['text']]
    return examples

imdb_sample = imdb_sample.map(add_stats, batched=True)

print(f"Original size: {len(imdb_sample):,}")

In [None]:
# Filter: Keep only reviews with 100-500 words
def is_medium_length(example):
    return 100 <= example['word_count'] <= 500

filtered = imdb_sample.filter(is_medium_length)
print(f"After filtering (100-500 words): {len(filtered):,}")

# Filter: Keep only positive reviews
positive = imdb_sample.filter(lambda x: x['label'] == 1)
print(f"Positive reviews only: {len(positive):,}")

In [None]:
# Batched filtering is also available
def filter_batch(examples):
    return [100 <= wc <= 500 for wc in examples['word_count']]

filtered_batch = imdb_sample.filter(filter_batch, batched=True, batch_size=1000)
print(f"Batched filter result: {len(filtered_batch):,}")

In [None]:
# Select specific indices
indices = [0, 100, 200, 300, 400]
selected = imdb_sample.select(indices)

print(f"Selected 5 specific examples:")
for i, example in enumerate(selected):
    print(f"  {indices[i]}: {example['word_count']} words, label={example['label']}")

---

## Part 5: Creating Train/Validation/Test Splits

In [None]:
# Load a dataset without pre-defined splits
# (IMDB has train/test, let's create our own validation)

# Method 1: train_test_split
train_data = load_dataset("imdb", split="train")

# Split train into train and validation (90/10)
split_data = train_data.train_test_split(test_size=0.1, seed=42)

print("After splitting:")
print(f"  New train: {len(split_data['train']):,}")
print(f"  Validation: {len(split_data['test']):,}")

In [None]:
# Method 2: Stratified split (maintains class balance)
split_stratified = train_data.train_test_split(
    test_size=0.1, 
    seed=42,
    stratify_by_column='label'  # Maintain label distribution!
)

# Check label distribution
import collections

original_dist = collections.Counter(train_data['label'])
train_dist = collections.Counter(split_stratified['train']['label'])
val_dist = collections.Counter(split_stratified['test']['label'])

print("Label distributions:")
print(f"  Original: {dict(original_dist)}")
print(f"  New train: {dict(train_dist)}")
print(f"  Validation: {dict(val_dist)}")
print(f"\n  Original ratio: {original_dist[1]/sum(original_dist.values()):.2%} positive")
print(f"  Train ratio: {train_dist[1]/sum(train_dist.values()):.2%} positive")
print(f"  Val ratio: {val_dist[1]/sum(val_dist.values()):.2%} positive")

In [None]:
# Method 3: Create custom splits using select
# (numpy was imported at the top of the notebook)

# Shuffle indices
np.random.seed(42)
indices = np.random.permutation(len(train_data))

# Split 80/10/10
train_size = int(0.8 * len(indices))
val_size = int(0.1 * len(indices))

train_idx = indices[:train_size]
val_idx = indices[train_size:train_size + val_size]
test_idx = indices[train_size + val_size:]

# Create splits
custom_train = train_data.select(train_idx)
custom_val = train_data.select(val_idx)
custom_test = train_data.select(test_idx)

# Combine into DatasetDict
full_dataset = DatasetDict({
    'train': custom_train,
    'validation': custom_val,
    'test': custom_test
})

print("Custom 80/10/10 split:")
print(full_dataset)

---

## Part 6: Working with Large Datasets (Streaming)

In [None]:
# Streaming: Process data without downloading everything
# Perfect for datasets too large to fit on disk!

# Stream a HUGE dataset
streaming_ds = load_dataset(
    "HuggingFaceFW/fineweb",  # A 15+ TB dataset!
    name="sample-10BT",
    split="train",
    streaming=True  # Magic flag!
)

print("Streaming dataset created (no download yet!)")
print(f"Type: {type(streaming_ds)}")

In [None]:
# Iterate through streaming dataset
# Data is downloaded on-demand!

print("First 3 examples from streaming dataset:\n")
for i, example in enumerate(streaming_ds):
    print(f"Example {i}:")
    print(f"  Text: {example['text'][:100]}...")
    print()
    if i >= 2:
        break

In [None]:
# Operations on streaming datasets
# They're applied lazily!

# Filter and map work on streaming datasets
processed_stream = (
    streaming_ds
    .filter(lambda x: len(x['text']) > 100)  # Only long texts
    .map(lambda x: {'word_count': len(x['text'].split()), **x})
    .take(5)  # Take first 5 that pass filter
)

print("Processed streaming results:\n")
for example in processed_stream:
    print(f"  Word count: {example['word_count']}")

### When to Use Streaming

| Situation | Use Streaming? |
|-----------|----------------|
| Dataset fits in memory | No - regular loading is faster |
| Dataset > disk space | Yes! |
| Only need a sample | Yes |
| Need random access | No |
| Training with shuffling | Depends (use shuffle buffer) |

**DGX Spark Note:** With 128GB memory, you can load datasets that would crash most systems. Streaming is still useful for truly massive datasets like The Pile, FineWeb, or RedPajama.

---

## Part 7: Complete Processing Pipeline

Let's build a complete pipeline for preparing the IMDB dataset for training.

In [None]:
from transformers import AutoTokenizer
import time

print("=" * 60)
print("COMPLETE DATASET PROCESSING PIPELINE")
print("=" * 60)

start_time = time.time()

# Step 1: Load dataset
print("\n[1/5] Loading dataset...")
raw_dataset = load_dataset("imdb")
print(f"      Loaded {len(raw_dataset['train']):,} train, {len(raw_dataset['test']):,} test")

# Step 2: Create validation split
print("\n[2/5] Creating validation split...")
train_val = raw_dataset['train'].train_test_split(
    test_size=0.1,
    seed=42,
    stratify_by_column='label'
)

dataset = DatasetDict({
    'train': train_val['train'],
    'validation': train_val['test'],
    'test': raw_dataset['test']
})
print(f"      Train: {len(dataset['train']):,}")
print(f"      Validation: {len(dataset['validation']):,}")
print(f"      Test: {len(dataset['test']):,}")

In [None]:
# Step 3: Tokenize
print("\n[3/5] Tokenizing...")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_batch(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=256,
        return_tensors=None  # Return lists, not tensors
    )

tokenized_dataset = dataset.map(
    tokenize_batch,
    batched=True,
    batch_size=1000,
    num_proc=4,
    remove_columns=['text'],
    desc="Tokenizing"
)

print(f"      Columns: {tokenized_dataset['train'].column_names}")

In [None]:
# Step 4: Format for PyTorch
print("\n[4/5] Formatting for PyTorch...")

# Rename label column to 'labels' (expected by HF Trainer)
tokenized_dataset = tokenized_dataset.rename_column('label', 'labels')

# Set format to PyTorch tensors
tokenized_dataset.set_format(
    type='torch',
    columns=['input_ids', 'attention_mask', 'labels']
)

print(f"      Format: PyTorch tensors")
print(f"      Sample input_ids type: {type(tokenized_dataset['train'][0]['input_ids'])}")

In [None]:
# Step 5: Verify and summarize
print("\n[5/5] Verification...")

# Check a sample
sample = tokenized_dataset['train'][0]
print(f"      Sample shapes:")
print(f"        input_ids: {sample['input_ids'].shape}")
print(f"        attention_mask: {sample['attention_mask'].shape}")
print(f"        labels: {sample['labels']}")

total_time = time.time() - start_time

print("\n" + "=" * 60)
print(f"PROCESSING COMPLETE in {total_time:.1f}s")
print("=" * 60)
print(f"\nFinal dataset:")
print(tokenized_dataset)

---

## Part 8: Saving and Loading Processed Datasets

In [None]:
# Save to disk
save_path = "./processed_imdb"

print(f"Saving to {save_path}...")
tokenized_dataset.save_to_disk(save_path)
print("Saved!")

# Check what was created
import os
for root, dirs, files in os.walk(save_path):
    level = root.replace(save_path, '').count(os.sep)
    indent = ' ' * 2 * level
    print(f'{indent}{os.path.basename(root)}/')
    subindent = ' ' * 2 * (level + 1)
    for file in files[:5]:  # Show first 5 files
        print(f'{subindent}{file}')
    if len(files) > 5:
        print(f'{subindent}... and {len(files)-5} more files')

In [None]:
# Load from disk (much faster than reprocessing!)
from datasets import load_from_disk

print("Loading from disk...")
start = time.time()
loaded_dataset = load_from_disk(save_path)
load_time = time.time() - start

print(f"Loaded in {load_time:.2f}s!")
print(loaded_dataset)

In [None]:
# Push to Hub (requires HF login)
# tokenized_dataset.push_to_hub("username/imdb-bert-tokenized")

print("To push to Hub:")
print("1. Run: huggingface-cli login")
print("2. Then: dataset.push_to_hub('your-username/dataset-name')")

---

## Try It Yourself: Process the AG News Dataset

Build a complete processing pipeline for the AG News dataset:
1. Load `ag_news` dataset
2. Create train/validation split (90/10)
3. Tokenize with `distilbert-base-uncased`
4. Add a column for text length
5. Filter out very short texts (< 10 words)
6. Save the processed dataset

<details>
<summary>Hint</summary>

```python
# AG News has 'text' and 'label' columns
ag_news = load_dataset("ag_news")

# AG News labels: 0=World, 1=Sports, 2=Business, 3=Sci/Tech
print(ag_news['train'].features)
```
</details>

In [None]:
# YOUR CODE HERE
# Process the AG News dataset




---

## Common Mistakes

### Mistake 1: Not Using Batched Processing

In [None]:
# WRONG: Processing one at a time (slow!)
# def slow_tokenize(example):
#     return tokenizer(example['text'], truncation=True)
# dataset.map(slow_tokenize)  # Very slow!

# CORRECT: Batched processing
# def fast_tokenize(examples):
#     return tokenizer(examples['text'], truncation=True)
# dataset.map(fast_tokenize, batched=True)  # Much faster!

print("Always use batched=True for tokenization!")

### Mistake 2: Forgetting to Remove Columns

In [None]:
# WRONG: Keeping original text after tokenization wastes memory
# tokenized = dataset.map(tokenize_fn, batched=True)
# Result: columns = ['text', 'label', 'input_ids', 'attention_mask']

# CORRECT: Remove original text
# tokenized = dataset.map(tokenize_fn, batched=True, remove_columns=['text'])
# Result: columns = ['label', 'input_ids', 'attention_mask']

print("Use remove_columns to save memory after processing!")

### Mistake 3: Not Setting the Format for Training

In [None]:
# WRONG: Trying to use dataset directly with PyTorch
# for batch in DataLoader(dataset):  # May fail or be slow!

# CORRECT: Set format first
# dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
# for batch in DataLoader(dataset):  # Works great!

print("Call set_format('torch', ...) before creating DataLoader!")

---

## Checkpoint

You've learned:
- ✅ How to load datasets from the Hub
- ✅ How to use map(), filter(), and select() operations
- ✅ How to process data efficiently with batching and multiprocessing
- ✅ How to create train/validation/test splits (including stratified)
- ✅ How to use streaming for massive datasets
- ✅ How to save and load processed datasets

---

## Challenge: Multi-Dataset Pipeline

Create a processing pipeline that:
1. Loads IMDB, Yelp, and Amazon reviews
2. Standardizes them to same format
3. Combines into one dataset
4. Creates balanced splits
5. Tokenizes with consistent settings

In [None]:
# YOUR CHALLENGE CODE HERE
# Hint: Use concatenate_datasets from datasets
# from datasets import concatenate_datasets


---

## Further Reading

- [Datasets Documentation](https://huggingface.co/docs/datasets)
- [Processing Data Tutorial](https://huggingface.co/docs/datasets/process)
- [Streaming Datasets](https://huggingface.co/docs/datasets/stream)
- [Dataset Performance Tips](https://huggingface.co/docs/datasets/about_cache)

---

## Cleanup

In [None]:
# Clean up saved files
import shutil

if os.path.exists("./processed_imdb"):
    shutil.rmtree("./processed_imdb")
    print("Cleaned up processed_imdb directory")

# Clear memory
import gc
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Cleanup complete!")

---

## Next Steps

In the next notebook, **04-trainer-finetuning.ipynb**, we'll use our processed datasets to actually train models using the Hugging Face Trainer API!

Great job completing Task 9.3! You now know how to efficiently process datasets for any ML task!