# Lecture 7: HF Datasets in Practice
In this notebook, you'll learn how to work with massive datasets using the Hugging Face Datasets library. We'll cover streaming, filtering, and memory-safe transforms so you can handle big data efficiently—even on a laptop!

## Learning Goals
- Load datasets from the Hugging Face Hub
- Stream and filter large datasets efficiently
- Apply memory-safe transforms and batching

## 1. Setup: Install and Import Libraries
Let's make sure you have the `datasets` library installed and import what we need.

In [1]:
!pip install -q datasets


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# If running locally and you don't have datasets installed, uncomment the next line:
# !pip install datasets

from datasets import load_dataset

## 2. Load a Sample Dataset
We'll start by loading a popular dataset from the Hugging Face Hub. Let's use the IMDB movie reviews dataset as an example.

In [3]:
# Load the IMDB dataset (small sample for demo)
dataset = load_dataset("imdb", split="train[:2000]")
print(dataset)
print(dataset[0])

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 2000
})
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered p

## 3. Streaming Large Datasets
If you want to work with huge datasets without downloading everything, you can use streaming mode. This is great for limited memory environments.

In [None]:
# Load the IMDB dataset in streaming mode
dataset_stream = load_dataset("imdb", split="train", streaming=True)

# Let's peek at the first few examples
from itertools import islice
for example in islice(dataset_stream, 3):
    print(example)

## 4. Filtering Data
You can filter datasets to only keep the rows you care about. For example, let's keep only positive reviews.

In [None]:
# Filter for positive reviews (label == 1)
positive_reviews = dataset.filter(lambda x: x["label"] == 1)
print(f"Number of positive reviews: {len(positive_reviews)}")
print(positive_reviews[0])

## 5. Memory-Safe Transforms with .map()
Use `.map()` to apply transformations to your data. This is efficient and can be done in batches.

In [None]:
# Lowercase all review texts
def to_lower(example):
    example["text"] = example["text"].lower()
    return example

lowercased = positive_reviews.map(to_lower, batched=False)
print(lowercased[0]["text"])

## 6. Batching and Iteration
Batching helps you process data efficiently, especially for training. Let's see how to batch and iterate over the dataset.

In [None]:
# Example: batching with streaming dataset
batch_size = 2
batch = []
for i, example in enumerate(dataset_stream):
    batch.append(example)
    if len(batch) == batch_size:
        print(batch)
        batch = []
    if i > 5:
        break

## 7. Key Takeaways
- Hugging Face Datasets lets you work with huge datasets efficiently.
- Streaming and filtering help you avoid memory issues.
- Use `.map()` for fast, memory-safe transforms.

Try streaming a dataset from the Hub, apply a filter or transform, and share your code snippet in the course repo!