# 🗃️ Big data? 🤗 Datasets to the rescue!

Modern NLP workloads often face "too big for RAM" datasets.  
Hugging Face Datasets handles multi-gigabyte and even terabyte-scale corpora with memory mapping and streaming.

We'll use the 825GB "Pile" corpus, starting with PubMed abstracts—a 19GB biomedical data subset.



Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 1️⃣ Install For Big Data: zstandard Compression

First, install the zstandard library used for decompressing the dataset files.


In [None]:
!pip install zstandard

## 2️⃣ Load a Massive Dataset

Let's load the PubMed abstracts subset (15 million medical texts, ~20GB).
Even with this size, 🤗 Datasets lets you work efficiently and safely.


In [None]:
from datasets import load_dataset

# URL to the compressed PubMed subset
data_files = "https://huggingface.co/datasets/qualis2006/PUBMED_title_abstracts_2020_baseline/resolve/main/PUBMED_title_abstracts_2020_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset


## 3️⃣ Inspect the First Example

Let's check the format and content of a single PubMed entry.


In [None]:
# Peek at the very first row in the dataset
print(pubmed_dataset[0])

## 4️⃣ Measure RAM Usage with psutil

Despite the dataset's huge size, RAM usage stays moderate—thanks to memory mapping.


In [None]:
!pip install psutil

In [None]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")


## 5️⃣ Dataset On-Disk Size vs. RAM

How big is the Arrow cache file? Compare RAM and disk footprint below.


In [None]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

## 6️⃣ Speed Test: Iterate Over the Whole Dataset

Let's measure sequential read speed from disk into RAM using memory mapping and Arrow.


In [None]:
import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)


## 7️⃣ Streaming Datasets

For *really* big data (larger than local disk!), load in streaming mode to fetch samples on-demand.

This keeps only a small subset in memory and never stores the full dataset locally.


In [None]:
pubmed_dataset_streamed=load_dataset(
    "json",data_files=data_files,split="train",streaming=True
)

# pubmed_dataset_streamed is an IterableDataset,so use iter and next
print(next(iter(pubmed_dataset_streamed)))

## 8️⃣ On-the-fly Tokenization With Streaming

Streamed datasets can be mapped over and tokenized in batches, just like regular datasets.

We'll use a pretrained tokenizer.


In [None]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize each streamed example as its read
tokenized_dataset=pubmed_dataset_streamed.map(lambda x:tokenizer(x["text"]))
print(next(iter(tokenized_dataset)))

## 9️⃣ Buffer-based Shuffling with Streaming

Shuffling is available for streamed data, but only within a buffer (not global random-shuffle).


In [None]:
shuffled_dataset=pubmed_dataset_streamed.shuffle(buffer_size=10_000,seed=42)
print(next(iter(shuffled_dataset)))

## 🔟 "Take" and "Skip" for Streams

Use `.take(N)` to preview or split data from a streamed source, and `.skip(N)` to discard/partition efficiently.


In [None]:
# Take the first 5 examples
dataset_head=pubmed_dataset_streamed.take(5)
list(dataset_head)

In [None]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

## 1️⃣1️⃣ Interleaving Huge Datasets On the Fly

Want to mix multiple massive sources? Use `interleave_datasets` for fair round-robin reading.


In [None]:
from datasets import interleave_datasets
from itertools import islice

# Load another large streamed subset, e.g FreeLaw (51GB legal cases)
law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))

# Interleave PubMed and FreeLaw streams
combined_dataset=interleave_datasets([pubmed_dataset_streamed,law_dataset_streamed])
print(list(islice(combined_dataset,2)))

## 1️⃣2️⃣ Streaming the Entire Pile (825GB)

With URLs and streaming, even the largest datasets are feasible (if high network throughput is available!).


In [None]:
base_url="https://the-eye.eu/public/AI/pile/"
data_files={
    "train":[base_url+"train/"+f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation":base_url+"val.jsonl.zst",
    "test":base_url+"test.jsonl.zst",
}
pile_dataset=load_dataset("json",data_files=data_files,streaming=True)
print(next(iter(pile_dataset("train"))))