## Handling BigData

### What is the Pile Dataset?<br>

The Pile is an English text corpus (825GB) that was created by EleutherAI for training large-scale language models. It includes a diverse range of datasets, spanning scientific articles, GitHub code repositories, and filtered web text. The training corpus is available in 14 GB chunks. Let’s start by taking a look at the PubMed Abstracts dataset, which is a corpus of abstracts from 15 million biomedical publications on PubMed. The dataset is in JSON Lines format and is compressed using the zstandard library, so first we need to install that:

In [2]:
!pip install zstandard
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

In [4]:
from datasets import load_dataset

# takes some time
# this is a broken link
# data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"          # Pubmed
# the zst compressed dataset will be de-compressed and stored in hard drive while loading
# To save the HDD space:
# dataset = load_dataset(
#     "dataset_name",
#     download_config=DownloadConfig(delete_extracted=True)
# )
# This ensures that once the dataset is loaded, the extracted files are deleted, keeping only the essential data.
# pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
# pubmed_dataset

In [5]:
pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline")
pubmed_dataset

README.md:   0%|          | 0.00/635 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


(…)_title_abstracts_2020_baseline.jsonl.zst:   0%|          | 0.00/7.98G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Loading dataset shards:   0%|          | 0/49 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['meta', 'text'],
        num_rows: 17722096
    })
})

#### 17,722,096 (about 17M rows and 2 columns is quite big a number)

- Now to load the entire dataset into memory we need 8GB of free RAM (+ by OS, remaining process + interpreter +..)

- If you’re familiar with Pandas, this result might come as a surprise because of Wes Kinney’s famous rule of thumb,
that you typically need 5 to 10 times as much RAM as the size of your dataset.

- As huggingface Datasets are Apache Arrow, we have *Memory Mapping*

- Memory-mapped files can also be shared across multiple processes, which enables methods like Dataset.map() to be parallelized without needing to move or copy the dataset. Under the hood, these capabilities are all realized by the Apache Arrow memory format and pyarrow library, which make the data loading and processing lightning fast

In [17]:
pubmed_dataset['train'][3]

{'meta': {'pmid': 1673588, 'language': 'eng'},
 'text': 'Local cardiac responses--alternative methods of control.\nMuch attention has been paid to the influence of the beta-adrenoceptor system on cardiac function in heart failure. Full agonists and partial agonists acting on cardiac beta 1 receptors have been widely investigated, as has the density of these receptors in the failing heart. However, other cardiac control mechanisms may play important roles in the normal heart as well as in heart failure. The Frank-Starling mechanism of enhanced cardiac contraction produced by mechanical stretching of the ventricular myofibrils is well known. When treating patients with heart failure with diuretics, vasodilators and other drugs that influence preload, it is important to consider their overall effects in relation to the Starling curves. Atrial stretching also produces compensatory responses which are currently being intensively studied. Reflex release of atrial natriuretic factor after sti

In [8]:
!pip install psutil



In [18]:
import psutil

print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 990.26 MB


In [19]:
size_gb = pubmed_dataset['train'].info.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Dataset size (cache file) : 22.77 GB


### TADAA!!!

In [22]:
import timeit

# running the code snippet with time check
code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset['train']), batch_size):
    _ = pubmed_dataset['train'][idx:idx + batch_size]
"""

time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

Iterated over 1 examples (about 22.8 GB) in 298.2s, i.e. 0.076 GB/s


Memory Mapping saves out RAM from melting!!!


What if out dataset is too big for our personal HDD too?
- PILES Dataset (825GB)
- If using dataset from HuggingFace Hub or the dataset it remotely accessible(special cases), we can use *streaming*

In [3]:
from datasets import load_dataset

pubmed_dataset_streamed = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline", streaming=True)

README.md:   0%|          | 0.00/635 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


In [4]:
pubmed_dataset_streamed

IterableDatasetDict({
    train: IterableDataset({
        features: Unknown,
        num_shards: 1
    })
})

In [5]:
# Not a DatasetDict but an IterableDatasetDict

The elements from a streamed dataset can be processed on the fly using IterableDataset.map(), which is useful during training if you need to tokenize the inputs.

In [6]:
next(iter(pubmed_dataset_streamed['train']))

{'meta': {'pmid': 1673585, 'language': 'eng'},
 'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was n

In [9]:
# You can also shuffle a streamed dataset using IterableDataset.shuffle(), but unlike Dataset.shuffle()
# this only shuffles the elements in a predefined buffer_size (fetches them in main memory and shuffle):

shuffled_dataset = pubmed_dataset_streamed['train'].shuffle(buffer_size=100, seed=42)
next(iter(shuffled_dataset))

{'meta': {'pmid': 1673593, 'language': 'eng'},
 'text': 'Effects of a histamine type-2 receptor antagonist (BMY-25368) on gastric secretion in horses.\nThe effects of a potent new histamine-2 (H2) receptor antagonist, BMY-25368, were studied on gastric acid secretion in 5 foals from which food was withheld. Doses of 0.02, 0.11, 0.22, and 1.10 mg/kg of body weight were administered IM in a randomly assigned treatment sequence. Following BMY-25368 administration, hydrogen ion concentration was decreased and mean pH was higher than baseline values in a dose-response pattern. At the 0.22 and 1.10 mg/kg doses, the high pH was sustained for greater than 4 hours. The BMY-25368 thus may be useful for treating gastric ulcer disease in horses.'}

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed['train'].map(lambda x: tokenizer(x["text"]), batched=True)
print(next(iter(tokenized_dataset)))

Token indices sequence length is longer than the specified maximum sequence length for this model (567 > 512). Running this sequence through the model will result in indexing errors


{'meta': {'pmid': 1673585, 'language': 'eng'}, 'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was no

In [15]:
# use can use functions like take and skip

first_ten = pubmed_dataset_streamed['train'].take(3)
list(first_ten)

[{'meta': {'pmid': 1673585, 'language': 'eng'},
  'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was

In [16]:
val_dataset = pubmed_dataset_streamed['train'].take(1000)
train_dataset = pubmed_dataset_streamed['train'].skip(1000)        # one purpose to use skip

#### combining multiple datasets together to create a single corpus

 Datasets provides an *interleave_datasets()* function that converts a *list* of IterableDataset objects into a single IterableDataset

In [19]:
from itertools import islice
from datasets import interleave_datasets

# using the pubmed_dataset from alt source as PILES dataset has been taken down due to copyright issues
law_dataset_streamed = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline",
    streaming=True,
)


combined_dataset = interleave_datasets([pubmed_dataset_streamed['train'], law_dataset_streamed['train']])
print(list(islice(combined_dataset, 2)))

# Here we’ve used the islice() function from Python’s itertools module to select the first two examples from the combined dataset
# This selects one from each dataset

Repo card metadata block was not found. Setting CardData to empty.


[{'meta': {'language': 'eng', 'pmid': 1673585}, 'text': 'Cardiac beta-adrenoceptor regulation and the effects of partial agonism.\nThe in vivo effects of xamoterol on the regulation of rat cardiac beta adrenoceptors were investigated. Rats were implanted subcutaneously with osmotic minipumps and exposed to the following treatment regimens: (1) subcutaneous infusion of saline (control), isoprenaline or xamoterol for 6 days, (2) subcutaneous infusion of isoprenaline with co-administration of xamoterol for various periods up to 96 hours, and (3) subcutaneous infusion of xamoterol for up to 96 hours after previous treatment with isoprenaline for 72 hours. Xamoterol did not induce beta-adrenoceptor down-regulation after short-term (72-hour) or long-term (6-day) infusions. When coadministered with isoprenaline xamoterol did not affect the rate or extent of down-regulation induced by isoprenaline alone. In addition, recovery of beta adrenoceptors down-regulated by isoprenaline treatment was n