### 从头开始预训练模型时，需要大量数据用于训练，但是电脑的RAM有限

### 通过Datasets库，可以直接从Huggingface的数据集中下载数据，以the_pile为例进行如下操作

In [2]:
from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
pubmed_dataset = load_dataset("json", 
                              data_dir="/Volumes/WD_BLACK/data/pubmed",
                              data_files="PUBMED_title_abstracts_2019_baseline.jsonl.zip",
                              split="train")
pubmed_dataset

检测内存使用情况

In [1]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 65.42 MB


In [None]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

Datasets 将每一个数据集看作一个内存映射文件, 它提供了RAM和文件系统存储之间的映射, 该映射允许库访问和操作数据集的元素, 而且无需将其完全加载到内存中。

## 流式数据集
动态下载和访问元素, 并且不需要下载整个数据集

In [3]:
from datasets import load_dataset

pubmed_dataset_streamed = load_dataset("json", 
                                       data_files="/Volumes/WD_BLACK/data/pubmed/PUBMED_title_abstracts_2019_baseline.jsonl",
                                       split="train", streaming=True)

访问第一个元素

In [4]:
next(iter(pubmed_dataset_streamed))

{'meta': {'pmid': 11409574, 'language': 'eng'},
 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that i

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

In [None]:
# 选取前五个数据
dataset_head = pubmed_dataset_streamed.take(5)
list(dataset_head)

In [None]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)  # 打乱数据
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

In [None]:
law_dataset_streamed = load_dataset(
    "json",
    data_files="https://the-eye.eu/public/AI/pile_preliminary_components/FreeLaw_Opinions.jsonl.zst",
    split="train",
    streaming=True,
)
next(iter(law_dataset_streamed))

In [None]:
from itertools import islice
from datasets import interleave_datasets

combined_dataset = interleave_datasets(
    [pubmed_dataset_streamed, law_dataset_streamed])
list(islice(combined_dataset, 2))

In [None]:
base_url = "https://the-eye.eu/public/AI/pile/"
data_files = {
    "train": [base_url + "train/" + f"{idx:02d}.jsonl.zst" for idx in range(30)],
    "validation": base_url + "val.jsonl.zst",
    "test": base_url + "test.jsonl.zst",
}
pile_dataset = load_dataset("json", data_files=data_files, streaming=True)
next(iter(pile_dataset["train"]))