<a href="https://colab.research.google.com/github/Leoli04/llms-notebooks/blob/main/huggingface/hf_nlp_05_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## datasets(数据集)

### 介绍

在前面的章节体验了数据集库，并看到微调模型的三个主要步骤：

- 从 Hugging Face Hub 加载数据集。
- 使用 Dataset.map() 预处理数据。
- 加载和计算指标。

其实这只是数据集功能的冰山一角。我们还会遇到如下情况：
- 当您的数据集不在 Hub 上时该怎么办？
- 如何对数据集进行切片和切块？ （如果你真的需要使用 Pandas 怎么办？）
- 当你的数据集很大并且会耗尽笔记本电脑的 RAM 时你会怎么做？
- “内存映射”和 Apache Arrow 到底是什么？
- 如何创建自己的数据集并将其推送到 Hub？

关于数据集的更多说明，可以看[这里](https://huggingface.co/docs/datasets/loading#local-and-remote-files)

### 加载本地和远程的数据集

Datasets 提供了加载脚本来处理本地和远程数据集的加载。它支持多种常见的数据格式：

| Data format          | Loading script | Example                                                 |
| -------------------- | -------------- | ------------------------------------------------------- |
| CSV & TSV CSV 和 TSV | `csv`          | `load_dataset("csv", data_files="my_file.csv")`         |
| Text files           | `text`         | `load_dataset("text", data_files="my_file.txt")`        |
| JSON & JSON Lines    | `json`         | `load_dataset("json", data_files="my_file.jsonl")`      |
| Pickled DataFrames   | `pandas`       | `load_dataset("pandas", data_files="my_dataframe.pkl")` |

#### 加载本地数据

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
# 下载文件
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

load_dataset() 函数的 data_files 参数非常灵活，可以是单个文件路径、文件路径列表或将拆分名称映射到文件路径的字典。您还可以根据 Unix shell 使用的规则来 glob 匹配指定模式的文件（例如，您可以通过设置 data_files="*.json" 将目录中的所有 JSON 文件作为单个分割进行 glob ）

##### 加载zip 文件

In [None]:
from datasets import load_dataset

data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

##### 加载json格式

In [None]:
# 解压
# gzip：这是一个标准的压缩和解压缩文件的工具。它通常用于处理.gz格式的文件。
# -d：-d参数告诉gzip对文件进行解压缩。如果不加任何参数，gzip会压缩文件。
# -k：-k参数告诉gzip在解压缩后保留输入文件。默认情况下，gzip在压缩后会删除原始文件。
# -v：-v参数代表“详细模式”（verbose），让gzip提供更详细的输出。它会列出正在处理的文件名称。
!gzip -dkv SQuAD_it-*.json.gz

SQuAD_it-test.json.gz:	 87.5% -- created SQuAD_it-test.json
SQuAD_it-train.json.gz:	 82.3% -- created SQuAD_it-train.json


In [None]:
from datasets import load_dataset

data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

#### 加载远程数据集

我们可以选择GitHub 或者 [the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)的数据集

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

### 切片与切块

在数据分析领域，"slice"（切片）通常指的是按照一个维度来查看数据的一个子集，而 "dice"（切块）则是指将数据进一步细分为更小的部分。结合起来，"slice and dice" 描述了一种多角度、多层次地探索数据的过程。

#### 对数据进行切片和切块

In [None]:
# 使用托管在加州大学欧文分校机器学习存储库上的药物评论数据集，
# 包含患者对各种药物的评论、正在治疗的病情以及患者满意度的 10 星评级。
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

这是TSV格式的文件，TSV 只是 CSV 的变体，它使用制表符而不是逗号作为分隔符

In [None]:
from datasets import load_dataset
# 加载数据
data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
# 随机查看数据
# 创建一个经过打乱和降采样（只包含1000个样本）的训练集的子集，并查看这个子集的前三个样本
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

展示的数据里面有以下几个问题：
- 'Unnamed: 0'字段 语义不明
- condition 混合了大小写标签
- review字段长度各不相同，并且包含 Python 行分隔符 ( \r\n ) 以及 HTML 字符代码 (如 &\#039; )


##### 对字段重命名（DatasetDict.rename_column()）

DatasetDict.rename_column() 函数可以一次重命名多个数据集，如train、test

In [None]:
# 使用 Dataset.unique() 函数验证 "Unnamed: 0 列为患者 ID "假设
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

In [None]:
# 使用 DatasetDict.rename_column() 将Unnamed: 0 列重命名为patient_id
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

In [None]:
# 查看训练和测试集中独特药物和条件的数量
train_drug_dataset=drug_dataset["train"]
test_drug_dataset=drug_dataset["test"]

print("train_drug_dataset:",len(train_drug_dataset.unique("drugName")),len(train_drug_dataset.unique("condition")))

print("test_drug_dataset:",len(test_drug_dataset.unique("drugName")),len(test_drug_dataset.unique("condition")))


train_drug_dataset: 3436 885
test_drug_dataset: 2637 709


##### 使用 Dataset.map() 标准化 condition 标签

In [None]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

# 过滤none数据
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

# 标签转小写
drug_dataset = drug_dataset.map(lowercase_condition)
# 查看转小写效果
drug_dataset["train"]["condition"][:3]



Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

['left ventricular dysfunction', 'adhd', 'birth control']

##### 为数据集添加新字段

- 使用`Dataset.map()`: 以函数的方式处理，
- 使用`Dataset.add_column()`：以 Python 列表或 NumPy 数组的形式提供列

In [None]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

# 添加新字段review_length，
# 当 compute_review_length() 传递给 Dataset.map() 时，它将应用于数据集中的所有行以创建新的 review_length 列
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

##### 过滤出价值大的数据(Dataset.filter())

In [None]:
# 删除少于 30 个单词的评论
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'train': 138514, 'test': 46108}


In [None]:
# 检查单词数最多的评论

drug_dataset.sort("review_length",reverse=True)["train"][0]

##### 处理html代码

In [None]:
import html
# 效果测试
text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [None]:
import html
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

####  Dataset.map()更多功能

map方法返回一个新的Dataset对象，其中包含了应用了function函数后的转换结果。这个方法不会修改原始数据集，而是生成一个新的数据集视图。通常用于数据预处理、特征提取或任何自定义的数据转换。



##### 提升数据处理速度
-  batched 参数，如果为 True ，则会批量传递数据给map（批量大小可配置，但默认为 1,000）
- 列表推导式通常比在 for 循环中执行相同的代码更快
num_proc=8：此参数设置在执行map操作时使用的最大进程数。这里设置为8意味着将使用最多8个并行进程来进行分词操作

以下是map方法的主要参数及其说明：
- function (必需):

> 这是一个接受单个样本或一批样本并返回处理后的样本的函数。它应该能够处理数据集中的一个样本（或一批样本），并返回一个经过转换的样本。

- batched (可选):

> bool类型，默认为False。当设置为True时，函数将接收一批样本而不是单个样本。这允许批处理操作，可以提高效率。

- batch_size_multiple (可选):

> int类型。当与batched=True一起使用时，此参数可以指定batch_size应该是多少的倍数。

- num_proc (可选):

> int类型。指定用于并行处理的进程数量。设置为0或None时，将使用所有可用的CPU核心。

- preprocess_params (可选):

> 一个包含预处理函数参数的字典。这允许你为function提供额外的参数。

- new_names (可选):

> list类型。当function返回一个不同于输入的新字段名称列表时，使用此参数指定这些新名称。

- features (可选):

> 指定数据集中哪些特征（列）应该被映射。这对于选择性地应用函数到数据集的子集很有用。

- expected_size (可选):

> int类型。当数据集很大时，此参数可以提供一个大小估计，帮助map方法优化内存使用。

- remove_columns (可选):

> list或str类型。指定在映射过程中要删除的列的名称。

- keep_in_memory (可选):

> bool类型。指示是否应将结果保持在内存中。如果数据集很大，这可以节省内存。

- remove_columns_from_batch_fn (可选):

> bool类型。指示是否在调用function之前从批次中删除列。

- fn_kwargs (可选):

> 传递给function的关键字参数。

In [None]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)
#  %time 对单行指令进行计时
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 1min 54s, sys: 882 ms, total: 1min 55s
Wall time: 1min 18s


关于batched、num_proc、分词参数use_fast在耗时方面的数据对比：


| Options                       | Fast tokenizer | Slow tokenizer |
| ----------------------------- | -------------- | -------------- |
| `batched=True`                | 10.8s          | 4min41s        |
| `batched=False`               | 59.2s          | 5min3s         |
| `batched=True`, `num_proc=8`  | 6.52s          | 41.3s          |
| `batched=False`, `num_proc=8` | 9.49s          | 45.2s          |

In [None]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

##### 提取特征

In [None]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

In [None]:


result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

In [None]:
drug_dataset["train"].column_names

['patient_id',
 'drugName',
 'condition',
 'review',
 'rating',
 'date',
 'usefulCount',
 'review_length']

In [None]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

In [None]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

In [None]:
# 对examples["review"]文本进行分词，and 在新旧序列中提取映射
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

#### 从 Dataset 到 DataFrame 并返回

##### 将数据集转换为 Pandas

在底层， Dataset.set_format() 更改数据集 __getitem__() dunder 方法的返回格式。

In [None]:
# 将数据集转换为 Pandas
drug_dataset.set_format("pandas")

In [None]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


##### 使用训练数据创建pandas.DataFrame 并使用其功能

In [None]:
train_df = drug_dataset["train"][:]

In [None]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,frequency,count
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


In [None]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 819
})

##### 重置drug_dataset格式

In [None]:
drug_dataset.reset_format()

#### 创建验证集Dataset.train_test_split()

在开发过程中保持测试集不变并创建单独的验证集。一旦您对验证集上的模型性能感到满意，您就可以对测试集进行最终的健全性检查。此过程有助于降低您过度适应测试集并部署在实际数据上失败的模型的风险。

In [None]:
# 从原始训练集中分割出一部分作为验证集。
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# 重命名默认的test为 "validation"。
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# 保留原始的测试集，并将其添加到修改后的数据集中。
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

#### 保存数据集

将数据集保存到磁盘

Datasets 提供了三个主要功能来以不同格式保存数据集：

| Data format | Function                 |
| ----------- | ------------------------ |
| Arrow       | `Dataset.save_to_disk()` |
| CSV         | `Dataset.to_csv()`       |
| JSON        | `Dataset.to_json()`      |

##### 保存为arrow

In [None]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]

这将创建一个具有以下结构的目录：
```
drug-reviews/
├── dataset_dict.json
├── test
│   ├── dataset.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── indices.arrow
│   └── state.json
└── validation
    ├── dataset.arrow
    ├── dataset_info.json
    ├── indices.arrow
    └── state.json
```

In [None]:
from datasets import load_from_disk
# 加载保存的数据
drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

##### 保存为CSV和JSON

对于 CSV 和 JSON 格式，我们必须将每个拆分存储为单独的文件.

In [None]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/111 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

In [None]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


In [None]:
# 加载数据
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

### 处理大数据

在训练模型的过程中，使用数 GB 数据集的情况很常见，例如，用于预训练 GPT-2 的 WebText 语料库包含超过 800 万个文档和 40 GB 的文本 - 将其加载到笔记本电脑的 RAM 中可能会导致心脏病发作！



下面以处理[Pile](https://pile.eleuther.ai/)为例：

- Pile 是 EleutherAI 创建的一个英文文本语料库，该数据集具有 825 GB 的巨大语料库，用于训练大规模语言模型。它包括各种数据集，涵盖科学文章、GitHub 代码存储库和过滤的网络文本。

- Pile 的格式是使用 zstandard 压缩的 jsonlines 数据。

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [2]:
!pip install zstandard

Collecting zstandard
  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: zstandard
Successfully installed zstandard-0.22.0


In [3]:
from datasets import load_dataset

# This takes a few minutes to run, so go grab a tea or coffee while you wait :)
# 文件地址已经不存在
# data_files = "https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline/resolve/main/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
data_files="https://huggingface.co/datasets/casinca/PUBMED_title_abstracts_2019_baseline/resolve/main/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
pubmed_dataset = load_dataset("json", data_files=data_files, split="train")
pubmed_dataset

Downloading data:   0%|          | 0.00/6.86G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Loading dataset shards:   0%|          | 0/42 [00:00<?, ?it/s]

Dataset({
    features: ['meta', 'text'],
    num_rows: 15518009
})

#### 内存映射

Datasets 将每个数据集视为内存映射文件，它提供 RAM 和文件系统存储之间的映射，允许库访问和操作数据集的元素，而无需将其完全加载到内存中。

内存映射文件还可以在多个进程之间共享，这使得 Dataset.map() 等方法可以并行化，而无需移动或复制数据集。在底层，这些功能都是由 Apache Arrow 内存格式和 pyarrow 库实现的，这使得数据加载和处理速度快如闪电。

In [None]:
!pip install psutil



In [None]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 872.36 MB


In [None]:
print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

In [None]:
import timeit

code_snippet = """batch_size = 1000

for idx in range(0, len(pubmed_dataset), batch_size):
    _ = pubmed_dataset[idx:idx + batch_size]
"""
# 使用 Python 的 timeit 模块来测量 code_snippet 所花费的执行时间
time = timeit.timeit(stmt=code_snippet, number=1, globals=globals())
print(
    f"Iterated over {len(pubmed_dataset)} examples (about {size_gb:.1f} GB) in "
    f"{time:.1f}s, i.e. {size_gb/time:.3f} GB/s"
)

#### streaming-datasets（数据集流）

要启用数据集流，只需将 streaming=True 参数传递给 load_dataset() 函数。

streaming=True 返回的对象是 IterableDataset

In [None]:
pubmed_dataset_streamed = load_dataset(
    "json", data_files=data_files, split="train", streaming=True
)

In [None]:
# 打乱一个批次的数据
shuffled_dataset = pubmed_dataset_streamed.shuffle(buffer_size=10_000, seed=42)
next(iter(shuffled_dataset))

In [None]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = shuffled_dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = shuffled_dataset.take(1000)

### 创建自己的数据集

在本节中，我们将向您展示如何创建 GitHub 问题语料库，该语料库通常用于跟踪 GitHub 存储库中的错误或功能。

#### 获取数据

使用GitHub REST API的[Issues endpoint](https://docs.github.com/en/rest/issues?apiVersion=2022-11-28#list-repository-issues)

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

In [None]:
!pip install requests

In [None]:
import time
import math
from pathlib import Path
import pandas as pd
from tqdm.notebook import tqdm

# 批量下载所有问题，以避免超出 GitHub 对每小时请求数的限制；
# 结果将存储在repository_name-issues.jsonl 文件中，其中每一行都是一个代表问题的JSON 对象
def fetch_issues(
    owner="huggingface",
    repo="datasets",
    num_issues=10_000,
    rate_limit=5_000,
    issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"
    # tqdm是一个快速、可扩展的Python进度条库，可以在长循环中添加一个进度提示信息。
    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}")
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []  # Flush batch for next time period
            print(f"Reached GitHub rate limit. Sleeping for one hour ...")
            time.sleep(60 * 60 + 1)

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.jsonl"
    )

In [None]:
fetch_issues()

  0%|          | 0/100 [00:00<?, ?it/s]

Reached GitHub rate limit. Sleeping for one hour ...
Downloaded all the issues for datasets! Dataset stored at ./datasets-issues.jsonl


Datasets 存储库的问题选项卡总共只显示大约 1,000 个问题，为什么会有几千个问题呢？
> GitHub 的 REST API v3 将每个拉取请求视为一个问题，但并非每个问题都是拉取请求。因此，“Issues” endpoints可能会在响应中同时返回问题和拉取请求。可以查看[github 速率限制](https://docs.github.com/zh/apps/creating-github-apps/registering-a-github-app/rate-limits-for-github-apps?apiVersion=2022-11-28)

In [None]:
from datasets import load_dataset

import pandas as pd
from datasets import Dataset
df = pd.read_json('datasets-issues.jsonl', orient='records', lines=True)
issues_dataset = Dataset.from_pandas(df, split="train")
issues_dataset

# 加载数据,这种方式加载报错：TypeError: Couldn't cast array of type timestamp[s] to null
# issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
# issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request'],
    num_rows: 6884
})

#### 清理数据

 pull_request 列可用于区分问题和拉取请求。

In [None]:
# 查看数据样本

sample = issues_dataset.shuffle(seed=666).select(range(5))

# zip() 函数用于将多个可迭代对象（如列表、元组等）中对应的元素打包成一个个元组
# Print out the URL and pull request entries
for url, pr in zip(sample["html_url"], sample["pull_request"]):
    print(f">> URL: {url}")
    print(f">> Pull request: {pr}\n")

>> URL: https://github.com/huggingface/datasets/pull/500
>> Pull request: {'diff_url': 'https://github.com/huggingface/datasets/pull/500.diff', 'html_url': 'https://github.com/huggingface/datasets/pull/500', 'merged_at': '2020-08-20T07:59:18Z', 'patch_url': 'https://github.com/huggingface/datasets/pull/500.patch', 'url': 'https://api.github.com/repos/huggingface/datasets/pulls/500'}

>> URL: https://github.com/huggingface/datasets/pull/1589
>> Pull request: {'diff_url': 'https://github.com/huggingface/datasets/pull/1589.diff', 'html_url': 'https://github.com/huggingface/datasets/pull/1589', 'merged_at': None, 'patch_url': 'https://github.com/huggingface/datasets/pull/1589.patch', 'url': 'https://api.github.com/repos/huggingface/datasets/pulls/1589'}

>> URL: https://github.com/huggingface/datasets/pull/3026
>> Pull request: {'diff_url': 'https://github.com/huggingface/datasets/pull/3026.diff', 'html_url': 'https://github.com/huggingface/datasets/pull/3026', 'merged_at': '2021-10-08T16:

In [None]:
# 标记每个url是否为pull_request
issues_dataset = issues_dataset.map(
    lambda x: {"is_pull_request": False if x["pull_request"] is None else True}
)

Map:   0%|          | 0/6884 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 6884
})

#### 扩充数据集

GitHub REST API 提供了一个 [Comments endpoint](https://docs.github.com/en/rest/reference/issues#list-issue-comments)，用于返回与问题编号关联的所有评论。

In [None]:
# 查看其中一个
issue_number = 6829
url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
response = requests.get(url)
response.json()

[]

In [None]:
# 数据格式设置为pandas
issues_dataset.set_format("pandas")

# issues_dataset[:3]
df = issues_dataset[:]

is_pull_request_true_count = df[df['is_pull_request']].shape[0]
is_pull_request_false_count = df[~df['is_pull_request']].shape[0]

# 输出结果
print(f"is_pull_request 为 True 的数量: {is_pull_request_true_count}")
print(f"is_pull_request 为 False 的数量: {is_pull_request_false_count}")

is_pull_request 为 True 的数量: 4050
is_pull_request 为 False 的数量: 2834


In [None]:
# 将数据格式恢复
issues_dataset.reset_format()

In [None]:
# 获取评论数据
def get_comments(issue_number,is_pull_request=False):
    if is_pull_request:
      return []
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url)
    return [r["body"] for r in response.json() if r and "body" in r]




# Test our function works as expected


In [None]:
issues_dataset[95]

{'url': 'https://api.github.com/repos/huggingface/datasets/issues/6829',
 'repository_url': 'https://api.github.com/repos/huggingface/datasets',
 'labels_url': 'https://api.github.com/repos/huggingface/datasets/issues/6829/labels{/name}',
 'comments_url': 'https://api.github.com/repos/huggingface/datasets/issues/6829/comments',
 'events_url': 'https://api.github.com/repos/huggingface/datasets/issues/6829/events',
 'html_url': 'https://github.com/huggingface/datasets/issues/6829',
 'id': 2258424577,
 'node_id': 'I_kwDODunzps6GnNMB',
 'number': 6829,
 'title': 'Load and save from/to disk no longer accept pathlib.Path',
 'user': {'avatar_url': 'https://avatars.githubusercontent.com/u/8515462?v=4',
  'events_url': 'https://api.github.com/users/albertvillanova/events{/privacy}',
  'followers_url': 'https://api.github.com/users/albertvillanova/followers',
  'following_url': 'https://api.github.com/users/albertvillanova/following{/other_user}',
  'gists_url': 'https://api.github.com/users/alb

In [None]:
get_comments(6829,False)

[]

In [None]:
issues_with_comments_dataset = issues_dataset.map(
    lambda x: {"comments": get_comments(x["number"],x["is_pull_request"])}
)

Map:   0%|          | 0/6884 [00:00<?, ? examples/s]

#### 将数据集上传到 Hugging Face Hub

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
issues_with_comments_dataset.push_to_hub("github-HF-datasets-issues")

In [None]:
# 后续使用
remote_dataset = load_dataset("leoli04/github-HF-datasets-issues", split="train")
remote_dataset

Downloading readme:   0%|          | 0.00/6.07k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6884 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 6884
})

#### 创建数据集卡片

在 Hugging Face Hub 上，此信息存储在每个数据集存储库的 README.md 文件中。创建此文件之前应执行两个主要步骤：
- 使用 datasets-tagging 应用程序创建 YAML 格式的元数据标签。用户搜索
- 阅读[Datasets guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md),将其作为模版

### 使用 FAISS 进行语义搜索

#### 使用embeddings（嵌入） 进行语义搜索

基于 Transformer 的语言模型将文本范围中的每个标记表示为 embedding vector（嵌入向量）。事实证明，我们可以“汇集”各个嵌入，为整个句子、段落或（在某些情况下）文档创建向量表示。然后，可以通过计算每个嵌入之间的点积相似性（或一些其他相似性度量）并返回具有最大重叠的文档，使用这些嵌入来查找语料库中的相似文档。

在本节中，我们将使用嵌入来开发语义搜索引擎。基于将查询中的关键字与文档相匹配的传统方法相比，优势如下：

- 理解上下文
- 处理同义词和多义词
- 短语和概念理解
- 自然语言处理
- 个性化搜索体验
等

#### 加载和准备数据


In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install faiss-gpu

In [10]:
from datasets import load_dataset

issues_dataset = load_dataset("leoli04/github-HF-datasets-issues", split="train")

issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 6884
})

In [11]:
# 原数据集中包括pullRequest,需要过滤issues数据
issues_dataset = issues_dataset.filter(lambda x: x["is_pull_request"] == False)

issues_dataset

Filter:   0%|          | 0/6884 [00:00<?, ? examples/s]

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'body', 'reactions', 'timeline_url', 'performed_via_github_app', 'state_reason', 'draft', 'pull_request', 'is_pull_request'],
    num_rows: 2834
})

In [12]:
columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2834
})

In [13]:
issues_dataset.set_format("pandas")
df = issues_dataset[:]

In [14]:
# 用explode方法来处理df中的comments列。explode方法通常用于将列表或序列中的每个元素转换为DataFrame中的一行。
# ignore_index=True ，DataFrame的索引将被重置
comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

Unnamed: 0,html_url,title,comments,body
0,https://github.com/huggingface/datasets/issues...,Caching map result of DatasetDict.,,Hi!\r\n\r\nI'm currenty using the map function...
1,https://github.com/huggingface/datasets/issues...,Export Parquet Tablet Audio-Set is null bytes ...,,### Describe the bug\n\nExporting the processe...
2,https://github.com/huggingface/datasets/issues...,Invalid YAML in README.md: unknown tag !<tag:y...,,### Describe the bug\n\nI wrote a notebook to ...
3,https://github.com/huggingface/datasets/issues...,NonMatchingSplitsSizesError when using data_dir,"Thanks for reporting, @srehaag.\r\n\r\nWe are ...",### Describe the bug\n\nLoading a dataset from...


In [15]:
from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 3256
})

In [17]:
# 增加comments_length 列
comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split()) if x["comments"] else 0 }
)

Map:   0%|          | 0/3256 [00:00<?, ? examples/s]

In [18]:
# 过滤评论少的数据
comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Filter:   0%|          | 0/3256 [00:00<?, ? examples/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 435
})

In [19]:
# 增加新字段text：将issue 标题、描述和评论连接到一个新的 text 列中
def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

Map:   0%|          | 0/435 [00:00<?, ? examples/s]

#### 创建text embeddings



In [20]:
from transformers import AutoTokenizer, AutoModel

#  multi-qa-mpnet-base-dot-v1 检查点在语义搜索方面具有最佳性能
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
# 将模型和输入放置在 GPU
import torch

device = torch.device("cuda")
model.to(device)

In [22]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [23]:
# 标记文档列表，将张量放置在 GPU 上，将它们提供给模型，最后将 CLS 池化应用于输出
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [24]:
# 检查输出形状来测试该函数的工作原理
embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

In [25]:
# 增加embeddings，
embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

Map:   0%|          | 0/435 [00:00<?, ? examples/s]

#### 使用 FAISS 进行高效的相似性搜索

In [26]:
# 在embeddings列上，增加 FAISS 索引
embeddings_dataset.add_faiss_index(column="embeddings")


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length', 'text', 'embeddings'],
    num_rows: 435
})

In [27]:
# 问题转成向量
question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

(1, 768)

In [28]:
# 使用 Dataset.get_nearest_examples() 函数进行最近邻居查找来对此索引执行查询
# Dataset.get_nearest_examples() 函数返回一个分数元组，对查询和文档之间的重叠进行排名，以及一组相应的样本（此处为 5 个最佳匹配）
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

In [29]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [30]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

COMMENT: It's still not officially supported x) But you can try to update `request_etag` in `file_utils.py` to use `fsspec_head` instead of `http_head`. It is responsible of getting the ETags of the remote files for caching. This change may do the trick for S3 urls
SCORE: 42.98045349121094
TITLE: Support cloud storage in load_dataset
URL: https://github.com/huggingface/datasets/issues/5281

COMMENT: Makes sense ! If you want to load locally a dataset that you download_and_prepared on a cloud storage, you would use `load_dataset(path_to_cloud_storage)` indeed. It would download the data from the cloud storage, cache them locally, and return a `Dataset`.
SCORE: 42.69151306152344
TITLE: Support cloud storage in load_dataset
URL: https://github.com/huggingface/datasets/issues/5281

COMMENT: Hi @kswamy15, thanks for reporting.

We are fixing this critical issue and making an urgent patch release of the `datasets` library today.

In the meantime, you can circumvent this issue by updating