# 如果我的数据集不在 Hub 上怎么办？

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 加载本地数据集
Datasets库提供了很多的加载脚本，主要支持如下几种常见数据格式：

>CSV & TSV: load_dataset("csv", data_files="my_file.csv")<br>
Text文本文件: load_dataset("text", data_files="my_file.txt")<br>
JSON & JSON Lines文件: load_dataset("json", data_files="my_file.jsonl")<br>
Pickled DataFrames（pandas）: load_dataset("pandas", data_files="my_dataframe.pkl")<br>

以SQuAD-it dataset为例，这个数据集是一个在意大利语的大规模文本问答数据集。

In [None]:
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

In [None]:
!gzip -dkv SQuAD_it-*.json.gz

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", 
                                data_files="SQuAD_it-train.json", # 指定取哪个域名对应的数据
                                field="data"
                                )

In [None]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

In [None]:
squad_it_dataset["train"][0]

{
    "title": "Terremoto del Sichuan del 2008",
    "paragraphs": [
        {
            "context": "Il terremoto del Sichuan del 2008 o il terremoto...",
            "qas": [
                {
                    "answers": [{"answer_start": 29, "text": "2008"}],
                    "id": "56cdca7862d2951400fa6826",
                    "question": "In quale anno si è verificato il terremoto nel Sichuan?",
                },
                ...
            ],
        },
        ...
    ],
}

在训练模型的时候一般需要准备训练和测试数据，即在DatasetDict对象中有train和test域。

In [None]:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

​ Datasets库的加载脚本包含解压缩文件功能，因此可以在data_files中直接将压缩包数据文件路径，它会自动完成解压缩的步骤，

In [None]:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

## 加载远端数据

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")