# What if my dataset isnâ€™t on the Hub?


## 1. Loading a local dataset

For this example weâ€™ll use the SQuAD-it dataset, which is a large-scale dataset for question answering in Italian.



The training and test splits are hosted on GitHub, so we can download them with Python:



In [2]:
import urllib.request

url_train = "https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz"
url_test = "https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz"
urllib.request.urlretrieve(url_train, "SQuAD_it-train.json.gz")
urllib.request.urlretrieve(url_test, "SQuAD_it-test.json.gz")

('SQuAD_it-test.json.gz', <http.client.HTTPMessage at 0x2561d3e5310>)

This will download two compressed files called `SQuAD_it-train.json.gz` and `SQuAD_it-test.json.gz`, which we can decompress as follows:



In [3]:
import gzip
import shutil

# Decompress training file
with gzip.open("SQuAD_it-train.json.gz", "rb") as f_in:
    with open("SQuAD_it-train.json", "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)

# Decompress test file
with gzip.open("SQuAD_it-test.json.gz", "rb") as f_in:
    with open("SQuAD_it-test.json", "wb") as f_out:
        shutil.copyfileobj(f_in, f_out)

We can see that the compressed files have been replaced with `SQuAD_it-train.json` and `SQuAD_it-test.json`, and that the data is stored in the JSON format.



To load a JSON file with the `load_dataset()` function, we just need to know if weâ€™re dealing with ordinary JSON (similar to a nested dictionary) or JSON Lines (line-separated JSON). Like many question answering datasets, SQuAD-it uses the nested format, with all the text stored in a `data` field. This means we can load the dataset by specifying the `field` argument as follows:



In [4]:
from datasets import load_dataset

squad_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

Generating train split: 0 examples [00:00, ? examples/s]

By default, loading local files creates a `DatasetDict` object with a `train` split. We can see this by inspecting the `squad_it_dataset` object:



In [5]:
squad_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

In [11]:
squad_dataset["train"]

Dataset({
    features: ['title', 'paragraphs'],
    num_rows: 442
})

Great, weâ€™ve loaded our first local dataset! But while this worked for the training set, what we really want is to include both the `train` and `test` splits in a single `DatasetDict` object so we can apply `Dataset.map()` functions across both splits at once. To do this, we can provide a dictionary to the `data_files` argument that maps each split name to a file associated with that split:

In [12]:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_dataset = load_dataset("json", data_files=data_files, field="data")
squad_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.



The loading scripts in ðŸ¤— Datasets actually support automatic decompression of the input files, so we could have skipped the use of `gzip` by pointing the `data_files` argument directly to the compressed files:



In [None]:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_dataset = load_dataset("json", data_files=data_files, field="data")

This can be useful if you donâ€™t want to manually decompress many GZIP files. The automatic decompression also applies to other common formats like ZIP and TAR, so you just need to point `data_files` to the compressed files and youâ€™re good to go!



## 2. Loading a remote dataset

 Instead of providing a path to local files, we point the `data_files` argument of `load_dataset()` to one or more URLs where the remote files are stored. For example, for the SQuAD-it dataset hosted on GitHub, we can just point `data_files` to the SQuAD_it-*.json.gz URLs as follows:

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")