# Loading local and remote datasets (HF LLM Course - Chapter 5.2)



This notebook follows the "What if my dataset isn't on the Hub?" section using the SQuAD-it question-answering dataset to show how to load local, compressed, and remote files with Hugging Face Datasets.

In [None]:
# Download the compressed SQuAD-it splits locally

!wget -q https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz

!wget -q https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

/content/drive/MyDrive/LLMS


In [None]:
# Decompress the archives (Datasets can also do this automatically)

!gzip -dkv SQuAD_it-*.json.gz

!ls -lh SQuAD_it-*.json

--2026-01-11 22:09:38--  https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz [following]
--2026-01-11 22:09:38--  https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7725286 (7.4M) [application/octet-stream]
Saving to: ‘SQuAD_it-train.json.gz’


2026-01-11 22:09:39 (42.9 MB/s) - ‘SQuAD_it-train.json.gz’ saved [7725286/7725286]

--2026-01-11 22:09:39--  https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.

In [None]:
from datasets import load_dataset

# Load a single local JSON file (defaults to a `train` split)

squad_it_dataset = load_dataset(

    "json",

    data_files="SQuAD_it-train.json",

    field="data",

)

squad_it_dataset

SQuAD_it-test.json.gz:	 87.5% -- created SQuAD_it-test.json
SQuAD_it-train.json.gz:	 82.3% -- created SQuAD_it-train.json


In [None]:
# Inspect the first training example

squad_it_dataset["train"][0]

langchain.ipynb       SQuAD_it-test.json     SQuAD_it-train.json.gz
LLM-finetuning.ipynb  SQuAD_it-test.json.gz
requirements.txt      SQuAD_it-train.json


In [None]:
# Load both train and test splits from local JSON files

data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

squad_it_dataset

In [None]:
# Peek at the test split

squad_it_dataset["test"][0]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

### Automatic decompression

You can also point `load_dataset` at compressed files directly; Hugging Face Datasets will handle the `.gz` files for you.

In [None]:
# Load directly from compressed JSON files

data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

squad_it_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

### Loading remote files

Instead of downloading first, you can pass URLs to `data_files` and let Hugging Face Datasets stream and decompress them.

In [None]:
# Load the same splits directly from remote URLs

url = "https://github.com/crux82/squad-it/raw/master/"

data_files = {

    "train": url + "SQuAD_it-train.json.gz",

    "test": url + "SQuAD_it-test.json.gz",

}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Downloading data:   0%|          | 0.00/7.73M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})