<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch5/loading_local_and_remote_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üìñ What if my dataset isn't on the Hub?

Many datasets live on GitHub or other remote sources instead of the Hub.  
Here, we'll see how to load local/remote datasets‚Äîincluding compressed files‚Äîand how Hugging Face Datasets handles these efficiently.



Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


## 1Ô∏è‚É£ Downloading Example Files

We'll use the SQuAD-it (Italian QA) dataset from GitHub.
These commands download and decompress two .json.gz files locally.


In [12]:
# Download SQuAd-it training and test splits from GitHub (compressed)
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

# Decompress both files (.json.gz-json)
!gzip -dkv SQuAD_it-*.json.gz

--2025-09-08 06:17:02--  https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz [following]
--2025-09-08 06:17:02--  https://raw.githubusercontent.com/crux82/squad-it/master/SQuAD_it-train.json.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7725286 (7.4M) [application/octet-stream]
Saving to: ‚ÄòSQuAD_it-train.json.gz.2‚Äô


2025-09-08 06:17:03 (113 MB/s) - ‚ÄòSQuAD_it-train.json.gz.2‚Äô saved [7725286/7725286]

--2025-09-08 06:17:03--  https://github.com/crux82/squad-it/raw/master/SQu

## 2Ô∏è‚É£ Loading a Local JSON Dataset

We can load the training set using the `load_dataset` function; the `field` argument points to the actual data inside the nested JSON structure.


In [13]:
from datasets import load_dataset

# Load the local JSON file, specifying the "data" filed that contains examples
squad_it_dataset=load_dataset("json",data_files="SQuAD_it-train.json",field="data")

# Inspect the resulting DatasetDict ()
print(squad_it_dataset)

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})


## 3Ô∏è‚É£ Inspecting Data Structure

View the first example to understand the format and fields.


In [14]:
# print the first example in the train split
print(squad_it_dataset["train"][0])

{'title': 'Terremoto del Sichuan del 2008', 'paragraphs': [{'context': "Il terremoto del Sichuan del 2008 o il terremoto del Gran Sichuan, misurato a 8.0 Ms e 7.9 Mw, e si √® verificato alle 02:28:01 PM China Standard Time all' epicentro (06:28:01 UTC) il 12 maggio nella provincia del Sichuan, ha ucciso 69.197 persone e lasciato 18.222 dispersi.", 'qas': [{'answers': [{'answer_start': 29, 'text': '2008'}], 'id': '56cdca7862d2951400fa6826', 'question': 'In quale anno si √® verificato il terremoto nel Sichuan?'}, {'answers': [{'answer_start': 232, 'text': '69.197'}], 'id': '56cdca7862d2951400fa6828', 'question': 'Quante persone sono state uccise come risultato?'}, {'answers': [{'answer_start': 29, 'text': '2008'}], 'id': '56d4f9902ccc5a1400d833c0', 'question': 'Quale anno ha avuto luogo il terremoto del Sichuan?'}, {'answers': [{'answer_start': 78, 'text': '8.0 Ms e 7.9 Mw'}], 'id': '56d4f9902ccc5a1400d833c1', 'question': 'Che cosa ha fatto la misura di sisma?'}, {'answers': [{'answer_st

## 4Ô∏è‚É£ Loading Both Train & Test Splits

By passing a dictionary to `data_files`, you can load multiple splits at once.


In [15]:
# Load both the train and test splits into a single DatasetDict
data_files={"train":"SQuAD_it-train.json","test":"SQuAD_it-test.json"}
squad_it_dataset=load_dataset("json",data_files=data_files,field="data")

# Inspect the dictionary: both splits are present
print(squad_it_dataset)

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})


## 5Ô∏è‚É£ (Advanced Tip) Loading from Compressed Files Directly

Hugging Face Datasets can auto-decompress gzip files‚Äîyou don‚Äôt have to unzip them first!


In [16]:
# Load directly from compressed .json.gz files by passing their paths
data_files={
    "train":"SQuAD_it-train.json.gz",
    "test":"SQuAD_it-test.json.gz",
}
squad_it_dataset=load_dataset("json",data_files=data_files,field="data")
# The resulting object is the same
print(squad_it_dataset)

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})


## 6Ô∏è‚É£ Loading Remote Datasets

Point `data_files` to the raw URLs‚ÄîDatasets will download and parse them.

This is handy when your data is hosted on GitHub or elsewhere online.


In [18]:
# Example: loading SQuAD-it splits directly from GitHub URLs
url="https://github.com/crux82/squad-it/raw/master/"
data_files={
    "train":url+"SQuAD_it-train.json.gz",
    "test":url+"SQuAD_it-test.json.gz",
}
squad_it_dataset=load_dataset("json",data_files=data_files,field="data")
print(squad_it_dataset)

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})


# ‚úÖ Summary

- Load local and remote files in many formats (csv, json, text, etc.)
- Open compressed files without manual extraction
- Organize multiple splits with a dictionary argument
- Inspect and manipulate resulting datasets with standard Hugging Face tools
