<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch5/loading_local_and_remote_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📖 What if my dataset isn't on the Hub?

Many datasets live on GitHub or other remote sources instead of the Hub.  
Here, we'll see how to load local/remote datasets—including compressed files—and how Hugging Face Datasets handles these efficiently.



Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 1️⃣ Downloading Example Files

We'll use the SQuAD-it (Italian QA) dataset from GitHub.
These commands download and decompress two .json.gz files locally.


In [None]:
# Download SQuAd-it training and test splits from GitHub (compressed)
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

# Decompress both files (.json.gz-json)
!gzip -dkv SQuAD_it-*.json.gz

## 2️⃣ Loading a Local JSON Dataset

We can load the training set using the `load_dataset` function; the `field` argument points to the actual data inside the nested JSON structure.


In [None]:
from datasets import load_dataset

# Load the local JSON file, specifying the "data" filed that contains examples
squad_it_dataset=load_dataset("json",data_files="SQuAD_it-train.json",field="data")

# Inspect the resulting DatasetDict ()
print(squad_it_dataset)

## 3️⃣ Inspecting Data Structure

View the first example to understand the format and fields.


In [None]:
# print the first example in the train split
print(squad_it_dataset["train"][0])

## 4️⃣ Loading Both Train & Test Splits

By passing a dictionary to `data_files`, you can load multiple splits at once.


In [None]:
# Load both the train and test splits into a single DatasetDict
data_files={"train":"SQuAD_it-train.json","test":"SQuAD_it-test.json"}
squad_it_dataset=load_dataset("json",data_files=data_files,field="data")

# Inspect the dictionary: both splits are present
print(squad_it_dataset)

## 5️⃣ (Advanced Tip) Loading from Compressed Files Directly

Hugging Face Datasets can auto-decompress gzip files—you don’t have to unzip them first!


In [None]:
# Load directly from compressed .json.gz files by passing their paths
data_files={
    "train":"SQuAD_it-train.json.gz",
    "test":"SQuAD_it-test.json.gz",
}
squad_it_dataset=load_dataset("json",data_files=data_files,field="data")
# The resulting object is the same
print(squad_it_dataset)

## 6️⃣ Loading Remote Datasets

Point `data_files` to the raw URLs—Datasets will download and parse them.

This is handy when your data is hosted on GitHub or elsewhere online.


In [None]:
# Example: loading SQuAD-it splits directly from GitHub URLs
url="https://github.com/crux82/squad-it/raw/master/"
data_files={
    "train":url+"SQuAD_it-train.json.gz",
    "test":url+"SQuAD_it-test.json.gz",
}
squad_it_dataset=load_dataset("json",data_files=data_files,field="data")
print(squad_it_dataset)

# ✅ Summary

- Load local and remote files in many formats (csv, json, text, etc.)
- Open compressed files without manual extraction
- Organize multiple splits with a dictionary argument
- Inspect and manipulate resulting datasets with standard Hugging Face tools
