### What if my dataset isn't on the Hub?
You know how to use the [Hugging Face Hub](https://huggingface.co/datasets) to download datasets, but you’ll often find yourself working with data that is stored either on your laptop or on a remote server. In this section we’ll show you how 🤗 Datasets can be used to load datasets that aren’t available on the Hugging Face Hub.

### Working with local and remote datasets
🤗 Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:

|Data format	      |Loading script	|Example
|--------------------:|----------------:|--------------------------------------:|
|CSV & TSV	       |   csv	           | load_dataset("csv", data_files="my_file.csv")
|Text files	       |   text	           | load_dataset("text", data_files="my_file.txt")
|JSON & JSON Lines |   json	           | load_dataset("json", data_files="my_file.jsonl")
|Pickled DataFrames|   pandas	       | load_dataset("pandas", data_files="my_dataframe.pkl")

In [1]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

--2022-01-22 06:44:14--  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 264426 (258K) [application/x-httpd-php]
Saving to: 'winequality-white.csv'

     0K .......... .......... .......... .......... .......... 19%  121K 2s
    50K .......... .......... .......... .......... .......... 38%  251K 1s
   100K .......... .......... .......... .......... .......... 58% 2.02M 0s
   150K .......... .......... .......... .......... .......... 77% 1.83M 0s
   200K .......... .......... .......... .......... .......... 96%  301K 0s
   250K ........                                              100%  121M=0.8s

2022-01-22 06:44:16 (311 KB/s) - 'winequality-white.csv' saved [264426/264426]



In [2]:
from datasets import load_dataset

# Load a .csv file
local_csv_dataset = load_dataset("csv", data_files="winequality-white.csv", sep=";")
local_csv_dataset["train"]

Using custom data configuration default-1a0438eb1aef2e17


Downloading and preparing dataset csv/default to C:\Users\batuh\.cache\huggingface\datasets\csv\default-1a0438eb1aef2e17\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


100%|██████████| 1/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 334.37it/s]


Dataset csv downloaded and prepared to C:\Users\batuh\.cache\huggingface\datasets\csv\default-1a0438eb1aef2e17\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 83.56it/s]


Dataset({
    features: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'],
    num_rows: 4898
})

In [3]:
# Load the dataset from the URL directly
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
remote_csv_dataset = load_dataset("csv", data_files=dataset_url, sep=";")
remote_csv_dataset

Using custom data configuration default-56a093e4587a7c78


Downloading and preparing dataset csv/default to C:\Users\batuh\.cache\huggingface\datasets\csv\default-56a093e4587a7c78\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e...


Downloading: 100%|██████████| 264k/264k [00:00<00:00, 314kB/s]
100%|██████████| 1/1 [00:02<00:00,  2.58s/it]
100%|██████████| 1/1 [00:00<00:00, 1003.66it/s]


Dataset csv downloaded and prepared to C:\Users\batuh\.cache\huggingface\datasets\csv\default-56a093e4587a7c78\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 91.14it/s]


DatasetDict({
    train: Dataset({
        features: ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'],
        num_rows: 4898
    })
})

In [4]:
# Load a .txt file
dataset_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text_dataset = load_dataset("text", data_files=dataset_url)
text_dataset["train"][:5]

Using custom data configuration default-56894355c5ba7b42


Downloading and preparing dataset text/default to C:\Users\batuh\.cache\huggingface\datasets\text\default-56894355c5ba7b42\0.0.0\d86c40dad297bdddf277b406c6a59f0250b5318c400bf23d420a31aff88c84c4...


Downloading: 1.12MB [00:00, 3.45MB/s]
100%|██████████| 1/1 [00:01<00:00,  1.08s/it]
100%|██████████| 1/1 [00:00<00:00, 501.11it/s]


Dataset text downloaded and prepared to C:\Users\batuh\.cache\huggingface\datasets\text\default-56894355c5ba7b42\0.0.0\d86c40dad297bdddf277b406c6a59f0250b5318c400bf23d420a31aff88c84c4. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 100.27it/s]


{'text': ['First Citizen:',
  'Before we proceed any further, hear me speak.',
  '',
  'All:',
  'Speak, speak.']}

In [5]:
# Load a json lines file
dataset_url = "https://raw.githubusercontent.com/hirupert/sede/main/data/sede/train.jsonl"
json_lines_dataset = load_dataset("json", data_files=dataset_url)
json_lines_dataset["train"][:2]

Using custom data configuration default-f5afb45b7a3adfed


Downloading and preparing dataset json/default to C:\Users\batuh\.cache\huggingface\datasets\json\default-f5afb45b7a3adfed\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde...


Downloading: 5.39MB [00:00, 6.37MB/s]
100%|██████████| 1/1 [00:01<00:00,  1.33s/it]
100%|██████████| 1/1 [00:00<00:00, 501.53it/s]


Dataset json downloaded and prepared to C:\Users\batuh\.cache\huggingface\datasets\json\default-f5afb45b7a3adfed\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 91.14it/s]


{'QuerySetId': [466, 784],
 'Title': ['Most controversial posts on the site',
  'Comments asking for questions to be made wiki'],
 'Description': ['Looks for posts with more than half the amount of downvotes as they have upvotes\nOrdered by upvotes\n',
  'All comments that contain the text should and wiki'],
 'QueryBody': ['SELECT \n* from Votes',
  "SELECT  PostId as [Post Link], Text from Comments\nwhere Text like '%should%wiki%'"],
 'CreationDate': [datetime.datetime(2020, 6, 24, 11, 23, 10),
  datetime.datetime(2019, 7, 7, 11, 1, 51)],
 'validated': [False, False]}

In [6]:
# Load a json file
dataset_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"
json_dataset = load_dataset("json", data_files=dataset_url, field="data")
json_dataset

Using custom data configuration default-3a242dc4b2dda81d


Downloading and preparing dataset json/default to C:\Users\batuh\.cache\huggingface\datasets\json\default-3a242dc4b2dda81d\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde...


Downloading: 42.1MB [00:16, 2.56MB/s]
100%|██████████| 1/1 [00:17<00:00, 17.92s/it]
100%|██████████| 1/1 [00:00<00:00, 449.31it/s]


Dataset json downloaded and prepared to C:\Users\batuh\.cache\huggingface\datasets\json\default-3a242dc4b2dda81d\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 83.57it/s]


DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

In [7]:
# We can also specify which splits to return with the data files argument
url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
data_files = {"train": f"{url}train-v2.0.json", "validation": f"{url}dev-v2.0.json"}
json_dataset = load_dataset("json", data_files=data_files, field="data")
json_dataset

Using custom data configuration default-3bbb4f3f65b9a5b6


Downloading and preparing dataset json/default to C:\Users\batuh\.cache\huggingface\datasets\json\default-3bbb4f3f65b9a5b6\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde...


Downloading: 4.37MB [00:00, 8.18MB/s]
100%|██████████| 2/2 [00:01<00:00,  1.68it/s]
100%|██████████| 2/2 [00:00<00:00, 668.63it/s]


Dataset json downloaded and prepared to C:\Users\batuh\.cache\huggingface\datasets\json\default-3bbb4f3f65b9a5b6\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde. Subsequent calls will reuse this data.


100%|██████████| 2/2 [00:00<00:00, 83.51it/s]


DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    validation: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 35
    })
})