# Handling Local Data
To load datasets that are stored either on your laptop or on a remote server, we can still use the `load_dataset()` function. This time, we just need to specify the type of loading script in the `load_dataset()` function, along with a `data_files=''` argument that specifies the path to one or more files.

### Loading a local dataset

| Data format | Loading script | Example |
|-------------|----------------|---------|
| CSV & TSV |`csv`|`load_dataset("csv", data_files="my_file.csv")`|
| Text files |`text`|`load_dataset("text", data_files="my_file.txt")`|
| JSON & JSON Lines |`json`|`load_dataset("json", data_files="my_file.json")`|
| Pickled DataFrames |`pandas`|`load_dataset("pandas", data_files="my_dataframe.pkl")`|

For this example, let's use the [SQuAD-it](https://github.com/crux82/squad-it/) dataset, which is a large-scale **json** dataset for question answering in Italian. It's hosted on GitHub, let's first download it in our `data/chapter_5` dir using `wget` and then decompress these compressed files `SQuAD_it-train.json.gz`, `SQuAD_it-test.json.gz` using `gzip`:

In [None]:
!cd data/chapter_5 && wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!cd data/chapter_5 && wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

!cd data/chapter_5 && gzip -dkv SQuAD_it-*.json.gz

Now that we have our data in the `JSON` format, we can simply use the `load_dataset()` function, we just need to know if we’re dealing with **ordinary JSON** (*similar to a nested dictionary*) or **JSON Lines** (*line-separated JSON*). Like many question answering datasets, **SQuAD-it** uses the *nested format*, with all the text stored in a **data field**. This means we can load the dataset by specifying the `field='data'` argument:

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="data/chapter_5/SQuAD_it-train.json", field="data")

squad_it_dataset

As we can see, by default, loading local files creates a `DatasetDict` object with only a **train** split. But, what we really want is to include both the **train** and **test** splits in a single `DatasetDict` object so we can apply `Dataset.map()` functions across both splits at once. To do this, we can provide a dictionary to the 
```python
data_files={"train":"path to the training data", "test":"path to the testing data"}
```
argument that maps each split name to a file associated with that split:

In [None]:
data_files = {
    "train":"data/chapter_5/SQuAD_it-train.json",
    "test":"data/chapter_5/SQuAD_it-test.json"
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

The loading scripts in Datasets actually support automatic decompression of the input files, so we could have skipped the use of gzip by pointing the `data_files` argument directly to the compressed files:
```python
data_files = {
    "train": "data/chapter_5/SQuAD_it-train.json.gz", 
    "test": "data/chapter_5/SQuAD_it-test.json.gz"
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
```
This can be useful if you don’t want to manually decompress many `GZIP` files. The automatic decompression also applies to other common formats like `ZIP` and `TAR`, so you just need to point `data_files` to the compressed files.

> The `data_files` argument is also quite flexible and can be either *a single file path*, *a list of file paths*, or *a dictionary* that maps split names to file paths. You can also *glob files* that match a *specified pattern* according to the rules used by the `Unix shell` (e.g., you can glob all the `JSON` files in a directory as a single split by setting `data_files="*.json"`). See the [Datasets documentation](https://huggingface.co/docs/datasets/loading#local-and-remote-files) for more details.

### Loading a remote dataset

Fortunately, loading *remote files* is just as simple as loading *local* ones!
<br />
Instead of providing a path to *local files*, we point the `data_files` argument to **one or more URLs** where the *remote files* are stored.

In [None]:
url =  "https://github.com/crux82/squad-it/raw/master/"

data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset