# **Dataset Library**

In this chapter, you'll delve deeper into the capabilities of the 🤗 Datasets library. Here are some of the key questions you'll explore:

1. How to handle datasets not available on the Hugging Face Hub?
2. Techniques for slicing, dicing, and working with datasets, including using Pandas.
3. Handling large datasets that might overwhelm your system's RAM.
4. Understanding concepts like memory mapping and Apache Arrow.
5. Creating custom datasets and contributing them to the Hugging Face Hub.

Let's embark on this journey to enhance your understanding of 🤗 Datasets!

## What should I do if my dataset isn't available on the Hugging Face Hub?

You've learned how to utilize the Hugging Face Hub to fetch datasets, but there will be instances where you need to work with data stored locally on your laptop or on a remote server. In this section, we'll explore how 🤗 Datasets can be employed to load datasets that aren't accessible on the Hugging Face Hub.

### Working with local and remote datasets

🤗 Datasets simplifies the loading of local and remote datasets by providing loading scripts for various common data formats. Here are examples of loading scripts for different data formats:

- CSV & TSV: `load_dataset("csv", data_files="my_file.csv")`
- Text files: `load_dataset("text", data_files="my_file.txt")`
- JSON & JSON Lines: `load_dataset("json", data_files="my_file.jsonl")`
- Pickled DataFrames: `load_dataset("pandas", data_files="my_dataframe.pkl")`

The above illustrates that for each data format, specifying the type of loading script in the `load_dataset()` function is sufficient. Additionally, the `data_files` argument is used to provide the path to one or more files. Let's begin by loading a dataset from local files, and subsequently, we'll explore how to achieve the same with remote files.

### Loading a local dataset

In [1]:
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.


This will download two compressed files called SQuAD_it-train.json.gz and SQuAD_it-test.json.gz, which we can decompress with the Linux gzip command: