# What is Hugging Face Datasets?

Datasets is one of the three main libraries (Datasets, Transformers, Tokenizers) of the 🤗 Hugging Face ecosystem.
Using Datasets we can **access** 101,839 datasets (as of February 2024) and also **publish** and share our own.

Using a dataset is the first step of researching and developing our cool AI projects.

These datasets are categorized into 6 broad categories called **"Tasks"**:

![image 1](https://cdn-images-1.medium.com/max/800/1*5e9YFMLDu9OSaTMGycbmIg.png)

Each Task is divided into numerous **"Sub-tasks"**, totaling over 100. There is a Sub-task for nearly anything.

Some examples include *Object Detection, Text Classification, Summarization, Zero-Shot Classification, Question Answering, Text-to-Audio, Time Series Forecasting, and Image-to-3D*.

The main goal of 🤗 Datasets is to provide a simple way to load a dataset of any format or type.

They surely deliver on this goal, as **reading a dataset takes a single line of code**.

In this tutorial, we will explore the four basic properties of the 🤗 Datasets:

![image 2](https://cdn-images-1.medium.com/max/800/1*KoPiR26XAAYIoj4FHziwyA.png)

# Download a Dataset

Before downloading any dataset we have to **install the library**:

In [None]:
# install Hugging Face's transformers and datasets modules
# !pip install transformers datasets

In [99]:
import tqdm as notebook_tqdm

from datasets import (
    concatenate_datasets,
    Dataset,
    get_dataset_config_names,
    list_datasets,
    load_dataset,
    load_from_disk,
    Value,
)

import pandas as pd

In [13]:
all_datasets = list_datasets()
print(f"There are {len(all_datasets)} datasets currently available on the Hugging Face Datasets.")

There are 101847 datasets currently available on the Hugging Face Datasets.


Let's first decide which dataset to download. 

The choice is easy for me. Back in 2016, I did my Thesis called "[Emotion Detection on Movie Reviews](https://deffro.github.io/projects/emotion-detection-on-movie-reviews/)". I made a brave attempt to construct a classifier capable of classifying a sentence in one of the 6 basic categories of emotion which are anger, disgust, fear, happiness, sadness, and surprise.

8 years later, I am ready to revisit this problem since I saw that there is a dataset in 🤗 Datasets called [emotion](https://huggingface.co/datasets/dair-ai/emotion).

In this tutorial, I will only load the dataset and I will keep the model training in a future one. Our focus for now is to get familiar with Datasets.

Let's **download** the dataset:

In [15]:
emotions = load_dataset("emotion", trust_remote_code=True)

As downloading might take some time depending on the dataset, you might first want to **inspect it without downloading**.

In [16]:
from datasets import load_dataset_builder

ds_builder = load_dataset_builder("emotion")

In [17]:
ds_builder.info.description

'Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.\n'

In [18]:
ds_builder.info.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [20]:
ds_builder.info.splits

{'train': SplitInfo(name='train', num_bytes=1741533, num_examples=16000, shard_lengths=None, dataset_name='emotion'),
 'validation': SplitInfo(name='validation', num_bytes=214695, num_examples=2000, shard_lengths=None, dataset_name='emotion'),
 'test': SplitInfo(name='test', num_bytes=217173, num_examples=2000, shard_lengths=None, dataset_name='emotion')}

There is also the possibility to **load a single split**:

In [31]:
load_dataset("emotion", split="train", trust_remote_code=True)

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

## Meet your Dataset

Let's see the emotion dataset

In [21]:
emotions

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

It looks like a Python dictionary with dataset splits as keys. 

The [Dataset object](https://huggingface.co/docs/datasets/v2.16.1/en/package_reference/main_classes#datasets.Dataset) is one of the core data structures in 🤗 Datasets. 

It is based on [Apache Arrow](https://arrow.apache.org/) which is more memory efficient than native Python. It represents data in a columnar format, which is highly efficient for analytical processing.

Let's see a sample.

In [23]:
emotions["train"][10]

{'text': 'i feel like i have to make the suffering i m seeing mean something',
 'label': 0}

We can access dataset information in the same way we did using `load_dataset_builder`

In [27]:
emotions["train"].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [29]:
emotions["train"].num_rows

16000

Some datasets contain several sub-datasets. For example, the MInDS-14 dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as configurations, and you must explicitly select one when loading the dataset. If you don’t provide a configuration name, 🤗 Datasets will raise a ValueError and remind you to choose a configuration.

Use the get_dataset_config_names() function to retrieve a list of **all the possible configurations** available to your dataset:

In [35]:
get_dataset_config_names("PolyAI/minds14", trust_remote_code=True)

['cs-CZ',
 'de-DE',
 'en-AU',
 'en-GB',
 'en-US',
 'es-ES',
 'fr-FR',
 'it-IT',
 'ko-KR',
 'nl-NL',
 'pl-PL',
 'pt-PT',
 'ru-RU',
 'zh-CN',
 'all']

In [42]:
emotions["train"]["text"][:5]

['i didnt feel humiliated',
 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
 'im grabbing a minute to post i feel greedy wrong',
 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
 'i am feeling grouchy']

## Loading your own dataset

In [53]:
load_dataset("csv", data_files={"train": "./my_data.csv"})

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)
Generating train split: 1460 examples [00:00, 31596.33 examples/s]


DatasetDict({
    train: Dataset({
        features: ['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPor

You can also load other files types:

In [None]:
load_dataset("json", data_files={'train': 'train.json', 'test': 'test.json'})
load_dataset("parquet", data_files={'train': 'train.parquet', 'test': 'test.parquet'})
load_dataset("arrow", data_files={'train': 'train.arrow', 'test': 'test.arrow'})
Dataset.from_sql("data_table_name", con="sqlite:///sqlite_file.db")

You can also create a Dataset directly from in-memory data structures like Python dictionaries and Pandas DataFrames.

In [57]:
my_dict = {"score": [10, 7, 8.5]}
dataset_d = Dataset.from_dict(my_dict)

my_list = [{"score": 10}, {"score": 7}, {"score": 8.5}]
dataset_l = Dataset.from_list(my_list)

df = pd.DataFrame({"score": [10, 7, 8.5]})
dataset_p = Dataset.from_pandas(df)

In [58]:
dataset_p

Dataset({
    features: ['score'],
    num_rows: 3
})

# Modifying a Dataset
![image 3](https://cdn-images-1.medium.com/max/800/1*KDzpnwIUplELboUW9aUv4w.png)

Dataset provides functionalities for sorting, shuffling, selecting, filtering, splitting, and sharding data.

In [70]:
# Sorting based on the "text" column
sorted_dataset = emotions["train"].sort("text")  

# Provide a seed for reproducibility
shuffled_dataset = emotions["train"].shuffle(seed=42)  

# Create a new dataset with rows selected following the list/array of indices.
selected_dataset = emotions["train"].select(range(5))

# Create train, validation, and test sets if your dataset doesn’t already have them.
train_dataset, test_dataset = emotions["train"].train_test_split(test_size=0.2, seed=42)

# Filter the dataset based on a condition
filtered_dataset = emotions["train"].filter(lambda example: len(example["text"]))

# Sharding is useful for distributing the dataset across multiple processes or nodes
sharded_dataset = emotions["train"].shard(num_shards=5, index=0)  # Assuming you have 5 shards and selecting the first one

You can also **rename** and **remove** columns, **cast** data types, **flatten** nested structures, and get **unique** values.

In [93]:
# Rename a column
renamed_dataset = emotions["train"].rename_column("text", "tweet")

# Remove a column
dataset_without_column = emotions["train"].remove_columns(["text"])

# Cast a column to a different data type
dataset_casted = emotions["train"].cast_column("label", Value("int8"))

# Flatten the dataset
flattened_dataset = emotions["train"].flatten()

# Get unique values from a column
unique_values = emotions["train"].unique("label")

Map

In [94]:
# Define a function to apply to each example
def preprocess_example(example):
    example["text"] = example["text"].lower()
    return example

# Apply the function to each example
preprocessed_dataset = emotions["train"].map(preprocess_example)

Map: 100%|█████████████████████████████████████████████████████████████| 16000/16000 [00:01<00:00, 11728.01 examples/s]


Concatenate

The following code snippet is for demonstration purposes only.

**Using it will download over 20G of data.**

In [97]:
bookcorpus = load_dataset("bookcorpus", split="train")
wiki = load_dataset("wikipedia", "20220301.en", split="train")
wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])  # only keep the 'text' column

assert bookcorpus.features.type == wiki.features.type
bert_dataset = concatenate_datasets([bookcorpus, wiki])

Downloading data: 100%|█████████████████████████████████████████████████████████████| 312M/312M [00:57<00:00, 5.41MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 312M/312M [00:59<00:00, 5.23MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 313M/313M [01:02<00:00, 5.00MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 313M/313M [01:02<00:00, 4.99MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 311M/311M [00:55<00:00, 5.59MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 312M/312M [00:57<00:00, 5.47MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 313M/313M [00:55<00:00, 5.62MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 313M/313M [01:03<00:00, 4.95MB/s]
Downloading data: 100%|█████████████████

Change Format

The set_format() function changes the format of a column to be compatible with some common data formats.

In [100]:
emotions.set_format(type="pandas")

# Restore original format
emotions.reset_format()

# Saving and exporting data

You can save and load your dataset locally using:

In [103]:
emotions.save_to_disk("./")

reloaded_dataset = load_from_disk("./")

Saving the dataset (1/1 shards): 100%|███████████████████████████████| 16000/16000 [00:00<00:00, 1228042.97 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████| 2000/2000 [00:00<00:00, 165950.03 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████| 2000/2000 [00:00<00:00, 265882.98 examples/s]


Export to various data types:

In [102]:
emotions["train"].to_csv("./dataset.csv")
emotions["train"].to_json("./dataset.json")
emotions["train"].to_parquet("./dataset.parquet")

Creating CSV from Arrow format: 100%|█████████████████████████████████████████████████| 16/16 [00:00<00:00, 145.98ba/s]
Creating json from Arrow format: 100%|████████████████████████████████████████████████| 16/16 [00:00<00:00, 337.91ba/s]
Creating parquet from Arrow format: 100%|█████████████████████████████████████████████| 16/16 [00:00<00:00, 559.53ba/s]


1741533