# Load a dataset from the Hub
Finding high-quality datasets that are reproducible and accessible can be difficult. One of ðŸ¤— Datasets main goals is to provide a simple way to load a dataset of any format or type. The easiest way to get started is to discover an existing dataset on the Hugging Face Hub - a community-driven collection of datasets for tasks in NLP, computer vision, and audio - and use ðŸ¤— Datasets to download and generate the dataset.

This tutorial uses the rotten_tomatoes and MInDS-14 datasets, but feel free to load any dataset you want and follow along. Head over to the Hub now and find a dataset for your task!

## Load a dataset
Before you take the time to download a dataset, itâ€™s often helpful to quickly get some general information about a dataset. A datasetâ€™s information is stored inside DatasetInfo and can include information such as the dataset description, features, and dataset size.

Use the load_dataset_builder() function to load a dataset builder and inspect a datasetâ€™s attributes without committing to downloading it:

In [1]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")

#inspect dataset description
ds_builder.info.description

#ds_builder.info.features

''

In [2]:
# inspect dataset features
ds_builder.info.features

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

In [3]:
#the above did not work as expected 
# I am trying an alternative approach 

from datasets import load_dataset_builder
ds_builder = load_dataset_builder("cornell-movie-review-data/rotten_tomatoes")
ds_builder.download_and_prepare("hidden/tut1")

ds_builder.info.description

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

''

In [4]:
#let's try another 

from datasets import load_dataset

dataset= load_dataset("cornell-movie-review-data/rotten_tomatoes")

dataset['train'].info.description

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

''

Self note. clearly this dataset does not or no longer contains a description. But it was supposed to say the following:

    Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.

And now back to the lesson


In [5]:
from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")

## Splits
A split is a specific subset of a dataset like train and test. List a datasetâ€™s split names with the get_dataset_split_names() function:

In [6]:
from datasets import get_dataset_split_names

get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes")

['train', 'validation', 'test']

Then you can load a specific split with the split parameter. Loading a dataset split returns a Dataset object:

In [7]:
from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

If you donâ€™t specify a split, ðŸ¤— Datasets returns a DatasetDict object instead:

In [8]:
from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## Configurations
Some datasets contain several sub-datasets. For example, the MInDS-14 dataset has several sub-datasets, each one containing audio data in a different language. These sub-datasets are known as configurations or subsets, and you must explicitly select one when loading the dataset. If you donâ€™t provide a configuration name, ðŸ¤— Datasets will raise a ValueError and remind you to choose a configuration.

Use the get_dataset_config_names() function to retrieve a list of all the possible configurations available to your dataset:

In [10]:
from datasets import get_dataset_config_names

configs = get_dataset_config_names("PolyAI/minds14")
print(configs)

README.md: 0.00B [00:00, ?B/s]

['all', 'cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN']


Then load the configuration you want:



In [11]:
from datasets import load_dataset

mindsFR = load_dataset("PolyAI/minds14", "en-US", split="train")

en-US/train-00000-of-00001.parquet:   0%|          | 0.00/34.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/563 [00:00<?, ? examples/s]

In [12]:
mindsFR

Dataset({
    features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
    num_rows: 563
})