# Hugging Face Datasets API

https://huggingface.co/docs/datasets/index

Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
    

#### Google Colab

If you are running the code in Google colab, install the packages by uncommenting/running the cell below

In [None]:
#%pip install datasets huggingface-hub

## Import packages

In [None]:
import huggingface_hub as hf_hub  
import datasets

# Get datasets information



In [None]:
# Returns an Iterable[DatasetInfo]
# Set the filter using DatasetFilter
# https://huggingface.co/docs/huggingface_hub/v0.19.3/en/package_reference/hf_api#huggingface_hub.DatasetFilter

all_datasets = hf_hub.list_datasets()

for datasetinfo in all_datasets:
    print(datasetinfo.id, datasetinfo.downloads, datasetinfo.likes)

In [None]:
named_datasets = hf_hub.list_datasets(filter = hf_hub.DatasetFilter(dataset_name='wikineural'))

for datasetinfo in named_datasets:
    print(datasetinfo.id, datasetinfo.downloads, datasetinfo.likes)

In [None]:
# Easier to see it on the HF portal than here - but go ahead and run it :)

ds_builder = datasets.load_dataset_builder("rotten_tomatoes")

ds_builder.info

# Load dataset

https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset

Requires at the minimum:

1. Path to be set:

* **Path** to the dataset on your local file system
* OR **Name** of the dataset (downloaded from Hugging Face)

2. Configuration name to be set

* Optional in case there are multimple configurations or types available

In [None]:
# Get the dataset with the name 'squad'

path = 'squad'

dataset_squad = datasets.load_dataset(path)

In [None]:
# print information about the dataaset
dataset_squad

In [None]:
# Get the 'glue' sub dataset with name='cola'

path = 'glue'
name = 'cola'

dataset_glue_cola = datasets.load_dataset(path, name)

In [None]:
# print information about the dataaset
dataset_glue_cola

# Dataset operations

https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset

* Access the metadata
* Select specific columns to access
* Sort
* Shuffle
* Access the data across splits (Train, Test, Validation)
* Cast the columns to different data types
* Rename columns
* Update the data
* Apply filter
* Create your own dataset and push to hub
* ...   ...   ...

In [None]:
X_train = dataset_squad['train'].select_columns(['question', 'answers'])

print(X_train[123])

## Iterate over data

https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.iter

Datasets are integrated with othe Hugging Face classes that use the various access interfaces exposed by the Dataset class.
You may use it for your own purposes by using the iter function.

In [None]:
X_train = dataset_squad['train'].select_columns(['question', 'answers'])

type(X_train)

In [None]:
# Dataset is iterable
for X in X_train:
    print('[Q]', X['question'], '[A]', X['answers']['text'][0])

In [None]:
# Remove columns instead of Selecting columns

X_train = dataset_squad['train'].remove_columns(['id','context'])

X_train

## Test - Train split

https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.train_test_split

In [None]:
# Split the 
my_dataset = X_train.train_test_split(test_size=0.2,  shuffle=True)

my_dataset

# Exercises

## Save & Load datasets from disk

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk
    
https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.load_from_disk

In [None]:
# Change the path to folder
local_folder = 'c:/temp/my_dataset1'

# Save the dataset
my_dataset.save_to_disk(local_folder)

# Now check the local folder

# Load my_dataset
my_dataset_from_disk = datasets.load_from_disk(local_folder)

my_dataset_from_disk.to_csv(local_folder')

## Convert the dataset format  to other formats

to_csv

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.to_csv

to_pandas

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.to_pandas

to_dict

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.to_dict

to_json

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.to_json

to_parquet

https://huggingface.co/docs/datasets/v2.15.0/en/package_reference/main_classes#datasets.Dataset.to_json

... ...  ...

In [None]:
# Local path to CSV data
local_folder_csv = 'c:/temp/my_dataset_csv'

# Save the dataset in CSV format
(my_dataset_from_disk['test']).to_csv(local_folder_csv)

# Now check the local folder

## Streaming


* Streaming option downloads a script to process data on demand
* No caching with streaming

In [None]:
# Use streaming
path = 'amazon_polarity'
dataset_amazon_polarity = datasets.load_dataset(path, streaming=True)

In [None]:
# Check dataset cache folder - you wont' see the cached dataset
# CHANGE folder on Windows
# !dir C:\Users\raj\.cache\huggingface\datasets
    
# Mac, Linux - uncomment
# ls ~/.cache/huggingface/datasets

## Map