# Hugging Face Datasets
Hugging Face Datasets is a library for loading and preprocessing a wide range of NLP and computer vision datasets in a consistent and easy-to-use way. It provides a high-level API for loading, splitting, and processing various datasets, including popular benchmark datasets for natural language processing tasks such as text classification, question answering, and machine translation, as well as computer vision datasets for image classification, object detection, and semantic segmentation.

The library is built on top of the PyTorch data loading utilities and is designed to work seamlessly with other PyTorch libraries and tools. The datasets are optimized for use with deep learning models and are preprocessed to provide a clean and consistent format for training, validation, and testing data.


Start by creating an account on [Hugging Face](https://huggingface.co/) if you don't already have an account. Then, install the library with pip:

In [None]:
!pip install datasets
!pip install librosa
!pip install matplotlib
!pip install nlpaug

Accessing and Viewing Datasets

In [None]:
import datasets
ds_list = datasets.list_datasets()
print(f"There are {len(ds_list)} datasets in the library.")

asr_datasets = [ds for ds in ds_list if 'asr' in ds.lower()]
print(f"There are {len(asr_datasets)} with ASR in the name\n")
print("\n".join(asr_datasets[:10]))


Let's pick one dataset. We'll use the Samrómur Speech Corpus, which is a collection of crowdsourced promted speeches. It contains about 100 thousand utterances for various speakers. The dataset is available on the Language and voice lab page on HuggingFace, and we can load it with the load_dataset function. The dataset is loaded as a DatasetDict, which is a dictionary-like object that contains multiple datasets. The actuatly files will be stored in the .chache directory in your home directory. 

Let's look at the features of the dataset. They can also be viewed on the [Hugging Face page](https://huggingface.co/datasets/language-and-voice-lab/samromur_asr)

In [None]:
dataset = datasets.load_dataset('language-and-voice-lab/samromur_asr', num_proc=5)

print(dataset.keys())

print(dataset['train'].description)
for i in dataset["train"].features.items():
    print(i)

We can now iterate over the features in the dataset

In [None]:
n=3
print("\t".join(dataset["train"][0].keys()))
for idx,item in enumerate(dataset["train"]):
    print(f"{item['audio_id']}\t{item['speaker_id']}\t{item['gender']}\t{item['age']}\t{item['duration']}\t{item['normalized_text']}")
    if idx ==n:
        break

print("\nThe audio key has a dict with the path to the file, numpy array of floats corresponding to the audio file and the sampling rate.")
for idx,item in enumerate(dataset["train"]):
    print(f"{item['audio']}")
    if idx ==n:
        break


print("\nWe can verify that array is indeed the audio file by comparing the audio duration to the array length divided by the sampling rate.")
for idx,item in enumerate(dataset["train"]):
    print(f"{item['duration']}\t{len(item['audio']['array'])/item['audio']['sampling_rate']}")
    if idx ==n:
        break

We can now run standard preprocessing steps on the dataset. For example, we can lower case the text or remove puncuation. We can also do audio manipulation, such as resampling the audio to a different sample rate, feature extraction and augmentation. Let's run through a few examples.

In [None]:
import librosa as lb
import librosa.display 
import matplotlib.pyplot as plt
import IPython.display as ipd
import nlpaug.augmenter.audio as naa
import numpy as np

In [None]:
n=5
sentence = dataset["train"][n]["normalized_text"]
audio_array = dataset["train"][n]["audio"]["array"]
sample_rate = dataset["train"][n]["audio"]["sampling_rate"]
audio_path = dataset["train"][n]["audio"]["path"]
print("Sentence:", sentence)
print(audio_array.shape)


fig, ax = plt.subplots(nrows=3)

# Let's first plot the audio array
lb.display.waveshow(audio_array, sr=sample_rate, ax=ax[0], x_axis='time')
ax[0].set(title=sentence)
ax[0].label_outer()

# Then calculate and plot the mel spectrogram for the same audio file 
mel_spectrogram = lb.feature.melspectrogram(y=audio_array, sr=sample_rate, power=0.2)
print(mel_spectrogram.shape)
img = lb.display.specshow(mel_spectrogram, x_axis='time', y_axis='linear', ax=ax[1])
ax[1].set(title="Mel spectogram")
ax[1].label_outer()

# Finally, let's use nlpaug to augment the audio file and plot the result
# More augmention methods can be found here:
# https://github.com/makcedward/nlpaug/examples
aug = naa.NoiseAug()
augmented_array = np.array(aug.augment(audio_array))
lb.display.waveshow(augmented_array, sr=sample_rate, ax=ax[2])
ax[2].set(title="Augemented audio")
ax[2].label_outer()


# Let's add a player as well
ipd.Audio(audio_path)



# Let's go over an issue with our current setup.
When we training good ASR models, we often need hundreds or thosands of hours of data. A typical speech dataset consists of approximately 100 hours of audio-transcription data, requiring upwards of 130GB of storage space for download and preparation. For most ASR researchers, this is already at the upper limit of what is feasible for disk space. So what happens when we want to train on a larger dataset? The full [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) dataset consists of 960 hours of audio data. Do we need to bite the bullet and buy additional storage? Or is there a way we can train on all of these datasets with no disk drive requirements?

When training machine learning systems, we rarely use the entire dataset at once. We typically _batch_ our data into smaller subsets of data, and pass these incrementally through our training pipeline. This is because we train our system on an accelerator device, such as a GPU or TPU, which has a memory limit typically around 12GB. We have to fit our model, optimiser and training data all on the same accelerator device, so we usually have to divide the dataset up into smaller batches and move them from the CPU to the GPU when required.

Consequently, we don't require the entire dataset to be downloaded at once; we simply need the batch of data that we pass to our model at any one go. We can leverage this principle of partial dataset loading when preparing our dataset: rather than downloading the entire dataset at the start, we can load each piece of data as and when we need it. For each batch, we load the relevant data from a remote server and pass it through the training pipeline. For the next batch, we load the next items and again pass them through the training pipeline. At no point do we have to save data to our disk drive, we simply load them in memory and use them in our pipeline. In doing so, we only ever need as much memory as each individual batch requires.

This is analogous to downloading a TV show versus streaming it 📺 When we download a TV show, we download the entire video offline and save it to our disk. Compare this to when we stream a TV show. Here, we don't download any part of the video to memory, but iterate over the video file and load each part in real-time as required. It's this same principle that we can apply to our ML training pipeline! We want to iterate over the dataset and load each sample of data as required.

The Hugging Face setup alows us to do this easily (given that the acutal data repository is setup in a compatible manner). 


Let's first check out the memory footprint of Samrómur.

In [None]:
!du -sh ~/.cache/huggingface/datasets/*


We will now do a minor change to the code above. We will now load the dataset in a streaming fashion, this is done by setting the streaming parameter to True.

In [None]:
dataset = datasets.load_dataset('language-and-voice-lab/samromur_asr', streaming=True)

Note: If you want to clear Samrómur out of the cache directory, you can do so by running the following command, but this will also delete all other Hugging Face datasets you have downloaded. 

!rm -r ~/.cache/huggingface/datasets/downloads