<a href="https://colab.research.google.com/github/Antony-M1/huggingface_eco/blob/main/datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**README**

This document is prepared from the **Google Colab**


# Install Packages

In [2]:
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 16.1.0 w

In [15]:
from datasets import load_dataset, Dataset
from huggingface_hub import list_datasets, login
from google.colab import userdata


In [7]:
login(userdata.get('HUGGINGFACEHUB_API_TOKEN'))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# [Datasets](https://pypi.org/project/datasets/)

🤗 Datasets is a `lightweight` library providing two main features:

* `one-line dataloaders for many public datasets`: one-liners to download and pre-process any of the number of datasets major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),

* `efficient data pre-processing`: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.


🤗 Datasets is designed to let the community easily `add` and `share new datasets.`

🤗 Datasets has many additional interesting features:

* `Thrive on large datasets`: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
* `Smart caching`: never wait for your data to process several times.
* Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
* Built-in interoperability with NumPy, pandas, PyTorch, TensorFlow 2 and JAX.
* Native support for audio and image data.
* Enable streaming mode to save disk space and start iterating over the dataset immediately.


### Here the example to load the dataset

In [8]:

list_of_datasets = list_datasets() # Return a generator object

print(len([dataset.id for dataset in list_of_datasets]))

168655


In [9]:
# Load a dataset and print the first example in the training set
squad_dataset = load_dataset('squad') # datasets.dataset_dict.DatasetDict Object

print(squad_dataset)

Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [10]:
squad_dataset['train'][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [11]:
# Process the dataset - add a column with the length of the context texts

dataset_with_length = squad_dataset.map(lambda x:{"length": len(x["context"])})
dataset_with_length

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'length'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'length'],
        num_rows: 10570
    })
})

In [12]:
# Process the dataset - tokenizer the context texts (using a tokenizer from the 🤗 Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased') # The Autotokenizer choose the token class automatically based on the llm architecture
tokenizer

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [13]:
tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)
tokenized_dataset

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (539 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 10570
    })
})

### Data Streaming

When you have a dataset with a big file size you can stream the dataset using the `streaming`. options.

In [14]:
image_dataset = load_dataset('cifar100', streaming=True)
for example in image_dataset["train"]:
    break

Downloading readme:   0%|          | 0.00/9.98k [00:00<?, ?B/s]

### Other Info

What are all the type of data we can get from the hugging face
* NLP (text)
* Audio
* Vision (image)

**Why `.parquet` format?**

The Hugging Face datasets are often stored in the `.parquet` format for several reasons:

1. **Efficiency**: Parquet is a columnar storage format that is highly optimized for analytical queries. It offers `efficient compression and encoding schemes, enabling faster read and write operations, especially for large datasets`. This efficiency is beneficial when working with large-scale datasets commonly used in machine learning and deep learning tasks.

2. **Scalability**: Parquet files can be easily `partitioned` and `distributed across multiple storage locations or clusters`. This makes it suitable for distributed processing frameworks like Apache Spark, enabling scalable data processing and analysis.

3. **Columnar Storage**: Parquet stores data in a `columnar` format, meaning that values within each column are stored together, allowing for efficient column-wise operations and data retrieval. This can be advantageous for tasks like `feature extraction` and `transformation`, where access to specific columns is required.

4. **Compatibility**: Parquet files are supported by a wide range of data processing frameworks and tools, including Python libraries like `Pandas`, `Apache Spark`, and `Hadoop`. This makes it easier to integrate datasets stored in Parquet format into existing data pipelines and workflows.

The full form of the `.parquet` format is **"Columnar Parquet"**. It's an open-source columnar storage format developed as part of the `Apache Hadoop ecosystem`.

Here the common meaning of `Parquet`
![image](https://i.pinimg.com/564x/a2/bf/8f/a2bf8fa220cb0020cff8358832aae2e5.jpg)

As for your second question, whether all types of datasets (NLP, audio, vision, etc.) are in the same format, the answer is no. While Parquet is a common format for storing `tabular data` or `structured datasets`, other formats may be more suitable for different types of data:

- **NLP Datasets**: NLP datasets may be stored in various formats, including plain text (e.g., CSV, JSON), tokenized text (e.g., HDF5, TFRecord), or specialized formats optimized for sequence data.
- **Audio Datasets**: Audio datasets are typically stored in formats like `WAV`, `MP3`, or more specialized formats designed for audio processing (e.g., HDF5 with audio features stored as arrays).
- **Vision Datasets**: Vision datasets often use image formats such as `JPEG`, `PNG`, or `TIFF`. Additionally, they may use specialized formats like HDF5 with image data stored as arrays or TFRecord for integration with TensorFlow.

In summary, while Parquet is a common format for structured datasets, different types of datasets may use specialized formats tailored to their specific data structures and requirements.

# Dataset Converstion

convert the data from differnet source to `Dataset` format.



## Dict

Convert the Dict data into dataset

In [16]:
# Sample dictionary
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['New York', 'Los Angeles', 'Chicago']
}

In [17]:
# Convert dictionary to Hugging Face dataset
dataset = Dataset.from_dict(data)

In [18]:

# Display the dataset
print(dataset)

Dataset({
    features: ['name', 'age', 'city'],
    num_rows: 3
})
