# Loading and processing data from Hugging Face 🤗

```bash 
pip install datasets
```

- Si quieres crear tu propio dataset o compartirlo con otros, puedes seguir la guía de cómo añadir un dataset al Hub de Hugging Face .

## Inspecting a dataset 🔎

DatasetInfo object can contains the description, features and datasetsize. 

You can access without downloading the dataset.

> **Note:** Image and Audio datasets have additional dependencies
>```bash
>pip install datasets[audio]
>pip install datasets[vision]


In [None]:
from datasets import load_dataset_builder
DATASET_NAME = 'poloclub/diffusiondb'
ds_diffusion = load_dataset_builder(DATASET_NAME)

In [None]:
print(ds_diffusion.info.description)

[Features](https://huggingface.co/docs/datasets/v2.12.0/about_dataset_features)
- Value - int,float...
- ClassLabel - Stores as integers
- Sequence - Object
- Array
- Image
- Audio

In [None]:
from pprint import pprint
pprint(ds_diffusion.info.features, indent=4, width=15)

Split = subset 

test, train, validation...

In [None]:
from datasets import get_dataset_split_names
get_dataset_split_names(DATASET_NAME)

Configuration = sub-dataset

In [None]:
from datasets import get_dataset_config_names
get_dataset_config_names(DATASET_NAME)

## [Datasets 🌎 vs IterableDatasets 🌌](https://huggingface.co/docs/datasets/about_mapstyle_vs_iterable)

### 📖 Datasets: use random access and memory-mapping (optimize for memory use) 

### 💧 IterableDatasets: use sequential access, don't have to downloading completly

In [None]:
from datasets import load_dataset
import numpy as np
CONFIGURATION = '2m_first_1k'
diffusiondb = load_dataset(DATASET_NAME, CONFIGURATION, split='train')


> **Note:** For large datasets use the index first and the column later
>```bash
>   dataset[0]['text']


In [None]:
import textwrap

random_i = np.random.choice(range(diffusiondb.num_rows))

wrapped_text = textwrap.fill(diffusiondb['prompt'][random_i], width=200)
print(wrapped_text)

image = diffusiondb['image'][random_i]
display(image)

In [None]:
from matplotlib import pyplot

def show_images(diff_images):
    fig, axes = pyplot.subplots(1, 3, figsize=(12,4))
    for image, ax in zip(diff_images, axes.ravel()):
        ax.imshow(image)
    fig.subplots_adjust(wspace=0.2)

#show_images(diffusiondb['image'][random_i: 3 + random_i])
show_images(diffusiondb[random_i: 3 + random_i]['image'])

## [Pre-processing](https://huggingface.co/docs/datasets/process) 💽

Filter

In [None]:
filter_subset = diffusiondb.filter(lambda sample: ' cat ' in sample['prompt'])
len(filter_subset)

In [None]:
show_images(filter_subset[:3]['image'])

Shards

- To fit the datasets to the memory resources
- Distributed processing of the dataset

In [None]:
diffusiondb.shard(num_shards=2, index=0)

Export

Allowed formats:
- csv
- json
- parquet
- sql
- pandas
- dict

In [None]:
diffusiondb.to_parquet('dataset/export/diffusiondb.parquet')

Apache Arrow 🪶

- Arrow allows zero-copy reads which removes virtually all serialization overhead.
- Arrow is language-agnostic (C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust).
- Arrow is column-oriented so it is faster at querying and processing slices or columns of data.
- Arrow can be passed directly to ML tools such as NumPy, Pandas, PyTorch, and TensorFlow.
- Arrow supports many, possibly nested, column types.

But....


Model needs numbers !!!

In [None]:
diffusiondb.features

In [None]:
diffusiondb = diffusiondb.remove_columns(['user_name', 'timestamp'])

In [None]:
diffusiondb.features

In [None]:
samplers = diffusiondb.unique('sampler')
print(samplers)

In [None]:
label2id = {"k_euler_ancestral": 0, "k_lms": 1, "k_euler": 2, "k_heun": 3}

In [None]:
diffusiondb_aligned = diffusiondb.align_labels_with_mapping(label2id, "sampler")
diffusiondb_aligned

In [None]:
diffusiondb[:3]

Cast

In [None]:
from datasets import ClassLabel
new_sampler_feat = diffusiondb.features.copy()
new_sampler_feat['sampler'] = ClassLabel(names=['k_euler_ancestral', 'k_lms', 'k_euler', 'k_heun'])
diffusiondb = diffusiondb.cast(new_sampler_feat)
diffusiondb.features

In [None]:
diffusiondb[:3]

In [None]:
diffusiondb = diffusiondb.align_labels_with_mapping(label2id, "sampler")
diffusiondb

```bash 
pip install tokenizers
```

[AutoTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer)

The tokenizer returns a dictionary with three items:

- input_ids: the numbers representing the tokens in the text.
- token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.
- attention_mask: indicates whether a token should be masked or not. The value is 1 for tokens that should be attended to and 0 for padding tokens that should be ignored.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("halffried/sd2-laion-clipH14-tokenizer")

print(tokenizer(diffusiondb["prompt"][random_i]))

In [None]:
print(diffusiondb["prompt"][random_i])

Map

Apply a function to each example in a dataset 
- independently
- batches: Input size != output size BUT all values in the output dictionary must contain the same number of elements


Multiprocessing - with_rank 

In [None]:
def tokenization(sample):
    return tokenizer(sample['prompt'])

diffusiondb = diffusiondb.map(tokenization)

In [None]:
diffusiondb.features

[Image Augmentations](https://pytorch.org/vision/stable/auto_examples/plot_transforms.html#sphx-glr-auto-examples-plot-transforms-py)


In [None]:
from torchvision.transforms import Grayscale, CenterCrop
from albumentations import HorizontalFlip

gray = Grayscale()

def transforms(samples):
    samples['gray_image'] = [gray(img) for img in samples['image'] ]#for _ in range(2)]
    return samples

In [None]:
diffusiondb = diffusiondb.map(transforms, batched=True) #, remove_columns=["image"]

In [None]:
diffusiondb['gray_image'][random_i]

In [None]:
diffusiondb.features

In [None]:
diffusiondb.reset_format()

diffusiondb.set_format(type="torch", columns=["input_ids", "attention_mask", "gray_image"])

diffusiondb.format['type']

In [None]:
diffusiondb.save_to_disk("/home/djm/Documents/Hugging Face Workshops/Datasets/dataset/save2disk")

In [None]:
from transformers import DataCollatorWithPadding

diffusiondb.reset_format()

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_dataset = diffusiondb.to_tf_dataset(
    columns=["input_ids", "attention_mask", "gray_image"],
    label_cols=["labels"],
    batch_size=2,
    collate_fn=data_collator,
    shuffle=True
)

set_transform - on the fly

- user-defined formatting, replaces datasets.Dataset.set_format() 
- A function that takes a batch (as a dict) as input and returns a batch. 
- Applied right before returning the objects in getitem.

In [None]:
import albumentations as A
import torch
from PIL import Image
from torchvision.transforms.functional import to_pil_image



augmentation_pipeline = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.Rotate(limit=30, p=0.5),
    A.RandomBrightnessContrast(p=0.5),
])

def pipeline_transforms(samples):
    augmented_image = []
    for img in samples['image']:
        np_image = np.flip(np.array(img), -1) #np.array(img.convert("RGB"))[:, :, ::-1]

        transformed_image = augmentation_pipeline(image=np_image)['image']

        tensor_image = torch.tensor(transformed_image).flip(-1).permute(2, 0, 1)
        
        augmented_image.append(to_pil_image(tensor_image))
    
    samples['augmented_image'] = augmented_image

    return samples


In [None]:
diffusiondb.set_transform(pipeline_transforms)

In [None]:

show_images(diffusiondb[random_i:random_i+3]['augmented_image'])

Medellín AI - Meetup