<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/notebooks/How_To_Use_Datasets_From_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates how to access genomic_benchmarks from the Hugging Face and use them with the Pytorch and TensorFlow framework. For more examples, see https://huggingface.co/docs/datasets/index

In [1]:
!pip install -qq datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 KB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

When you want to access some dataset, lets say `human_nontata_promoters`, you just need to add a prefix `katarinagresova/Genomic_Benchmarks_` to get the path to Hugging Face dataset.



In [2]:
from huggingface_hub import notebook_login

notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from datasets import Dataset
from datasets import load_dataset

dataset = load_dataset("katarinagresova/Genomic_Benchmarks_human_nontata_promoters", use_auth_token=True)

Downloading readme:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/katarinagresova___parquet/katarinagresova--Genomic_Benchmarks_human_nontata_promoters-9d8ebae779f9ff53/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/27097 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9034 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/katarinagresova___parquet/katarinagresova--Genomic_Benchmarks_human_nontata_promoters-9d8ebae779f9ff53/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

When loading the dataset, data are downloaded and stored in the huggingface cache, but the easiest way to work with them is trough `dataset` variable. It is of type `DatasetDict` (so something like a python dictionary) with splits as keys and data as values. 

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['seq', 'label'],
        num_rows: 27097
    })
    test: Dataset({
        features: ['seq', 'label'],
        num_rows: 9034
    })
})

In [11]:
train_dset = dataset["train"]
train_dset

Dataset({
    features: ['seq', 'label'],
    num_rows: 27097
})

## TensorFlow

By default, datasets return regular Python objects: integers, floats, strings, lists, etc.

To get TensorFlow tensors instead, you can set the format of the dataset to tf:

In [12]:
ds = train_dset.with_format("tf")
ds[0]

{'seq': <tf.Tensor: shape=(), dtype=string, numpy=b'CAAGGGTGTAGTGCCCTGAGGGTGGCAATAGTTCCTGAGGCCATAACTGTTCTGAGCCCTTGCTGGGTGCCAGGCACAGTGCTGCTAGTGCGCTCTGCAGAGCTGATCTCACAATAACTTTTGGAGGTGCAAATACTCTATCCAGTTTATGAATGAGGAAACTGAGGCACAAAGTGGCTCCATGACTTGCCTGAGTCCCCACAGCTAGTAAGGGATGCCAGCAGGCGTTGAACCTCAACCCTAGAGCCTGC'>,
 'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>}

Although you can load individual samples and batches just by indexing into your dataset, this won’t work if you want to use Keras methods like fit() and predict(). You could write a generator function that shuffles and loads batches from your dataset and fit() on that, but that sounds like a lot of unnecessary work. Instead, if you want to stream data from your dataset on-the-fly, we recommend converting your dataset to a tf.data.Dataset using the to_tf_dataset() method.

In [13]:
tf_ds = train_dset.to_tf_dataset(
            columns=["seq"],
            label_cols=["label"],
            batch_size=2,
            shuffle=True
            )

## Pytorch

To get PyTorch tensors instead, you can set the format of the dataset to pytorch using Dataset.with_format():

In [19]:
ds = train_dset.with_format("torch")
ds[0]

{'seq': 'CAAGGGTGTAGTGCCCTGAGGGTGGCAATAGTTCCTGAGGCCATAACTGTTCTGAGCCCTTGCTGGGTGCCAGGCACAGTGCTGCTAGTGCGCTCTGCAGAGCTGATCTCACAATAACTTTTGGAGGTGCAAATACTCTATCCAGTTTATGAATGAGGAAACTGAGGCACAAAGTGGCTCCATGACTTGCCTGAGTCCCCACAGCTAGTAAGGGATGCCAGCAGGCGTTGAACCTCAACCCTAGAGCCTGC',
 'label': tensor(0)}

Like torch.utils.data.Dataset objects, a Dataset can be passed directly to a PyTorch DataLoader:

In [21]:
from torch.utils.data import DataLoader

dataloader = DataLoader(ds, batch_size=4)
for batch in dataloader:
    print(batch)   
    break                                                                                         

{'seq': ['CAAGGGTGTAGTGCCCTGAGGGTGGCAATAGTTCCTGAGGCCATAACTGTTCTGAGCCCTTGCTGGGTGCCAGGCACAGTGCTGCTAGTGCGCTCTGCAGAGCTGATCTCACAATAACTTTTGGAGGTGCAAATACTCTATCCAGTTTATGAATGAGGAAACTGAGGCACAAAGTGGCTCCATGACTTGCCTGAGTCCCCACAGCTAGTAAGGGATGCCAGCAGGCGTTGAACCTCAACCCTAGAGCCTGC', 'TGCAGTTAGGAGGGCAGGCCAGGGAGGATCCCACAGTGGCCCAGGGGTTTGAGATTTGAGCAGCAAATAAGAGAAAATGTGTGGATCTGAAATGTAGAAAGACGGAGGATTGAACCTCAAGGGGAACAAGGTGGCTGACGTGAGTGGAACAGGAGTAAAGAAGGGGAGGTGAGGCTTGAACCGCGAGGTGCCATGTGGGGAGCTTATGCAGAGGCTGGGGCATCTCAGGATGCATACCCAAGATGTTCTTG', 'CCCCCAATTTATCCTAGCTCCTCGTAGGACCTGACCTCCTCTTTATTCTGATTATTCCATCTGGGTTTTGTTGTTTTCTTAAGAAAACAATTTTTTTTCCTACTTGGCTGGTCTAGTTTTTTGAGGGAGAGCCAATCTTTTATCAGCTGAACCAAAATAATAATGGCTTTGGTTGCTAACTTCTCTGTGTCATGTAGGACCTTGGTTTGCTGCCAAGGACTGGAGTAGAAAAAAGGGGAACGAGATGCAGG', 'CCCGATGCCATCGTGCTGGCCGAGGAGGCCCTGGACAAAGCCCAGGAAGTGCTGGAGTTCCACCAAAGCCTGGGGGCCTTGGTGGAGGGCACAGGGCACCTGCTGGAGGCCCACTATGCTCGGCCAGAGGTCGTGGGGCAGACCAGTGCCCTCCTGCGGGCCAAGCTGGCCCAGGGCGCCTACCGCACAGCTGTGGACTTGGAGTCTCTGGCCTCTCAGCTCACA