<a href="https://colab.research.google.com/github/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/notebooks/How_To_Use_Datasets_From_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates how to access genomic_benchmarks from the Hugging Face and use them with the Pytorch and TensorFlow framework. For more examples, see https://huggingface.co/docs/datasets/index

In [1]:
!pip install -qq datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.1/311.1 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h

When you want to access some dataset, lets say `human_nontata_promoters`, you just need to add a prefix `katarinagresova/Genomic_Benchmarks_` to get the path to Hugging Face dataset.



In [2]:
from datasets import Dataset
from datasets import load_dataset

dataset = load_dataset("katarinagresova/Genomic_Benchmarks_human_nontata_promoters")

Downloading readme:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/27097 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9034 [00:00<?, ? examples/s]

When loading the dataset, data are downloaded and stored in the huggingface cache, but the easiest way to work with them is trough `dataset` variable. It is of type `DatasetDict` (so something like a python dictionary) with splits as keys and data as values.

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['seq', 'label'],
        num_rows: 27097
    })
    test: Dataset({
        features: ['seq', 'label'],
        num_rows: 9034
    })
})

In [4]:
train_dset = dataset["train"]
train_dset

Dataset({
    features: ['seq', 'label'],
    num_rows: 27097
})

## TensorFlow

By default, datasets return regular Python objects: integers, floats, strings, lists, etc.

To get TensorFlow tensors instead, you can set the format of the dataset to tf:

In [5]:
ds = train_dset.with_format("tf")
ds[0]

{'seq': <tf.Tensor: shape=(), dtype=string, numpy=b'CAAGGGTGTAGTGCCCTGAGGGTGGCAATAGTTCCTGAGGCCATAACTGTTCTGAGCCCTTGCTGGGTGCCAGGCACAGTGCTGCTAGTGCGCTCTGCAGAGCTGATCTCACAATAACTTTTGGAGGTGCAAATACTCTATCCAGTTTATGAATGAGGAAACTGAGGCACAAAGTGGCTCCATGACTTGCCTGAGTCCCCACAGCTAGTAAGGGATGCCAGCAGGCGTTGAACCTCAACCCTAGAGCCTGC'>,
 'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>}

Although you can load individual samples and batches just by indexing into your dataset, this won’t work if you want to use Keras methods like fit() and predict(). You could write a generator function that shuffles and loads batches from your dataset and fit() on that, but that sounds like a lot of unnecessary work. Instead, if you want to stream data from your dataset on-the-fly, we recommend converting your dataset to a tf.data.Dataset using the to_tf_dataset() method.

In [6]:
tf_ds = train_dset.to_tf_dataset(
            columns=["seq"],
            label_cols=["label"],
            batch_size=2,
            shuffle=True
            )

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


## Pytorch

To get PyTorch tensors instead, you can set the format of the dataset to pytorch using Dataset.with_format():

In [7]:
ds = train_dset.with_format("torch")
ds[0]

{'seq': 'CAAGGGTGTAGTGCCCTGAGGGTGGCAATAGTTCCTGAGGCCATAACTGTTCTGAGCCCTTGCTGGGTGCCAGGCACAGTGCTGCTAGTGCGCTCTGCAGAGCTGATCTCACAATAACTTTTGGAGGTGCAAATACTCTATCCAGTTTATGAATGAGGAAACTGAGGCACAAAGTGGCTCCATGACTTGCCTGAGTCCCCACAGCTAGTAAGGGATGCCAGCAGGCGTTGAACCTCAACCCTAGAGCCTGC',
 'label': tensor(0)}

Like torch.utils.data.Dataset objects, a Dataset can be passed directly to a PyTorch DataLoader:

In [8]:
from torch.utils.data import DataLoader

dataloader = DataLoader(ds, batch_size=4)
for batch in dataloader:
    print(batch)
    break

{'seq': ['CAAGGGTGTAGTGCCCTGAGGGTGGCAATAGTTCCTGAGGCCATAACTGTTCTGAGCCCTTGCTGGGTGCCAGGCACAGTGCTGCTAGTGCGCTCTGCAGAGCTGATCTCACAATAACTTTTGGAGGTGCAAATACTCTATCCAGTTTATGAATGAGGAAACTGAGGCACAAAGTGGCTCCATGACTTGCCTGAGTCCCCACAGCTAGTAAGGGATGCCAGCAGGCGTTGAACCTCAACCCTAGAGCCTGC', 'TGCAGTTAGGAGGGCAGGCCAGGGAGGATCCCACAGTGGCCCAGGGGTTTGAGATTTGAGCAGCAAATAAGAGAAAATGTGTGGATCTGAAATGTAGAAAGACGGAGGATTGAACCTCAAGGGGAACAAGGTGGCTGACGTGAGTGGAACAGGAGTAAAGAAGGGGAGGTGAGGCTTGAACCGCGAGGTGCCATGTGGGGAGCTTATGCAGAGGCTGGGGCATCTCAGGATGCATACCCAAGATGTTCTTG', 'CCCCCAATTTATCCTAGCTCCTCGTAGGACCTGACCTCCTCTTTATTCTGATTATTCCATCTGGGTTTTGTTGTTTTCTTAAGAAAACAATTTTTTTTCCTACTTGGCTGGTCTAGTTTTTTGAGGGAGAGCCAATCTTTTATCAGCTGAACCAAAATAATAATGGCTTTGGTTGCTAACTTCTCTGTGTCATGTAGGACCTTGGTTTGCTGCCAAGGACTGGAGTAGAAAAAAGGGGAACGAGATGCAGG', 'CCCGATGCCATCGTGCTGGCCGAGGAGGCCCTGGACAAAGCCCAGGAAGTGCTGGAGTTCCACCAAAGCCTGGGGGCCTTGGTGGAGGGCACAGGGCACCTGCTGGAGGCCCACTATGCTCGGCCAGAGGTCGTGGGGCAGACCAGTGCCCTCCTGCGGGCCAAGCTGGCCCAGGGCGCCTACCGCACAGCTGTGGACTTGGAGTCTCTGGCCTCTCAGCTCACA