This notebook documents processing Human genome (downloaded from ensembl.org as gzipped FASTA file) into Hugging Face datasets:


*   [Human_DNA_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0): DNA splitted into 10kb pieces
*   [Human_DNA_v0_DNABert6tokenized](https://huggingface.co/datasets/simecek/Human_DNA_v0_DNABert6tokenized): DNA tokenized and ready for language model training (tensors of 512 tokens)



## 0) PIP installation & FASTA.GZ download

In [1]:
!pip install transformers datasets Bio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 65.4 MB/s 
[?25hCollecting Bio
  Downloading bio-1.3.9-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 77.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 36.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_

In [2]:
!wget http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

--2022-06-07 13:43:30--  http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1107654500 (1.0G) [application/x-gzip]
Saving to: ‘Homo_sapiens.GRCh38.dna.toplevel.fa.gz’



## 1) Preprocessing FASTA into pandas DataFrame

Chromosomes are cutted into 10 kb pieces. Pieces with >1% N frequency are filtered out.

In [None]:
import gzip
from Bio import SeqIO
from Bio.Seq import Seq
from tqdm.autonotebook import tqdm

def _fastagz2dict(fasta_path, fasta_total=None, stop_id=None, region_name_transform=lambda x: x):
    # load gzipped fasta into dictionary
    fasta = {}

    with gzip.open(fasta_path, "rt") as handle:
        for record in tqdm(SeqIO.parse(handle, "fasta"), total=fasta_total):
            fasta[region_name_transform(record.id)] = str(record.seq)
            if stop_id and (record.id == stop_id):
                # stop, do not read small contigs
                break
    return fasta

dna_raw = _fastagz2dict("/content/Homo_sapiens.GRCh38.dna.toplevel.fa.gz", 24, "MT")

In [None]:
sum([len(x) for x in dna_raw.values()]) / 10**5

In [8]:
def kmers(s, k=6):
    return [s[i:i + k] for i in range(0, len(s), k) if i + k <= len(s)]

kmers("ACTGACTA", 3)

['ACT', 'GAC']

In [None]:
output = dict()

for chr in dna_raw:
    for i, chunk in enumerate(kmers(dna_raw[chr], 10_000)):
        key = chr + "_" + str(i)
        output[key] = chunk

len(output.keys())

In [35]:
import pandas as pd

s  = pd.Series(output,index=output.keys())
sum(s.str.count("N") >= 100)

15689

In [37]:
df = pd.DataFrame(s[s.str.count("N") < 100])
df

Unnamed: 0,0
1_1,TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC...
1_2,CCTGGTGCTCCCACAAAGGAGAAGGGCTGATCACTCAAAGTTGCGA...
1_3,GGGGAAGCAAGGCGGAGTTGGGCAGCTCGTGTTCAATGGGTAGAGT...
1_4,GCCTCATGGAGGGGATCAGCTGCGAGGAGCTAAGAGCCCCCTCCAG...
1_5,AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTA...
...,...
Y_5684,CTCTTGTCTAGGCTCTGCCTAAAGGGGGTATTGTGACATATCTCTG...
Y_5685,TTATTGTGACATATCTTTGCACTGATCACTCAGGTGATGGGACTAT...
Y_5686,TCACCCAGGAGGAGGATAAATTTTCAACTTTCATAACTTAAATATG...
Y_5687,AATCTGCAGTTTAAAAATGCCAATCACTACAAATCACAGGGAAAAC...


## 2) From pandas DataFrame into HF Dataset (splitted into train/test and uploaded to HF Hub)

In [47]:
from datasets import Dataset

raw_dataset = Dataset.from_pandas(df).shuffle(seed=42).rename_column(original_column_name="0", new_column_name="Seq").remove_columns(['__index_level_0__'])
raw_dataset

Dataset({
    features: ['Seq'],
    num_rows: 292955
})

In [48]:
raw_dataset[0]

{'Seq': 'ATGCGGCATACTTCTGCTGTAGGAATTGCTGCTGGAATATACTAATCATTTTAAGCAGTGAAGAGCATTGCGTCTGAAAGGCAAGTAGAGATACGGAAACACAATCAAGTCTAACAATAGATTGAGAAGAACCATGCCTAACGACAAGGGATAAAGATGAAAACCACCGTTTCATTCTTTCTCCTCCTTTTGACTTTATTGCTGCAGCTCAGTTTGCATCTTGAGCACCCTCGTTGCCCAGCAGCCCGGCGGGGTTCATTTGCATGTGTCTTCCCAGGTCTTCCCTTCAACCTCTGCAAAGCCAGCCAGGCGGAGAGGGGGCAGAGGCGTCCTTGGAGGGGAGCAATTCAGAAACAGCCACATCTTTTCTTTAAGGAAAAGGGAGGTCTCAAGATTACTTTCTATTTTTCATCACTTCTCTAATATATGCACACTGTATTTGCATAACATCTATTTGCACTGGGAGCCCATCCCTGGCTTCCTGAAAGATACAGGAGGGCATTTGAATATATATTTTATTCCCTGTGATGTCTCAGAGTTGAGCCTCTAATCTCATTACCAGCTTGCATGCTTCCAGTGAGTTATTCTATGGTCTTTAGAATTGTGCCTCCAATTTGTAAGCCTAGCTAACAATTACATTTTCATCGTGGAAAGATGTTAAAGATTGCTTTCAGTGAGAATTAAATCAAAGATCTCAGCATGTGCAGCTCCCAACCCCCAACCTCACTTCTTTGCGCACTTAATAGAGGTTGGCAACATAAAACTCCCTTTCTCTAGAACATCATCTTCACACAGAAAATCCTGCAGAAACTATGTTAAAAACACAGCATTGTCTAGTCTTATTCACTAAATGCTCCAACTTGACCACCTCAAAAAAAATAATAATTTCCAGGCTTGGAGAGACTGTTTAATTATTGAAGGACCAGCCTAGTGAAATGACATGGGATCCCAGACTGGTGGGATTTATGAAGAAATTGTCCATCCCTAAAAG

In [52]:
splitted_datasets = raw_dataset.train_test_split(0.1, seed=42)
splitted_datasets

DatasetDict({
    train: Dataset({
        features: ['Seq'],
        num_rows: 263659
    })
    test: Dataset({
        features: ['Seq'],
        num_rows: 29296
    })
})

In [32]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [54]:
splitted_datasets.push_to_hub("Human_DNA_v0")

Pushing split train to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/6 [00:00<?, ?it/s]

Pushing split test to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

## 3) Tokenization and chunking into 512 tokens tensors

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("armheb/DNA_bert_6")

Downloading:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [13]:
from datasets import load_dataset

splitted_datasets = load_dataset("simecek/Human_DNA_v0")

Using custom data configuration simecek--Human_DNA_v0-d7be3fc44fadbb72
Reusing dataset parquet (/root/.cache/huggingface/datasets/simecek___parquet/simecek--Human_DNA_v0-d7be3fc44fadbb72/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901)


  0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
def tokenize_function(s, k=6):
  seq_split = " ".join(kmers(s['Seq'], k))
  return tokenizer(seq_split)

tokenize_function({'Seq':'ACCTGCTGGACGATCATA'})  

{'input_ids': [2, 675, 2000, 393, 3], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [None]:
tokenized_datasets = splitted_datasets.map(tokenize_function, remove_columns='Seq', num_proc=8)
tokenized_datasets

In [None]:
tokenized_datasets['train'][0]['input_ids']

In [15]:
from itertools import chain
# Main data processing function that will concatenate all texts from our dataset and generate chunks of
# max_seq_length.
# grabbed from: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py

def group_texts(examples, max_length=512):
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= max_length:
        total_length = (total_length // max_length) * max_length
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + max_length] for i in range(0, total_length, max_length)]
        for k, t in concatenated_examples.items()
    }
    return result

chunked_datasets = tokenized_datasets.map(group_texts, batched=True, desc=f"Grouping texts in chunks of 512")
chunked_datasets

Grouping texts in chunks of 512:   0%|          | 0/264 [00:00<?, ?ba/s]

Grouping texts in chunks of 512:   0%|          | 0/30 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 858737
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 95417
    })
})

In [33]:
chunked_datasets.push_to_hub("Human_DNA_v0_DNABert6tokenized")

Pushing split train to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/6 [00:00<?, ?it/s]

Pushing split test to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]