This repository contains utilities for generating DNA sequence embeddings using the DNA Language Models DNABERT (first generation, 6-mer), DNABERT-2 and NT-MS. These tools allow you to convert DNA sequences from FASTA files into numerical embeddings suitable for downstream analysis, leveraging pretrained models from the HuggingFace Hub.
To use the functionalities of this repository, you would need python installation within conda.
Clone this repository from GitHub.
git clone https://github.com/IsoformAnalysisGroup/dnaLM
Set up your conda enviroment and install required packages.
conda create -n dnaLM python=3.8 # Create conda enviroment
conda activate dnaLM # Activate your conda enviroment
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 # Install pytorch with gpu support
python3 -m pip install -r requirements.txt # Install other requirements
Generate sequence embeddings from a FASTA file containing DNA sequences and save them to an HDF5 file. Example FASTA files are included in the data/ directory with artificial sequences. Try an example with 10 sequences:
python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10.h5
-f,--fasta: Input FASTA file (.fasta or .fasta.gz)-o,--out: Output HDF5 file (.h5)-m,--model: DNA language model to use (dnabert1,dnabert2, ornt_ms). Default:dnabert1-bs,--batch_size: Batch size (default: 5)
python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10_dnabert2.h5 -m dnabert2
python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10_ntms.h5 -m nt_ms
The sequence-level embeddings are generated from mean pooling of token embeddings and saved in HDF5 file format. Each sequence's embedding is stored under its accession in the output file.