Skip to content

IsoformAnalysisGroup/dnaLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dnaLM

This repository contains utilities for generating DNA sequence embeddings using the DNA Language Models DNABERT (first generation, 6-mer), DNABERT-2 and NT-MS. These tools allow you to convert DNA sequences from FASTA files into numerical embeddings suitable for downstream analysis, leveraging pretrained models from the HuggingFace Hub.

Getting started

To use the functionalities of this repository, you would need python installation within conda.

1. Clone the repository

Clone this repository from GitHub.

git clone https://github.com/IsoformAnalysisGroup/dnaLM

2. Create enviroment

Set up your conda enviroment and install required packages.

conda create -n dnaLM python=3.8                                                          # Create conda enviroment
conda activate dnaLM                                                                      # Activate your conda enviroment
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117  # Install pytorch with gpu support
python3 -m pip install -r requirements.txt                                                # Install other requirements

3. Usage

Generate sequence embeddings from a FASTA file containing DNA sequences and save them to an HDF5 file. Example FASTA files are included in the data/ directory with artificial sequences. Try an example with 10 sequences:

python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10.h5

Script arguments

  • -f, --fasta: Input FASTA file (.fasta or .fasta.gz)
  • -o, --out: Output HDF5 file (.h5)
  • -m, --model: DNA language model to use (dnabert1, dnabert2, or nt_ms). Default: dnabert1
  • -bs, --batch_size: Batch size (default: 5)

Example: Use DNABERT-2

python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10_dnabert2.h5 -m dnabert2

Example: Use NT-MS model

python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10_ntms.h5 -m nt_ms

The sequence-level embeddings are generated from mean pooling of token embeddings and saved in HDF5 file format. Each sequence's embedding is stored under its accession in the output file.

About

Sequence embedding generation for DNA using DNABERT-2.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages