dnaLM

This repository contains utilities for generating DNA sequence embeddings using the DNA Language Models DNABERT (first generation, 6-mer), DNABERT-2 and NT-MS. These tools allow you to convert DNA sequences from FASTA files into numerical embeddings suitable for downstream analysis, leveraging pretrained models from the HuggingFace Hub.

Getting started

To use the functionalities of this repository, you would need python installation within conda.

1. Clone the repository

Clone this repository from GitHub.

git clone https://github.com/IsoformAnalysisGroup/dnaLM

2. Create enviroment

Set up your conda enviroment and install required packages.

conda create -n dnaLM python=3.8                                                          # Create conda enviroment
conda activate dnaLM                                                                      # Activate your conda enviroment
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117  # Install pytorch with gpu support
python3 -m pip install -r requirements.txt                                                # Install other requirements

3. Usage

Generate sequence embeddings from a FASTA file containing DNA sequences and save them to an HDF5 file. Example FASTA files are included in the data/ directory with artificial sequences. Try an example with 10 sequences:

python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10.h5

Script arguments

-f, --fasta: Input FASTA file (.fasta or .fasta.gz)
-o, --out: Output HDF5 file (.h5)
-m, --model: DNA language model to use (dnabert1, dnabert2, or nt_ms). Default: dnabert1
-bs, --batch_size: Batch size (default: 5)

Example: Use DNABERT-2

python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10_dnabert2.h5 -m dnabert2

Example: Use NT-MS model

python scr/generate_embedding.py -f data/example_seqs_10.fasta -o embeddings_10_ntms.h5 -m nt_ms

The sequence-level embeddings are generated from mean pooling of token embeddings and saved in HDF5 file format. Each sequence's embedding is stored under its accession in the output file.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
scr		scr
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dnaLM

Getting started

1. Clone the repository

2. Create enviroment

3. Usage

Script arguments

Example: Use DNABERT-2

Example: Use NT-MS model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dnaLM

Getting started

1. Clone the repository

2. Create enviroment

3. Usage

Script arguments

Example: Use DNABERT-2

Example: Use NT-MS model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages