Skip to content

SIAT-HPCC/gene-sequence-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

nGIA is an accurate and fast gene sequence clustering too. It uses greedy incremental clustering algorithm, the similarity of sequences is determined by alignment, and filters are added before alignment. All algorithms are implemented on GPU, resulting in extremely fast.
The oneAPI version has stopped maintenance.
Only supports Nvidia RTX 30 or higher GPUs.

Install

  1. Install MPICH.

Install MPICH and copy the library headers to "/usr/local/include", and libraries to "/usr/local/lib". If using the Ubuntu, it can be installed by the package manager.

sudo apt install mpich
  1. Install CUDA and GPU Driver.

It can be downloaded from https://developer.nvidia.com/cuda-downloads.

  1. Compile.

Using make to compile.

cd cuda/makeDB && make && cd -
cd cuda/cluster && make && cd -

Usage

  1. Make a database.
cd cuda/makeDB && ./makeDB -f ../../data/gene.fasta -p ../../data/gene.packed -t 0 && cd -
  • -f: fasta file (gene or protein sequences)
  • -p: packed file (generated by makeDB)
  • -t: data type (0-gene 1-protein)
  1. Do clustering
cd cuda/cluster && mpirun -n 1 ./cluster -p ../../data/gene.packed -r ../../data/result.txt -s 0.95 -m 0 && cd -
  • -p: packed file (generated by makeDB)
  • -r: result file (generated by cluster)
  • -s: similarity
  • -m: mode (0-fast 1-precise)

Result

The generated result file will look like this:

1>sequence1
2ACGT
3  >sequence2
4  >sequence3

The first and second lines do not have spaces at the beginning, so they are non-redundant sequences. The first line starts with '>', indicating that this is the sequence name, and the second line is the sequence content. The spaces at the beginning of lines 3 and 4 indicate that these are redundant sequences and are similar to the first line. Redundant sequences only list sequence names without sequence content.

Citing

Please cite the following publication if you use nGIA:

Ju Z, Zhang H, Meng J, et al. nGIA: A novel Greedy Incremental Alignment based algorithm for gene sequence clustering[J]. Future Generation Computer Systems, 2022, 136: 221-230.

About

accurate and fast gene greedy clustering tool

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published