Meta-Prism 2.0

Introduction

Microbial community samples have been accumulated at a speed faster than ever, with hundreds of thousands of samples been sequenced each year. Mining such a huge amount of multi-source heterogeneous data is becoming an increasingly difficult challenge. Among several data mining bottlenecks, efficient and accurate search of samples is one of the most prominent: Faced with millions of samples in the data repository, traditional sample comparison and search approaches fall short in speed and accuracy.

Here we proposed Meta-Prism 2.0, a microbial community sample search method based on smart pair-wise sample comparison, which pushed the time and memory efficiency to a new limit without compromising accuracy. Based on index-free and memory-saving data structure, time-saving instruction pipeline, and boost scheme optimization, Meta-Prism 2.0 has enabled ultra-fast, memory-efficient, flexible and accurate search among millions of samples. Meta-Prism 2.0 has been put to test on several datasets, with the largest containing one million samples. Results have shown that Meta-Prism 2.0’s 0.00001s per sample pair search speed and 8GB memory needs for searching against one million samples have made it the most efficient sample search method. Additionally, Meta-Prism 2.0 could achieve the accuracy comparable with or better than other contemporary methods. Thirdly, Meta-Prism 2.0 can precisely identify the original biome for samples, thus enabling sample source tracking. It default use Silva phylogeny tree and support EBI MGnify data. It also supports custom sequencing evolutionary trees. It can calculate the similarity matrix of microbiome samples and search the microbiome database. Because of the improvement of calculation performance compared with Meta-Prism 1.0, 2.0 doesn’t need index system but instead execute exhaustive search.

Meta-Prism 2.0 has been put to test on several datasets, with largest containing one million samples. Results have shown that firstly, as a distance-based method, Meta-Prism 2.0 is not only faster than other distance-based methods, but also faster than unsupervised methods. Its 0.00001s per sample pair search speed, as well as 8GB memory needs for searching against one million samples, have enabled it to be the most efficient method for sample comparison. Additionally, Meta-Prism 2.0 could achieve the comparison accuracy and search precision that are comparable or better than other contemporary methods. Thirdly, Meta-Prism 2.0 can precisely identify the original biome for samples, thus enabling sample source tracking.

In summary, Meta-Prism 2.0 has changed the resource-intensive sample search scheme to an effective procedure, which could be conducted by researchers everyday even on a laptop, for insightful sample search and knowledge discovery. Our work is preprinted at biorxiv.

Requirement

C++ compilers that support C++11.

Installation

to clone a local repository, run

git clone https://github.com/HUST-NingKang-Lab/Meta-Prism-2.0.git

to install, enter repository folder and run

make

to uninstall, run

make clean

QuickStart

Here are three commands to start using Meta-Prism 2.0. We provided example datasets to help you startup, but please note that this tutorial needs you have wget and the Meta-Prism 2.0 softwares installed first.

download example datasets

wget -c https://github.com/HUST-NingKang-Lab/Meta-Prism-2.0/releases/download/datas/Skin_samples.ascii_packaged.pdata
wget -c https://github.com/HUST-NingKang-Lab/Meta-Prism-2.0/releases/download/datas/Oral_samples.ascii_packaged.pdata

download phylogenetic tree

wget -c https://github.com/HUST-NingKang-Lab/Meta-Prism-2.0/releases/download/datas/LTPs132_SSU_tree.txt

search a dataset against another in search mode

./bin/Meta-Prism2.0 --tree LTPs132_SSU_tree.txt -l ascii Skin_samples.ascii_packaged.pdata -s ascii Oral_samples.ascii_packaged.pdata 5 -o Skin_against_Oral.txt

N-against-N comparison in matrix mode

./bin/Meta-Prism2.0 -m --tree LTPs132_SSU_tree.txt -l ascii Oral_samples.ascii_packaged.pdata -o Oral_SimMatrix.txt

Resources

We provide the SILVA phylogenetic tree, an example taxa abundance data, and an example ascii packaged data.

Data	Example
Phylogenetic tree	at SILVA (with header) or our local backup (without header).
taxa abundance data	at MGnify or our local backup.
ascii packaged data	100 samples or 1000 samples

Datasets used in our study:

Dataset	Description	Source	Link	Format
Combined dataset	126,727 samples curated from MGnify database	MGnify	download	packaged (binary)
FEAST dataset	10,270 samples curated from MGnify database, also used in FEAST study	MGnify	download	packaged (ascii)
skin samples	1,261 skin metagenomic samples	MGnify	download	packaged (ascii)
oral samples	70 oral metagenomic samples	MGnify	download	packaged (ascii)
groundwater samples	11 Query samples curated from groundwater in Saudi Arabian	MGnify	download	abundance matrix
dataset for detecting contaminations	611 Query samples from indoor house surfaces + 6,285 Target samples	MGnify	download	Run IDs list

Their related meta data are available from MGnify database.

Detailed Usage

Prepare phylogeny tree:

Meta prism 2.0 needs the evolutionary tree of Newick format as the basis of calculation. It is recommended to use Silva evolutionary tree, or set the evolutionary tree according to the sequencing results or use requirements. Use -t [path] or --tree [path] to select the phylogenetic tree path.

Load data:

Meta prism 2.0 supports SSU abundance data files in EBI MGNify and other databases as input, and provides special packaged data format with two types: ascii and binary to store abundance more efficiently. Note that packaged data is bound to the specific evolution tree.

--load(-l) [list|OTU|ascii|binary] [path]

Package data:

Meta prism 2.0 can package texa data with by ascii or binary for space efficient storage and fast loading. Ascii format packaged data is easy to read and edit, while binary format data is more efficient at space and time.

--package(-p) [ascii|binary] [path]

For example:

./bin/Meta-Prism2.0 --tree [TreePath] -l list [PathListFile] -p binary [PdataPath]

PathListFile is a file contains the absolute paths to all the TSV files you want to package, separated by lines. This command will load these tsv files and map each abundance to phylogenetic tree, then package these data as binary format to PdataPath. Note that the packaged data is bound to the phylogenetic tree and cannot be used with another phylogenetic tree.

Calculate similarity matrix:

Meta prism 2.0 can load the data of microbiome samples and calculate the similarity matrix between samples.

--matrix(-m) ## Calculate similarity matrix
--output(-o) [path] ## Result output path
--threads(-T) [Number] ## Threads for calculathon

For example:

./bin/Meta-Prism2.0 --tree [TreePath] -l OTU [OTUPath] -m -o [ResultPath]

This command will open OTU file and read all samples, calculate the similarity matrix, and output it at ResultPath.

Database search:

Meta prism 2.0 can search one or more microbial samples for another group of microbial samples, and return the top n sample name with the highest similarity and similarity. Relevant args are:

--search(-s) [single|list|OTU|ascii|binary] [search sample path] [select number|f] ## Load and search samples then return top N (default=5, input char 'f' will return full result as matrix) similar samples
--output(-o) [path] ## Result output path
--threads(-T) [Number] ## Threads for calculathon

For example:

./bin/Meta-Prism_2.0 --tree [TreePath] -l ascii [PdataPath] -s single [FilePath] 3 -o [ResultPath]

Multiple samples can be read from the packaged file with ascii type as the database, searched with one sample in FilePath, returned the three best matched samples, and output to ResultPath.

Formats

Format of phylogeny tree

Newick format: (left_node: Phylogeny_distance, right_node: Phylogeny_distance)

Both left or right node are recursable. For example

((NodeA:0.2,(NodeB:0.1,NodeC:0.2):0.2):0.3,(((NodeD:0.15,NodeE:0.23):0.66,NodeF:0.4)NodeH:0.7,NodeG:0.5):0.2)

Format of taxa abundance data (for a microbial community sample)

Each line contains abundance and taxonomy.

# Constructed from biom file
# OTU ID	ERR1754760	taxonomy
207119	19.0	sk__Archaea
118090	45.0	sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitrosopumilales;f__Nitro...
153156	38.0	sk__Archaea;k__;p__Thaumarchaeota;c__;o__Nitrosopumilales;f__Nitro...
131704	1.0	sk__Archaea;k__;p__Thaumarchaeota;c__Nitrososphaeria;o__Nitrososp...
103181	5174.0	sk__Bacteria
157361	9.0	sk__Bacteria;k__;p__;c__;o__;f__;g__;s__agricultural_soil_bacterium_SC-I-11

Format of abundance mat data

Matrix of relative abundance for different samples and taxonomy. For example

#Taxonomy     Sample1     Sample2     Sample3
sk__Bacteria;k__;p__Actinobacteria;c__Actinobacteria        0.3     0.1     0.5
sk__Bacteria;k__;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae;g__Actinotignum     0.4     0.9     0
sk__Bacteria;k__;p__Proteobacteria;c__Gammaproteobacteria;o__Xanthomonadales;f__Xanthomonadaceae        0.3     0       0.5

Format of ascii packaged data

Samples are split by '{' or '}', in each sample contains relative abundance and taxon id (identical to taxon name) at phylogeny tree. For example

#Generated By Prism2.0,Data number: 2
{
MGYS00002173-SRR2086747.tsv
98:557 4.06554e-05,816 0.0060983,2968 4.06554e-05,3617 0.00422816,4050 4.06554e-05,4158 0.000243932,4229 0.00959467,4494 0.0446802,4610 0.00195146,4616 0.000162621,4621 4.06554e-05,4662 4.06554e-05,4686 0.000121966,4756 0.000284588,5231 4.06554e-05,5478 8.13107e-05,5561 0.000731797,5818 0.00166687,6464 4.06554e-05,8678 0.000121966,9419 4.06554e-05,9422 0.00060983,9563 0.069602,9633 0.000487864,9634 0.00126032,9692 0.133675,9725 0.0473228,9726 0.0133756,9925 0.000569175,9999 0.044355,10095 0.000365898,10515 0.0015449,11693 0.00378095,12211 0.053909,12245 4.06554e-05,12262 0.00138228,12268 4.06554e-05,12274 0.00317112,12276 0.000243932,12312 0.000121966,12424 0.113672,12441 4.06554e-05,12479 0.0197585,12507 0.0566329,12627 4.06554e-05,12645 0.0921657,12862 0.000121966,13867 0.00390291,13939 4.06554e-05,14921 0.0015449,14971 0.000325243,14984 0.000650486,14988 0.000203277,14995 8.13107e-05,15047 4.06554e-05,15932 0.00134163,16149 0.015571,18277 0.00556978,20690 0.000447209,20797 0.00536651,20848 0.000325243,20940 8.13107e-05,20955 8.13107e-05,20976 0.000487864,21023 0.0207749,21609 8.13107e-05,21627 0.00101638,21646 4.06554e-05,21988 0.0373216,22051 0.00101638,22101 0.000121966,22118 0.00113835,22119 0.00439078,22400 0.000121966,22541 0.000243932,22581 0.00898484,22780 0.000121966,24879 0.00162621,24940 0.0045534,25074 4.06554e-05,25144 0.0123592,25537 0.00430947,25593 0.000975729,25597 4.06554e-05,25737 0.00150425,25800 0.000691141,25939 0.00187015,25970 0.00357767,25972 0.000203277,25996 0.000162621,26023 0.00573241,26078 0.000691141,26123 0.0428508,26134 0.000121966,26399 0.00101638,26575 0.0079278,26699 0.00939139,26810 0.0603326,
}
{
MGYS00002173-SRR2086906.tsv
28:4050 0.0247525,4158 0.618812,4610 0.00990099,9286 0.019802,9422 0.00495049,9563 0.039604,9692 0.019802,9725 0.0148515,9726 0.0148515,9999 0.00990099,12211 0.00495049,12239 0.00495049,12274 0.00495049,12424 0.039604,12507 0.0247525,12645 0.0445545,13867 0.00990099,20976 0.00495049,21023 0.00495049,21627 0.00495049,21988 0.00990099,22119 0.00495049,22541 0.00495049,22581 0.00495049,26023 0.00495049,26123 0.029703,26575 0.00990099,26810 0.00495049,
}

Format of binary packaged data

Binary packaged data is mainly similar to ascii packaged data. The structure is (little endian mode)

20 Byte char Source
4 Byte float Version
8 Byte unsigned_int Size # Count of samples

# At each sample:
38 Byte char Sample_Name
8 Byte unsigned_int SampleSize # Count of sample's abundance
Repeat of 4 Byte unsigned_int Taxa_ID and 4 Byte float abundance pair

Format of matrix mode result

Result of matrix mode calculation is ascii packaged. The structure is

N # count of samples
SampleName1 SampleName2 SampleName3 ... SampleNameN # N samples' name
value11 value12 ... value1N # first line of similarity matrix
...
valueN1 valueN2 ... ValueNN # last line of similarityt matrix

Format of search mode result

Top N result of search mode calculation is ascii packaged. The structure is

QuerySample1: 1st_Matched_Target_Sample:Similarity, 2nd_Matched_Target_Sample:Similarity, ... Nth_Matched_Target_Sample:Similarity,
QuerySample2: 1st_Matched_Target_Sample:Similarity, 2nd_Matched_Target_Sample:Similarity, ... Nth_Matched_Target_Sample:Similarity,
...

Full result of search mode calculation is ascii packaged. The structure is

X # Count of Query samples
TargetSampleName1 TargetSampleName2 ... TargetSampleNameN
QuerySample1 SimilarityValue11 SimilarityValue12 ... SimilarityValue1N
QuerySample2 SimilarityValue21 SimilarityValue22 ... SimilarityValue2N
...
QuerySampleX SimilarityValueX1 SimilarityValueX2 ... SimilarityValueXN

Contact

Feel free to contact us if you have any question. Thank you for using Meta-Prism 2.0.

Name	Email	Organization
Kai Kang	kang_kai_kk@icloud.com	Research Assistant, School of Life Science and Technology, Huazhong University of Science & Technology
Hui Chong	huichong.me@gmail.com	Research Assistant, School of Life Science and Technology, Huazhong University of Science & Technology
Kang Ning	ningkang@hust.edu.cn	Professor, School of Life Science and Technology, Huazhong University of Science & Technology

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
src		src
LICENSE		LICENSE
README.md		README.md
index.html		index.html
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta-Prism 2.0

Introduction

Requirement

Installation

QuickStart

Resources

Detailed Usage

Prepare phylogeny tree:

Load data:

Package data:

Calculate similarity matrix:

Database search:

Formats

Contact

About

Releases

Packages

Contributors 2

Languages

License

HUST-NingKang-Lab/Meta-Prism-2.0

Folders and files

Latest commit

History

Repository files navigation

Meta-Prism 2.0

Introduction

Requirement

Installation

QuickStart

Resources

Detailed Usage

Prepare phylogeny tree:

Load data:

Package data:

Calculate similarity matrix:

Database search:

Formats

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages