Skip to content

Finn-Lab/Tax-Benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking taxonomic assignment pipelines

Scripts for benchmarking taxonomic analysis tools and databases used in the following publication:

A Almeida, AL Mitchell, A Tarkowska, RD Finn (2018). Benchmarking taxonomic assignments based on 16S rRNA gene profiling of the microbiota from commonly sampled environments. GigaScience, Volume 7, Issue 5, 1 May 2018.

Requirements:

  • Python 2.7
  • BioPython
  • R 3.4.1
  • R libraries: ggplot2, phyloseq, vegan, scales, grid, ape, RColorBrewer, data.table

Python_scripts / random_species.py

Python script for randomly selecting a set number of species from a taxonomy file.

Usage:

python random_species.py -i [tax.file] -s [number of species] -m [max species per genera]

Options:
-i TAX Input taxonomy file (see example below)
-s SPECIES Number of species to select
-m MAXSPECIES Maximum number of species allowed per genera

The taxonomy file should consist of two columns: the first with the sequence identifier and the second with the full taxonomy lineage.

e.g.

AWVO01000035.3882.5400	sk__Bacteria;k__;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Oscillospiraceae;g__Oscillibacter;s__Oscillibacter_sp._KLE_1745
ADLK01000048.472.1991	sk__Bacteria;k__;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Lachnoclostridium;s__[Clostridium]_citroniae
CVPA01000001.8833.10362	sk__Bacteria;k__;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__Bacillus_sp._Co1-6
KM360064.1.1408	sk__Bacteria;k__;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Porphyromonas;s__Porphyromonas_katsikii
JUZZ01000099.289.1808	sk__Bacteria;k__;p__Proteobacteria;c__Gammaproteobacteria;o__Pasteurellales;f__Pasteurellaceae;g__Haemophilus;s__Haemophilus_parainfluenzae
ADCW01000016.45.1587	sk__Bacteria;k__;p__Firmicutes;c__Negativicutes;o__Veillonellales;f__Veillonellaceae;g__Veillonella;s__Veillonella_sp._6_1_27

Python_scripts / mutate_seqs.py

Python script for generating simulated sequence data with a certain percentage of nucleotide substitutions.

Usage:

python mutate_seqs.py -i [in.fasta] -p [perc]

Options:
-i FASTA Input FASTA file
-p PERC Base substitutions (in %)

R_scripts /

  • plot_ds.R : calculate dissimilarity scores and plot heatmap
  • plot_f-score.R : calculate recall, precision and F-scores and plot scatterplot
  • plot_pcoa.R : plot principal coordinates analysis (PCoA)
  • plot_stack.R : plot stacked histogram with predicted genera
  • plot_usage.R : plot bargraph with computational cost

About

Benchmarking taxonomic analysis tools and databases.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published