Skip to content

BenKaehler/paycheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

paycheck

tAxonomY Classification Expected Composition

Code for this study. In support of q2-clawback.

paycheck contains the cross-validation code.

analysis collates results.

figures-and-tables draws figures and generates tables.

Works with data found here and here.

The main results in the paper were obtained in the following fashion.

Install QIIME 2 and q2-clawback. Install paycheck by running

pip install git+https://github.com/BenKaehler/paycheck.git

Download the Greengenes 13.8 reference data, which can also be found on the QIIME 2 site. Make some working directories

mkdir empo-3
cd empo-3
mkdir ref
mkdir sediment-non-saline

Copy the files rep_set/99_otus.fasta and taxonomy/99_otu_taxonomy.txt from Greengenes to the reference directory (ref). Now import them into QIIME 2 Artifacts and extract V4 reads:

cd ref
qiime tools import \
  --type 'FeatureData[Sequence]' \
  --input-path 99_otus.fasta \
  --output-path 99_otus.qza

qiime feature-classifier extract-reads \
  --i-sequences 99_otus.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA \
  --p-r-primer GGACTACNVGGGTWTCTAAT \
  --o-reads ref-seqs-v4.qza

qiime tools export \
  --input-path ref-seqs-v4.qza
  --output-path .

mv dna-sequences.fasta 99_otus_v4.fasta

Move into our working directory.

cd ../sediment-non-saline

Retrieve a dataset.

wget https://zenodo.org/record/2548899/files/sediment-non-saline.qza

An updated dataset for the same data could alteratively be obtained by running

qiime clawback fetch-Qiita-samples \
  --p-metadata-key empo_3 \
  --p-metadata-value 'Sediment (non-saline)' \
  --p-metadata-value 'sediment (non-saline)' \
  --o-samples sediment-non-saline.qza \
  --p-context Deblur-Illumina-16S-V4-150nt-780653

See the q2-clawback tutorial for details.

Extract the sequence variant data.

qiime tools export \
  --input-path sediment-non-saline.qza \
  --output-path .

We can now run the cross-validation script.

mkdir results tmp
paycheck_cv \
  --empirical-samples feature-table.biom \
  --ref-seqs ../ref/99_otus_v4.fasta  \
  --ref-taxa ../ref/99_otu_taxonomy.txt \
  --results-dir results \
  --intermediate-dir tmp \
  --k 5 \
  --log-file results/log \
  --log-level DEBUG

If your machine has enough memory, you can use --n-jobs to accelerate the process.

The cross validation results should now be in bespoke70 and uniform70 subdirectories of results.

The above steps can be repeated for any of the EMPO 3 types in the paper.

Next we analyse the results. Change directories to wherever you keep git repositories, clone the paycheck repo, and launch a Jupyter notebook server in the analysis subdirectory. Note that Jupyter is included in your QIIME 2 installation.

git clone https://github.com/BenKaehler/paycheck.git
cd paycheck/analysis
jupyter notebook

Now open the eval-empo-3-error-rate.ipynb notebook and change the values in the third cell to reflect he location of your empo-3 directory, the EMPO 3 types for which you have run the analysis, and the weights that you have tried. So far we have only run bespoke70 and uniform70, so truncate class_weights to only include those weights. If you have only run cross validation for Sediment (non-saline) you would truncate sample_types to only include sediment-non-saline.

Run the fourth cell to generate the eval_taxa_er.tsv result file.

The results can be visualised by running, for instance, the error-rate-vs-habitat.ipynb Jupyter notebook in the figures-and-tables subdirectory.

To calculate and visualise the other metrics presented in the paper, it is necessary to run the other paycheck-eval-* notebooks in the analysis subdirectory, followed by the required visualisation notebooks in the figures-and-tables subdirectory.

Finally, analysing the results for average taxonomic weights is somewhat more involved, because it requires that weights be drawn from all of the EMPO 3 habitats at once.

For this process, start by calculating the average weights for each EMPO 3 habitat. For example

cd empo-3/sediment-non-saline
qiime clawback sequence-variants-from-samples \
  --i-samples sediment-non-saline.qza \
  --o-sequences tmp/reads.qza
qiime feature-classifier classify-sklearn \
  --i-reads tmp/reads.qza \
  --i-classifier ../ref/uniform-classifier.qza \
  --p-confidence -1 \
  --o-classification tmp/classification.qza
qiime clawback generate-class-weights \
  --i-reference-taxonomy ../ref/99_otu_taxonomy.qza \
  --i-reference-sequences ../ref/99_otus_v4.qza \
  --i-samples sediment-non-saline.qza \
  --i-taxonomy-classification tmp/classification.qza \
  --o-class-weight results/weights.qza

The uniform-classifier.qza for the V4 region can be downloaded from the QIIME 2 website or trained im the standard way. See the q2-feature-classifier tutorial, which is available at the same site, for details.

Now, to calculate the average taxonomic weights resuls, start by telling the cross-validation script where to find the taxonomic weights for each EMPO3 habitat. This takes the form of a file that contains one file path on each line. For instance

cat > ../weights_files << END
../sediment-saline/results/weights.qza
END

Now, we can perform cross validation using average weihts by running

paycheck_cv_average \
  --ref-taxa ../ref/99_otu_taxonomy.txt \
  --ref-seqs ../ref/99_otus_v4.fasta \
  --weights ../weights_files \
  --exclude results/weights.qza \
  --obs-dir average \
  --results-dir results \
  --intermediate-dir tmp \
  --log-file results/log-average \
  --log-level DEBUG

This command will leave results in the results/average directory alongside bespoke70 and uniform70. Note that paycheck_cv must be run before running paycheck_cv_average as it uses the same cross-validation partitions.

The average results can now be analysed in the same way that we analysed bespoke and uniform results, above.

About

tAxonomY Classification Expected Composition

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages