# A Reproducibility Analysis of 'A comprehensive metagenomics framework to characterize organisms relevant for planetary protection'
### Annaliese Meyer and Matt Baldes
#### Environmental Bioinformatics 2021

Danko et al. (2021) form a compelling case for the need for reproducible, user-friendly pipelines to facilitate the reporting of microbiomes in spacecraft assembly clean rooms. A thorough knowledge of the bioburden on spacecraft is critical to planetary protection and life detection efforts. However, despite their assertions of the necessity of reproducibility, the authors fall short in delivering a remotely useable pipeline, let alone a reproducible method for the processing of their genomic samples. In this report, we will compare figures found within Danko et al. (2021) with our best efforts at reproducing them. We will address the numerous reporting failures found within this paper and its accompanying Github repository, and address the dangers of using completely novel in-house codes and methods rather than field standard modules for data processing.

## Background
Planetary protection is the practice of protection both the Earth and planets we interact with from contamination. One of the foremost ways to protect against contamination of other planets is through stringent cleaning practices and rigourous bioburden assessments of the clean rooms where spacecraft are assembled. At the moment, no agreed-upon best practice for assessing spacecraft and spacecraft assembly facilities exists. Further, there is no hard legislation surrounding acceptable bioburden on interplanetary spacecraft. The UN Committee on Space Research (COSPAR) provides guidelines on bioburden limits for various categories of spaces missions, but enforcement is solely at the discretion of member states. 

## Attempted Methods
Our original intention in this project was to utilize the MetaSUB CAP Pipeline described in this paper, in an effort to closely follow the methods of the original authors. However, this pipeline relies on a custom pipeline manager, _Module Ultra_. Module Ultra failed with multiple attempts using both pip and conda installation methods, building from source, and using several different conda environments with different python versions. Given that the original CAP pipeline was deprecated and functionality had been transferred to the CAP2 pipeline, bugs existing in this first pipeline was expected. However, the CAP2 pipeline also failed to function as promised after attempting the same extensive installation and run options as the first pipeline. While we did not use either pipeline for our analyses, we did reference these pipelines to find version numbers and flags when not specified in the text. This report will differentiate between when these were explicitly mentioned in the publication versus when searching the source code was necessary. 

## Figure 1
Figure 1 describes the diversity of each sample. Danko et al. (2021) use A. species level richness (i.e., the total number of detected species), B. Shannon entropy of species abundance, and C. uniform manifold and abundance. Danko et al. (2021) used a program based on Kraken called KrakenUniq to generate taxonomic profiles in order to create this figure and figure 2. While the version number for KrakenUniq was specified, the program failed on Poseidon, and the existence of multiple (>40) open issues on Github led us to believe that the issue was intrinsic to the program. As such, we used the more common program, Kraken 2. No parameters were listed for KrakenUniq in the text, and the source code for this portion was difficult to parse. Due to this fact and that we were not using the original program, we used the default parameters for Kraken 2 to analyze our data. 


![image.png](attachment:image.png)

We ran Kraken 2 as both a stand-alone program and within a pipeline developed by the Bhatt Lab (Bhatt and Siranosian, 2021). The stand-alone script is provided below:
```
for i in $(cat ${1}) #loop over sample names
do
  	cd /vortexfs1/omics/env-bio/collaboration/clean_room/output/kraken2/$i #enter sample directories to place output in specific files
        kraken2 --db /vortexfs1/omics/env-bio/collaboration/databases/kraken2db_pluspf --threads 8 --output output_default --report report_default.kreport --paired /vortexfs1/omics/env-bio/collaboration/clean_room/output/error_corrected/$i/corrected/$i.1.00.0_0.cor.fastq.gz /vortexfs1/omics/env-bio/collaboration/clean_room/output/error_corrected/$i/corrected/$i.2.00.0_0.cor.fastq.gz #run kraken2 on files within error-corrected directory and save the outputs with standard names for later processing
done

```



Input figure 1 here

![image.png](attachment:image.png)

## Other analyses
#### Jellyfish
