___
# Jovian analysis report
___

<b>If you want to show or hide the programming code, press the button with the eye in it on the toolbar above.</b>  
<br>
<i><b>In order to load in the data, click the "Cell" button above, and then press "Run all".</b></i>  
<br>
You can view the Jovian rulegraph by clicking [here](files/GitHub_images/rulegraph_Jovian.png).

In [None]:
######################################
# Required packages for this script  #
######################################
import pandas as pd
import qgrid
import glob
import os

grid_options = {
    'fullWidthRows': True,
    'syncColumnCellResize': True,
    'forceFitColumns': False,
    'defaultColumnWidth': 100,
    'rowHeight': 23,
    'enableColumnReorder': True,
    'enableTextSelectionOnCells': True,
    'editable': True,
    'autoEdit': False,
    'explicitInitialization': True,
    'maxVisibleRows': 20,
    'minVisibleRows': 8,
    'sortable': True,
    'filterable': True,
    'highlightSelectedCell': True,
    'highlightSelectedRow': True
}

___
## Quality control metrics report (MultiQC):
[Open MultiQC graph in seperate tab by clicking here](results/multiqc.html)  
___

In [None]:
%%HTML
<div style="text-align: center">
    <iframe src="results/multiqc.html" width=1400 height=980></iframe>
</div>

___
## Metagenomics:
___


### Interactive metagenomics overview (Krona):
[Open Krona graph in seperate tab by clicking here](results/krona.html)  

In [None]:
%%HTML
<div style="text-align: center">
    <iframe src="results/krona.html" width=1400 height=980></iframe>
</div>

### Summary heatmaps:

Note (2019-03-28): There is a bug between Jupyter notebook and Tornado (a dependency of Jupyter). This causes the links below to give a 403 error in the Chrome browser, it does work in Firefox. This is a known bug and will be updated shortly by the developers of Tornado. See [here](https://github.com/jupyter/notebook/issues/4498), [here](https://github.com/jupyterlab/jupyterlab/issues/6106), [here](https://github.com/jupyterlab/jupyterlab/issues/6131) and [here](https://github.com/jupyter/terminado/issues/62).

In [None]:
%%HTML
<script>
function goBack() {
    window.history.back()
}
</script>

<button onclick="goBack()">Click this button to go back</button>


<div style="text-align: center">
    <iframe src="results/Heatmap_index.html" width=1400 height=980></iframe>
</div>

### Read-based composition of analysed samples

Quantifies all annotations made by Jovian by counting reads that were:  
Discarded due to low quality, removed due to being alignable to the human genome, classified by Megablast, and finally reads that could not be assembled into contigs longer than the user specified length (default is 500 nt) and were therefore not included in the classification steps ("remaining").

In [None]:
%%HTML
<div style="text-align: center">
    <iframe src="results/Sample_composition_graph.html" width=1400 height=980></iframe>
</div>

### Classified scaffolds:

In [None]:
if os.path.exists("results/all_taxClassified.tsv"):
    ClassifiedScaffolds_df = pd.read_csv("results/all_taxClassified.tsv" , sep = "\t")
else:
    print("The file \"results/all_taxClassified.tsv\" does not exist. Either no scaffolds were classified, or something went wrong, please doublecheck the logfiles below:")
    print("\t\"logs/Merge_all_metrics_into_single_tsv_[sample_name].log\"")
    print("\t\"logs/Concat_files.log\"")
    ClassifiedScaffolds_df = pd.DataFrame({'Error' : ["Please", "see", "error", "message", "above"]})

qgrid.show_grid(ClassifiedScaffolds_df, show_toolbar=False, grid_options=grid_options)

### Dark matter (i.e. unclassified scaffolds):

In [None]:
if os.path.exists("results/all_taxUnclassified.tsv"):
    UnclassifiedScaffolds_df = pd.read_csv("results/all_taxUnclassified.tsv" , sep = "\t")
else:
    print("The file \"results/all_taxUnclassified.tsv\" does not exist. Either no scaffolds were unclassified, or something went wrong, please doublecheck the logfiles below:")
    print("\t\"logs/Merge_all_metrics_into_single_tsv_[sample_name].log\"")
    print("\t\"logs/Concat_files.log\"")
    UnclassifiedScaffolds_df = pd.DataFrame({'Error' : ["Please", "see", "error", "message", "above"]})

qgrid.show_grid(UnclassifiedScaffolds_df, show_toolbar=False, grid_options=grid_options)

___
## Predicted virus hosts:
___

In [None]:
if os.path.exists("results/all_virusHost.tsv"):
    virusHost_df = pd.read_csv("results/all_virusHost.tsv" , sep = "\t")
else:
    print("The file \"results/all_virusHost.tsv\" does not exist. Either no viral scaffolds had host information, or something went wrong, please doublecheck the logfiles below:")
    print("\t\"logs/Merge_all_metrics_into_single_tsv_[sample_name].log\"")
    print("\t\"logs/Concat_files.log\"")
    virusHost_df = pd.DataFrame({'Error' : ["Please", "see", "error", "message", "above"]})

qgrid.show_grid(virusHost_df, show_toolbar=False, grid_options=grid_options)

___
## Virus typing tool output
___


### Norovirus typing tool output:  
[Link to the norovirus typing tool](https://www.rivm.nl/mpf/typingtool/norovirus/)  

In [None]:
if os.path.exists("results/all_NoV-TT.tsv") and os.path.getsize("results/all_NoV-TT.tsv") > 0:
    NoV_TT_df = pd.read_csv("results/all_NoV-TT.tsv" , sep = ",")
elif os.path.exists("results/all_NoV-TT.tsv") and os.path.getsize("results/all_NoV-TT.tsv") == 0:
    print("No viral scaffolds with species equal to \"Norwalk virus\" were found in this dataset.")
    NoV_TT_df = pd.DataFrame({'NA' : ["No", "Norwalk virus", "species", "scaffolds", "found"]})
else:
    print("The file \"results/all_NoV-TT.tsv\" does not exist. Something went wrong. Please see the logfiles below:")
    print("\t\"logs/Viral_typing_[sample_name].log\"")
    print("\t\"logs/Concat_TT_files.log\"")
    NoV_TT_df = pd.DataFrame({'Error' : ["Please", "see", "error", "message", "above"]})

qgrid.show_grid(NoV_TT_df, show_toolbar=False, grid_options=grid_options)

### Enterovirus typing tool output:  
[Link to the enterovirus typing tool](https://www.rivm.nl/mpf/typingtool/enterovirus/)  

In [None]:
if os.path.exists("results/all_EV-TT.tsv") and os.path.getsize("results/all_EV-TT.tsv") > 0:
    EV_TT_df = pd.read_csv("results/all_EV-TT.tsv" , sep = ",")
elif os.path.exists("results/all_EV-TT.tsv") and os.path.getsize("results/all_EV-TT.tsv") == 0:
    print("No viral scaffolds with family equal to \"Picornaviridae\" were found in this dataset.")
    EV_TT_df = pd.DataFrame({'NA' : ["No", "Picornaviridae", "family", "scaffolds", "found"]})
else:
    print("The file \"results/all_EV-TT.tsv\" does not exist. Something went wrong. Please see the logfiles below:")
    print("\t\"logs/Viral_typing_[sample_name].log\"")
    print("\t\"logs/Concat_TT_files.log\"")
    EV_TT_df = pd.DataFrame({'Error' : ["Please", "see", "error", "message", "above"]})

qgrid.show_grid(EV_TT_df, show_toolbar=False, grid_options=grid_options)

___
## Scaffold viewer:
**Containing: SNPs and minority variants (quasispecies), predicted ORFs, depth of coverage graph, GC contents graph**
___

N.B. Depending on the depth of coverage of the contig this can be <b>(very) slow, or downright crash your browser</b>. This is a <b>client-sided</b> problem, meaning, your computer isn't powerful enough. This cannot be fixed by us developers.  

In [None]:
!bash bin/start_nginx.sh start

In [None]:
%%HTML
<script>
function goBack() {
    window.history.back()
}
</script>

<button onclick="goBack()">Click this button to go back</button>

<div style="text-align: center">
    <iframe src="results/IGVjs_index.html" width=1400 height=980></iframe>
</div>

___
## Interactive minority variants table
___

In [None]:
if os.path.exists("results/all_filtered_SNPs.tsv"):
    filtered_VCF_df = pd.read_csv("results/all_filtered_SNPs.tsv" , sep = "\t")
else:
    print("The file \"results/all_filtered_SNPs.tsv\" does not exist. Either no SNP's were classified, maybe because you've set the minimum allele-frequency too high? Or something went wrong, please doublecheck the logfiles below:")
    print("\t\"logs/SNP_calling_[sample_name].log\"")
    print("\t\"logs/Concat_filtered_SNPs.log\"")
    filtered_VCF_df = pd.DataFrame({'Error' : ["Please", "see", "error", "message", "above"]})

qgrid.show_grid(filtered_VCF_df, show_toolbar=False, grid_options=grid_options)

___
# Logging and audit-trail: 
___

## Snakemake summary statistics
[Open Snakemake summary statistics in seperate tab by clicking here](snakemake_report.html#stats)

In [None]:
%%HTML
<div style="text-align: center">
    <iframe src="results/snakemake_report.html" width=1400 height=980></iframe>
</div>

### Software list in Jovian_master environment:

In [None]:
%%bash
cat envs/Jovian_master_environment.yaml

<br>  
### Database versions:

In [None]:
%%bash
echo -e "==> NCBI NT Database: <==\n$(ls -lah /mnt/db/NT_Database/)"
echo -e "\n==> NCBI NR Database: <==\n$(ls -lah /mnt/db/NR_Database/)"
echo -e "\n==> NCBI Taxdb Database: <==\n$(ls -lah /mnt/db/taxdb/)"
echo -e "\n==> NCBI new_taxdump Database: <==\n$(ls -lah /mnt/db/new_taxdump/)"
echo -e "\n==> Krona Taxonomy Database: <==\n$(ls -lah /mnt/db/taxonomy_krona/)"
echo -e "\n==> Homo Sapiens NCBI GRch38 NO DECOY genome: <==$(ls -lah /mnt/db/Homo_sapiens_NO_DECOY/NCBI/GRCh38/Sequence/Bowtie2Index/)"
echo -e "\n==> Virus-Host Interaction Database: <==\n$(ls -lah /mnt/db/Virus-Host_interaction_DB/)"

<br>  
### Pipeline code with unique methodological "fingerprint":

In [None]:
%%bash
echo -e "This is the link to the code used for this analysis:\thttps://github.com/DennisSchmitz/Jovian/tree/$(git log -n 1 --pretty=format:"%H")" 
echo -e "This code with unique fingerprint $(git log -n1 --pretty=format:"%H") was committed by $(git log -n1 --pretty=format:"%an <%ae>") at $(git log -n1 --pretty=format:"%ad")"

<br>  
### Snakemake config file (containing pipeline parameters):

In [None]:
%%bash
echo -e "________________________________________________________________________________\n\tContents of the config.yaml (contains the Snakemake CLI parameters):\n________________________________________________________________________________"
cat profile/config.yaml
echo -e "________________________________________________________________________________\n\tContents of the pipeline_parameters.yaml (contains the parameters for the pipeline software):\n________________________________________________________________________________"
cat profile/pipeline_parameters.yaml

___
# Acknowledgements
___

|Name |Publication|Website|
|:---|:---|:---|
|BBtools|NA|https://jgi.doe.gov/data-and-tools/bbtools/|
|BEDtools|Quinlan, A.R. and I.M.J.B. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. 2010. 26(6): p. 841-842.|https://bedtools.readthedocs.io/en/latest/|
|BLAST|Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. 25(17): p. 3389-3402.|https://www.ncbi.nlm.nih.gov/books/NBK279690/|
|BWA|Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997.|https://github.com/lh3/bwa|
|BioConda|Grüning, B., et al., Bioconda: sustainable and comprehensive software distribution for the life sciences. 2018. 15(7): p. 475.|https://bioconda.github.io/|
|Biopython|Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., ... & De Hoon, M. J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423.|https://biopython.org/|
|Bokeh|Bokeh Development Team (2018). Bokeh: Python library for interactive visualization.|https://bokeh.pydata.org/en/latest/|
|Bowtie2|Langmead, B. and S.L.J.N.m. Salzberg, Fast gapped-read alignment with Bowtie 2. 2012. 9(4): p. 357.|http://bowtie-bio.sourceforge.net/bowtie2/index.shtml|
|Conda|NA|https://conda.io/|
|DRMAA|NA|http://drmaa-python.github.io/|
|FastQC|Andrews, S., FastQC: a quality control tool for high throughput sequence data. 2010.|https://www.bioinformatics.babraham.ac.uk/projects/fastqc/|
|gawk|NA|https://www.gnu.org/software/gawk/|
|GNU Parallel|O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.|https://www.gnu.org/software/parallel/|
|Git|NA|https://git-scm.com/|
|igvtools|NA|https://software.broadinstitute.org/software/igv/igvtools|
|Jupyter Notebook|Kluyver, Thomas, et al. "Jupyter Notebooks-a publishing format for reproducible computational workflows." ELPUB. 2016.|https://jupyter.org/|
|Jupyter_contrib_nbextension|NA|https://github.com/ipython-contrib/jupyter_contrib_nbextensions|
|Jupyterthemes|NA|https://github.com/dunovank/jupyter-themes|
|Krona|Ondov, B.D., N.H. Bergman, and A.M. Phillippy, Interactive metagenomic visualization in a Web browser. BMC Bioinformatics, 2011. 12: p. 385.|https://github.com/marbl/Krona/wiki|
|Lofreq|Wilm, A., et al., LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. 2012. 40(22): p. 11189-11201.|http://csb5.github.io/lofreq/|
|Minimap2|Li, H., Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 2018.|https://github.com/lh3/minimap2|
|MultiQC|Ewels, P., et al., MultiQC: summarize analysis results for multiple tools and samples in a single report. 2016. 32(19): p. 3047-3048.|https://multiqc.info/|
|Nb_conda|NA|https://github.com/Anaconda-Platform/nb_conda|
|Nb_conda_kernels|NA|https://github.com/Anaconda-Platform/nb_conda_kernels|
|Nginx|NA|https://www.nginx.com/|
|Numpy|Walt, S. V. D., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2), 22-30.|http://www.numpy.org/|
|Pandas|McKinney, W. Data structures for statistical computing in python. in Proceedings of the 9th Python in Science Conference. 2010. Austin, TX.|https://pandas.pydata.org/|
|Picard|NA|https://broadinstitute.github.io/picard/|
|Prodigal|Hyatt, D., et al., Prodigal: prokaryotic gene recognition and translation initiation site identification. 2010. 11(1): p. 119.|https://github.com/hyattpd/Prodigal/wiki/Introduction|
|Python|G. van Rossum, Python tutorial, Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI), Amsterdam, May 1995.|https://www.python.org/|
|Qgrid|NA|https://github.com/quantopian/qgrid|
|SAMtools|Li, H., et al., The sequence alignment/map format and SAMtools. 2009. 25(16): p. 2078-2079.|http://www.htslib.org/|
|SPAdes|Nurk, S., et al., metaSPAdes: a new versatile metagenomic assembler. Genome Res, 2017. 27(5): p. 824-834.|http://cab.spbu.ru/software/spades/|
|Seqtk|NA|https://github.com/lh3/seqtk|
|Snakemake|Köster, J. and S.J.B. Rahmann, Snakemake—a scalable bioinformatics workflow engine. 2012. 28(19): p. 2520-2522.|https://snakemake.readthedocs.io/en/stable/|
|Tabix|NA|www.htslib.org/doc/tabix.html|
|tree|NA|http://mama.indstate.edu/users/ice/tree/|
|Trimmomatic|Bolger, A.M., M. Lohse, and B. Usadel, Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 2014. 30(15): p. 2114-20.|www.usadellab.org/cms/?page=trimmomatic|
|Virus-Host Database|Mihara, T., Nishimura, Y., Shimizu, Y., Nishiyama, H., Yoshikawa, G., Uehara, H., ... & Ogata, H. (2016). Linking virus genomes with host taxonomy. Viruses, 8(3), 66.|http://www.genome.jp/virushostdb/note.html|

#### Link to Gitlab repo (only accessible from within the RIVM network):  
https://github.com/DennisSchmitz/Jovian

#### Authors:
- Dennis Schmitz ([RIVM](https://www.rivm.nl/en) and [EMC](https://www6.erasmusmc.nl/viroscience/))  
- Sam Nooij ([RIVM](https://www.rivm.nl/en) and [EMC](https://www6.erasmusmc.nl/viroscience/))  
- Robert Verhagen ([RIVM](https://www.rivm.nl/en))  
- Thierry Janssens ([RIVM](https://www.rivm.nl/en))  
- Jeroen Cremer ([RIVM](https://www.rivm.nl/en))  
- Harry Vennema ([RIVM](https://www.rivm.nl/en))  
- Annelies Kroneman ([RIVM](https://www.rivm.nl/en))  
- Marion Koopmans ([EMC](https://www6.erasmusmc.nl/viroscience/))  

___