# Vignette demonstrating microbial-like sequence discovery workflow

#### Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se

<h3><center>Abstract</center></h3>
In this vignette, we will demonstrate how to prepare and run the workflow detecting microbial-like sequeneces in eukaryotic reference genomes. Thw workflow accepts a eukaryotic reference in FASTA-format and outputs coordinates of microbial-like regions together with microbial species annotation.

### Table of Contents
* [Prepare input files](#Prepare-input-files)
* [Run workflow](#Run-workflow)

![Green algae](images/GreenAlgae.png)

### Prepare input files <a class="anchor" id="Prepare-input-files"></a>

For demonstration purposes we are going to use the reference genome of [*Bathycoccus prasinos*](https://en.wikipedia.org/wiki/Bathycoccus_prasinos) which is a green algae (picoplankton) eukaryotic organism related to plants. The reference genome [GCF_002220235.1](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_002220235.1/) of this algae is small (15 Mb) and therefore computationally easy to handle. The worflow together with the test-files is available at the following github address: https://github.com/NikolayOskolkov/MCWorkflow. Let us first clone th github repository and inspect its content:

In [1]:
cd /home/nikolay
git clone https://github.com/NikolayOskolkov/MCWorkflow
cd MCWorkflow

Cloning into 'MCWorkflow'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 19 (delta 3), reused 16 (delta 3), pack-reused 0 (from 0)[K
Unpacking objects: 100% (19/19), done.


In [2]:
ls -l

total 7584
drwxrwxr-x 2 nikolay nikolay    4096 mar  7 20:20 [0m[01;34mdata[0m
-rwxrwxr-x 1 nikolay nikolay    6265 mar  7 20:20 [01;32mextract_coords_micr_contam.R[0m
-rw-rw-r-- 1 nikolay nikolay 3766396 mar  7 20:20 GTDB_fna2name.txt
-rw-rw-r-- 1 nikolay nikolay 3675565 mar  7 20:20 [01;31mGTDB_sliced_seqs_sliding_window.fna.gz[0m
drwxrwxr-x 2 nikolay nikolay    4096 mar  7 20:20 [01;34mimages[0m
-rwxrwxr-x 1 nikolay nikolay    5000 mar  7 20:20 [01;32mmicr_cont_detect.sh[0m
-rw-rw-r-- 1 nikolay nikolay      26 mar  7 20:20 README.md
-rw-rw-r-- 1 nikolay nikolay  280904 mar  7 20:20 vignette.html
-rw-rw-r-- 1 nikolay nikolay    6959 mar  7 20:20 vignette.ipynb


Let us now download an eukaryotic reference genome and place it in a data-folder

In [3]:
pwd

/home/nikolay/MCWorkflow


### Run workflow <a class="anchor" id="Run-workflow"></a>

Now we can start the workflow by the following command line:

In [4]:
./micr_cont_detect.sh GCF_002220235.fna.gz /home/nikolay/MCWorkflow/data GTDB 4 GTDB_sliced_seqs_sliding_window.fna.gz GTDB_fna2name.txt


PREPARING FILES FOR ANALYSIS OF GCF_002220235.fna.gz REFERENCE GENOME

BUILDING BOWTIE2 INDEX FOR GCF_002220235.fna.gz REFERENCE GENOME
ALIGNING MICROBIAL READS WITH BOWTIE2 TO GCF_002220235.fna.gz REFERENCE GENOME
[bam_sort_core] merging from 0 files and 4 in-memory blocks...

RANKING GCF_002220235.fna.gz CONTIGS BY NUMBER OF MAPPED MICROBIAL READS
COMPUTING BREADTH OF COVERAGE FOR EACH CONTIG AND COORDINATES OF MICROBIAL CONTAMINATION FOR GCF_002220235.fna.gz REFERENCE GENOME
NC_023997.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024004.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_024008.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSING BOC FILES
NC_023992.1 CONTIG OF GCF_002220235.fna.gz
EXTRACTING COORDINATES OF MICROBIAL CONTAMINATION
DELETING BAM AND COMPRESSIN