# JM-lab virome pipeline: tutorial
This jupyter notebook gives an overview of the commands needed for the primary analysis of raw NGS data with explanations. This is intended as a learning tool for new PhD-students, master students, interns, etc. 

## Logging in to teaching server
For this this tutorial we can work on the teaching server of gbiomed (bmw.gbiomed.kuleuven.be). Everyone with a u- or r-number can connect to this server by ssh-ing to *__'your_r/u-number'@bmw.gbiomed.kuleuven.be__* and giving your intranet password.

Perform following actions in the terminal:
```bash
ssh 'your_r/u-number'@bmw.gbiomed.kuleuven.be
```
Next, you should give your password connected to your KU Leuven account.

## Installing all necessary software
### Miniconda
(Mini)conda is a package manager from which you can install a lot of (bioinformatics) software. https://docs.conda.io/projects/conda/en/latest/
1. Create in your datafolder a new directory and move into that directory:

In [1]:
cd ~/data
mkdir software
cd software
pwd

/home/luna.kuleuven.be/u0140985/data/software


2. Download the Miniconda installer with `wget`. Next, run the installation script (`-b` makes the installation run silent and `-p` provides the path where to install Miniconda). When Miniconda is installed, activate the tool by sourcing the initialization script, this simply sets a couple of shell environment variables, and conda command as a shell function. More information: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

In [2]:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/data/software/miniconda
source $HOME/data/software/miniconda/bin/activate
conda init
source ~/.bashrc

--2021-01-13 16:49:44--  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94235922 (90M) [application/x-sh]
Saving to: ‘Miniconda3-latest-Linux-x86_64.sh’


2021-01-13 16:49:45 (84.1 MB/s) - ‘Miniconda3-latest-Linux-x86_64.sh’ saved [94235922/94235922]

PREFIX=/home/luna.kuleuven.be/u0140985/data/software/miniconda
Unpacking payload ...
Collecting package metadata (current_repodata.json): done                       
Solving environment: done

## Package Plan ##

  environment location: /home/luna.kuleuven.be/u0140985/data/software/miniconda

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - brotlipy==0.7.0=py38h27cfd23_1003
    - ca-certificates==2020.10.14=0
    - certifi==2020.6.20=pyhd3eb1b0_3
    - cffi==1.14.3

: 1

3. When installing new software with conda the best practice is to create a new conda environment for each part of a project your working on, for example:

In this tutorial we will run the virome pipeline, so we will create a conda environment with all software we need to run the pipeline installed in this environment. Then we need to activate this environment to make the software available for use.


In [6]:
conda create -y --name virome_pipeline python
conda activate virome_pipeline
conda install -y -c bioconda krona samtools bwa-mem2 bowtie2 spades trimmomatic bedtools

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/luna.kuleuven.be/u0140985/data/software/miniconda/envs/virome_pipeline

  added / updated specs:
    - biopython


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    biopython-1.78             |   py38h7b6447c_0         2.1 MB
    blas-1.0                   |              mkl           6 KB
    ca-certificates-2020.12.8  |       h06a4308_0         121 KB
    certifi-2020.12.5          |   py38h06a4308_0         141 KB
    intel-openmp-2020.2        |              254         786 KB
    mkl-2020.2                 |              256       138.3 MB
    mkl-service-2.3.0          |   py38he904b0f_0          62 KB
    mkl_fft-1.2.0              |   py38h23d657b_0         157 KB
    mkl_random-1.1.1           |   py38h0573a6f_0         341 KB
    numpy-1.19.2     

Downloading and Extracting Packages
samtools-1.7         | 1.0 MB    | ##################################### | 100% 
perl-5.26.2          | 10.5 MB   | ##################################### | 100% 
curl-7.71.1          | 140 KB    | ##################################### | 100% 
libssh2-1.9.0        | 269 KB    | ##################################### | 100% 
krb5-1.18.2          | 1.3 MB    | ##################################### | 100% 
trimmomatic-0.39     | 142 KB    | ##################################### | 100% 
bwa-mem2-2.1         | 3.0 MB    | ##################################### | 100% 
libgcc-7.2.0         | 269 KB    | ##################################### | 100% 
bedtools-2.29.2      | 13.8 MB   | ##################################### | 100% 
libcurl-7.71.1       | 305 KB    | ##################################### | 100% 
openjdk-8.0.152      | 57.4 MB   | ##################################### | 100% 
krona-2.7.1          | 189 KB    | ##################################### 

: 1

###### Downloading taxonomy database for Krona
Krona is installed but we still need to run `ktUpdateTaxonomy.sh` to download the taxonomy database, see message below:

```sh
Krona installed.  You still need to manually update the taxonomy databases before Krona can generate taxonomic reports. The update script is ktUpdateTaxonomy.sh. 
The default location for storing taxonomic databases is /home/luna.kuleuven.be/u0140985/data/software/miniconda/envs/virome_pipeline/opt/krona/taxonomy

If you would like the taxonomic data stored elsewhere, simply replace
this directory with a symlink.  For example:

rm -rf /home/luna.kuleuven.be/u0140985/data/software/miniconda/envs/virome_pipeline/opt/krona/taxonomy
mkdir /path/on/big/disk/taxonomy
ln -s /path/on/big/disk/taxonomy /home/luna.kuleuven.be/u0140985/data/software/miniconda/envs/virome_pipeline/opt/krona/taxonomy
ktUpdateTaxonomy.sh```

In [7]:
ktUpdateTaxonomy.sh

Fetching taxdump.tar.gz...
   Fetching checksum...
   Checksum for taxdump.tar.gz matches server.
Extracting taxonomy...

Cleaning up...

Finished.

(virome_pipeline) 

: 1

### From another source
Next to Anaconda/Miniconda their are a lot of other possibilities to install software (`pip`, building from source, installing binaries, cloning from github, etc.)

As the latest version of Diamond (a sequence aligner for protein and translated DNA searches) is not available through `conda`, we can install it from github by following the instructions (https://github.com/bbuchfink/diamond & installation https://github.com/bbuchfink/diamond/wiki).

In [1]:
cd ~/data/software/
mkdir diamond
cd diamond
wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz
tar -xzf diamond-linux64.tar.gz

--2021-01-13 23:12:37--  http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz [following]
--2021-01-13 23:12:37--  https://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/31987083/77690800-3fb3-11eb-8628-ddfbe9c08477?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210113%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210113T221237Z&X-Amz-Expires=300&X-Amz-Signature=e75ab16f5a0f77a206ddf1606483ca7ef8ff4778963a2aa18a08cfed9c55466e&X-Amz-Signed

Now we still need to put the diamond executable in our `$PATH` so we can call it on the command line from everywhere in the terminal. This can be done by making a `bin` subdirectory in our `software` directory, followed by creating a symlink from the `diamond` executable to this bin directory and finally export `bin` to our `$PATH` by adding it to your `.profile` or `.bash_profile` file.

In [2]:
cd ~/data/software
mkdir bin
cd bin/
ln -s ~/data/software/diamond/diamond .

Next, open the `.profile` file with `nano` and add following line to the bottom of the file:
```bash
PATH="$PATH:~/data/software/bin"
```
More documentation on where and how to set the `PATH` variable in these two topics: 
* https://superuser.com/questions/183870/difference-between-bashrc-and-bash-profile/183980#183980 
* https://unix.stackexchange.com/questions/26047/how-to-correctly-add-a-path-to-path


When you `source` your `.profile` file, you should now be able to call diamond.
```bash
source ~/.profile
```

In [1]:
diamond help

diamond v2.0.6.144 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org

Syntax: diamond COMMAND [OPTIONS]

Commands:
makedb	Build DIAMOND database from a FASTA file
blastp	Align amino acid query sequences against a protein reference database
blastx	Align DNA query sequences against a protein reference database
view	View DIAMOND alignment archive (DAA) formatted file
help	Produce help message
version	Display version information
getseq	Retrieve sequences from a DIAMOND database file
dbinfo	Print information about a DIAMOND database file
test	Run regression tests

General options:
--threads (-p)           number of CPU threads
--db (-d)                database file
--out (-o)               output file
--outfmt (-f)            output format
	0   = BLAST pairwise
	5   = BLAST XML
	6   = BLAST tabular
	100 = DIAMOND alignment archive (DAA)
	101 = SAM

	Value 6 may be followed by a space-separated list of these key

```bash
ls | cut -f1 -d '_'| sort -u > names.txt
while read line; do cat ${line}_*_*_R1_*.fastq.gz > $line.R1.fastq.gz; done < names.txt
while read line; do cat ${line}_*_*_R2_*.fastq.gz > $line.R2.fastq.gz; done < names.txt
rm *L00*```
