# <u>MSc Module 2 - metagenomics workshop </u>

The main aim of this workshop is for you to become familiar with analysing metagenomics sequencing data. Because of the requirement of high computational power this is routinely done on compute clusters. 
We will use the high performance computing environment (HPC) available at King's called CREATE.

Please see this wiki : https://docs.er.kcl.ac.uk/CREATE/access/

This notebook contains all the necessary steps for you to:
1. Log into the HPC environment
2. Setup a virtual environment using CONDA that contains Metaphlan
3. Run a single sample and inspect the output
4. Submit a script to the cluster to run the entire dataset
5. Start the downstream analysis

##### The commands/scripts you will need will appear as below

In [None]:
these are commands that you should run

and the rest of the text is to guide you through the workshop

We will break it down into tasks according to the steps above. 

##### Please ensure that you follow throughout and we will ONLY continue once everyone has finished each step.

### <u> Task 1 : logging into the HPC </u>

<u>Step1</u> : open terminal 

* Mac: if you are using a MacBook (or other Linux distribution) you will find terminal in LaunchPad
* Windows: use MobaXterm downloaded previously

<u>Step2</u> : ssh (replace k1234567 with your k-number)

In [None]:
ssh k1234567@hpc.create.kcl.ac.uk

Once you are logged in, you are on what is called a login node. Login nodes are used to edit scripts and run small tasks that do not require allot of CPU. To run bigger jobs (the reason we use an HPC) you need to be on a larger node - usually termed a compute node. There are different ways to do this and we will do this in the next steps.

<u>Step3</u> : move to project shared space and view the directory structure. Then move into your dedicated space

In [None]:
cd /scratch/prj/docs_microbiome_msc/
ls
cd k1234567

here you will see several directories for all workshop participants with k-numbers as well as a directory called <b>shared</b> that houses the data that you will need to run metaphlan

### <u> Task 2 : setup a virtual environment for metaphlan </u>

https://github.com/biobakery/MetaPhlAn

https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4#installation


A big advantage of using an HPC is that it usually comes with several modules (or software packages) already installed 

To view all modules you can run the command

In [None]:
module spider

to load anaconda to your current session run the command below

In [None]:
module load anaconda3

** remember that you will have to load any module again if you log out and back in to the cluster. Set the default shell of conda to bash - you only have to do this once. After you done this log out and back in to the cluster

In [None]:
conda init bash

create a conda environment with the name <u>msc</u> (you can name it anything, but remember the name)

In [None]:
conda create --name msc

If asked to proceed type <i>y</i> and hit enter

The necessary packages will then be installed in the environment.

To enter the environment:

In [None]:
conda activate msc

install metaphlan in this environment using pip

In [None]:
pip install metaphlan

### <u> Task 3 : Run a single sample </u>


### Run this in an interactive node

Create directories (folders) where the input and output of the pipeline will be stored

In [None]:
mkdir input output

move into the input directory

In [None]:
cd input

create symbolic links for all raw data files to this (input) directory

In [None]:
ln -s /scratch/prj/docs_microbiome_msc/shared/data/*.gz .

even a single metaphlan run requires significant cpu, therefore this should be done either by changing to a compute node, or by submitting a bash script to the scheduler. We will do both these to show the difference.

First, login to a compute node using this command

In [None]:
srun -p cpu --pty /bin/bash

run the following command to run a single sample. PLEASE WAIT so that we do this together to ensure that you do this correctly

In [None]:
metaphlan input/ERR526291_1.fastq.gz,input/ERR526291_2.fastq.gz 
--bowtie2out output/metagenome.bowtie2.bz2 
--input_type fastq 
-o output/profiled_metagenome.txt 
--bowtie2db /scratch/prj/docs_microbiome_msc/shared/metaphlan_db/

It will take some time to run so cancel the job using CONTROL+C

### Submit a job to the SLURM queue

navigate yourself back to /scratch/prj/docs_microbiome_msc/k1234567

create a file called submit.sh and open this file with either vim or nano

fill the file with the following lines of code

In [None]:
#!/bin/bash -l
#SBATCH --job-name=test_metaphlan
#SBATCH --ntasks=10

module load anaconda3
source activate msc

metaphlan input/ERR526291_1.fastq.gz,input/ERR526291_2.fastq.gz --bowtie2out output/metagenome.bowtie2.bz2 --input_type fastq -o output/profiled_metagenome.txt --bowtie2db /scratch/prj/docs_microbiome_msc/shared/metaphlan_db/

submit the job with the following command

In [None]:
sbatch submit.sh

view the job with the following command

In [None]:
squeue -u k1234567

### <u> Task 4 : Run multiple samples </u>


Create a sample list file

In [None]:
ls input/ | awk -F'_' '{print $1}' | sort -u > sample_list.txt

Create a file named submit_all.sh and populate it with the code below

In [None]:
#!/bin/bash -l
#SBATCH --job-name=test_metaphlan_multiple
#SBATCH --ntasks=10

module load anaconda3
source activate msc

input="sample_list.txt"

while read -r line
do
  metaphlan input/$line"_1.fastq.gz",input/$line"_2.fastq.gz" --input_type fastq --bowtie2out output/$line"_bowtie2.bz2" -o output/$line"_profiled.tsv" --bowtie2db /scratch/prj/docs_microbiome_msc/shared/metaphlan_db/
done < "$input"

You can submit this with sbatch submit_all.sh --> but please do not do this to spare the cluster