# <u>MSc Module 2 - metagenomics workshop </u>

The main aim of this workshop is for you to become familiar with analysing metagenomics sequencing data. Because of the requirement of high computational power this is routinely done on compute clusters. 
We will use the high performance computing environment (HPC) available at King's called CREATE.

Please see this wiki : https://docs.er.kcl.ac.uk/CREATE/access/

This notebook contains all the necessary steps for you to:
1. Log into the HPC environment
2. Setup a virtual environment using CONDA that contains Metaphlan
3. Run a single sample and inspect the output
4. Submit a script to the cluster to run the entire dataset
5. Start the downstream analysis

##### The commands/scripts you will need will appear as below

and the rest of the text is to guide you through the workshop

We will break it down into tasks according to the steps above. 

##### Please ensure that you follow throughout and we will ONLY continue once everyone has finished each step.

### <u> Task 1 : logging into the HPC </u>

<u>Step1</u> : open terminal 

* Mac: if you are using a MacBook (or other Linux distribution) you will find terminal in LaunchPad
* Windows: use MobaXterm downloaded previously

<u>Step2</u> : ssh (replace k1234567 with your k-number)

Once you are logged in, you are on what is called a login node. Login nodes are used to edit scripts and run small tasks that do not require allot of CPU. To run bigger jobs (the reason we use an HPC) you need to be on a larger node - usually termed a compute node. There are different ways to do this and we will do this in the next steps.

<u>Step3</u> : move to project shared space and view the directory structure

here you will see several directories for all workshop participants with k-numbers as well as a directory called <b>shared</b> that houses the data that you will need to run metaphlan

### <u> Task 2 : setup a virtual environment </u>


A big advantage of using an HPC is that it usually comes with several modules (or software packages) already installed 

To view all modules you can run the command

to load anaconda to your current session run the command below

** remember that you will have to load any module again if you log out and back in to the cluster

create a conda environment with the name <u>msc</u> (you can name it anything, but remember the name) and install metaphlan in this environment with the same command

If asked to proceed type <i>y</i> and hit enter

The necessary packages will then be installed in the environment.

To enter the environment:

### <u> Task 3 : Run a single sample </u>


Create directories (folders) where the input and output of the pipeline will be stored

move into the input directory

create symbolic links for all raw data files to this (input) directory

even a single metaphlan run requires significant cpu, therefore this should be done either by changing to a compute node, or by submitting a bash script to the scheduler. We will do both these to show the difference.

First, login to a compute node using this command

run the following command to run a single sample. PLEASE WAIT so that we do this together to ensure that you do this correctly

Once this is completed you can view the output using this command

In [2]:
import pandas as pd
from collections import Counter,defaultdict

In [10]:
df = pd.read_csv('M2/Metagenomics analysis/merged.tsv',sep='\t',index_col=0)
df.columns = df.columns.str.rstrip('d')
sample_map = pd.read_csv('M2/Metagenomics analysis/chinese_sample_map.txt',sep='\t')
meta = pd.read_csv('M2/Metagenomics analysis/chinese_metadata.csv',sep=',')

df = df.rename(columns=dict(zip(sample_map['run_accession'],sample_map['sample_alias'])))

select = []
for col in df.columns:
    if 'HD' in col or 'LD' in col:
        select.append(col)
        
df = df[select]

select = defaultdict(list)
for col in df.columns:
    select[col.split('_')[0]].append(col)
    
select2 = []
for i in select:
    select2.append(select[i][0])
df = df[select2]

df.columns = df.columns.str.split('_').str[0]
#df.to_csv('M2/Metagenomics analysis/merged_parsed.tsv')

In [27]:
df2 = df.copy()
df2.index = df2.index.str.split('|').str[-1]

#kingdom
df3 = df2.loc[df2.index.str.contains('k__')]
df3.index = df3.index.str.replace('k__','')
df3.to_csv('M2/Metagenomics analysis/kingdom_abd.csv')

df2 = df.copy()
df2 = df2.loc[df2.index.str.contains('k__Bacteria')]
df2.index = df2.index.str.split('|').str[-1]

#phylum
df3 = df2.loc[df2.index.str.contains('p__')]
df3.index = df3.index.str.replace('p__','')
df3.to_csv('M2/Metagenomics analysis/phylum_abd.csv')

#order
df3 = df2.loc[df2.index.str.contains('o__')]
df3.index = df3.index.str.replace('o__','')
df3.to_csv('M2/Metagenomics analysis/order_abd.csv')

#class
df3 = df2.loc[df2.index.str.contains('c__')]
df3.index = df3.index.str.replace('c__','')
df3.to_csv('M2/Metagenomics analysis/class_abd.csv')

#family
df3 = df2.loc[df2.index.str.contains('f__')]
df3.index = df3.index.str.replace('f__','')
df3.to_csv('M2/Metagenomics analysis/family_abd.csv')

#genus
df3 = df2.loc[df2.index.str.contains('g__')]
df3.index = df3.index.str.replace('g__','')
df3.to_csv('M2/Metagenomics analysis/genus_abd.csv')

#species
df3 = df2.loc[df2.index.str.contains('s__')]
df3.index = df3.index.str.replace('s__','')
df3.to_csv('M2/Metagenomics analysis/species_abd.csv')

In [14]:
l = []
for i in df2.index:
    i = i.split('__')[0]
    l.append(i)
set(l)

{'c', 'f', 'g', 'k', 'o', 'p', 's', 't'}

In [9]:
meta = pd.read_csv('M2/Metagenomics analysis/chinese_metadata.csv',sep=',')
meta = meta[meta['Sample ID'].isin(df.columns)]
meta.to_csv('M2/Metagenomics analysis/metadata_parsed.csv')

In [17]:
sample_map = pd.read_csv('M2/Metagenomics analysis/chinese_sample_map.txt',sep='\t')
sample_map.head(1)

Unnamed: 0,run_accession,sample_accession,experiment_accession,study_accession,tax_id,scientific_name,fastq_ftp,submitted_ftp,sra_ftp,sample_alias
0,ERR527006,SAMEA2581978,ERX492209,PRJEB6337,256318,metagenome,ftp.sra.ebi.ac.uk/vol1/fastq/ERR527/ERR527006/...,ftp.sra.ebi.ac.uk/vol1/run/ERR527/ERR527006/HV...,ftp.sra.ebi.ac.uk/vol1/err/ERR527/ERR527006,HV1_Run4
