# How to run Snippy, Gubbins, AMRFinderPlus and MLST.

#### 1. Snippy and Gubbins

Link to Snippy Github: https://github.com/tseemann/snippy

Link to Gubbins Github: https://github.com/nickjcroucher/gubbins

For the purpose of demo, we will use dataset from Assignment 2 which is located here: 

```/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/```

Since, Gubbins is run on the final output of Snippy, we will run both the analysis in the same folder.
 
- We will create a new directory to save Snippy and Gubbins results under the ```shared_data/data/snippy_and_gubbins_demo```

In [6]:
cd /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/

In [2]:
mkdir snippy_and_gubbins_demo

In [7]:
cd snippy_and_gubbins_demo

- Prepare multiple sample list for Snippy

In class9, we ran Snippy on a single isolate PCMP_H326 and running Snippy individually on an entire dataset could become cumbersome and would require applying a for loop. 

But, Snippy package comes with a helper script called ```snippy-multi``` that takes a tab seperated list of Sample names and path to its fastq sequences and generates a bash script that we can then submit as a job.

To prepare an input list of samples, run this command:

```
#Loop through each file of forward reads, and print genome name and F/R read files for snippy batch
for r1 in /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/*_1.fastq;
do 
    #Get name of genome from forward reads
    isolate=`echo $r1 | cut -d'/' -f9 | sed 's/_1.fastq//g'`;
    #Get reverse reads corresponding to current forward reads
    r2=`echo $r1 | sed 's/_1.fastq/_2.fastq/g'`;
    #Print out genome, forward reads and reverse reads, separated by tabs
    printf "$isolate\t$r1\t$r2\n";
done > input.tab
```

***Lets check the input.tab file we just created:***

In [6]:
head input.tab

SRR6204326	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204326_1.fastq	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204326_2.fastq
SRR6204327	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204327_1.fastq	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204327_2.fastq
SRR6204328	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204328_1.fastq	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204328_2.fastq
SRR6204329	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204329_1.fastq	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204329_2.fastq
SRR6204330	/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crk

***The first column contains the samplenames that Snippy will use to create output folders for each samples and the second and third column is the path to its paired end reads.***

- Run snippy-multi on input sample list

- Lets load modules required to run Snippy and chek if all the dependencies looks good. 

In [18]:
# Load these modules
module load python3.9-anaconda/2021.11
module load Bioinformatics
module load perl-modules


    Provides Bioinformatics software.
    For more information please use:

        $ module help Bioinformatics




In [19]:
snippy --check

[14:40:04] This is snippy 4.6.0
[14:40:04] Written by Torsten Seemann
[14:40:04] Obtained from https://github.com/tseemann/snippy
[14:40:04] Detected operating system: linux
[14:40:04] Enabling bundled linux tools.
[14:40:04] Found bwa - /gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/bin/snippy/binaries/linux/bwa
[14:40:04] Found bcftools - /gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/bin/snippy/binaries/linux/bcftools
[14:40:04] Found samtools - /gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/bin/snippy/binaries/linux/samtools
[14:40:04] Found java - /usr/bin/java
[14:40:04] Found snpEff - /gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/bin/snippy/binaries/noarch/snpEff
[14:40:04] Found samclip - /gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/bin/snippy/binaries/noarch/samclip
[14:40:04] Found seqtk - /gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/bin/snippy/binaries/linux

- Only proceed if the last line of the snippy check is "Dependences look good!"

We will run snippy-multi on commandline with input.tab as yoour data input and path to a reference genome genbank file. 

Note: Make sure to use project specific reference genome. 

In [21]:
snippy-multi input.tab --ref /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/class9/KPNIH1.gbk --cpu 4 --force --report > runme.sh

Reading: input.tab
Generating output commands for 69 isolates
Done.


Note: 

    - snippy-multi generates individual snippy commands for each sample as well as snippy-core command which aggregates results from individual sample snippy results and generates a consensus alignment file. This consensus alignment file can then be used as an input for Gubbins which masks recombinant region in this alignment file and generates a phylogenetic tree.

    - If snippy-multi throws an error that it can't file a read file or a path, its propbably due to the format of your input.tab Check your for loop and regenerate the input.tab file. THis file shoul contain three columns each seperated by a tab. 

In [22]:
head -n5 runme.sh

snippy --outdir 'SRR6204326' --R1 '/gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204326_1.fastq' --R2 '/gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204326_2.fastq' --ref /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/class9/KPNIH1.gbk --cpu 4 --force --report
snippy --outdir 'SRR6204327' --R1 '/gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204327_1.fastq' --R2 '/gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204327_2.fastq' --ref /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/class9/KPNIH1.gbk --cpu 4 --force --report
snippy --outdir 'SRR6204328' --R1 '/gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_fastq/SRR6204328_1.fastq' --R2 '/gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/data/ass

***snippy-multi generated snippy commands for each of the samples and saved it in runme.sh script. If you look at the end of runme.sh, you would find a snippy-core command which will aggregate the variant call results and generate a core alignment which will be our input for Gubbins***

In [23]:
tail -n1 runme.sh

snippy-core --ref 'SRR6204326/ref.fa' SRR6204326 SRR6204327 SRR6204328 SRR6204329 SRR6204330 SRR6204331 SRR6204332 SRR6204333 SRR6204334 SRR6204335 SRR6204336 SRR6204337 SRR6204338 SRR6204339 SRR6204340 SRR6204341 SRR6204342 SRR6204343 SRR6204344 SRR6204345 SRR6204346 SRR6204347 SRR6204348 SRR6204349 SRR6204350 SRR6204351 SRR6204352 SRR6204353 SRR6204354 SRR6204355 SRR6204356 SRR6204357 SRR6204358 SRR6204359 SRR6204360 SRR6204361 SRR6204362 SRR6204363 SRR6204364 SRR6204365 SRR6204366 SRR6204367 SRR6204368 SRR6204369 SRR6204370 SRR6204371 SRR6204372 SRR6204373 SRR6204374 SRR6204375 SRR6204376 SRR6204377 SRR6204378 SRR6204379 SRR6204380 SRR6204381 SRR6204382 SRR6204383 SRR6204384 SRR6204385 SRR6204386 SRR6204387 SRR6204388 SRR6204389 SRR6204390 SRR6204391 SRR6204392 SRR6204393 SRR6204394


- Submit Snippy job

In [25]:
touch snippy.sbat

**Substitute username with your umich uniqname and paste these lines to snippy.sbat file using nano:**

```
#!/bin/sh
# Job name
#SBATCH --job-name=Snippy
# User info
#SBATCH --mail-user=username@umich.edu
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE
#SBATCH --export=ALL
#SBATCH --partition=standard
#SBATCH --account=epid582w23_class
# Number of cores, amount of memory, and walltime
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=4 --mem=20g --time=240:00:00
#  Change to the directory you submitted from
cd $SLURM_SUBMIT_DIR
echo $SLURM_SUBMIT_DIR

bash runme.sh
```

Submit the job using sbatch

In [26]:
sbatch snippy.sbat

Submitted batch job 49825358


Snippy will generate variant calling results for each of the samples along with other useful reports that we can analyze further based on our needs. 

Extension | Description
----------|--------------
.aln | A core SNP alignment in the `--aformat` format (default FASTA)
.full.aln | A whole genome SNP alignment (includes invariant sites)
.tab | Tab-separated columnar list of **core** SNP sites with alleles but NO annotations
.vcf | Multi-sample VCF file with genotype `GT` tags for all discovered alleles
.txt | Tab-separated columnar list of alignment/core-size statistics
.ref.fa | FASTA version/copy of the `--ref`
.self_mask.bed | BED file generated if `--mask auto` is used.

In [8]:
cat core.txt

ID	LENGTH	ALIGNED	UNALIGNED	VARIANT	HET	MASKED	LOWCOV
SRR6204326	5766615	5461922	286542	208	2437	0	15714
SRR6204327	5766615	5419140	292048	189	11778	0	43649
SRR6204328	5766615	5312074	292778	168	7127	0	154636
SRR6204329	5766615	5460187	286745	176	3617	0	16066
SRR6204330	5766615	5463038	284579	173	2810	0	16188
SRR6204331	5766615	5443496	287518	174	9402	0	26199
SRR6204332	5766615	5450708	286854	176	4746	0	24307
SRR6204333	5766615	5445196	286778	175	8733	0	25908
SRR6204334	5766615	5397586	354903	164	1474	0	12652
SRR6204335	5766615	5453872	287043	170	1345	0	24355
SRR6204336	5766615	5441449	287387	170	10522	0	27257
SRR6204337	5766615	5427392	289800	168	1085	0	48338
SRR6204338	5766615	5363788	298058	165	1914	0	102855
SRR6204339	5766615	5406545	348332	163	319	0	11419
SRR6204340	5766615	5173207	311606	149	3873	0	277929
SRR6204341	5766615	5453867	286001	146	493	0	26254
SRR6204342	5766615	4299630	417158	107	6937	0	1042890
SRR6204343	5766615	5045617	346284	107	4465	0	370249
SRR6204344	5766615	532

- Run Gubbins on multi sequence alignemnt `core.full.aln`

In [2]:
module load Bioinformatics
module load gubbins/2.3.1


    Provides Bioinformatics software.
    For more information please use:

        $ module help Bioinformatics




In [3]:
# Check if gubbins was loaded into the environment
run_gubbins -h

usage: run_gubbins [-h] [--outgroup OUTGROUP] [--starting_tree STARTING_TREE]
                   [--use_time_stamp] [--verbose] [--no_cleanup]
                   [--tree_builder TREE_BUILDER] [--iterations ITERATIONS]
                   [--min_snps MIN_SNPS]
                   [--filter_percentage FILTER_PERCENTAGE] [--prefix PREFIX]
                   [--threads THREADS] [--converge_method CONVERGE_METHOD]
                   [--version] [--min_window_size MIN_WINDOW_SIZE]
                   [--max_window_size MAX_WINDOW_SIZE]
                   [--raxml_model RAXML_MODEL] [--remove_identical_sequences]
                   alignment_filename

Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley
S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples
of recombinant bacterial whole genome sequences using Gubbins". Nucleic Acids
Res. 2015 Feb 18;43(3):e15. doi: 10.1093/nar/gku1196 .

positional arguments:
  alignment_filename    Multifasta ali

In [9]:
touch gubbins.sbat

Copy and paste these lines to gubbins.sbat file using nano:

```
#!/bin/sh
# Job name
#SBATCH --job-name=Gubbins
# User info
#SBATCH --mail-user=username@umich.edu
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE
#SBATCH --export=ALL
#SBATCH --partition=standard
#SBATCH --account=epid582w23_class
# Number of cores, amount of memory, and walltime
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=40g --time=240:00:00
#  Change to the directory you submitted from
cd $SLURM_SUBMIT_DIR
echo $SLURM_SUBMIT_DIR

run_gubbins --prefix crkp_core_full_aln --verbose /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/snippy_and_gubbins_demo/core.full.aln
```

In [13]:
sbatch gubbins.sbat

Submitted batch job 49880796


- Gubbins will create various output files with the prefix `crkp_core_full_aln`. 
- We will move these files to a new folder gubbins_results.

In [11]:
mkdir gubbins_results

In [12]:
mv crkp_core_full_aln.* gubbins_results/

#### 2. Run AMRFinderPlus on CRKP assemblies

In [14]:
cd /scratch/epid582w23_class_root/epid582w23_class/shared_data/data

In [15]:
mkdir AMRFinderPlus_demo

In [16]:
cd AMRFinderPlus_demo

In [17]:
conda activate class7

(class7) 

: 1

In [19]:
# Check if amrfinder is loaded into the environment
amrfinder -h

(class7) Identify AMR and virulence genes in proteins and/or contigs and print a report

DOCUMENTATION
    See https://github.com/ncbi/amr/wiki for full documentation

UPDATES
    Subscribe to the amrfinder-announce mailing list for database and software update notifications:
    https://www.ncbi.nlm.nih.gov/mailman/listinfo/amrfinder-announce

USAGE:   amrfinder [--update] [--force_update] [--protein PROT_FASTA] [--nucleotide NUC_FASTA] [--gff GFF_FILE] [--pgap] [--annotation_format ANNOTATION_FORMAT] [--database DATABASE_DIR] [--ident_min MIN_IDENT] [--coverage_min MIN_COV] [--organism ORGANISM] [--list_organisms] [--translation_table TRANSLATION_TABLE] [--plus] [--report_common] [--mutation_all MUT_ALL_FILE] [--blast_bin BLAST_DIR] [--report_all_equal] [--print_node] [--name NAME] [--output OUTPUT_FILE] [--protein_output PROT_FASTA_OUT] [--nucleotide_output NUC_FASTA_OUT] [--nucleotide_flank5_output NUC_FLANK5_FASTA_OUT] [--nucleotide_flank5_size NUC_FLANK5_SIZE] [--quiet] [--gpipe_

: 1

In [None]:
# We will generate AMRfinderplus commands for each individual genomes with the following for loop and submit it as a cluster job.
for i in /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/*.fasta; 
do 
output=`echo $i | cut -d'/' -f9 | sed 's/.fasta/.txt/g'`; 
report=`echo $i | cut -d'/' -f9 | sed 's/.fasta/_mutation_report.tsv/g'`; 
echo "amrfinder --plus --output $output -n $i --mutation_all $report --organism Klebsiella_pneumoniae";
# Comment this line to print AMRFinderPlus commands
#amrfinder --plus --output $output -n $i --mutation_all $report --organism Klebsiella_pneumoniae;
done

- Note that you would need to specify project specific organism name with the `--organism` argument. 
- `amrfinder -l` will list possible options for this argument.

```Available --organism options: Acinetobacter_baumannii, Burkholderia_cepacia, Burkholderia_pseudomallei, Campylobacter, Clostridioides_difficile, Enterococcus_faecalis, Enterococcus_faecium, Escherichia, Klebsiella_oxytoca, Klebsiella_pneumoniae, Neisseria_gonorrhoeae, Neisseria_meningitidis, Pseudomonas_aeruginosa, Salmonella, Staphylococcus_aureus, Staphylococcus_pseudintermedius, Streptococcus_agalactiae, Streptococcus_pneumoniae, Streptococcus_pyogenes, Vibrio_cholerae```

In [25]:
touch AMRFinderPlus_demo.sbat

(class7) 

: 1

Copy and paste these lines to gubbins.sbat file using nano:

```
#!/bin/sh
# Job name
#SBATCH --job-name=AMRFinderPlus
# User info
#SBATCH --mail-user=username@umich.edu
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE
#SBATCH --export=ALL
#SBATCH --partition=standard
#SBATCH --account=epid582w23_class
# Number of cores, amount of memory, and walltime
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=8 --mem=40g --time=240:00:00
#  Change to the directory you submitted from
cd $SLURM_SUBMIT_DIR
echo $SLURM_SUBMIT_DIR

for i in /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/*.fasta; 
do
output=`echo $i | cut -d'/' -f9 | sed 's/.fasta/.txt/g'`; 
report=`echo $i | cut -d'/' -f9 | sed 's/.fasta/_mutation_report.tsv/g'`;
# Uncomment this line to print AMRFinderPlus commands.
#echo "amrfinder --plus --output $output -n $i --mutation_all $report --organism Klebsiella_pneumoniae";
amrfinder --plus --output $output -n $i --mutation_all $report --organism Klebsiella_pneumoniae;
done

```

#### 3. Perform MLST detection using mlst tool

Link to MLST: https://github.com/tseemann/mlst

Installation instructions for mlst tool:

`conda create -n mlst -c conda-forge -c bioconda -c defaults mlst`


In [31]:
cd /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/

In [None]:
# creating a new folder to save mlst results
mkdir mlst_demo

In [33]:
# Activated the mlst conda environment
conda activate mlst

(mlst) 

: 1

In [40]:
# Check if mlst tool was installed and check if dependencies looks okay: 
mlst -check

(mlst) [16:30:18] This is mlst 2.23.0 running on linux with Perl 5.032001
[16:30:18] Checking mlst dependencies:
[16:30:18] Found 'blastn' => /home/apirani/.conda/envs/mlst/bin/blastn
[16:30:18] Found 'any2fasta' => /home/apirani/.conda/envs/mlst/bin/any2fasta
[16:30:18] Found blastn: 2.13.0+ (002013)
[16:30:18] OK.
(mlst) 

: 1

In [41]:
# Check if mlst tool contain mlst typing scheme for klebsiella
mlst -list

(mlst) [16:30:48] This is mlst 2.23.0 running on linux with Perl 5.032001
[16:30:48] Checking mlst dependencies:
[16:30:48] Found 'blastn' => /home/apirani/.conda/envs/mlst/bin/blastn
[16:30:48] Found 'any2fasta' => /home/apirani/.conda/envs/mlst/bin/any2fasta
[16:30:48] Found blastn: 2.13.0+ (002013)
pmultocida plarvae sbsec ureaplasma brachyspira psalmonis chlamydiales spneumoniae cbotulinum campylobacter_nonjejuni_8 sepidermidis campylobacter_nonjejuni_5 pfluorescens mplutonius brachyspira_5 campylobacter_nonjejuni_7 shaemolyticus csepticum neisseria bhenselae bwashoensis campylobacter_nonjejuni cronobacter staphlugdunensis spyogenes scanis helicobacter leptospira listeria_2 bsubtilis pputida cdifficile bordetella_3 mhominis_3 hparasuis ecloacae mgallisepticum_2 campylobacter_nonjejuni_9 mcatarrhalis_achtman_6 manserisalpingitidis lsalivarius bcereus wolbachia senterica_achtman_2 mycobacteria_2 brucella miowae vcholerae brachyspira_2 gallibacterium tpallidum xfastidiosa llactis_phag

: 1

In [None]:
# Run mlst tool on all the CRKP assemblies in crkp_assembly folder and save the output in csv format to a file mlst.csv.
mlst --quiet --scheme klebsiella --csv /scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/*.fasta > crkp_mlst.csv

In [45]:
head crkp_mlst.csv

/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/SRR6204326.fasta,klebsiella,258,gapA(3),infB(3),mdh(1),pgi(1),phoE(1),rpoB(1),tonB(79)
/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/SRR6204327.fasta,klebsiella,258,gapA(3),infB(3),mdh(1),pgi(1),phoE(1),rpoB(1),tonB(79)
/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/SRR6204328.fasta,klebsiella,258,gapA(3),infB(3),mdh(1),pgi(1),phoE(1),rpoB(1),tonB(79)
/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/SRR6204329.fasta,klebsiella,258,gapA(3),infB(3),mdh(1),pgi(1),phoE(1),rpoB(1),tonB(79)
/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/SRR6204330.fasta,klebsiella,258,gapA(3),infB(3),mdh(1),pgi(1),phoE(1),rpoB(1),tonB(79)
/scratch/epid582w23_class_root/epid582w23_class/shared_data/data/assignment_2/crkp_assembly/SRR6204331.fa

: 1