Skip to content

AWS GBI AMI Documentation

FlintMitchell edited this page Feb 14, 2022 · 78 revisions

Back to Links to relevant softwares, papers, and resources

What is this page?

This page will include the documentation I am recording as I install softwares onto the AMI. It includes the date I verified the successful installation of the software, the version of the software installed, the process for installation (this will be a copy of the commands I entered to install the software and either simulink it to usr/lib add it to the PATH if necessary), example of the software use (this will be either the generic command syntax or an example command, some of which are copied from the user guides or manuals), and a brief description of the software, including links to manuals, Githubs, or other relevant repositories.

You can find this same information on the AMI itself, within the software's subdirectory of GBI-software. For example, the subdirectory Picard holds both the software I downloaded and installed from Picard's Github repo, but also a README file named GBI-README-PICARD. Within this file is a copy of its brief documentation that is on this wiki page you're currently reading. Each software has its own readme file with that syntax: GBI-README-[SOFTWARE_NAME].

I would recommend using the search function of your browser to search for a software you are interested in because this page is quite long.

The date associated with these pages is simply the day that I put it onto the AMI, not necessarily the day that I installed the software or wrote any of this information. It is more of a placed holder for if someone else edits or adds to these files, there is a place for them log the timeframe they do so.

The general layout for these README's are:

  • Software name
  • Date README uploaded
  • Location of the software on the AMI
  • Version of the software on the AMI
  • Links to / the method I used for installing the software (so you can reproduce it if needed)
  • Examples of the software (basic syntax of how to use the command sometimes with example commands I found in the documentation)
  • Brief description of the software and links to it's documentation, github, papers, or anything else I found relevant

You should always check the man page (or --help option) and the github or any other linked documentation page for answers regarding these softwares. These README's are really just meant to a successful installation on this AMI and to be a quick look at where to find further information and are in no way meant to be a replacement for what the actual authors have published!

The Softwares



Jellyfish

Date: 11.01.2021

Location of installation:

/usr/bin/jellyfish

Version of the software on the AMI and date installed:

(base) ubuntu@ip-172-31-52-167:~$ jellyfish --version
jellyfish 2.3.0

Process for Installation:

sudo apt update
sudo apt install jellyfish

Software Example of Use:

Example command from the manual to count k-mers:

jellyfish count -m 22 -o output -c 3 -s 10000000 -t 32 input.fasta

Description of the above command: “This will count the the 22-mers in input.fasta with 32 threads. The counter field in the hash uses only 3 bits and the hash has at least 10 million entries. The output files will be named output 0, output 1, etc. (the prefix is specified with the -o switch). If the hash is large enough (has specified by the -s switch) to fit all the k-mers, there will be only one output file named output 0. If the hash filled up before all the mers were read, the hash is dumped to disk, zeroed out and reading in mers resumes. Multiple intermediary files will be present on the disks, named output 0, output 1, etc”

Description of Software:

From the README: “Jellyfish is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence.” The Github and further information can be found at: https://github.com/gmarcais/Jellyfish. The last release was in 2019, but it has an active community still engaging questions and issues on the Github. In order to execute a certain function within jellyfish, enter that function name after jellyfish like: jellyfish count Or jellyfish merge A helpful manual for jellyfish, which includes information on how to execute the command, relevant and required options, and recommendations for use can be found at: https://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual-1.1.pdf



TrimGalore

Date: 11.01.2021

Location of installation:

(base) ubuntu@ip-172-31-52-167:~$ which trim_galore 
/usr/local/bin/trim_galore

Version of the software on the AMI and date installed:

(base) ubuntu@ip-172-31-52-167:~$ trim_galore --version

                        Quality-/Adapter-/RRBS-/Speciality-Trimming
                                [powered by Cutadapt]
                                  version 0.6.6

                               Last update: 11 05 2020

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ curl -fsSL https://github.com/FelixKrueger/TrimGalore/archive/0.6.6.tar.gz -o trim_galore.tar.gz
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ tar -xzvf trim_galore.tar.gz 
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ rm trim_galore.tar.gz 
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir TrimGalore
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mv TrimGalore-0.6.6/ TrimGalore
(base) ubuntu@ip-172-31-52-167:~/GBI-software/TrimGalore/TrimGalore-0.6.6$ sudo ln -s "$(pwd)/trim_galore" ~/../../usr/local/bin/trim_galore

Software Example of Use:

Basic command usage: trim_galore [options] <filename(s)>

There are a lot of options for this software because of the scope of things that it can do. For a full list, check out the user guide link below.

Description of Software:

From the README: “Trim Galore is a Perl wrapper around two tools: Cutadapt and FastQC.” This means that it is a software which contains and builds upon two legacy softwares (which are included on this AWS AMI): Cutadapt and FastQC. TrimGalore is intended to be used for quality control checks and trimming of sequencing data. The Github for TrimGalore can be found here: https://github.com/FelixKrueger/TrimGalore The command takes in fastq files. The README has a brief introduction to the software, but a more helpful documentation of it can be found in the user guide. To find information on how to to use the trim_galore command and command options that you can use with it, look at the user guide here: https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md



BLAST

Date: 11.01.2021

Verification of installation: For the time being,I only linked blastn and blastp into the PATH of this instance, but not the rest (because there are quite a few of them).

In order to use the BLAST commands, simple go into the bin folder using this command from the directory this readme is in:

$ cd ncbi-blast-2.12.0+/bin/

Then, you can run any of the blast commands, for example blastn:

$ ./blastn [options]

Version of the software on the AMI and date installed: DESCRIPTION Nucleotide-Nucleotide BLAST 2.12.0+

Process for Installation: Blast software and executables: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

I was actually able to download it simply using apt-get: $ sudo apt-get install ncbi-blast+

Software Example of Use: Software Example of Use: There are two main functions within BLAST: blastp and blastn. These are for protein and nucleotide sequences, respectively. For the blastn command, there are sub-functions: blastn : Traditional BLASTN requiring an exact match of 11 blastn-short : BLASTN program optimized for sequences shorter than 50 bases megablast : Traditional megablast used to find very similar (e.g., intraspecies or closely related species) sequences dc-megablast : Discontiguous megablast used to find more distant (e.g., interspecies) sequences

The basic blastn command syntax is:

`blastn -query fasta.file -db database_name

However there are lots of options that are useful like -out for outputting the results to a file. For a full list, check out the manual linked below.

As menitoned above, there are quite a few different commands that come with the NCBI BLAST software. These include: blast2sam.pl blast_report blastdb_convert blastdbcheck blastdbcp blastp
blast_formatter blastdb_aliastool blastdb_path blastdbcmd blastn blastx

the two most relevant (I think) are: blastn is for nucleotide sequences blastp is for protein sequences

Description of Software:

BLAST web application: https://blast.ncbi.nlm.nih.gov/Blast.cgi.

BLAST stands for Basic Local Alignment Search tool. From the above website: “BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.” This is usually done to try and ascertain important functional or evolutionary information about a nucleotide or protein sequence.

This is probably one of the most documented bioinformatics search tools available. To find out information regarding using the command line BLAST tool, check out the manual here: https://www.ncbi.nlm.nih.gov/books/NBK279690/.



Momi and Dadi

Date: 11.02.2021

Verification of installation:

>>> momi.__file__
'/home/ubuntu/dependency-software/miniconda3/envs/momi-env/lib/python3.7/site-packages/momi/__init__.py'

Version of the software on the AMI and date installed:

3.7/site-packages/momi-2.1.18-py3.7.egg-info$ cat PKG-INFO 
Metadata-Version: 1.2
Name: momi
Version: 2.1.18
Summary: MOran Model for Inference
Home-page: https://github.com/jackkamm/momi2
Author: Jack Kamm, Jonathan Terhorst, Richard Durbin, Yun S. Song
Author-email: jkamm@stat.berkeley.edu, terhorst@stat.berkeley.edu, yss@eecs.berkeley.edu
License: UNKNOWN
Description: UNKNOWN
Keywords: population genetics,statistics,site frequency spectrum,coalescent
Platform: UNKNOWN
Requires-Python: >=3.5

Process for Installation:

Create conda env with python 3.7

Install momi

(base) ubuntu@ip-172-31-52-167:~/GBI-software/momi$ conda create -n momi-env python=3.7
(momi-env) ubuntu@ip-172-31-52-167:~/GBI-software/momi$ conda install momi -c conda-forge -c bioconda -c jackkamm

Install dadi:

(momi-env) ubuntu@ip-172-31-52-167:~$ conda install -c conda-forge nlopt
(momi-env) ubuntu@ip-172-31-52-167:~/GBI-software/momi$ conda install -c conda-forge dadi

For momi, it appears to not be working with python 3.9, so I created a conda environment with python 3.7 and installed it within that env. Therefore in order to use momi (and I will put dadi in here because it is written by the same authors), you will need to enter the command:

(momi-env) ubuntu@ip-172-31-52-167:~/GBI-software/momi$ conda activate momi-env

And then once you are done, you will exit the environment with python 3.7 using the command:

(momi-env) ubuntu@ip-172-31-52-167:~/GBI-software/momi$ conda deactivate

Software Example of Use:

In order to use momi, either create a python program and import momi, or enter the python shell, import momi into that shell, and then do your work within that shell as shown below (the last command calling the DemographicModel function of momi is from the manual linked to below):

$ python
Python 3.7.11 (default, Jul 27 2021, 14:32:16) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import momi
>>> model = momi.DemographicModel(N_e=1.2e4, gen_time=29,
                              muts_per_gen=1.25e-8)

dadi example from manual for Importing data

dadi represents frequency spectra using dadi.Spectrum objects. As described in the Manipulating spectra section, Spectrum objects are subclassed from numpy.masked_array and thus can be constructed similarly. The most basic way to create a Spectrum is manually:

$ python
Python 3.7.11 (default, Jul 27 2021, 14:32:16) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dadi
>>> fs = dadi.Spectrum([0, 100, 20, 10, 1, 0])

Description of Software: Momi:

From the README, “momi (MOran Models for Inference) is a Python package that computes the expected sample frequency spectrum (SFS), a statistic commonly used in population genetics, and uses it to fit demographic history." The main resources for this software are:

The github: https://github.com/popgenmethods/momi2

The documentation pages / tutorial : https://momi2.readthedocs.io/en/latest/introduction.html

Dadi:

From the README, “dadi is a powerful software tool for simulating the joint frequency spectrum (FS) of genetic variation among multiple populations and employing the FS for population-genetic inference. An important aspect of dadi is its flexibility, particularly in model specification, but with that flexibility comes some complexity. The main resources for this software are:

The docs: https://dadi.readthedocs.io/en/latest/

The bitbucket repo: https://bitbucket.org/gutenkunstlab/dadi/src/master/

Both of these softwares are written nd intended to be worked with using Python. This means you must either write a python script that includes them OR you can work with them from the Python shell. To do this, enter the command python which will take you to the python shell (in the case of the momi-env I created and installed momi and dadi onto, this python shell is running python 3.7). From here you can import momi and dadi using:

>>> import momi, dadi

With them imported, you can call the functions within the two programs. Read the dadi docs (link above) to become more familiar with how to enter these commands into the python shell.



Plink

Date: 11.02.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which plink
/usr/local/bin/plink

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ plink --version --noweb

@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Process for Installation:

repo: https://zzz.bwh.harvard.edu/plink/download.shtml

Install:

Download plink-1.07-x86_64.zip from above website Scp onto ec2

Flints-MacBook-Pro:Desktop flintmitchell$ scp -i ~/Desktop/GBI/AWS/AWS_keypairs/r5large-gbi-keypair.pem plink-1.07-x86_64.zip ubuntu@ec2-34-222-181-231.us-west-2.compute.amazonaws.com:/home/ubuntu/GBI-software/Plink/

then on the EC2 instance:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir Plink
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd Plink/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Plink$ unzip plink-1.07-x86_64.zip 

Software Example of Use:

USAGE: Type plink or ./plink from the command line followed by the options of choice (see documentation)

EXAMPLE DATA: Two example files test.ped and test.map are included in the distribution; for example, once PLINK is installed try running:

     plink --file test

     plink --file test --freq

     plink --file test --assoc

     plink --file test --make-bed

     plink --bfile test --assoc

Description of Software:

From the plink documentation: “PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

The focus of PLINK is purely on analysis of genotype/phenotype data, so there is no support for steps prior to this (e.g. study design and planning, generating genotype or CNV calls from raw data). Through integration with gPLINK and Haploview, there is some support for the subsequent visualization, annotation and storage of results.”

The last release was in October of 2009 so it doesn’t appear to be updated anymore (for a while). However the documentation online is thorough. You can find that here:

https://zzz.bwh.harvard.edu/plink/index.shtml



HISAT2

Date: 11.02.2021

Verification of installation:

/home/ubuntu/GBI-software/HISAT2/hisat2-2.2.1/hisat2-align-s

Version of the software on the AMI:

version 2.2.1

Process for Installation:

Download the source from: http://daehwankimlab.github.io/hisat2/download/ Unzip the zip file, then move to your EC2 instance using scp. Inside of the unzipped HISAT2 folder use $ make Then $ make install And then add to PATH using export

Software Example of Use:

This software is extremely well documented using the help option:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ hisat2 --help

The basic command usage is:

$ hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <sam>]

where:
  <ht2-idx>  Index filename prefix (minus trailing .X.ht2).
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <SRA accession number>        Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

There is a good example for how to use the command here: http://daehwankimlab.github.io/hisat2/howto/

Description of Software:

From the documentation: “HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) against the general human population (as well as against a single reference genome).” The last release date (which we are using) was on 7/24/2020, so this software is still quite new and was published most recently it looks like by the author in 2019:

Kim, D., Paggi, J.M., Park, C. et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915 (2019). https://doi.org/10.1038/s41587-019-0201-4

The documentation can be found here: Github: http://daehwankimlab.github.io/hisat2/manual/



FastQC

Date: 11.02.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which fastqc 
/usr/local/bin/fastqc

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ fastqc --version
FastQC v0.11.9

Process for Installation:

Instructions can be found here: https://raw.githubusercontent.com/s-andrews/FastQC/master/INSTALL.txt

Update java with sudo apt install default-jre

Download the most recent release for linux here: https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc

Software Example of Use:

I simulinked the FastQC command fastqc into usr/local/bin so that it is on the PATH and can be used from anywhere on the instance. You could also use the command from within it’s subdirectory FastQC of GBI-software.

The software itself actually has a graphical component, however on the EC2 instance we are just working on it from the CLI. The documentation says it has a couple extra command options that are only available via the CLI. The plots are saved as .html files.

The command syntax is as follows:

$ fastqc seqfile1 seqfile2 .. seqfileN

With some optional commands:
    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

Description of Software:

From the Github: “FastQC is a program designed to spot potential problems in high throughput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarises the results.”

The most recent update was in 2019. It appears to be continuously monitored, which makes sense considering how commonly used the software is.

This tool is very useful and well documented. You can find plenty of information regarding For information about the software, check out the documentation here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

There is actually even a tutorial video for using FastQC! Check that out here: https://www.youtube.com/watch?v=bz93ReOv87Y

If you have issues, the github may have solutions or similar questions that have been asked. You can find that here: https://github.com/s-andrews/FastQC/



Cutadapt

Date: 10.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which cutadapt 
/home/ubuntu/dependency-software/miniconda3/bin/cutadapt

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ cutadapt --version
3.4

Process for Installation:

To install, follow the installation instructions on this page: https://cutadapt.readthedocs.io/en/stable/installation.html

Software Example of Use:

The basic command syntax is

cutadapt [options] [file]

From the documentation:

To trim a 3’ adapter, the basic command-line for Cutadapt is: cutadapt -a AACCGGTT -o output.fastq input.fastq

Compressed files can also be used: cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz

Description of Software:

From the Docs: “Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.”

This software is used for pre-assembly and analysis trimming, in order to remove sequences from the raw sequencing data that are not scientifically relevant.

There are a couple sources of information regarding Cutadapt. There is a documentation page here: https://cutadapt.readthedocs.io/en/stable/develop.html

And a Github page here: https://github.com/marcelm/cutadapt/

It appears to be continuously updated by the authors as the last update was in March of 2021.



CAP3

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which cap3
/home/ubuntu/dependency-software/miniconda3/bin/cap3

Version of the software on the AMI:

VersionDate: 02/10/15

Process for Installation:

https://bioconda.github.io/recipes/cap3/README.html

(base) ubuntu@ip-172-31-52-167:~/GBI-software/CAP3$ conda install cap3
(base) ubuntu@ip-172-31-52-167:~/GBI-software/CAP3$ conda update cap3

Software Example of Use:

From docs: Basic command syntax: cap3 File_of_reads [options]

Description of Software:

This appears to be the first/main paper associated with the software: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310812/pdf/x5.pdf

A helpful documentation page was adapted on the LSU website: http://www.hpc.lsu.edu/docs/compbio/cap3.php

It takes in FASTA files and outputs the assembly in what they call “ace” format, which is .ace. It also outputs consensus sequences in the file .contigs and a report in the file .results.

The last update to the software was 2015, so it’s not currently being monitored or updated.



BWA

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which bwa
/home/ubuntu/dependency-software/miniconda3/bin/bwa

Version of the software on the AMI:

Version: 0.7.17-r1188

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir BWA
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd BWA
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BWA$ git clone https://github.com/lh3/bwa.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BWA$ cd bwa/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BWA/bwa$ make

Software Example of Use:

Basic command syntax: Usage: ./bwa <command> [options]

Example from the Github:

./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

Description of Software:

Github can be found here: https://github.com/lh3/bwa

From the docs: “BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.”

This note seems important to read before using BWA: “Note: To use BWA, you need to first index the genome with bwa index'. There are three alignment algorithms in BWA: mem', bwasw', and aln/samse/sampe'. If you are not sure which to use, try bwa mem' first. Please man ./bwa.1' for the manual.“

The last release was in 2017, however issues are still actively being opened and resolved, so there is a community still supporting this software.



Braker2

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which braker.pl
/home/ubuntu/GBI-software/Braker2/BRAKER/scripts//braker.pl

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ braker.pl --version
braker.pl version 2.1.6

Process for Installation:

The process for installing the dependencies is fairly long but straightforward. There are quite a few associated tools. Follow the installation procedure on the github here: https://github.com/Gaius-Augustus/BRAKER#installation

An important note about GeneMark! GeneMark-EX will only run if a valid key file resides in your home directory. The key file will expire after 200 days, which means that you have to download a new GeneMark-EX release and a new key file after 200 days. The key file is downloaded as gm_key.gz. Instructions for this are in the GeneMark installation file:

Installation instructions for GeneMark* software

a. Copy the content of distribution to desired location.
b. Install the key: copy key "gm_key" into users home directory as:

  cp gm_key ~/.gm_key

ProtHint is installed in the dependency-software/ProtHint directory, but is symlinked into usr/local/bin so it is accessible anywhere on the CLI.

Software Example of Use:

Basic Command Syntax: braker.pl [OPTIONS] --genome=genome.fa {--bam=rnaseq.bam | --prot_seq=prot.fa}

Examples from docs:

To run with RNA-Seq

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --bam=accepted_hits.bam
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=rnaseq.gff

To run with protein sequences

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --prot_seq=proteins.fa
braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \
    --hints=prothint_augustus.gff

Description of Software:

From the docs: “BRAKER2 is an extension of BRAKER1 which allows for fully automated training of the gene prediction tools GeneMark-EX F1 and AUGUSTUS from RNA-Seq and/or protein homology information, and that integrates the extrinsic evidence from RNA-Seq and protein homology information into the prediction and AUGUSTUS.

In contrast to other available methods that rely on protein homology information, BRAKER2 reaches high gene prediction accuracy even in the absence of the annotation of very closely related species and in the absence of RNA-Seq data.”

The software appears to be well supported, with its last release in March of 2021. There is a substantial amount of information about this software, so definitely refer to the documentation on the Github, found here: https://github.com/Gaius-Augustus/BRAKER#braker



Picard

Date: 11.03.2021

Verification of installation:

The software itself is installed at /home/ubuntu/GBI-software/Picard/picard/build/libs/picard.jar

I created an environment variable $PICARD, that is linked to this .jar file.

Version of the software on the AMI:

Version 2.26.4, released October 26, 2021

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/Picard$ git clone https://github.com/broadinstitute/picard.git
Cloning into 'picard'...
remote: Enumerating objects: 167266, done.
remote: Counting objects: 100% (7305/7305), done.
remote: Compressing objects: 100% (1505/1505), done.
remote: Total 167266 (delta 6091), reused 6635 (delta 5623), pack-reused 159961
Receiving objects: 100% (167266/167266), 205.56 MiB | 28.65 MiB/s, done.
Resolving deltas: 100% (136390/136390), done.
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Picard$ ls
GBI-README-Picard  picard
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Picard$ cd picard/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Picard/picard$ ./gradlew shadowJar

Then, to create the environment variable, I added the following line into the .bashrc file in the home directory:

PICARD=/home/ubuntu/GBI-software/Picard/picard/build/libs/picard.jar

Software Example of Use: In order to use the command, you can either go into the picard folder itself and then use:

java -jar build/libs/picard.jar [options]

Or you can use the environment variable I created:

java -jar $PICARD [options]

For example to get further information / help about the software, use the -h command:

java -jar $PICARD -h

Description of Software:

This software is created by the Broad Institute and is actively being updated. There have been new releases every couple weeks for the past couple months. Therefore when you use this if there are any errors, it would be wise to either rebuild it (delete the picard folder within GBI-software/Picard, and rebuild using the commands in the installation procedure above (you won’t have to recreate the $PICARD env. variable, it should remain the same).

Picard is described in the Github as: “A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.”

In order to see the full list of tools possible, either look at the help documentation (using the -h option shown in examples above) or check out the information at their website here: https://broadinstitute.github.io/picard/

In order to update to the most recent release, go to the Github repository here: https://github.com/broadinstitute/picard



HTSeq

Date: 11.03.2021

Verification of installation:

>>> HTSeq.__file__
'/home/ubuntu/dependency-software/miniconda3/lib/python3.9/site-packages/HTSeq/__init__.py'

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~/dependency-software/miniconda3/lib/python3.9/site-packages/HTSeq$ cat _version.py 
__version__ = "0.13.5"

Process for Installation:

$ pip install HTSeq

Software Example of Use: HTSeq is a python package. Therefore in order to use it you can either:

Write a python script that imports HTSeq and implements some of it’s functions or Use it in the Python shell by entering the Python command in the CLI and then importing HTSeq using “import HTSeq”. Then, within this Python shell you will have access to the HTSeq functions that are found in the user guide linked below!

Example of reading in a file into the Python shell from https://htseq.readthedocs.io/en/master/tour.html:

(base) ubuntu@ip-172-31-52-167:~$ python
Python 3.9.5 (default, Jun  4 2021, 12:28:51) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>> fastq_file = HTSeq.FastqReader("yeast_RNASeq_excerpt_sequence.txt", "solexa")
>>> fastq_file
<FastqReader object, connected to file name 'yeast_RNASeq_excerpt_sequence.txt'>

Description of Software: According to the docs: “HTSeq is a Python package for high-throughput sequencing assays.”

This software is well documented, in order to find out more, there is a helpful Overview here: https://htseq.readthedocs.io/en/master/overview.html

On that link, there are tutorials and links to documentation regarding the functions within the software.



Bedtools

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which bedtools
/home/ubuntu/dependency-software/miniconda3/bin/bedtools

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ bedtools --version
bedtools v2.30.0

Process for Installation:

Install: https://bedtools.readthedocs.io/en/latest/content/installation.html

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir bedtools
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd bedtools/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/bedtools$ wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools.static.binary
(base) ubuntu@ip-172-31-52-167:~/GBI-software/bedtools$ mv bedtools.static.binary bedtools
(base) ubuntu@ip-172-31-52-167:~/GBI-software/bedtools$ chmod a+x bedtools

Software Example of Use:

Basic command syntax: bedtools <subcommand> [options]

In order to get more information about any of bedtools commonly used functions, use the -h option like: $ bedtools intersect -h Or $ bedtools merge -h

Description of Software:

From the github: “Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.”

Github: https://github.com/arq5x/bedtools2

Documentation pages: http://bedtools.readthedocs.io/



Bamtools

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which bamtools
/usr/bin/bamtools

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ bamtools --version

bamtools 2.5.1
Part of BamTools API and toolkit
Primary authors: Derek Barnett, Erik Garrison, Michael Stromberg
(c) 2009-2012 Marth Lab, Biology Dept., Boston College

Process for Installation:

BamTools requires CMake (version >= 3.0)

BamTools also makes use of JsonCpp for certain serialization tasks:

(base) ubuntu@ip-172-31-52-167:~/dependency-software$ git clone https://github.com/Microsoft/vcpkg.git
(base) ubuntu@ip-172-31-52-167:~/dependency-software$ cd vcpkg/
(base) ubuntu@ip-172-31-52-167:~/dependency-software/vcpkg$ ./bootstrap-vcpkg.sh 
(base) ubuntu@ip-172-31-52-167:~/dependency-software/vcpkg$ sudo ./vcpkg integrate install
(base) ubuntu@ip-172-31-52-167:~/dependency-software/vcpkg$ sudo ./vcpkg install jsoncpp

Install BamTools:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir Bamtools
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd Bamtools/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Bamtools$ git clone git://github.com/pezmaster31/bamtools.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Bamtools$ cd bamtools/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Bamtools/bamtools$ mkdir build
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Bamtools/bamtools$ cd build
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Bamtools/bamtools/build$ cmake -DCMAKE_INSTALL_PREFIX=/my/install/dir ..

Software Example of Use:

The documentation is not super useful, but this website I found has some valuable information regarding important Bamtools commands: https://hcc.unl.edu/docs/applications/app_specific/bioinformatics_tools/data_manipulation_tools/bamtools/running_bamtools_commands/

For example, “the basic usage of the BamTools count is:”

$ bamtools count -in input_alignments.bam

Description of Software:

From the git: “BamTools is a project that provides both a C++ API and a command-line toolkit for reading, writing, and manipulating BAM (genome alignment) files.”

Make sure to check out the link above in “Software Example of Use”, it is more helpful for implementing bamtools than the github wiki. That said, there is also a brief manual for the API on the Github worth checking out (though it only mentions how to use bamtools while programming, not necessarily from the CLI).

Github: https://github.com/pezmaster31/bamtools/wiki



VCFtools

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which vcftools
/usr/local/bin/vcftools

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ vcftools --version
VCFtools (0.1.17)

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir VCFtools
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd VCFtools/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/VCFtools$ git clone https://github.com/vcftools/vcftools.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/VCFtools/vcftools$ ./autogen.sh
(base) ubuntu@ip-172-31-52-167:~/GBI-software/VCFtools/vcftools$ ./configure
(base) ubuntu@ip-172-31-52-167:~/GBI-software/VCFtools/vcftools$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/VCFtools/vcftools$ sudo make install

Software Example of Use:

Check out the man pages for uses of the command and it’s options: man vcftool

From there, the general syntax of the command is shown:

$ vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT PREFIX ]  [  FILTERING
       OPTIONS ]  [ OUTPUT OPTIONS ]

There are plenty of examples in the man page, such as:

Output allele frequency for all sites in the input vcf file from chromosome 1
$ vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

Description of Software:

A set of tools written in Perl and C++ for working with VCF files, such as those generated by the 1000 Genomes Project.

It looks like your best bet for information regarding this tool is the manual page (man vcftools)! There is a ton of information there.

Github: https://github.com/vcftools/vcftools

Website: https://vcftools.github.io/



Samtools

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which samtools
/home/ubuntu/dependency-software/miniconda3/bin/samtools

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ samtools --version
samtools 1.13
Using htslib 1.13

Process for Installation:

From the github:

$ autoheader            # Build config.h.in (this may generate a warning about AC_CONFIG_SUBDIRS - please ignore it).
$ autoconf -Wno-syntax  # Generate the configure script
$ ./configure           # Needed for choosing optional functionality
$ make
$ make install

Software Example of Use:

Basic command syntax: samtools <command> [options]

A good web page with examples and descriptions can be found here!: http://quinlanlab.org/tutorials/samtools/samtools.html

Description of Software:

About from man page: “Samtools is a set of utilities that manipulate alignments in the SAM (Sequence Alignment/Map), BAM, and CRAM formats. It converts between the formats, does sorting, merging and indexing, and can retrieve reads in any region swiftly.”

Check the link above for examples and tutorials on using the samtools commands.

You can also find more information about samtools at:

Github: https://github.com/samtools/samtools

Documentations page: http://www.htslib.org/doc/

Or by looking at the man page : man samtools.



Trinity

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which Trinity 
/usr/local/bin/Trinity

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ Trinity --version
Trinity version: Trinity-v2.13.1

Process for Installation:

Download the most recent release from: https://github.com/trinityrnaseq/trinityrnaseq/releases

$ git clone https://github.com/trinityrnaseq/trinityrnaseq.git
Enter into the cloned trinity repository on the EC2 instance
$ make

Software Example of Use:

For more information on Trinity, use the help command:

(base) ubuntu@ip-172-31-52-167:~$ Trinity --help

Example from the github:

“Assemble RNA-Seq data like so:”

Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G

The resulting assembly files will be in 'trinity_out_dir/`.

Description of Software:

From the github: “Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes.”

You can find out more about the software at the github: https://github.com/trinityrnaseq/trinityrnaseq/wiki



BUSCO & QUAST

Date: 11/10/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which busco && which quast.py
/home/ubuntu/.local/bin/busco
/usr/local/bin/quast.py

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ busco --version && quast.py --version
BUSCO 5.2.2
QUAST v5.1.0rc1, 278f61fe

Process for Installation: BUSCO: https://busco.ezlab.org/busco_userguide.html#manual-installation

Links (most from above website): https://biopython.org/ https://pandas.pydata.org/ https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST http://bioinf.uni-greifswald.de/augustus/ https://github.com/soedinglab/metaeuk https://github.com/hyattpd/Prodigal http://hmmer.org/ https://github.com/smirarab/sepp/ https://www.r-project.org/ https://ftp.osuosl.org/pub/cran/bin/linux/ubuntu/fullREADME.html

Ubuntu already has python installed.

Downloading pip

ubuntu@ip-172-31-58-54:~$ sudo apt update
ubuntu@ip-172-31-58-54:~$ sudo apt upgrade
ubuntu@ip-172-31-58-54:~$ sudo apt install python3-pip

Downloading Biopython and adding scripts to PATH

ubuntu@ip-172-31-58-54:~$ pip3 install biopython
Collecting biopython
  Downloading biopython-1.78-cp38-cp38-manylinux1_x86_64.whl (2.3 MB)
     |████████████████████████████████| 2.3 MB 5.6 MB/s 
Collecting numpy
  Downloading numpy-1.20.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.4 MB)
     |████████████████████████████████| 15.4 MB 31.0 MB/s 
Installing collected packages: numpy, biopython
  WARNING: The scripts f2py, f2py3 and f2py3.8 are installed in '/home/ubuntu/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed biopython-1.78 numpy-1.20.3
ubuntu@ip-172-31-58-54:~$ cd /home/ubuntu/.local/bin
ubuntu@ip-172-31-58-54:~/.local/bin$ export PATH=$PATH:$(pwd)
ubuntu@ip-172-31-58-54:~/.local/bin$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/ubuntu/.local/bin

Downloading Pandas

ubuntu@ip-172-31-58-54:~/.local/bin$ pip3 install pandas

Downloading tBlastn 2.2+ Downloading the tar.gz file from here: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ Scp copy to my linux EC2 instance:

Flints-MacBook-Pro:Downloads flintmitchell$ scp -i ~/Desktop/GBI/AWS/AWS_keypairs/keypair-r5xlarge-flint.pem ncbi-blast-2.11.0+-x64-linux.tar.gz ubuntu@ec2-44-242-146-7.us-west-2.compute.amazonaws.com:~/
ubuntu@ip-172-31-58-54:~$ tar -zxvf ncbi-blast-2.11.0+-x64-linux.tar.gz 

Downloading Augustus Dependencies:

ubuntu@ip-172-31-58-54:~$ sudo apt install libboost-iostreams-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install zlib1g-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install libgsl-dev 
ubuntu@ip-172-31-58-54:~$ sudo apt install libboost-all-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install libsuitesparse-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install liblpsolve55-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install libsqlite3-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install libmysql++-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install libbamtools-dev
ubuntu@ip-172-31-58-54:~$ sudo apt install samtools libhts-dev
ubuntu@ip-172-31-58-54:~$ git clone https://github.com/Gaius-Augustus/Augustus.git
ubuntu@ip-172-31-58-54:~$ cd Augustus/
ubuntu@ip-172-31-58-54:~/Augustus$ make
ubuntu@ip-172-31-58-54:~/Augustus$ sudo make install
#To check success of global install	:
ubuntu@ip-172-31-58-54:~/Augustus$ augustus --help

Downloading Prodigal

ubuntu@ip-172-31-58-54:~$ git clone https://github.com/hyattpd/Prodigal
ubuntu@ip-172-31-58-54:~$ cd Prodigal/
ubuntu@ip-172-31-58-54:~/Prodigal$ make
ubuntu@ip-172-31-58-54:~/Prodigal$ sudo make install

Downloading Metaeuk

ubuntu@ip-172-31-58-54:~$ wget https://mmseqs.com/metaeuk/metaeuk-linux-sse41.tar.gz
tar xzvf metaeuk-linux-sse41.tar.gz
export PATH=$(pwd)/metaeuk/bin/:$PATH

Downloading Hmmer

ubuntu@ip-172-31-58-54:~$ sudo apt install hmmer

Downloading SEPP

ubuntu@ip-172-31-58-54:~$ git clone https://github.com/smirarab/sepp.git
ubuntu@ip-172-31-58-54:~$ cd sepp
ubuntu@ip-172-31-58-54:~/sepp$ python3 setup.py config
ubuntu@ip-172-31-58-54:~/sepp$ sudo python3 setup.py install

Downloading R

ubuntu@ip-172-31-58-54:~$ sudo apt update -qq
ubuntu@ip-172-31-58-54:~$ sudo apt install --no-install-recommends software-properties-common dirmngr
ubuntu@ip-172-31-58-54:~$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
ubuntu@ip-172-31-58-54:~$ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/"
ubuntu@ip-172-31-58-54:~$ sudo apt install --no-install-recommends r-base

Downloading ggplot2

ubuntu@ip-172-31-58-54:~$ sudo apt-get install r-base-dev
ubuntu@ip-172-31-58-54:~$ sudo add-apt-repository ppa:c2d4u.team/c2d4u4.0+
ubuntu@ip-172-31-58-54:~$ sudo apt-get install -y r-cran-ggplot2

Finally, downloading BUSCO

ubuntu@ip-172-31-58-54:~$ git clone https://gitlab.com/ezlab/busco.git
ubuntu@ip-172-31-58-54:~$ cd busco/
ubuntu@ip-172-31-58-54:~/busco$ python3 setup.py install --user
#Check it is installed
ubuntu@ip-172-31-58-54:~$ busco --help

Downloading ggplot2

Using BUSCO!

Command:

/home/ubuntu/.local/bin/busco -i a-thaliana-output.scafSeq -l viridiplantae_odb10 -o a-thaliana-BUSCO -m genome

or

/home/ubuntu/.local/bin/busco -i a-thaliana-output.scafSeq -l embryophyta_odb10 -o a-thaliana-BUSCO -m genome

Downloading QUAST

ubuntu@ip-172-31-54-244:~/GBIsoftware$ git clone https://github.com/ablab/quast.git

2 options for installation. First only installs part of the full software. Second is the full install and“includes (1) tools for SV detection based on read pairs, which is used for more precise misassembly detection, (2) and tools/data for reference genome detection in metagenomic datasets)”. A slight increase in the storage size of this AMI is okay so we will install the full option:

ubuntu@ip-172-31-54-244:~/GBIsoftware/quast$ sudo python3 ./setup.py install_full

Dependencies: For the main pipeline: Python2 (2.5 or higher) or Python3 (3.3 or higher) (installed with Ubuntu) Perl 5.6.0 or higher (installed with Ubuntu) GCC 4.7 or higher (installed with build-essential) GNU make and ar (installed with build-essential) zlib development files

ubuntu@ip-172-31-58-54:~$ sudo apt install zlib1g-dev

For the optional submodules: Time::HiRes perl module for GeneMark-ES (needed when using --gene-finding --eukaryote) Java 1.8 or later for GRIDSS (needed for SV detection) R for GRIDSS (needed for SV detection)

Matplotlib and some other packages download for plotting capability of QUAST:

ubuntu@ip-172-31-58-54:~$ sudo apt-get update && sudo apt-get install -y pkg-config libfreetype6-dev libpng-dev python3-matplotlib

Software Example of Use: Here is an example of the BUSCO command:

$ busco -i [input-sequence.file] -l [busco-lineage] -o [output-prefix] -m [busco-analysis-mode] -c [num-cores]

The input sequence file can be an assembly of a genome, transcriptome, or protein sequences in FASTA format. Use the following command to list all of the current possible busco lineage datasets:

$ busco --list-datasets

With this list, I would first look at the taxonomy of the organism you are using and find which BUSCO dataset you think will overlap most. You can also choose a couple of them and then run BUSCO several times to see which gives you the best results. Or you can replace the lineage with the following command to test against all the eukaryote lineages and find which your assembly best fits. Obviously running busco multiple times increases the amount of time your instance has to operate and therefore the cost of your analysis so it is good to think about the correct lineage dataset to use before jumping in.

$ --auto-lineage-euk

Using the -c flag you can assign a certain number of AWS cores. 2 examples using an A. Thaliana assembly "a-thaliana-output.scafSeq" with two different BUSCO lineages, outputting results into a folder named "a-thaliana-BUSCO"

$ busco -i a-thaliana-output.scafSeq -l viridiplantae_odb10 -o a-thaliana-BUSCO -m genome
$ busco -i a-thaliana-output.scafSeq -l embryophyta_odb10 -o a-thaliana-BUSCO -m genome

If you start a busco run with the above command and it fails for any reason, it will have already created a folder for that busco run with the prefix you entered. This will prevent you from running the busco command again with the same prefix! So first, before retrying busco, you should delete the busco folder from the failed run (after you have figured out what went wrong and are sure nothing in that folder is valuable for you). To do this do:

$ rm -rf [busco-folder-name]
# example for a folder named "athal-masurca-busco"
$ rm -rf athal-masurca-busco

QUAST: QUAST runs from a command line as follows:

python quast.py [options] <contig_file(s)>

Description of Software:

First off, what is BUSCO? Many tools used to assess the quality of a genome assembly look at "per-base error rates, insert size distributions; or genome biases, e.g. k-mer distributions; or fragment (contig) length distributions, e.g. N50, which summarizes assembly contiguity in a single number: half the genome is assembled on contigs of length N50 or longer." https://doi.org/10.1093/bioinformatics/btv351

Using these metrics alone does not take into consideration the actual content of the genome assembly. In order to take a shot at the "content" of an assembly (the genes that make it up), Benchmarking Universal Single-Copy Orthologs (BUSCO) are used. As their name describes, these are genes that have remained single-copy (they have not been duplicated or lost) since the last speciation event (a lineage-splitting event that resulted in two or more different species). Since these single copy orthologs are assumed to be highly conserved within a given clade, they can be used as a content-quality assessment - i.e. what percentage of genes that we expect to find in this organism's genome from one of it's close ancestors do we find?.

When looking a the results of a BUSCO run, you will find the result broken down into several categories dealing with the completeness of the given genes within a BUSCO lineage set. These categories are Complete (C), Fragmented (F), and Missing (M). Further, within the Complete (C) category, is Complete and single-copy (S) and Complete and duplicated (D). Having a high percentage of complete genes means that many of the genes in the BUSCO lineage dataset that you chose are present in the assembly you submitted the BUSCO analysis of. On this topic, there is a note I found important that is printed when running BUSCO: Busco evaluations are valid when an appropriate dataset is used, i.e., the dataset belongs to the lineage of the species to test. Because of overlapping markers/spurious matches among domains, busco matches in another domain do not necessarily mean that your genome/proteome contains sequences from this domain. However, a high busco score in multiple domains might help you identify possible contaminations.

Here is an example of a result that I got using Arabidopsis Thaliana data:
# BUSCO version is: 5.1.2 
# The lineage dataset is: viridiplantae_odb10 (Creation date: 2020-09-10, number of genomes: 57, number of BUSCOs: 425)
# Summarized benchmarking in BUSCO notation for file /home/ubuntu/masurca-athal-pacraw/superReadSequences.named.fasta
# BUSCO was run in mode: genome
# Gene predictor used: metaeuk

	***** Results: *****

	C:88.5%[S:27.3%,D:61.2%],F:10.6%,M:0.9%,n:425	   
	376	Complete BUSCOs (C)			   
	116	Complete and single-copy BUSCOs (S)	   
	260	Complete and duplicated BUSCOs (D)	   
	45	Fragmented BUSCOs (F)			   
	4	Missing BUSCOs (M)			   
	425	Total BUSCO groups searched		   

Dependencies and versions: hmmsearch: 3.3 metaeuk: 9dee7a78db0f2a8d6aafe7dbf18ac06bb6e23bf0

At a glance, there is a high number of duplicated BUSCOs. As discussed above, BUSCOs are single-copy orthologs, and are assumed not to be duplicated. One group of reasons for this could deal with the assembly itself. Did my assembly align multiple copies of a gene that only exists once in the genome? Another reason for this result could be that genes chosen for the BUSCO lineage Viridiplantae (the kingdom synonymous with Plantae) were single-copy in the plants chosen to create that dataset, but were duplicated sometime afterwards, before the speciation of A. Thaliana. Therefore I might want to use a BUSCO dataset that is built on plants that are closer-relatives of A. Thaliana. Let's then look at the datasets within Viridiplantae:

  • viridiplantae_odb10 - chlorophyta_odb10 - embryophyta_odb10 - liliopsida_odb10 - poales_odb10 - eudicots_odb10 - brassicales_odb10 - fabales_odb10 - solanales_odb10

Since A. Thaliana resides in the order Brassicales, we may also want to try it with the BUSCO lineages educiots and brassicales. There are a lot of factors that come into play on what BUSCO results mean. For example, different conclusions could be made by comparing the BUSCO results of a genome assembly done with the same data using two different assemblers than genome assemblies of two different plants. In the first case, a higher BUSCO score using the same data may represent a better assembly. Whereas in the second case a higher BUSCO score may just mean that organism was a closer relative of the BUSCO lineage that you used. Before jumping into the command, there is one note I have on the software dependency side. One of the programs used by BUSCO is called "Metaeuk." For some reason I cannot figure out how to keep metaeuk in the environment PATH. When you exit the EC2 instance, it deletes itself from PATH. The simple fix is to use the commands:

  • $ cd
  • $ export PATH=$(pwd)/metaeuk/bin/:$PATH

each time you start up the BUSCO EC2 instance. If BUSCO fails at any point because it can't find metaeuk, you just have to do those above commands once and it will work fine afterwards.

QUAST documentation: http://quast.sourceforge.net/quast



Fastx toolkit

Date: 11/11/2021

Note: Since it is a bunch of different softwares, you will note I only showed the use of one of them in this documentation.

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which fastq_to_fasta 
/usr/local/bin/fastq_to_fasta

Version of the software on the AMI: Each software has it’s own version update. Use the -hoption to find out for each. Example for fastq_to_fasta:

(base) ubuntu@ip-172-31-52-167:~$ fastq_to_fasta -h
usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon (assafgordon@gmail.com)

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir fastx-toolkit
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd fastx-toolkit

Must install libgtextutils first before installing fastx-toolkit:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit$ git clone https://github.com/agordon/libgtextutils.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit$ cd libgtextutils/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/libgtextutils$ ./reconf
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/libgtextutils$ ./configure
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/libgtextutils$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/libgtextutils$ sudo make install

Now fastx-toolkit:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit$ wget https://github.com/agordon/fastx_toolkit/releases/download/0.0.14/fastx_toolkit-0.0.14.tar.bz2 
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit$ tar -xjvf fastx_toolkit-0.0.14.tar.bz2
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/fastx_toolkit-0.0.14$ ./configure
Because of an error I found this fix:
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/fastx_toolkit$ cd src/fasta_formatter/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/fastx_toolkit/src/fasta_formatter$ vim fasta_formatter.cpp 
Press ‘esc’ then type ‘105j’ which will take you to the 105th line. Press the right arrow until you are at the end of usage(), press ‘i’ to enter editing mode in vim, and then press “enter” to type a new line. After this enter “exit(0)” then press “esc” again and then “:wq” to leave and save your changes.
Now you can use the following without an error:
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/fastx_toolkit$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/fastx-toolkit/fastx_toolkit$ sudo make install

And it will be installed in the usr/include/bin

Software Example of Use: (base) ubuntu@ip-172-31-52-167:~$ fastq_to_fasta -h usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE] Part of FASTX Toolkit 0.0.14 by A. Gordon (assafgordon@gmail.com)

(base) ubuntu@ip-172-31-52-167:~$ fastx_quality_stats -h usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE] Part of FASTX Toolkit 0.0.14 by A. Gordon (assafgordon@gmail.com)

(base) ubuntu@ip-172-31-52-167:~$ fastq_quality_boxplot_graph.sh -h Solexa-Quality BoxPlot plotter Generates a solexa quality score box-plot graph

(base) ubuntu@ip-172-31-52-167:~$ fastx_nucleotide_distribution_graph.sh -h FASTA/Q Nucleotide Distribution Plotter

etc...

Description of Software:

The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. FASTX TOOLKIT is unmaintained software, no new features have been added since 2010. Since it is a bunch of different softwares, you will note I only showed the use of one of them in this documentation. The usage of the others is the same as they are all installed in the same location on the EC2 instance. Just check out the documentation to see more: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

The tools available: Available Tools FASTQ-to-FASTA converter

  • Convert FASTQ files to FASTA files. FASTQ Information
  • Chart Quality Statistics and Nucleotide Distribution FASTQ/A Collapser
  • Collapsing identical sequences in a FASTQ/A file into a single sequence (while maintaining reads counts) FASTQ/A Trimmer
  • Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise). FASTQ/A Renamer
  • Renames the sequence identifiers in FASTQ/A file. FASTQ/A Clipper
  • Removing sequencing adapters / linkers FASTQ/A Reverse-Complement
  • Producing the Reverse-complement of each sequence in a FASTQ/FASTA file. FASTQ/A Barcode splitter
  • Splitting a FASTQ/FASTA files containing multiple samples FASTA Formatter
  • changes the width of sequences line in a FASTA file FASTA Nucleotide Changer
  • Converts FASTA sequences from/to RNA/DNA FASTQ Quality Filter
  • Filters sequences based on quality FASTQ Quality Trimmer
  • Trims (cuts) sequences based on quality FASTQ Masker
  • Masks nucleotides with 'N' (or other character) based on quality


WENGAN

Date: 11/11/2021

Verification of installation: /home/ubuntu/GBI-software/WENGAN/wengan I have this folder added to the PATH via the .bashrc file.

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan$ ./wengan.pl --version
./wengan.pl version [unknown] calling Getopt::Std::getopts (version 1.12 [paranoid]),
running under Perl version 5.30.0.
  [Now continuing due to backward compatibility and excessive paranoia.
   See 'perldoc Getopt::Std' about $Getopt::Std::STANDARD_HELP_VERSION.]

Process for Installation:

ubuntu@ip-172-31-59-188:~$ sudo apt-get update
ubuntu@ip-172-31-59-188:~$ sudo apt-get upgrade
ubuntu@ip-172-31-51-221:~$ sudo apt-get install build-essential
ubuntu@ip-172-31-58-150:~$ sudo apt install cmake
ubuntu@ip-172-31-58-150:~$ sudo apt install clang
ubuntu@ip-172-31-58-150:~$ git clone --recursive https://github.com/adigenova/wengan.git wengan
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan$ cd components/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components$ cd minia/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/minia$ sh INSTALL
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/minia$ cd ..
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components$ cd fastmin-sg/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/fastmin-sg$ make all
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components$ cd ..
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components$ cd liger
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/liger$ sh install.sh
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/liger$ cd ..
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components$ cd intervalmiss/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/intervalmiss$ make all
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/intervalmiss$ cd ..
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components$ cd seqtk/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/seqtk$ make all
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/seqtk$ cd ..
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components$ cd discovarexp-51885/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/discovarexp-51885$ ./configure
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/discovarexp-51885$ make all
(base) ubuntu@ip-172-31-52-167:~/GBI-software/WENGAN/wengan/components/discovarexp-51885$ sudo make install

Software Example of Use: Copied from the synopsis in the README:

# Assembling Oxford Nanopore and Illumina reads with WenganM
 wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000

# Assembling PacBio reads and Illumina reads with WenganA
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and BGI reads with WenganM
 wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000

# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM
 wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000

# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (need a high memory machine 600GB)
 wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000

# Assembling pacraw reads with pre-assembled short-read contigs from Minia3
 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa

# Assembling pacraw reads with pre-assembled short-read contigs from Abyss
 wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa

# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo
 wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa

Description of Software:

From the README “Wengan is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison. The key idea behind Wengan is that long-read alignments can be inferred by building paths on a sequence graph. To achieve this, Wengan builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Longer alignments are then built by performing a transitive reduction of the edges. Another distinct feature of Wengan is that it performs self-validation by following the read information. Wengan identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind Wengan, please read the preprint available in bioRxiv.”

The last release of this was in 2020. Relevant information can be found here: The github: https://github.com/adigenova/wengan The README: wengan.pl -h Their main published paper about WENGAN: https://www.nature.com/articles/s41587-020-00747-w



BCFtools

Date: 11.03.2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which bcftools 
/usr/local/bin/bcftools

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ bcftools --version
bcftools 1.13-36-gef5cf0a
Using htslib 1.13-19-g31bf087

Process for Installation: http://samtools.github.io/bcftools/howtos/install.html

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir BCFtools
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd BCFtools/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools$ git clone --recurse-submodules git://github.com/samtools/htslib.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools$ git clone git://github.com/samtools/bcftools.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools$ cd bcftools
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools/bcftools$ autoheader
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools/bcftools$ cd autom4te.cache/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools/bcftools$ ./configure --enable-libgsl --enable-perl-filters
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools/bcftools$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/BCFtools/bcftools$ sudo make install

Software Example of Use:

Information regarding the command and its options can be found at: http://samtools.github.io/bcftools/bcftools.html

Command Syntax: bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]

Description of Software:

From docs: “BCFtools is a program for variant calling and manipulating files in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.”

This software is actively being updated so if you have any issues, you will probably have luck asking them in the github issues page in the github link below: Github: https://github.com/samtools/bcftools

Related documentation: http://samtools.github.io/bcftools/howtos/index.html



SparseAssembler

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which SparseAssembler 
/usr/local/bin/SparseAssembler

Version of the software on the AMI: No version documented

Process for Installation:

 - $ cd
  - $ git clone https://github.com/yechengxi/SparseAssembler.git
  - $ cd SparseAssembler/
  - /SparseAssembler$ cd compiled/
  - /SparseAssembler/compiled$ chmod +x SparseAssembler

Software Example of Use:

Since this assembler is a little less documented, I am copying and pasting the documentation tho the end after I highlight some of the important parts of the command.

The flags that seem most important are

g which is related to the "sparseness" of the k-mers, a higher value here would skip more k-mers.

This is something that would have to be optimized. The balance here is between how many k-mers are created and what percentage of those will eventually be ones that are used in contigs/further in scaffolds. More k-mers created means more memory and computing resources required to do an assembly (less k-mers would therefore require less of each). Now, when skipping possible k-mers, how does it affect your final assembly? If you skip too many, does it negatively impact the final result?

After doing some basic optimization of these parameters while performing Arabidopsis Thaliana assemblies, unfortunately I don't have any conclusive evidence for how to adjust your own use of this parameter. With the illumina A. Thaliana data I used, I had ~20x coverage, and got an N50 of ~5Kb using ABySS with k=40, 50, and 60. Here, I tried using k=40 and 60 with g (the sparseness parameter) =10,17,24 (these values appear to span the ones used in examples found via the github page). I did not see a large different in the output assembly results with g = 10, 17, or 24, yet saw a decrease in the number of k-mers used during the contig building process. Compared to the amount of k-mers g = 10 used, g = 17 used 60% and g = 24 used 46%. This is just proof of principle that the assembler did in fact lower the amount of memory required (via the # of k-mers used and therefore number of possible contigs to align) while providing similar assembly qualities.

Unfortunately I don't think this can be extrapolated to other assemblies , if you want to use this assembler to assemble a de novo genome, it would be wise to do your own sweep of the parameters to see which would be most efficient and still provide a quality result.

  • k gives the length of the k-mer value used when creating the De Bruijn graph.
  • LD this loads a saved k-mer graph and setting it to 0 means you are not loading a k-mer graph.
  • GS This is the estimated genome length, which is used for memory allocation. If memory isn't an issue, you might use 2x (or greater) of the estimated genome length to prevent running into issues.
  • NodeCovTh this is a parameter that appears to deal with setting a threshold for the length of k-mers that are considered fake. The default is 1 but I haven't played with adjusting it.
  • EdgeCovTh like the above parameter, this looks like it sets a threshold for identifying incorrect edges in the De Bruijn graph. Default is 0
  • f This is the flag that tells the assembler the following file will be sequencing data. You can use this several times like in the following example command from the github to input your illumina short-read sequencing data.

example command:

./SparseAssembler g 10 k 51 LD 0 GS 200000000 NodeCovTh 1 EdgeCovTh 0 f frag_1.fastq f frag_2.fastq f frag_3.fastq &

Description of Software:

SparseAssembler is a De Bruijn Graph assembler, like ABySS and SOAPdenovo2, except for during the process of k-mer construction, a part of De Bruijn Assembly discussed on the assembly basics page, fewer of the possible k-mers are used. This is the "sparse" use of k-mers, and is intended to saves on memory usage but more importantly on the computational load. Less k-mers effectively means less pieces of information to try and pair together. Like SGA, this one hasn't been updated in a while and will probably not end up being our workhorse, but I think it's valuable to see what else is out there besides the most popular assemblers and how they've been adapted.

For more information check out the software’s github page: "A sparse k-mer graph based, memory-efficient genome assembler." https://github.com/yechengxi/SparseAssembler

Also check out the GBI github page for this software at: https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-SparseAssembler-on-EC2



Shasta

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which shasta-Linux-0.7.0 
/usr/local/bin/shasta-Linux-0.7.0

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ shasta-Linux-0.7.0 --version
Shasta Release 0.7.0
Linux version

Process for Installation: [Copy the command information from the notes I took for each software]

Software Example of Use:

Example command from manual to run an assembly:

​​./shasta-Linux-0.8.0 --input myinput.fasta --config myConfigFile_2021

Description of Software:

De novo assembly from Oxford Nanopore reads.

From the github: “The goal of the Shasta long read assembler is to rapidly produce accurate assembled sequence using DNA reads generated by Oxford Nanopore flow cells as input.

Computational methods used by the Shasta assembler include: Using a run-length representation of the read sequence. This makes the assembly process more resilient to errors in homopolymer repeat counts, which are the most common type of errors in Oxford Nanopore reads.

Using in some phases of the computation a representation of the read sequence based on markers, a fixed subset of short k-mers (k ≈ 10). As currently implemented, Shasta can run an assembly of a human genome at coverage around 60x in about 3 hours using a single, large machine (AWS instance type x1.32xlarge, with 128 virtual processors and 1952 GB of memory). The compute cost of such an assembly is around $20 at AWS spot market or reserved prices.

Github: https://github.com/chanzuckerberg/shasta Documentation pages: https://chanzuckerberg.github.io/shasta/index.html Paper: https://www.nature.com/articles/s41587-020-0503-6



NECAT

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which necat.pl 
/usr/local/bin/necat.pl

Version of the software on the AMI: v0.0.1_update20200803Latest

Process for Installation:

Note: this software must be run on an x86 infrastructure. It will not work using ARM. Most of the following instructions are from the following link:

https://github.com/xiaochuanle/necat NECAT Documentation

Make sure perl is newer than 5.24 perl -v

Download NECAT, extract, unzip, add the /bin folder within NECAT to the PATH

$ wget https://github.com/xiaochuanle/NECAT/releases/download/v0.0.1_update20200803/necat_20200803_Linux-amd64.tar.gz
$ tar xzvf necat_20200803_Linux-amd64.tar.gz
$ cd NECAT/Linux-amd64/bin
$ export PATH=$PATH:$(pwd)

Software Example of Use:

Create a directory to contain the assembly

mkdir my-assembly

Create a config file template using the following command:

$ necat.pl config my-assembly-config.txt

Modifying the relative information:

vim my-assembly-config.txt

Press i to insert text

PROJECT= your assembly project name
ONT_READ_LIST=read_list.txt
GENOME_SIZE=
THREADS=
MIN_READ_LENGTH=

Press escape then type -wq and press enter to save the file and exit.

Assembly time! Correct raw reads:

necat.pl correct ecoli_config.txt

Assemble the corrected raw reads:

necat.pl assemble ecoli_config.txt

Bridge the newly-created contigs:

necat.pl bridge ecoli_config.txt

Description of Software:

NECAT is an error correction and de-novo assembly tool for Nanopore long noisy reads.

Documentation at https://github.com/xiaochuanle/necat

For further information regarding this assembler, check out the GBI github page for it at: https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-NECAT-on-EC2



Flye

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which flye
/usr/local/bin/flye

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ flye --version
2.8.3-b1767

Process for Installation:

Set up the basics and dependencies:

$ sudo apt update &&  sudo apt-get upgrade
$ sudo apt install build-essential g++ make zlib1g zlib1g-dev

Download miniconda (Anaconda Python but without unnecessary packages since this instance will only need certain packages).

$ cd
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Press enter to scroll down the license agreement Enter yes for the default settings Next

$ rm Miniconda3-latest-Linux-x86_64.sh

If you accidentally did not enter yes for the miniconda assembly to initialize conda (this sets the correct PATH for your conda environment), use the following:

$ conda init bash
$ source ~/.bashrc

Lastly

$ conda update --name base conda --yes

Install Flye

$ cd
$ git clone https://github.com/fenderglass/Flye
$ cd Flye
$ python setup.py install

Software Example of Use:

Assemble your genome with Flye! Here's an example for oxford nanopore data of Lambda Phage:

flye --nano-raw /home/ubuntu/lambda-ont-data/fastq_runid_5dd3f31631aaf8b094e6dfd522b916c92d81e5a c_0.fastq --genome-size 48502 --out-dir ~/lambda-ont-flye-results/ --threads 4 

Description of Software:

De novo assembler for single molecule sequencing reads using repeat graphs

From the github:

“Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly.”

Flye manual: https://github.com/fenderglass/Flye/blob/flye/docs/USAGE.md#examples

Github: https://github.com/fenderglass/Flye

For more information on running this assembler in the GBI github: https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-Flye-on-EC2



Canu

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which canu
/usr/bin/canu

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ canu --version
Canu 1.9

Process for Installation:

Install Canu dependencies

$ sudo apt update &&  sudo apt upgrade
$ sudo apt install build-essential

Check that you have perl 5.12 or newer downloaded:

$ perl -v
$ sudo apt install openjdk-8-jre-headless
$ sudo apt install gnuplot

Download Canu

sudo apt install canu

Software Example of Use:

Assemble your genome with Canu! Here's an example for oxford nanopore data of Lambda Phage:

$ canu -p lambda -d lambda-phage-ont genomesize=.485m -nanopore-raw lambda_26620_read_11_ch_126d.fast5

To get assistance, check the help pages:

$ canu --help

Description of Software:

From the quick start page:

“Canu specializes in assembling PacBio or Oxford Nanopore sequences. Canu operates in three phases: correction, trimming and assembly. The correction phase will improve the accuracy of bases in reads. The trimming phase will trim reads to the portion that appears to be high-quality sequence, removing suspicious regions such as remaining SMRTbell adapter. The assembly phase will order the reads into contigs, generate consensus sequences and create graphs of alternate paths.”

Canu is extremely well documented. To find more information check: The github: https://github.com/marbl/canu The documentation pages: https://canu.readthedocs.io/en/latest/index.html The GBI github wiki canu page: https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-Canu-on-EC2



ABySS

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which abyss-pe
AB/home/ubuntu/dependency-software/miniconda3/bin/abyss-pe

Version of the software on the AMI:

2.3.3 (updated 11/11/2021)

Process for Installation:

conda install -c bioconda abyss

Software Example of Use:

A basic command for ABySS looks like:

abyss-pe name=[assembly-name] j=[num-threads] v=-v k=53 in="SRR1946554_1.fastq SRR1946554_2.fastq" | tee filename.log

First is the command itself. If you type abyss- and then press tab, you will see a variety of options. These seem to mostly be individual modules of the ABySS program, some of which can be used on their own for doing things like calculating statistics from the assembly. However the main program as shown above is abyss-pe.

name=[assembly-name], where you set the output name of the assembly. You might put some information about the assembly itself here to create a descriptive output filename, like 'apallida-k53' to signify the plant being assembled and the kmer value. Sometimes it create a bit more work by adding a longer filename, but the more descriptive you make it the easier it is when looking back on your data months later!

j=[num-threads] sets the number of threads available you want to allocate from the EC2 instance to do the assembly. Some of the processes are multi-threaded, so make sure to set the maximum amount of threads you have to expedite the assembly process.

v=-v makes the assembly give a verbose output. This can be helpful for understanding what went wrong (or right!) during the assembly afterwards. k=[k-mer-value] sets the k-mer value for the assembler.

in="filename1.filetype filename2.filetype" gives the assembler the locationa nd name of the input sequencing read data. If you are in the folder where the data is location, you don't need to add the path to the data beforehands. If you are outside of the data folder, you would need the path to the data.

| tee filename.log creates a file where all the assembly output information is stored. By using the v=-v flag above we can increase the amount of information saved into this log file.

Putting this all together, an example assembly using the plant Arabidopsis Thaliana, 16 threads, a k-mer value of 53, and input files named SRR1946554_1.fastq and SRR1946554_2.fastq might look like:

abyss-pe name=athaliana-k53-FM060821 j=16 v=-v k=53 in="SRR1946554_1.fastq SRR1946554_2.fastq" | tee athaliana-k53-FM060821-stdout.log

Description of Software:

Find more information about this at the github page: https://github.com/bcgsc/abyss

https://www.bcgsc.ca/resources/software/abyss

Or at the GBI github page for ABySS: https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-ABySS-on-EC2



MOTHUR

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which mothur
/usr/local/bin/mothur

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ mothur --version
Linux 
Mothur version=1.46.0
Release Date=8/25/21

Process for Installation: Installation instructions are very clear - read the INSTALL.md file within the main mothur directory.

(base) ubuntu@ip-172-31-52-167:~/GBI-software/MOTHUR/mothur$ cat INSTALL.md 

For more help with installation, check here: https://mothur.org/wiki/installation/

Software Example of Use: If you are going to use MOTHUR, I highly recommend reading the help page:

(base) ubuntu@ip-172-31-52-167:~$ mothur --help

Description of Software:

From the main webpage: “This project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community.”

Main webpage: https://mothur.org/ MOTHUR manual: https://mothur.org/wiki/mothur_manual/ Github: https://github.com/mothur/mothur



Dada2

Date: 11/11/2021

Verification of installation:

/usr/local/lib/R/site-library

Version of the software on the AMI:

> library(dada2); packageVersion("dada2")
Loading required package: Rcpp
[1] ‘1.20.0’

Process for Installation:

In order to install you need

R 3.6.1 or newer (√)

Bioconductor v3.10 https://www.bioconductor.org/install/

To install core packages, type the following in an R command window: if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install() I said ‘a’ for updating all the old packages (which looks like just refers to one called ‘tibble’)

The downloaded source packages are in ‘/tmp/Rtmpy74zIP/downloaded_packages’

For rcurl In bash $ sudo apt-get install libcurl4-gnutls-dev Then $ sudo R

install.packages(“RCurl”) BiocManager::install("GenomeInfoDb") BiocManager::install("Biostrings") BiocManager::install("Rhtslib") Lastly BiocManager::install("dada2") Success!

Software Use: This software is an R package, so in order to use it you either go through an R session using the command 'R' and then importing the package:

$ R
# then
> library(dada2)

or use Rscript.

Commands within the dad2 R package can then be found in the manual (listed below)

Description of Software:

Github: https://benjjneb.github.io/dada2/tutorial.html

Dad2 manual: https://www.bioconductor.org/packages/release/bioc/manuals/dada2/man/dada2.pdf



GATK

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which gatk
/usr/local/bin/gatk

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ gatk --version
Using GATK jar /home/ubuntu/GBI-software/GATK/gatk/build/libs/gatk-package-4.2.2.0-SNAPSHOT-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ubuntu/GBI-software/GATK/gatk/build/libs/gatk-package-4.2.2.0-SNAPSHOT-local.jar --version
The Genome Analysis Toolkit (GATK) v4.2.2.0-SNAPSHOT
HTSJDK Version: 2.24.1
Picard Version: 2.25.4

Process for Installation: Clone the GATK github to your computer Enter the GATK repo Use ./gradlew bundle

Software Example of Use: There are a lot of different tools within this toolkit. In order to check out the full list, look here: https://gatk.broadinstitute.org/hc/en-us/articles/4405443524763--Tool-Documentation-Index

The basic syntax for using one of these commands is

 gatk AnyTool toolArgs

You can get help with gatk itself using

gatk --help

Or with a specific tool within it using

gatk [tool] --help

To get the full list of tools from within the command line use:

gatk --list

Description of Software:

The Genome Analysis Toolkit (GATK) from the Broad Institute: https://gatk.broadinstitute.org/hc/en-us

From the website: “A genomic analysis toolkit focused on variant discovery. The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic short variant calling, and to tackle copy number (CNV) and structural variation (SV). In addition to the variant callers themselves, the GATK also includes many utilities to perform related tasks such as processing and quality control of high-throughput sequencing data, and bundles the popular Picard toolkit.

These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. And although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.”

More information can be found on the github: https://github.com/broadinstitute/gatk



ANGSD

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which angsd
/usr/local/bin/angsd

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ angsd --version
	-> angsd version: 0.935-51-g48e3d61 (htslib: 1.10.2-3) build(Aug 25 2021 04:04:31)

Process for Installation:

$ git clone https://github.com/angsd/angsd.git;
$ cd angsd
$ make

Software Example of Use:

From http://www.popgen.dk/angsd/index.php/ANGSD

Basic syntax:

./angsd [OPTIONS]

Example of allele frequency estimated from genotype likelihoods with bam files as input using 10 threads

./angsd -out outFileName -bam bam.filelist -GL 1 -doMaf 1 -doMajorMinor 1 -nThreads 10

Description of Software:

The title is quite concise as far as the function of this package: it is the Analysis of Next Gen Sequencing Data (ANGSD).

From their paper (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0356-4): This program can calculate various summary statistics, and perform association mapping and population genetic analyses utilizing the full information in next generation sequencing data by working directly on the raw sequencing data or by using genotype likelihoods.

For more information look at the github: https://github.com/angsd/angsd



SPAdes

Date: 11/11/2021

Verification of installation:

Because this software packages a handful of different commands, I added the ni folder within the SPAdes installation directory to the PATH using the .bashrc file

(base) ubuntu@ip-172-31-52-167:~$ which spades.py
/home/ubuntu/GBI-software/SPAdes/SPAdes-3.15.3/bin//spades.py

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ spades.py --version
SPAdes genome assembler v3.15.3

Process for Installation:

$ wget http://cab.spbu.ru/files/release3.15.3/SPAdes-3.15.3-Linux.tar.gz
$ tar -xzf SPAdes-3.15.3-Linux.tar.gz
$ cd SPAdes-3.15.3-Linux/bin/

Then I added the bin folder within the SPAdes installation directory to the PATH using the .bashrc file

Software Example of Use:

Basic command syntax:

Usage: spades.py [options] -o <output_dir>

This assembler is extremely well documented. Go to the manual on the github (linked below), and select the blue link to [3. Running SPAdes].

Description of Software:

Github: https://github.com/ablab/spades

From the git summary:

“The current version of SPAdes works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide additional contigs that will be used as long reads.

Version 3.15.3 of SPAdes supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously. Note, that SPAdes was initially designed for small genomes. It was tested on bacterial (both single-cell MDA and standard isolates), fungal and other small genomes. SPAdes is not intended for larger genomes (e.g. mammalian size genomes). For such purposes you can use it at your own risk.“



VEGAN

Date: 11/11/2021

PATH of installation:

/usr/local/lib/R/site-library

Version of the software on the AMI:

2.5.7:

> library(vegan)
Loading required package: permute
Loading required package: lattice
This is vegan 2.5-7

Process for Installation:

Downloading R

ubuntu@ip-172-31-58-54:~$ sudo apt update -qq
ubuntu@ip-172-31-58-54:~$ sudo apt install --no-install-recommends software-properties-common dirmngr
ubuntu@ip-172-31-58-54:~$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
ubuntu@ip-172-31-58-54:~$ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/"
ubuntu@ip-172-31-58-54:~$ sudo apt install --no-install-recommends r-base

Then to add the repository with VEGAN I:

Go into the R prompt as root:

ubuntu@ip-172-31-52-167:~$ sudo R

And download the package:

install.packages("vegan")

Software Example of Use:

It is installed within the folder:

/usr/local/lib/R/site-library

You can access/use its functions by either opening up an R session and using its functions within that session or by using Rscript to execute a script written in R that uses VEGAN functions.

For the first method, in the command line, type:

$ R

This will create an R session within the command line.

The prompt will change from the bash prompt

$

to the R prompt:

>

Now to use the VEGAN functions, use the syntax "package::package-module". For example, to use of the first modules in VEGAN, called "adipart":

> vegan::adipart(...)

where (...) represents the input variables for that function (which can be found in the documentation at https://cran.r-project.org/web/packages/vegan/vegan.pdf)

Description of Software:

From the documentation: “The vegan package provides tools for descriptive community ecology. It has most basic functions of diversity analysis, community ordination and dissimilarity analysis. Most of its multivariate tools can be used for other data types as well.

The functions in the vegan package contain tools for diversity analysis, ordination methods and tools for the analysis of dissimilarities. Together with the labdsv package, the vegan package provides most standard tools of descriptive community analysis. Package ade4 provides an alternative comprehensive package, and several other packages complement vegan and provide tools for deeper analysis in specific fields. Package BiodiversityR provides a GUI for a large subset of vegan functionality”

For more information check The docs: https://cran.r-project.org/web/packages/vegan/vegan.pdf The github: https://github.com/vegandevs/vegan/



QIIME2

Date: 11/11/2021

Verification of installation:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~/GBI-software/QIIME2$ which qiime 
/home/ubuntu/dependency-software/miniconda3/envs/qiime2-2021.4/bin/qiime

Version of the software on the AMI:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~/GBI-software/QIIME2$ qiime info
System versions
Python version: 3.8.8
QIIME 2 release: 2021.4
QIIME 2 version: 2021.4.0
q2cli version: 2021.4.0

Installed plugins
alignment: 2021.4.0
composition: 2021.4.0
cutadapt: 2021.4.0
dada2: 2021.4.0
deblur: 2021.4.0
demux: 2021.4.0
diversity: 2021.4.0
diversity-lib: 2021.4.0
emperor: 2021.4.0
feature-classifier: 2021.4.0
feature-table: 2021.4.0
fragment-insertion: 2021.4.0
gneiss: 2021.4.0
longitudinal: 2021.4.0
metadata: 2021.4.0
phylogeny: 2021.4.0
quality-control: 2021.4.0
quality-filter: 2021.4.0
sample-classifier: 2021.4.0
taxa: 2021.4.0
types: 2021.4.0
vsearch: 2021.4.0

Process for Installation:

$ conda update conda
$ conda install wget
$ wget https://data.qiime2.org/distro/core/qiime2-2021.8-py38-linux-conda.yml
$ conda env create -n qiime2-2021.8 --file qiime2-2021.8-py38-linux-conda.yml
# OPTIONAL CLEANUP
$ rm qiime2-2021.8-py38-linux-conda.yml
$ conda activate qiime2-2021.8

The website for qiime2 recommended to install this software within a virtual environment. All this means is that in order to use it, you must activate the conda environment that was created during installation with qiime2 in it!

Software Example of Use:

The syntax for running the software is:

qiime [OPTIONS] COMMAND [ARGS]...

To get a list of the associated tools and commands use

$ qiime --help

And

$ qiime tools

Description of Software:

From the documentation (https://docs.qiime2.org/2021.8/): “QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.”

For information relevant to running QIIME 2 (how data or plots are stored), check out the concepts page of the manual: https://docs.qiime2.org/2021.8/concepts/



MaSuRCA

Date: 11/11/2021

Verification of installation:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ which masurca 
/usr/local/bin/masurca

Version of the software on the AMI:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ masurca --version
version 4.0.4

Process for Installation: On your personal or lab computer, download the most recent distribution from link above and then copy it to the EC2 instance

$ scp -i /path/to/keypairs/keypair.pem local/path/to/MaSuRCA-release.tar.gz ubuntu@ecx-xx-xxx-xxx-xxx.us-west-1.compute.amazonaws.com

Then back on the EC2 instance, unzip and install it using the following commands:

$ tar -zxvf MaSuRCA-4.0.3.tar.gz 
$ cd MaSuRCA-4.0.3/
$ BOOST_ROOT=install ./install.sh

Software Example of Use:

MaSuRCA is run by creating a config file and then using the command masurca [config-file]. After running this command, masurca will build a bash script called assemble.sh, which you can run by going into the directory you want to do the assembly in and using that script. In the following examples I use an old version of MaSuRCA (v3.4.2) because I was having issues with the newest release. In theory the newer release does not require the config file for simple assemblies, but it is probably best practice to use it because it requires a better understanding to the user of what parameters there are to change.

An example of a masurca config file can be found here: https://github.com/Green-Biome-Institute/AWS/blob/master/masurca_config_ex

Running MaSuRCA with a config file example:

# Navigate to your assembly directory
$ cd
$ cd athaliana-assembly

# List what's in the directory athaliana-assembly
athaliana-assembly$ ls
athaliana-assembly

# run the masurca command
athaliana-assembly$ ../MaSuRCA-3.4.2/bin/masurca athaliana-config.txt

# check to see if bash assembly.sh script was created
athaliana-assembly$ ls
 athaliana-assembly    assemble.sh

# run the MaSuRCA assembler using the assemble.sh script
athaliana-assembly$ ./assemble.sh

example command without a config file:

$ MaSuRCA-4.0.4/bin/masurca -t 16 -i athal-data/short-read/SRR1946554_1.fastq.gz,athal-data/short-read/SRR1946554_2.fastq.gz -r athal-data/long-read/SRR11968809.fastq.gz 

Because you aren't using a configuration file, which includes further information about the run, you must use some flags to provide further information:

t = threads
i = no config file input, the following files will be the paired-end illumina reads
r = the file after this indicates the path to a certain sequencing data

Description of Software:

From the github:

“The MaSuRCA (Maryland Super Read Cabog Assembler) genome assembly and analysis toolkit contains of MaSuRCA genome assembler, QuORUM error corrector for Illumina data, POLCA genome polishing software, Chromosome scaffolder, jellyfish mer counter, and MUMmer aligner. The usage instructions for the additional tools that are exclusive to MaSuRCA, such as POLCA and Chromosome scaffolder are provided at the end of this Guide.

The MaSuRCA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. Since version 3.2.1 it supports hybrid assembly with short Illumina reads and long high error PacBio/MinION data.“

For more information check: The MaSuRCA github: https://github.com/alekseyzimin/masurca The GBI wiki page for MaSuRCA: https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-MaSuRCA-on-EC2



SOAPDenovo2

Date: 11/11/2021

Verification of installation:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ which soapdenovo2-127mer 
/usr/bin/soapdenovo2-127mer

Version of the software on the AMI:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ soapdenovo2-127mer --help

Version 2.04: released on July 13th, 2012
Compile Mar 22 2020	15:58:14

Process for Installation:

$ sudo apt install soapdenovo2

Software Example of Use:

A basic command for SOAPdenovo2 looks like:

soapdenovo2-127mer all -s /path/to/config-file -K [k-mer-value] -p [num-threads] -o /path/to/assembly/directory/assembly-output-filename 1>assembly.log 2>assembly.err.

For lots more information, check out the GBI page dedicated to SOAPdenovo here!: https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-SOAPdenovo2-on-EC2

Description of Software:

There is lots of information about SOAPdenovo2.

A good entry point is the GBI soapdenovo2 github wiki:

https://github.com/Green-Biome-Institute/AWS/wiki/Assembling-with-SOAPdenovo2-on-EC2

For more information, use the --help command and go to the SOAPdenovo2 github page:

https://github.com/aquaskyline/SOAPdenovo2

Or the following links as well:

http://manpages.ubuntu.com/manpages/xenial/man1/soapdenovo-63mer.1.html

https://banana-slug.soe.ucsc.edu/_media/lecture_notes:soap_team2_report.pdf

https://home.cc.umanitoba.ca/~psgendb/doc/SOAP/SOAPdenovo-Trans-UsersGuide.html



Genome Tools

Date: 11/11/2021

Verification of installation:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ which gt
/usr/local/bin/gt

Version of the software on the AMI:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ gt --version
gt (GenomeTools) 1.6.2
Copyright (c) 2003-2016 G. Gremme, S. Steinbiss, S. Kurtz, and CONTRIBUTORS
Copyright (c) 2003-2016 Center for Bioinformatics, University of Hamburg
See LICENSE file or http://genometools.org/license.html for license details.

Used compiler: cc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Compile flags:  -g -Wall -Wunused-parameter -pipe -fPIC -Wpointer-arith -Wno-unknown-pragmas -O3 -Werror

Process for Installation:

(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ sudo apt-get install genometools
(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ sudo apt-get install libgenometools0
(qiime2-2021.4) ubuntu@ip-172-31-52-167:~$ sudo apt-get install libgenometools0-dev

Software Example of Use:

Basic syntax:

gt [option ...] [tool | script] [argument ...]

This software contains numerous tools. In order to list them, use:

$ gt --help

Description of Software:

Further information found in The website: http://genometools.org/index.html Within this, their is documentation for the tools: http://genometools.org/tools.html The github: https://github.com/genometools/genometools The development guide: http://genometools.org/documents/devguide.pdf



NGS Tools

Date: 11/11/2021

Verification of installation:

The PATH of the base directory for NGS tools is:

/home/ubuntu/GBI-software/ngsTools

However, the PATH for each of the tools within it is in their respective directories.

Version of the software on the AMI:

There isn’t a release version for this because it is a series of different tools within the name “NGS tools”

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ mkdir ngsTools
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd ngsTools/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/ngsTools$ git clone --recursive https://github.com/mfumagalli/ngsTools.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/ngsTools$ cd ngsTools/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/ngsTools/ngsTools$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/ngsTools/ngsTools$ make test

Software Example of Use:

In order to use the tools within NGS tools, you will need to go into the NGS tools directory at the following path:

$ cd /home/ubuntu/GBI-software/ngsTools

Then from there you can enter into the specific tool you want to use. For example, to use ngsPopGen, you would use

$ cd ngsPopGen

Then list the contents of this directory and you will find a couple different commands (they will be in green lettering):

$ ls

In order to use a specific tool here, you will use the command with a period and forward slash before it like so with the command ngsCovar:

$ ./ngsCovar 

In order to get help with each of these commands, simply enter the command like that with no options and it will return an output with all the possible options and input values necessary to run it!

Description of Software:

From the git:

“NGS (Next-Generation Sequencing) technologies have revolutionised population genetic research by enabling unparalleled data collection from the genomes or subsets of genomes from many individuals. Current technologies produce short fragments of sequenced DNA called reads that are either de novo assembled or mapped to a pre-existing reference genome. This leads to chromosomal positions being sequenced a variable number of times across the genome. This parameter is usually referred to as the sequencing depth. Individual genotypes are then inferred from the proportion of nucleotide bases covering each site after the reads have been aligned.”

For further information, check out the github here: https://github.com/mfumagalli/ngsTools

Links to all the tools within ngsTools are on that github above, so if you have questions regarding the individual tools, go to that github and click on the link that will take you to the documentation/repository of that specific tool! ​​



Trimmomatic

Date: 11/11/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which trimmomatic 
/home/ubuntu/dependency-software/miniconda3/bin/trimmomatic

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ trimmomatic -version
0.39

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~$ conda install -c bioconda trimmomatic

Software Example of Use:

Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...

Description of Software:

From the github: “Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line. The current trimming steps are: ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read. SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. LEADING: Cut bases off the start of a read, if below a threshold quality TRAILING: Cut bases off the end of a read, if below a threshold quality CROP: Cut the read to a specified length HEADCROP: Cut the specified number of bases from the start of the read MINLEN: Drop the read if it is below a specified length TOPHRED33: Convert quality scores to Phred-33 TOPHRED64: Convert quality scores to Phred-64“

Further documentation can be found on the github here: https://github.com/usadellab/Trimmomatic

Or in the manual here: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf



MEGAN

Date: 11/11/2021

Verification of installation:

/home/ubuntu/GBI-software/MEGAN

Version of the software on the AMI:

Megan6

Process for Installation:

https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/megan6

Software Example of Use:

This software will be available via the command line interface, however it will actually be far easier to use with its dedicated GUI. If there are specific commands within MEGAN you would like to use to automate an analysis (maybe by having it run directly after something else), that can be done. However if you can it would be better to use the GUI. To do this you would click the link in the installation section above, download and install MEGAN on your or your lab computer, and then open it up (you can do this by using ./MEGAN in the directory where you have installed it.

Description of Software:

The MEGAN6 manual: https://software-ab.informatik.uni-tuebingen.de/download/megan6/manual.pdf

Introductory video to MEGAN6: https://www.youtube.com/watch?v=2-4th7O0rOU&feature=youtu.be

From the manual: “Typically, after generating a RMA file (read-match archive) from a BLAST file, the user will then interact with the program, using the Find toolbar to determine the presence of key species, collapsing or un-collapsing nodes to produce summary statistics and using the Inspector window to look at the details of the matches that are the basis of the assignment of reads to taxa. The assignment of reads to taxa is computed using the LCA-assignment algorithm, see [10] for details. In addition to taxonomic binning, MEGAN also allows functional analysis. Another main feature is the comparison of samples. There are a number of tools for graphing data, and for import and export of data. The Community Edition of MEGAN provides a graphical user interface to allow the interactively explore and analyze their samples.”



NovoPlasty

Date: 11/17/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which NOVOPlasty.pl 
/home/ubuntu/dependency-software/miniconda3/bin/NOVOPlasty.pl

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ NOVOPlasty.pl 


-----------------------------------------------
NOVOPlasty: The Organelle Assembler
Version 4.3.1
Author: Nicolas Dierckxsens, (c) 2015-2020
-----------------------------------------------

Process for Installation:

Installed using conda, but I show how you can install it using git clone as well if that’s desirable in a future application:

(base) ubuntu@ip-172-31-52-167:~$ conda install novoplasty

or

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ git clone https://github.com/ndierckx/NOVOPlasty.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd NOVOPlasty/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/NOVOPlasty$ chmod +x NOVOPlasty4.3.1.pl 
(base) ubuntu@ip-172-31-52-167:~/GBI-software/NOVOPlasty$ chmod +x filter_reads.pl 

Software Example of Use:

The command is accessible system wide. Example command (requires you to fill out the config file or make a copy of it, fill out the copy, and then pass that copy into the program using the c option):

(base) ubuntu@ip-172-31-52-167:~/GBI-software/NOVOPlasty$ NOVOPlasty.pl -c config.txt 

The example config file that is given can be found at /home/ubuntu/GBI-software/NOVOPlasty/config.txt or at the github (linked below). And for good measure, here is a copy of it:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/NOVOPlasty$ cat config.txt 
Project:
-----------------------
Project name          = Test
Type                  = mito
Genome Range          = 12000-22000
K-mer                 = 33
Max memory            = 
Extended log          = 0
Save assembled reads  = no
Seed Input            = /path/to/seed_file/Seed.fasta
Extend seed directly  = no
Reference sequence    = /path/to/reference_file/reference.fasta (optional)
Variance detection    = 
Chloroplast sequence  = /path/to/chloroplast_file/chloroplast.fasta (only for "mito_plant" option)

Dataset 1:
-----------------------
Read Length           = 151
Insert size           = 300
Platform              = illumina
Single/Paired         = PE
Combined reads        = 
Forward reads         = /path/to/reads/reads_1.fastq
Reverse reads         = /path/to/reads/reads_2.fastq
Store Hash            =

Heteroplasmy:
-----------------------
MAF                   = 
HP exclude list       = 
PCR-free              = 

Optional:
-----------------------
Insert size auto      = yes
Use Quality Scores    = no
Output path           = 




Project:
-----------------------
Project name         = Choose a name for your project, it will be used for the output files.
Type                 = (chloro/mito/mito_plant) "chloro" for chloroplast assembly, "mito" for mitochondrial assembly and 
                       "mito_plant" for mitochondrial assembly in plants.
Genome Range         = (minimum genome size-maximum genome size) The expected genome size range of the genome.
                       Default value for mito: 12000-20000 / Default value for chloro: 120000-200000
                       If the expected size is know, you can lower the range, this can be useful when there is a repetitive
                       region, what could lead to a premature circularization of the genome.
K-mer                = (integer) This is the length of the overlap between matching reads (Default: 33). 
                       If reads are shorter then 90 bp or you have low coverage data, this value should be decreased down to 23. 
                       For reads longer then 101 bp, this value can be increased, but this is not necessary.
Max memory           = You can choose a max memory usage, suitable to automatically subsample the data or when you have limited                      
                       memory capacity. If you have sufficient memory, leave it blank, else write your available memory in GB
                       (if you have for example a 8 GB RAM laptop, put down 7 or 7.5 (don't add the unit in the config file))
Extended log         = Prints out a very extensive log, could be useful to send me when there is a problem  (0/1).
Save assembled reads = All the reads used for the assembly will be stored in separate files (yes/no)
Seed Input           = The path to the file that contains the seed sequence.
Extend seed directly = This gives the option to extend the seed directly, instead of finding matching reads. Only use this when your seed 
                       originates from the same sample and there are no possible mismatches (yes/no)
Reference (optional) = If a reference is available, you can give here the path to the fasta file.
                       The assembly will still be de novo, but references of the same genus can be used as a guide to resolve 
                       duplicated regions in the plant mitochondria or the inverted repeat in the chloroplast. 
                       References from different genus haven't been tested yet.
Variance detection   = If you select yes, you should also have a reference sequence (previous line). It will create a vcf file                
                       with all the variances compared to the give reference (yes/no)
Chloroplast sequence = The path to the file that contains the chloroplast sequence (Only for mito_plant mode).
                       You have to assemble the chloroplast before you assemble the mitochondria of plants!

Dataset 1:
-----------------------
Read Length          = The read length of your reads.
Insert size          = Total insert size of your paired end reads, it doesn't have to be accurate but should be close enough.
Platform             = illumina/ion - The performance on Ion Torrent data is significantly lower
Single/Paired        = For the moment only paired end reads are supported.
Combined reads       = The path to the file that contains the combined reads (forward and reverse in 1 file)
Forward reads        = The path to the file that contains the forward reads (not necessary when there is a merged file)
Reverse reads        = The path to the file that contains the reverse reads (not necessary when there is a merged file)
Store Hash           = If you want several runs on one dataset, you can store the hash locally to speed up the process (put "yes" to store the hashes locally)
                       To run local saved files, goto te wiki section of the github page

Heteroplasmy:
-----------------------
MAF                  = (0.007-0.49) Minor Allele Frequency: If you want to detect heteroplasmy, first assemble the genome without this option. Then give the resulting                         
                       sequence as a reference and as a seed input. And give the minimum minor allele frequency for this option 
                       (0.01 will detect heteroplasmy of >1%)
HP exclude list      = Option not yet available  
PCR-free             = (yes/no) If you have a PCR-free library write yes

Optional:
-----------------------
Insert size auto     = (yes/no) This will finetune your insert size automatically (Default: yes)                               
Use Quality Scores   = It will take in account the quality scores, only use this when reads have low quality, like with the    
                       300 bp reads of Illumina (yes/no)
Output path          = You can change the directory where all the output files will be stored.

Description of Software:

Github: https://github.com/ndierckx/NOVOPlasty “NOVOPlasty is a de novo assembler and heteroplasmy/variance caller for short circular genomes.”

Abstract from the main NOVOplasty article: “The evolution in next-generation sequencing (NGS) technology has led to the development of many different assembly algorithms, but few of them focus on assembling the organelle genomes. These genomes are used in phylogenetic studies, food identification and are the most deposited eukaryotic genomes in GenBank. Producing organelle genome assembly from whole genome sequencing (WGS) data would be the most accurate and least laborious approach, but a tool specifically designed for this task is lacking. We developed a seed-and-extend algorithm that assembles organelle genomes from whole genome sequencing (WGS) data, starting from a related or distant single seed sequence. The algorithm has been tested on several new (Gonioctena intermedia and Avicennia marina) and public (Arabidopsis thaliana and Oryza sativa) whole genome Illumina data sets where it outperforms known assemblers in assembly accuracy and coverage. In our benchmark, NOVOPlasty assembled all tested circular genomes in less than 30 min with a maximum memory requirement of 16 GB and an accuracy over 99.99%. In conclusion, NOVOPlasty is the sole de novo assembler that provides a fast and straightforward extraction of the extranuclear genomes from WGS data in one circular high quality contig.”

Nicolas Dierckxsens, Patrick Mardulyn, Guillaume Smits, NOVOPlasty: de novo assembly of organelle genomes from whole genome data, Nucleic Acids Research, Volume 45, Issue 4, 28 February 2017, Page e18, https://doi.org/10.1093/nar/gkw955



ChloroExtractor

Date: 11/13/2021

Verification of installation:

Since there is a GNU core utility also called ptx, I am unable to install this system wide. The path to the executable command to use Chloro Extractor is:

(base) ubuntu@ip-172-31-52-167:~$ ./GBI-software/chloroExtractor/bin/ptx

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ ./GBI-software/chloroExtractor/bin/ptx --version
v1.0.9

Process for Installation:

There are quite a few dependencies for this software. These include:

  • Jellyfish (2.2.4)
  • Spades (v3.10.1)
  • bowtie2 (2.2.6)
  • NCBI-Blast+ (2.2.31+)
  • Samtools (0.1.19-96b5f2294a)
  • Bedtools (v2.25.0)
  • GNU R (3.2.3)
  • Ghostscript (9.18)
  • Python (2.7.12)
  • Perl (v5.22.1)

Most of these have their own dedicated page / respective REAME on the AMI (or in the folder with all of these software-validation documents for the GBI AMI). For the ones not included in those documents:

Ghostcript (https://ghostscript.com/doc/9.55.0/Readme.htm):

(base) ubuntu@ip-172-31-52-167:~/GBI-software/chloroExtractor$ sudo apt install ghostscript

The perl packages:

  • Moose (2.1604)
  • Log::Log4Perl (1.44)
  • Term::ProgressBar (2.17)
  • Graph (0.96)
  • IPC::Run (0.94)
  • File::Which (1.19)

Which can be downloaded by doing:

(base) ubuntu@ip-172-31-52-167:~$ sudo cpanm Moose
(base) ubuntu@ip-172-31-52-167:~$ sudo cpanm Log::Log4perl
(base) ubuntu@ip-172-31-52-167:~$ sudo cpanm Term::ProgressBar
(base) ubuntu@ip-172-31-52-167:~$ sudo cpanm Graph
(base) ubuntu@ip-172-31-52-167:~$ sudo cpanm IPC::Run
(base) ubuntu@ip-172-31-52-167:~$ sudo cpanm File::Which

Then, clone the git repository:

git clone --recursive https://github.com/chloroExtractorTeam/chloroExtractor

For some reason when I installed several of the perl packages they installed into the root perl folder (/usr/local/share/perl/5.30.0). I added this to the environment variable PERL5LIB and everything seems to work fine. So if you’re getting issues with perl modules not being installed or found, this would be something to check.

Software Example of Use:

The command to use ChloroExtractor is called ptx. As I mention in the verification of installation, there is a GNU core utility also called ptx, so I am unable to install this executable system wide. The path to the executable command to use Chloro Extractor is:

(base) ubuntu@ip-172-31-52-167:~$ ./GBI-software/chloroExtractor/bin/ptx --help

If, in the future, you think it would be helpful to have it system wide, you can use the following command:

sudo ln -s "/home/ubuntu/GBI-software/chloroExtractor/bin/ptx" ~/../../usr/local/bin/ptx-gbi

This will simulink the ptx command into a directory on the PATH, that will instead be named ptx-gbi. Therefore if you use this command, chloroextractor will be executable system wide by using the command ptx-gbi <OPTIONS>.

This software requires a configuration file. To create a custom config file use

(base) ubuntu@ip-172-31-52-167:~/GBI-software/chloroExtractor/bin$ ./ptx --create-config

This config file is rather long so it may take some time to parse through and decide what you want to change, but if you are going to use this software, it is likely important that you do so.

An example from the github of a command passing in the configuration file ownptx.cfg and the fastq files FQ_1 and FQ_2 looks like:

$ ./ptx -c ownptx.cfg -1 FQ_1 -2 FQ_2

Where the -c option signifies the next file with be the config file and -1 and -2 signify the files following will be your fastq_1 and fastq_2 files, respectively.

I Description of Software:

https://github.com/chloroExtractorTeam/chloroExtractor

From the github: “The chloroExtractor is a perl based program which provides a pipeline for DNA extraction of chloroplast DNA from whole genome plant data. Too huge amounts of chloroplast DNA can cast problems for the assembly of whole genome data. One solution for this problem can be a core extraction before sequencing, but this can be expensive. The chloroExtractor takes your whole genome data and extracts the chloroplast DNA, so you can have your different DNA separated easily by the chloroExractor. Furthermore the chloroExtractor takes the chloroplast DNA and tries to assemble it. This is possible because of the preserved nature of the chloroplasts primary and secondary structure. Through k-mer filtering the k-mers which contain the chloroplast sequences get extracted and can then be used to assemble the chloroplast on a guided assembly with several other chloroplasts."



GetOrganelle

Date: 11/17/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which get_organelle_from_reads.py
/home/ubuntu/dependency-software/miniconda3/bin/get_organelle_from_reads.py

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ get_organelle_from_reads.py --version
GetOrganelle v1.7.5

Process for Installation:

(base) ubuntu@ip-172-31-52-167:~$ conda install -c bioconda getorganelle
(base) ubuntu@ip-172-31-52-167:~$ get_organelle_config.py --add embplant_pt,embplant_mt

Software Example of Use:

For a full example of use, check out this wiki page on the github: https://github.com/Kinggerm/GetOrganelle/wiki/Example-1

Also, here are two example commands from the README that show the basic syntax of the command:

###  Embryophyta plant plastome, 2*(1G raw data, 150 bp) reads
$ get_organelle_from_reads.py -1 sample_1.fq -2 sample_2.fq -s cp_seed.fasta -o plastome_output  -R 15 -k 21,45,65,85,105 -F embplant_pt

And

###  Embryophyta plant mitogenome
$ get_organelle_from_reads.py -1 sample_1.fq -2 sample_2.fq -s mt_seed.fasta -o mitogenome_output  -R 30 -k 21,45,65,85,105 -F embplant_mt

Description of Software:

The github can be found here: https://github.com/Kinggerm/GetOrganelle

It appears to be still actively updated, the last release was on May 13, 2021. The issues page is also fairly active so it shouldn’t be a problem to ask a question.

From the paper Jin, JJ., Yu, WB., Yang, JB. et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol 21, 241 (2020). https://doi.org/10.1186/s13059-020-02154-5 :

“GetOrganelle is a state-of-the-art toolkit to accurately assemble organelle genomes from whole genome sequencing data. It recruits organelle-associated reads using a modified “baiting and iterative mapping” approach, conducts de novo assembly, filters and disentangles the assembly graph, and produces all possible configurations of circular organelle genomes. For 50 published plant datasets, we are able to reassemble the circular plastomes from 47 datasets using GetOrganelle. GetOrganelle assemblies are more accurate than published and/or NOVOPlasty-reassembled plastomes as assessed by mapping. We also assemble complete mitochondrial genomes using GetOrganelle.”



IOGA

Date: 11/17/2021

Verification of installation:

This software requires Python2. In order to use this version of Pyhton I set up a virtual environment explained in the Process for Installation below. What this means is that in order to use this software (IOGA), you need to go into the following folder:

/home/ubuntu/GBI-software/IOGA

And use the following command:

$ conda activate IOGA-env-py2

This will load the environment with all the dependencies required for IOGA. Since this is the case, I am not making this software available system wide. Instead, in order to use it, you will need to go into the IOGA directory /home/ubuntu/GBI-software/IOGA and use it from there.

Version of the software on the AMI: No release date or version listed in any of the documentation related to this software. The project is also no longer maintained, so the software currently installed is the latest release, which appears to have been updated in 2020. (Though this looks like a small update and the software itself is likely from 2017.)

Process for Installation:

This requires python2 so I am setting up a virtual environment called IOGA-env-py2 that will have all the dependencies required within it to run IOGA.

(base) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda create --name IOGA-env-py2 python=2.7

To activate this environment use:

$ conda activate IOGA-env-py2

Then within the environment I install the dependencies:

(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda install biopython
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda install bbmap
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda install soapdenovo2
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda install seqtk
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda install spades
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda install samtools # already installed from above dependency but making sure to update if necessary
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ sudo apt install ale
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ conda install picard
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ pip install wget
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ pip install matplotlib
(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ sudo apt-get install -y libncurses5-dev

And then set up IOGA using:

(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ python setup_IOGA.py 

Software Example of Use:

As previously mentioned, in order to use this software you will need to use the executable within the IOGA directory at:

/home/ubuntu/GBI-software/IOGA

Once here and after you have activated the conda environment IOGA-env-py2 as shown how to do above, you can execute the IOGA command by using:

(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ python IOGA.py

For example, to get more information using the help option do:

(IOGA-env-py2) ubuntu@ip-172-31-52-167:~/GBI-software/IOGA$ python IOGA.py -h

This shows that the basic command syntax looks like:

usage: IOGA.py [-h] [--reference REFERENCE] [--name NAME] [--forward FORWARD]
               [--reverse REVERSE] [--insertsize INSERTSIZE]
               [--threads THREADS] [--maxrounds MAXROUNDS] [--verbose]

Where options within [ ] brackets represent optional files or methods.

Description of Software:

The Github: https://github.com/holmrenser/IOGA

From the Github: IOGA was used to assemble chloroplast genomes for a range of herbarium samples, and was published in The Biological Journal of the Linnean Society on 07/08/2015.

From the paper (https://onlinelibrary.wiley.com/doi/abs/10.1111/bij.12642, Herbarium genomics: plastome sequence assembly from a range of herbarium specimens using an Iterative Organelle Genome Assembly pipeline, Biol. J. Linnean Soc.): Herbarium genomics is proving promising as next-generation sequencing approaches are well suited to deal with the usually fragmented nature of archival DNA. We show that routine assembly of partial plastome sequences from herbarium specimens is feasible, from total DNA extracts and with specimens up to 146 years old. We use genome skimming and an automated assembly pipeline, Iterative Organelle Genome Assembly, that assembles paired-end reads into a series of candidate assemblies, the best one of which is selected based on likelihood estimation.

Once again it is important to note that this software is no longer being updated. If you find it is useful, and want it to include further functionality, that will only be possible via personal effort to program that functionality in.



Chloroplast Assembly Protocol

Date: 12/01/2021

PATH of installation: ~/GBI-software/Chloroplast_Assembly_Protocol_CAP/chloroplast_assembly_protocol

Version of the software on the AMI:

No info, using whatever release was available 12/01/2021

Process for Installation:

DUK

Download from https://sourceforge.net/projects/duk/files/README/download

SCP to the ex2 instance from my computer
Flints-MacBook-Pro:Desktop flintmitchell$ scp -i ~/Desktop/GBI/AWS/AWS_keypairs/r5large-gbi-keypair.pem ../Downloads/duk.tar ubuntu@44.226.207.191:/home/ubuntu/GBI-software/Chloroplast_Assembly_Protocol_CAP/DUK/
then
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/DUK$ cd duk
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/DUK/duk$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/DUK/duk$ ./duk --help
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/DUK/duk$ sudo ln -s "$(pwd)/duk" ~/../../usr/local/bin/duk

Musket

Download from http://musket.sourceforge.net/homepage.htm

SCP to the ec2 instance from my computer
Flints-MacBook-Pro:Desktop flintmitchell$ scp -i ~/Desktop/GBI/AWS/AWS_keypairs/r5large-gbi-keypair.pem  musket-1.1.tar.gz ubuntu@44.226.207.191:/home/ubuntu/GBI-software/Chloroplast_Assembly_Protocol_CAP/Musket/
then
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/Musket$ tar xvf musket-1.1.tar.gz 
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/Musket$ rm musket-1.1.tar.gz 
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/Musket$ cd musket-1.1/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/Musket$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/Musket$ ./musket --version
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/Musket$ sudo ln -s "$(pwd)/musket" ~/../../usr/local/bin/musket

Velvet

(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/velvet/velvet$ git clone https://github.com/dzerbino/velvet.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/velvet$ cd velvet/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/velvet/velvet$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/velvet/velvet$ sudo ln -s "$(pwd)/velvetg" ~/../../usr/local/bin/velvetg
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/velvet/velvet$ sudo ln -s "$(pwd)/velveth" ~/../../usr/local/bin/velveth

SSPACE

(base) ubuntu@ip-172-31-52-167:~$ sudo apt install sspace

GapFiller, no longer supported by the original company BaseClear, just FYI

(base) ubuntu@ip-172-31-52-167:~$ conda install gapfiller

Seqtk

(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP$ git clone https://github.com/lh3/seqtk.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP$ cd seqtk/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/seqtk$ make
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/seqtk$ ./seqtk

CAP itself:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP$ git clone https://github.com/eead-csic-compbio/chloroplast_assembly_protocol.git
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP$ cd chloroplast_assembly_protocol/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/chloroplast_assembly_protocol$ cat README.txt
(base) ubuntu@ip-172-31-52-167:~/GBI-software/Chloroplast_Assembly_Protocol_CAP/chloroplast_assembly_protocol$ perl ./install.pl

Software Example of Use:

There is a well laid out example for using this software with the test data provided (within the test directory inside of the man directory chloroplast_assembly_protocol.

It appears that there are three main commands:

./0_get_cp_reads.pl INPUT_DIR WORKING_DIR FASTA_CP_GENOMES
./1_cleanreads.pl -folder WORKING_DIR [-ref FASTA_REF_GENOME] [-skip] [-regex REGEX]
./2_assemble_reads.pl WORKING_DIR ASSEMBLY_NAME [-ref FASTA_REF_GENOME] [-threads FLOAT] [-sample INTEGER] [-kmer INTEGER] [-outdir OUTPUT_DIR]

To understand how to use these commands use read the README.txt file:

cat README.txt

Description of Software:

CAP Github: https://github.com/eead-csic-compbio/chloroplast_assembly_protocol

This is the Chloroplast Assembly Software, from the github bio: “A set of scripts for the assembly of chloroplast genomes out of whole-genome sequencing reads.” In order to use this you’ll have to read the REAM available on the front page of the github or within the main directory chloroplast_assembly_protocol.



Org.Asm

Date: 11/19/2021

Verification of installation:

(Remember, this is from within the orgasm environment activated by using the ./orgasm command within the directory ~/GBI-software/org-asm/get_orgasm:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm/get_orgasm/ORG.asm-1.0.3/bin$ which oa
/home/ubuntu/GBI-software/org-asm/get_orgasm/ORG.asm-1.0.3/export/bin/oa

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm/get_orgasm/ORG.asm-1.0.3/bin$ oa --version
The Organelle Assembler - Version b'2.2'

Process for Installation:

https://docs.metabarcoding.org/asm/install.html

This program only runs on Python 3.7, so first I’m going to create a conda environment for this software:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ conda create -n organellegassembler python=3.7

This means that in order to run this software, you must activate the specific conda environment using:

$ conda activate organellegassembler

Then,

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ conda activate organellegassembler
(organellegassembler) ubuntu@ip-172-31-52-167:~/GBI-software$ sudo apt update && sudo apt -y upgrade
(organellegassembler) ubuntu@ip-172-31-52-167:~/GBI-software$ sudo apt-get install python-dev
(organellegassembler) ubuntu@ip-172-31-52-167:~/GBI-software$ git clone https://git.metabarcoding.org/org-asm/org-asm
(organellegassembler) ubuntu@ip-172-31-52-167:~/GBI-software$ cd org-asm/
(organellegassembler) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm$ python get-orgasm.py 

The Org.Asm program requires you to actually activate another environment with the software available. In order to do this, from within the org-asm directory, you use the command:

(organellegassembler) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm/get_orgasm$ ./orgasm

You can now use ORG.asm

Now, you will see that the environment changes to (base):

(base) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm/get_orgasm$ 

You will now have access to the Org.Asm software.

Software Example of Use:

https://docs.metabarcoding.org/asm/index.html

Once you have entered the software environment by following the above commands, reprinted here:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ cd org-asm/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm$ cd get_orgasm/
(base) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm/get_orgasm$ ./orgasm 

Then you can use the command as follows:

oa [-h] [--version] [--log ORGASM:LOG] [--no-progress]

For example, to get further information about the options for the subcommand “index” use

(base) ubuntu@ip-172-31-52-167:~/GBI-software/org-asm$ oa index --help

Description of Software:

A de novo assembler dedicated to organelle genome assembling.

Manual: https://docs.metabarcoding.org/asm/index.html Look at the principles here in order to find further information about the theory behind the software: https://docs.metabarcoding.org/asm/oa.html#oa



Unicycler

Date: 12/05/2021

Verification of installation:

(base) ubuntu@ip-172-31-52-167:~$ which unicycler
/home/ubuntu/dependency-software/miniconda3/bin/unicycler

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ unicycler --version
Unicycler v0.4.9

Process for Installation:

$ git clone https://github.com/rrwick/Unicycler.git
$ cd Unicycler
Unicycler/$ python3 setup.py install

Software Example of Use:

Command syntax from README:

unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2]
                 [-s UNPAIRED] [-l LONG] -o OUT [--verbosity VERBOSITY]
                 [--min_fasta_length MIN_FASTA_LENGTH] [--keep KEEP]
                 [-t THREADS] [--mode {conservative,normal,bold}]
                 [--linear_seqs LINEAR_SEQS] [--vcf]

For example, to get the help page and more information on all the above options use:

(base) ubuntu@ip-172-31-52-167:~/GBI-software/Unicycler$ unicycler --help

Description of Software:

From the github: “Unicycler is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.”

Github: https://github.com/rrwick/Unicycler



TransABySS (and transabyss-merge)

Date: 01/15/2022

In order to use this software, you must use the environment created for it:

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ conda activate transabyss

You will have to point to the software in order to use it because it is only installed within this environment:

./GBI-software/transabyss/transabyss [OPTIONS]

Once you are done, you can leave the environment with:

$ conda deactivate

Location of installation:

(transabyss) ubuntu@ip-172-31-52-167:~/GBI-software/transabyss$ pwd
/home/ubuntu/GBI-software/transabyss
(transabyss) ubuntu@ip-172-31-52-167:~/GBI-software/transabyss$ ls
LICENSE  README.md  TUTORIAL.md  bin  prereqs  sample_dataset  transabyss  transabyss-merge  utilities

Version of the software on the AMI:

(transabyss) ubuntu@ip-172-31-52-167:~/GBI-software/transabyss$ ./transabyss --version
2.0.1

Process for Installation: This software requires python 3.6.0, so I am creating a conda environment with that version of python and installing the dependencies within it:

Create environment and enter into it

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ conda create -n transabyss python=3.6.0
(base) ubuntu@ip-172-31-52-167:~/GBI-software$ conda activate transabyss

blat

(base) ubuntu@ip-172-31-52-167:~/dependency-software$ wget --quiet --no-check-certificate -N -P ./bin http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/blat/blat
(base) ubuntu@ip-172-31-52-167:~/dependency-software$ mv bin blat 
(base) ubuntu@ip-172-31-52-167:~/dependency-software$ cd blat
(base) ubuntu@ip-172-31-52-167:~/dependency-software/bin$ chmod +x blat 
(base) ubuntu@ip-172-31-52-167:~/dependency-software/blat$ sudo ln -s "/home/ubuntu/dependency-software/blat/blat" ~/../../usr/local/bin/blat

Python-igraph

pip install python-igraph

Then transabyss

(base) ubuntu@ip-172-31-52-167:~/GBI-software$ conda install -c bioconda transabyss

Software Example of Use: For information regarding using transabyss, check out the tutorial here: https://github.com/bcgsc/transabyss/blob/master/TUTORIAL.md

A basic command for TransABySS looks like:

transabyss [OPTIONS]

For example to get help regarding all the options transabyss has, use:

transabyss -h

Description of Software:

Find more information about this at the github page: https://github.com/bcgsc/transabyss

There is also a discussion group where emails are answered that looks helpful: https://groups.google.com/g/trans-abyss



Cufflinks

Date: 01/15/2022

Location of installation:

(base) ubuntu@ip-172-31-52-167:~$ which cufflinks
/home/ubuntu/GBI-software/cufflinks/cufflinks

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ cufflinks --version
cufflinks: unrecognized option '--version'
cufflinks v2.2.1

Process for Installation: Pretty simple, download the precompiled tar file from http://cole-trapnell-lab.github.io/cufflinks/install/ And then add the location of cufflinks,cuffdiff and cuffcompare (the root directory of cufflinks) is accessible via your PATH.

Software Example of Use: For information regarding using cufflinks, check out the tutorial here:

The general command usage is:

$ cufflinks [options] <hits.sam>

For information regarding all the options, use:

$ cufflinks --help

Description of Software:

“Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.” Find more information about this at the github page: https://github.com/cole-trapnell-lab/cufflinks

You can read more in their Nature Protocols paper: https://www.nature.com/articles/nprot.2012.016



SOAPdenovo-Trans

Date: 01/15/2022

Location of installation:

(base) ubuntu@ip-172-31-52-167:~$ which SOAPdenovo-Trans-127mer
/home/ubuntu/dependency-software/miniconda3/bin/SOAPdenovo-Trans-127mer

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ SOAPdenovo-Trans-127mer --version

The version 1.03: released on July 19th, 2013
With bug fixes by Chris Boursnell January 23rd, 2014

Process for Installation:

Install using bioconda:

$ conda install -c bioconda soapdenovo-trans

Software Example of Use: There are two different softwares, for use with different k-mer values. This is similar to the regular SOAPdenovo de novo assembler:

(base) ubuntu@ip-172-31-52-167:~$ SOAPdenovo-Trans-
SOAPdenovo-Trans-127mer  SOAPdenovo-Trans-31mer   

These just allocate different amounts of memory and therefore probably have different run speeds.

The general command syntax is:

SOAPdenovo-Trans-127mer [OPTIONS]

For example, to get help regarding the options, use:

(base) ubuntu@ip-172-31-52-167:~$ SOAPdenovo-Trans-127mer --help

Description of Software:

“SOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.The assembler provides a more accurate, complete and faster way to construct the full-length transcript sets.” Read more in their github: https://github.com/aquaskyline/SOAPdenovo-Trans

Or in their paper: https://pubmed.ncbi.nlm.nih.gov/24532719/



SPAdes and rnaSPAdes

Date: 01/15/2022

The SPAdes repository contains both SPAdes and rnaSPAdes.

Location of installation:

(base) ubuntu@ip-172-31-52-167:~$ which spades.py && which rnaspades.py 
/home/ubuntu/dependency-software/miniconda3/bin/spades.py
/home/ubuntu/dependency-software/miniconda3/bin/rnaspades.py

Version of the software on the AMI:

(base) ubuntu@ip-172-31-52-167:~$ spades.py --version && rnaspades.py --version
SPAdes v3.13.0
SPAdes v3.13.0 [rnaSPAdes mode]

Process for Installation:

Download the binaries:

$ wget http://cab.spbu.ru/files/release3.15.3/SPAdes-3.15.3-Linux.tar.gz
$ tar -xzf SPAdes-3.15.3-Linux.tar.gz
$ cd SPAdes-3.15.3-Linux/bin/

Then add SPAdes and its affiliated softwares to the PATH by exporting the root directory’s path (pwd) onto the PATH.

Software Example of Use: The syntax for both of these softwares is similar.

Use either

spades.py [OPTIONS]

Or

rnaspades.py [OPTIONS]

For example to get more information about the options available with rnaspades, use:

(base) ubuntu@ip-172-31-52-167:~$ rnaspades.py --help

Description of Software:

For more information regarding SPAdes, check out the github at: https://github.com/ablab/spades

Or their most recent paper: https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102



Transposable element softwares

  • McClintock
    • TrimGalore
    • NGS TE Mapper (1 and 2)
    • RelocaTE (1 and 2)
    • TEMP (1 and 2)
    • RetroSeq
    • PoPoolation TE (1 and 2)
    • TE-Locate
    • TEFLON
  • RetroSeq
  • Tangram
  • TARDIS
  • MELT

Some papers to look through reviewing softwares for analyze transposable elements:

  • Vendrell-Mir, P., Barteri, F., Merenciano, M., González, J., Casacuberta, J. M., & Castanera, R. (2019). A benchmark of transposon insertion detection tools using real data. Mobile DNA, 10(1), 1–19. https://doi.org/10.1186/s13100-019-0197-9
  • Lanciano, S., & Cristofari, G. (2020). Measuring and interpreting transposable element expression. Nature Reviews Genetics, 21(12), 721–736. https://doi.org/10.1038/s41576-020-0251-y
  • Ewing, A. D. (2015). Transposable element detection from whole genome sequence data. Mobile DNA, 6(1). https://doi.org/10.1186/s13100-015-0055-3
  • Chu, C., Borges-Monroy, R., Viswanadham, V. V., Lee, S., Li, H., Lee, E. A., & Park, P. J. (2021). Comprehensive identification of transposable element insertions using multiple sequencing technologies. Nature Communications, 12(1). https://doi.org/10.1038/s41467-021-24041-8

McClintock

Date: 02/14/2022

Before using this software, you must use the following command to enter the conda environment with all of McClintock’s dependencies:

$ conda activate mcclintock

Verification of installation:

$ which mcclintock.py 
/home/ubuntu/GBI-software/mcclintock-2.0.0/mcclintock.py

Version of the software on the AMI:

McClintock v.2.0.0 from September 17, 2021

Process for Installation:

$ # INSTALL (Requires Conda)
$ git clone git@github.com:bergmanlab/mcclintock.git
$ cd mcclintock
$ conda env create -f install/envs/mcclintock.yml --name mcclintock
$ conda activate mcclintock
$ python3 mcclintock.py --install
$ python3 test/download_test_data.py

For me, the env create command was not working. Instead, I build a conda environment with the dependency softwares:

  - mamba
  - python=3.7.8
  - snakemake=5.32.0
  - biopython=1.77
  - git=2.23.0

After doing this I again proceeded to follow the instructions as listed above.

Software Example of Use:

Before using the command you must activate the mcclintock conda environment as mentioned above. To do this, use:

$ conda activate mcclintock

Now, once in the conda environment you can use the mcclintock command like so:

(mcclintock) ubuntu@ip-172-31-52-167:~$ mcclintock.py --help
usage: McClintock [-h] -r REFERENCE -c CONSENSUS -1 FIRST [-2 SECOND]
                  [-p PROC] [-o OUT] [-m METHODS] [-g LOCATIONS] [-t TAXONOMY]
                  [-s COVERAGE_FASTA] [-T] [-a AUGMENT]
                  [--sample_name SAMPLE_NAME] [--resume] [--install] [--debug]
                  [--slow] [--make_annotations] [-k KEEP_INTERMEDIATE]
                  [--config CONFIG]

Meta-pipeline to identify transposable element insertions using next
generation sequencing data

Within McClintock, there are a variety of other softwares available related to Transposable Element detection and analysis. These other softwares can be used with the -m option. For example, to use the RetroSeq component method, you would add -m retroseq into the mcclintock command.

Description of Software:

Github link: https://github.com/bergmanlab/mcclintock

From the Github: “Many methods have been developed to detect transposable element (TE) insertions from whole genome shotgun next-generation sequencing (NGS) data, each of which has different dependencies, run interfaces, and output formats. Here, we have developed a meta-pipeline to install and run multiple methods for detecting TE insertions in NGS data, which generates output in the UCSC Browser extensible data (BED) format. A detailed description of the original McClintock pipeline and evaluation of the original six McClintock component methods on the yeast genome can be found in Nelson, Linheiro and Bergman (2017) G3 7:2763-2778. The complete pipeline requires a fasta reference genome, a fasta consensus set of TE sequences present in the organism and fastq paired-end sequencing reads. Optionally if a detailed annotation of TE sequences in the reference genome has been performed, a GFF file with the locations of reference genome TE annotations and a tab delimited taxonomy file linking individual insertions to the TE family they belong to can be supplied (an example of this file is included in the test directory as sac_cer_te_families.tsv). If only single-end fastq sequencing data are available, then this can be supplied as option -1, however only ngs_te_mapper and RelocaTE will run as these are the only methods that handle single-ended data.”

The full list of available methods within McClintock using the -m/--methods option (including links to their githubs) is:

For further information regarding these methods, check out both the mcclintock github and the methods own link from above!


RetroSeq

Date: 02/14/2022

Verification of installation:

$ which retroseq.pl 
/home/ubuntu/GBI-software/RetroSeq/bin/retroseq.pl

Version of the software on the AMI:

RetroSeq Version: 1.5

Process for Installation:

Dependency: Exonerate

Download exonerate and uncompress using:

$  wget http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.2.0-x86_64.tar.gz
$ tar -zxvf exonerate-2.2.0-x86_64.tar.gz
$ rm exonerate-2.2.0-x86_64.tar.gz

Then add the PATH of the /bin folder in your .bashrc file.

For RetroSeq, download from github and add the bin directory to your path:

$ git clone https://github.com/tk2/RetroSeq.git
$ cd RetroSeq/bin
$ pwd

Software Example of Use:

The command for RetroSeq is:

$ retroseq.pl [OPTIONS]

There are two options for this software:

            -discover       Takes a BAM and a set of reference TE (fasta) and calls candidate supporting read pairs (BED output)
            -call           Takes multiple output of discovery stage and a BAM and outputs a VCF of TE calls            

Description of Software:

The Github is: https://github.com/tk2/RetroSeq

According to the Github: “RetroSeq is a tool for discovery and genotyping of transposable element variants (TEVs) (also known as mobile element insertions) from next-gen sequencing reads aligned to a reference genome in BAM format. The goal is to call TEVs that are not present in the reference genome but present in the sample that has been sequenced. It should be noted that RetroSeq can be used to locate any class of viral insertion in any species where whole-genome sequencing data with a suitable reference genome is available.”


Tangram

Date: 02/14/2022

Verification of installation:

$ which tangram_filter.pl 
/home/ubuntu/GBI-software/Tangram/bin/tangram_filter.pl

Version of the software on the AMI: Tangram 0.3.1

Process for Installation:

$ git clone git://github.com/jiantao/Tangram.git
$ cd Tangram/src
$ make

Then export the location of the bin directory within Tangram to the PATH on your .bashrc file.

Software Example of Use:

The command for the software looks like:

./tangram_filter [options] --vcf <input_vcf> --msk <mask_input_list>

And the mandatory arguments are

      --vcf  FILE   input vcf file for filtering
      --msk  FILE   input list of mask files with window size information

Within this command there are a series of submodules, which can be found on the github (link below)

Description of Software:

Paper discussing it: https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-795

Github: https://github.com/jiantao/Tangram

From the paper above: “Here we report Tangram, a computationally efficient MEI detection program that integrates read-pair (RP) and split-read (SR) mapping signals to detect MEI events. By utilizing SR mapping in its primary detection module, a feature unique to this software, Tangram is able to pinpoint MEI breakpoints with single-nucleotide precision. To understand the role of MEI events in disease, it is essential to produce accurate individual genotypes in clinical samples. Tangram is able to determine sample genotypes with very high accuracy. Using simulations and experimental datasets, we demonstrate that Tangram has superior sensitivity, specificity, breakpoint resolution and genotyping accuracy, when compared to other, recently developed MEI detection methods.”


TARDIS (Variation Hunter)

Date: 02/14/2022

Verification of installation:

$ which tardis
/usr/local/bin/tardis

Version of the software on the AMI:

Version 1.0.8
	Last update: September 10, 2020, build date: Tue Feb  1 01:08:42 UTC 2022

Process for Installation: Dependencies:

$ sudo apt-get install zlib1g-dev liblzma-dev libbz2-dev

Install:

$ git clone https://github.com/BilkentCompGen/tardis.git --recursive
$ cd tardis/
$ make libs
$ make
$ sudo ln -s "$(pwd)/tardis" ~/../../usr/local/bin/tardis

Software Example of Use: To use the program, use the command:

$ tardis [OPTIONS]

For example to get the list of all the options, use

$ tardis --help

The basic parameters here are:

    --bamlist   [bamlist file] : A text file that lists input BAM files one file per line.
    --input/-i [BAM files]     : Input files in sorted and indexed BAM format. You can pass multiple BAMs using multiple --input parameters.
    --out   [output prefix]    : Prefix for the output file names.
    --ref   [reference genome] : Reference genome in FASTA format.
    --sonic [sonic file]       : SONIC file that contains assembly annotations.
    --hist-only                : Generate fragment size histograms only, then quit.

Description of Software:

From the Github: Tardis is a Toolkit for automated and rapid discovery of structural variants”

Paper discussing this software: https://linkinghub.elsevier.com/retrieve/pii/S1046202317300762

Github: https://github.com/BilkentCompGen/tardis


MELT

Date: 02/14/2022

Verification of installation: ~/GBI-software/MELTv2.2.2/MELT.jar

Version of the software on the AMI: MELTv2.2.2

Process for Installation: Download from: https://melt.igs.umaryland.edu/downloads.php

Copy to your EC2 instance:

$ scp -i ~/Desktop/GBI/AWS/AWS_keypairs/r5large-gbi-keypair.pem ~/Desktop/MELTv2.2.2.tar.gz ubuntu@34.223.40.201:/home/ubuntu/GBI-software

Software Example of Use: To use the program:

$ java -jar GBI-software/MELTv2.2.2/MELT.jar --help

There is a help forum for this software found at https://groups.google.com/g/melt-help

Description of Software:

This software is developed by the University of Maryland School of Medicine. For more information regarding this software check out their website: https://melt.igs.umaryland.edu/

From this website: “The Mobile Element Locator Tool (MELT) is a software package, written in Java, that discovers, annotates, and genotypes non-reference Mobile Element Insertions (MEIs) in Illumina DNA paired-end whole genome sequencing (WGS) data. MELT was first conceived as a tool to identify non-reference MEIs in large genome sequencing projects, specifically as part of the 1000 Genomes Project, and has been further optimized to run on a wide range of data types. MELT is optimized for performing discovery in a large number of individual sequencing samples using the Sun Grid Engine (SGE). MELT also has two additional workflows: analysis without SGE (for adaptability to other parallel computing platforms) and single genome analysis. MELT is highly scalable for many different types of data, and the different workflows are outlined and detailed in this documentation.”

It is free to use for academic purposes but requires a license for commercial ones. Please cite them in any literature created using their software.

Clone this wiki locally