# Overview

| Tool | Link | Function|
|------------|-----|----|
| LSC 2      | http://www.healthcare.uiowa.edu/labs/au/LSC/default.asp  |long read error correction|
|  Proovread | https://github.com/BioInf-Wuerzburg/proovread               | long read error correction|
| Lordec | http://www.atgc-montpellier.fr/lordec/                 | long read error correction|
| HALC  |https://github.com/lanl001/halc             | long read error correction |
|PBJelly| https://sourceforge.net/p/pb-jelly/wiki/Home/?#058c |  gapclosing |
|Canu | http://canu.readthedocs.io/en/latest/index.html | hybrid assembling |
|MaSuRCA | http://www.genome.umd.edu/masurca.html|  hybrid assembling|



| Correctors | Number of threads |
|------------|-------------------|
    | LSC 2      | 8                 |
    |  Proovread | 8                 |
    |HALC | 8                 |
    | Lordec     | 8                 |




    Cluster node configuration:
    Number of CPU: 40
    CPU frequency: 2.6 GHz
    Available RAM: 256 GBytes

# Long Read Correction: Hybrid correctors (short and long reads):

## 1. LSC-2
#### Introduction
LSC-2 is an hybrid corrector of long reads. Long reads and short reads are first compressed into
homopolymers, then short reads are mapped to long reads with Bowtie2. Finally, the short read
consensus replaces the long read sequences.
Website: http://www.healthcare.uiowa.edu/labs/au/LSC/default.asp

#### Installation
LSC-2 can be downloaded as pre-compiled binaries. Bowtie2 is required and must be installed.
Extraction of pre-compiled binaries files:
$ tar zxvf LSC-2.0.tar.gz

#### Input data
LSC-2 takes FASTA or FASTQ files as input.
#### Pipeline
+ runLSC script divides long read error correction into 5 steps:
    - The sequences of long and the short reads are transformed by homopolymer compression
so that each sequence of the same nucleotide is replaced by a single nucleotide of the same
type.
    - The short reads quality is checked. Indeed, some of these reads contain too much N letters
or are too short.
    - The short reads are aligned against the long ones with Bowtie2.
    - The long reads are then modified according to the information provided by the short read
consensus obtained with previous alignment.
    - Once the correction points have been replaced by the corresponding short reads consensus,
the rest of the compressed points are decompressed.


In [None]:
runLSC.py --long_reads LR.fa --short_reads SR.fa --specific_tempdir temp --output output_dir

• longs reads : long read file.
• short reads : short read file.
• specific tempdir : folder containing temporary files (optional).
• output : final assembly folder.

#### Encountered errors
Died at /LSC-2.0/bin/../utilities/explode fasta.pl
ValueError: invalid literal for int() with base 10: ”
    - solution : Convert short reads from FASTQ format to FASTA format (AWK command line or Biopython Seq.IO module)
</p> [bam header read] EOF marker is absent. The input is probably truncated.
    - solution : Install Bowtie2.
#### Output data
The corrected sequences are written into the ”corrected read.fasta” file, while full LR.fasta file
contains concatenate uncorrected terminus sequences and corrected sequences. Both files are
located in the final assembly folder.

## 2. Proovread
#### Introduction
Proovread is a de novo corrector, using a de-Bruijn graph constructed from long reads.
Website : https://github.com/BioInf-Wuerzburg/proovread
#### Installation
Proovread is available on Linux and requires NCBI Blast-2.2.24+, samtools-1.1+, Perl 5.10.1+
and the perl moduls Log::Log4perl and File::Which.
#### Code source compilation:

In [None]:
$ git clone --recursive https://github.com/BioInf-Wuerzburg/proovread
$ cd proovread/util/bwa
$ make

#### Input data
In order to not overload the processor and the memory, it is wise to divide the long reads data
(FASTA, FASTQ) into several files:

In [None]:
$ SeqChunker -s 20M -o 0%03d pacbio_file

    - s : file length
    - o : output file name
#### Pipeline
Run the long read error correction process with the binary ”proovread”, located in the folder
”bin”, for every folders created in the previous step:

In [None]:
$ for i in {0001..000n}; do proovread -l $i -s /Path/to/short_reads {pre pb_$i; done

    - n : Number of files generated by SeqChunker
    - l : raw noisy long reads
    - s : file containing accurate short reads
    - pre : prefix used to name output file.
It is also possible to add unitigs with the argument ”-unitigs”.
+ Proovread corrects long reads in 2 steps:
    - The mapping of short reads on long reads is done by SHRIMP2, by adapting the score
mode to consider that insertions are more frequent than deletions and that substitutions
are rare events. Bowtie2 and bwa mem are also supported.
    - A consensus sequence is computed from these alignments.
    
#### Output data
The corrected sequences are written in the specified output folder (trimmed and untrimmed
reads)

## 3. LoRDEC
#### Introduction
LoRDEC is an hybrid corrector, using a de-Bruijn graph constructed from short reads to correct
long reads. Website : http://www.atgc-montpellier.fr/lordec/
#### Installation
LoRDEC is available on linux and requires Cmake 2.6+ and GCC 4.7+.
+ Import LoRDEC and the GATB library (http://gatb-core.gforge.inria.fr/) :

In [None]:
$ wget http://www.atgc-montpellier.fr/download/sources/lordec/LoRDEC-0.6.tar.gz
$ tar zxvf LoRDEC-0.6.tar.gz
$ cd LoRDEC-0.6
$ wget https://github.com/GATB/gatb-core/releases/download/v1.1.0/ \gatb-core-1.1.0-bin-Linux.tar.gz
$ tar zxvf gatb-core-1.1.0-bin-Linux.tar.gz

Modify the variable GATB VER from the Makefile (1.1.0) Install LoRDEC

In [None]:
$ make
$ cd ..

#### Input data
LoRDEC requires short reads in FASTA or FASTQ file format and long reads in FASTA or
FASTQ file format.
#### Pipeline
Run the long read error correction with the binary ”lordec-correct”:

In [None]:
$ lordec-correct -2 illumina.fasta -k 19 -s 3 -i pacbio.fasta \
-o pacbio-corrected.fasta

    - 2 : File of short reads.
    - k : Size of the kmer used in the de-Bruijn graph
    - s : Abundance threshold of a kmer to be considered correct
    - i : Input file
    - o : Output file

A series of steps is then performed in order to correct the long reads:
     1. Construction of a de-Bruijn graph from the short reads
     2. Suppression of k-mer with occurrence less than the s value
     3. Choose an optimal path of the graph by calculating the edit distance between the path and a region of long read.
#### Output data
The corrected sequences will be in the output file indicated after the ”-o” parameter. The output
file in FASTA format contains long reads. Corrected sequences are defined by uppercase letters
while uncorrected sequences appears as lowercase letters. Lordec offers the possibility to remove
the uncorrected sequences at the beginning and at the end of the long reading or to keep only
the corrected sequences.

In [None]:
$ lordec-trim -i fichier_reads.fasta -o fichier_trim.fasta

    i : corrected reads file
    o : output file


In [None]:
$ loredec-trim-split -i fichier_reads.fasta -o fichier_trim_split.fasta

## 4. HALC
#### Introduction
HALC is software that makes error correction for long reads with high throughput.
#### Installation
Aligner BLASR and error correction software LoRDEC (only for -ordinary mode) are required to run HALC.

    The source files in 'src' and 'thirdparty' folders can be compiled to generate a 'bin' folder by running Makefile: make all.
    Put BLASR, LoRDEC and the 'bin' folder to your $PATH: export PATH=PATH2BLASR:$PATH , export PATH=PATH2LoRDEC:$PATH and export PATH=PATH2bin:$PATH, respectively.
#### Input data

    - Long reads in FASTA format.
    - Contigs assembled from the corresponding short reads in FASTA format.
    - The initial short reads in FASTA format (only for -ordinary mode; obtained with  cat left_reads.fa >short_reads.fa and  then cat right_reads.fa >>short_reads.fa)


#### Pipeline
runHALC.py long_reads.fa contigs.fa [-options|-options]

In [None]:
$ runHALC.py long_reads.fa contigs.fa -b 4 -a -w 2 -k 25 -t 8

$ runHALC.py long_reads.fa contigs.fa -o short_reads.fa -b 4 -a -w 2 -k 25 -t 8
# or scaffolds instead of contigs.

#### Output data


    Error corrected full long reads.
    Error corrected trimmed long reads.
    Error corrected split long reads.


# PBJelly gapclosing
PBJelly can be used to try to close or shrink gaps that may be present between contigs after scaffolding.
### Requirements:


    Blasr (https://github.com/PacificBiosciences/blasr)
    Version 1.3.1.127046 is fully vetted as compatible with
    Jelly. Other versions may run into problems. Use
    > blasr -version
    to figure out what you have. Blasr must be in your environment
    path.

    Python 2.7
    Python must be in your environment path and executable with
    the commands:
    > python
    > /usr/bin/env python

    Networkx v1.1
    Versions past v1.1 have been shown to have many issues. This will
    be updated in the future. To check your version use, in a python
    interactive terminal, type:
    > import networkx
    > networkx.version
    If you get an error saying the attribute isn't found, you don't have
    version 1.1 
    
    https://sourceforge.net/p/pb-jelly/wiki/Home/
    https://github.com/alvaralmstedt/Tutorials/wiki/Gap-closing-with-PBJelly
    


# Hybrid Assembly:
This process is usually carried out by a pre-assembly of short fragments and subsequent joining of contigs using long reads to correctly order the sequences and fill gaps [5] ;  [6]. An opposite strategy is to employ long reads as a scaffold for assembly of short reads [7]; [8] ;  [9]. However, as concluded in one of the recent studies, the highest efficiency in genome assembly is obtained with approaches that are based on use of the high-accuracy Illumina sequencing data for correction of the PacBio SMRT or Nanopore sequencing results

# 1. Canu
#### Introduction
is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION)

#### Installation


In [None]:

git clone https://github.com/marbl/canu.git
cd canu/src
make -j <number of threads>


In [None]:
canu [-correct | -trim | -assemble | -trim-assemble] \
  [-s <assembly-specifications-file>] \
   -p <assembly-prefix> \
   -d <assembly-directory> \
   genomeSize=<number>[g|m|k] \
   [other-options] \
   [-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq

The -p option, to set the file name prefix of intermediate and output files, is mandatory. If -d is not supplied, canu will run in the current directory, otherwise, Canu will create the assembly-directory and run in that directory. It is _not_ possible to run two different assemblies in the same directory.

The -s option will import a list of parameters from the supplied specification (‘spec’) file. These parameters will be applied before any from the command line are used, providing a method for setting commonly used parameters, but overriding them for specific assemblies.

By default, all three top-level tasks are performed. It is possible to run exactly one task by using the -correct, -trim or -assemble options.  Additionally, suppling pre-corrected reads with -pacbio-corrected or -nanopore-corrected will run only the trimming (-trim) and assembling (-assemble) stages.

One parameter is required: the genomeSize (in bases, with common SI prefixes allowed, for example, 4.7m or 2.8g

#### Pipeline


![title](http://canu.readthedocs.io/en/latest/_images/canu-pipeline.svg)

1. Correction:

In [None]:
canu -correct \
  -p ecoli -d ecoli \
  genomeSize=4.8m \
  -pacbio-raw  pacbio.fastq


Then, trim the output of the correction:

In [None]:
canu -trim \
  -p ecoli -d ecoli \
  genomeSize=4.8m \
  -pacbio-corrected ecoli/ecoli.correctedReads.fasta.gz

And finally, assemble the output of trimming, twice, with different stringency on which overlaps to use (see correctedErrorRate):

In [None]:
canu -assemble \
  -p ecoli -d ecoli-erate-0.039 \
  genomeSize=4.8m \
  correctedErrorRate=0.039 \
  -pacbio-corrected ecoli/ecoli.trimmedReads.fasta.gz

canu -assemble \
  -p ecoli -d ecoli-erate-0.075 \
  genomeSize=4.8m \
  correctedErrorRate=0.075 \
  -pacbio-corrected ecoli/ecoli.trimmedReads.fasta.gz

Note that the assembly stages use different ‘-d’ directories. It is not possible to run multiple copies of canu with the same work directory.


#### Consensus Accuracy

Canu consensus sequences are typically well above 99% identity. Accuracy can be improved by polishing the contigs with tools developed specifically for that task. We recommend Quiver (https://github.com/PacificBiosciences/GenomicConsensus) for PacBio and Nanopolish for Oxford Nanpore data. When Illumina reads are available, Pilon (http://software.broadinstitute.org/software/pilon/) can be used to polish either PacBio or Oxford Nanopore assemblies.


# 2. MaSuRCA
#### Introduction
assembler combines the benefits of deBruijn graph and Overlap Layout Consensus assembly approaches. 
Since version 3.2.1 itsupports hybrid assembly with short Illumina reads and long high error PacBio/MinION data

#### Compile/Install
To compile the assembler we require gcc version 4.7 or newer to be installed on the system.
The assembler has been tested on the following distributions:
    - Fedora 12 and up
    - RedHat 5 and 6 (requires installation of gcc 4.7)
    - CentOS 5 and 6 (requires installation of gcc 4.7)
    - Ubuntu 12 LTS and up
    - SUSE Linux 16and up
#### Hardware requirements
The hardware requirements vary with the size of the genome project. Both Intel and AMD x64 architectures are
supported. The general guidelines for hardware: Mammalian genomes (up 
to 3Gb): 512Gb RAM, 32+ cores, 5Tb disk space;
Expected run times: Mammalian genomes (up to 3Gb): 15-20 days

#### Installation
To install, first download the latest distribution from 
ftp://ftp.genome.umd.edu/pub/MaSuRCA/
. Then untar/unzip the package MaSuRCA
-
X.X.X.tgz, 
cd to the resulting folder and run './install.sh'.  The installation script will configure and make all 
necessary packages.
In the rest of this document, '/install_path' refers to a path to the directory in which './install.sh' 
was run.

IMPORTANT! Do not preprocess Illumina data before providing it to MaSuRCA.  Do not do any 
trimming, cleaning or error correction. This WILL deteriorate the assembly.

First, 
create a configuration file which contains the location of the compiled assembler, the 
location of the data and
some parameters. Copy in your assembly directory the template 
configuration file '/install_path/sr_config_example.txt' which was created by the 
installer with 
the correct paths to the freshly compiled software and with reasonable parameters. Many 
assembly projects should only need to set the path to the input data.
Second, run the '
masurca
' script which will generate from the configuration file a
shell script 
'assemble.sh'. This last script is the main driver of the assembly.
Finally, run the script 'assemble.sh' to assemble the data.


Configuration
. To run the assembler, one must first create a configuration file that specifies the 
location of the executables, data and assembly parameters for the assembler. The installation 
script will create a sample config file 'sr_config_example.txt'. The sample configuration file looks like this:

ftp://ftp.genome.umd.edu/pub/MaSuRCA/MaSuRCA_QuickStartGuide.pdf