Skip to content

Section 7: Genome assembly with ABySS and SOAPdenovo2

FlintMitchell edited this page Jan 16, 2022 · 4 revisions

Go back to Section 6: Pre assembly software

Go back to tutorial overview

Learning Points for Section 7: Genome assembly with ABySS and SOAPdenovo2

  1. You will understand how to use the ABySS genome assembler.
  2. You will have a refresher on how to create virtual terminals for running software behind the scenes in your EC2 instance.
  3. You will know how to generate a SOAPdenovo2 configuration file.
  4. You will understand how to use the SOAPdenovo2 genome assembler.

ABySS assembly

Okay! We now have the test data we want to assemble our test arabidopsis thaliana genome with. We will get experience here with two separate genome assemblers. First, we will use ABySS. ABySS stands for “Assembly By Short Sequences”, and substantially more information can be found on it here:

Before we run this assembly, however, let’s set up a virtual terminal using tmux in order to create a clean working environment that we can log in and out of so that we can still work on this EC2 instance in this terminal.

To do use, use the command

$ tmux new -s [session-name]
ex. $ tmux new -s abyss-assembly

For future reference when we want to detach from a tmux session:

Press the 'control' button and then the letter 'B' button at the same time, and then release them. This tells tmux that you are about to give it a command to do something. After releasing those buttons, press the letter 'D' button. You will be detached from the "arabidopsis-assembly" terminal and re-attached to the main terminal. Any programs you started on the "arabidopsis-assembly" terminal will continue to run!

To attach back into a terminal session:

$ tmux a -t [session-name]
ex. $ tmux a -t arabidopsis-assembly

Okay, now we are ready. The ABySS command syntax is fairly simple, it looks like:

$ abyss-pe name=[assembly-name] j=[num-threads] v=-v k=53 in="fastq_1.fastq fastq_2.fastq" | tee filename.log

Let’s use the command below and then read up on the options while it runs.

First is the command itself. If you type abyss- and then press tab, you will see a variety of options. These seem to mostly be individual modules of the ABySS program, some of which can be used on their own for doing things like calculating statistics from the assembly. However the main program as shown above is abyss-pe. name=[assembly-name], where you set the output name of the assembly. You might put some information about the assembly itself here to create a descriptive output filename, like 'apallida-k53' to signify the plant being assembled and the kmer value. Sometimes it create a bit more work by adding a longer filename, but the more descriptive you make it the easier it is when looking back on your data months later! j=[num-threads] sets the number of threads available you want to allocate from the EC2 instance to do the assembly. Some of the processes are multi-threaded, so make sure to set the maximum amount of threads you have to expedite the assembly process. v=-v makes the assembly give a verbose output. This can be helpful for understanding what went wrong (or right!) during the assembly afterwards. k=[k-mer-value] sets the k-mer value for the assembler. in="filename1.filetype filename2.filetype" gives the assembler the locationa nd name of the input sequencing read data. If you are in the folder where the data is location, you don't need to add the path to the data beforehands. If you are outside of the data folder, you would need the path to the data. | tee filename.log creates a file where all the assembly output information is stored. By using the v=-v flag above we can increase the amount of information saved into this log file.

So in order to run an assembly of our test data we will use the following command:

$ abyss-pe name=abyss-athal-121621 j=1 v=-v k=53 in="SRR1946554_1_phred21_trimmed.fq SRR1946554_2_phred21_trimmed.fq" 

With that running, let’s detach from this tmux session as described above and do another assembly using a different assembler..

SOAPdenovo2 assembly

Here, we will use the SOAPdenovo2 assembler.

First, like we did with ABySS, let’s create a virtual terminal with tmux:

$ tmux new -s soap-assembly

SOAPdenovo2, unlike ABySS, requires a config file. -s tells the program that the next file you give it will be the location/name of it. The config file tells the assembler where to find the raw sequencing files (FASTA, FASTQ, or BAM) and has relevant information to them. Most of the information entered into the config file has default values, but you will want to write your own so that it reflects your experiment. The basic information to look at (and change if necessary) in this config file is:

### Global Information
# Maximum lead length. The length of the longest read in your dataset.
max_rd_len=150

### Library Information
# Multiple libraries can be submitted. Each library section starts with [LIB] and the following information 
# before the next [LIB] call (if there is one) relates to that library section. (i.e. if only have 1 library, 
# you only need to write [LIB] once)
[LIB]

# Average insert size of the library. You should know this from your experiment.
avg_ins=150

# This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed.
# Illumima produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with
# typical insert size less than 500 bp; b) reverse-forward, generated from circularizing libraries with typical insert
# size greater than 2 Kb. The parameter "reverse_seq" should be set to indicate this: 0, forward-reverse; 1, reverse-forward.
reverse_seq=0

# This indicator decides in which part of the assembly the reads are used.
# 1 = contig assembly only
# 2 = scaffold assembly only
# 3 = both contig and scaffold assembly
# 4 = only gap closure
asm_flags=3

# Length of the longest read in this library
rd_len_cutoff=150

# The order in which these reads are used in scaffolding. If you had a second library, you could have them
# used at the same time for scaffolding (make both have rank=1), or use one first then the other 
# (one would be rank=1 and the other rank=2).
rank=1

# This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. 
# Min for paired-end is 3. Min for mate-pair is 5.
pair_num_cutoff=3

# Minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32

# Paired-end fastq files, read 1 file should always be followed by read 2 file.
q1=/home/ubuntu/a-thaliana-kmer-analysis/data/SRR1946554_1.fastq
q2=/home/ubuntu/a-thaliana-kmer-analysis/data/SRR1946554_2.fastq

Let’s create a config file using the above information. Create a new text file and enter the nano editor using:

$ nano arabidopsis-test-config

You are now in the nano editor, editing a new file you have just created called “arabidopsis-test-config”. Copy the above text and paste it into this file.

The information in this config file is well commented/documented, so for your own personal experiment, go through and change the parameters as they relate to your data. In this example we will leave everything except for the last lines describing q1 and q2. These are the PATHs to your data files.

Change them so that they have the correct PATH and fastq filename:

# Paired-end fastq files, read 1 file should always be followed by read 2 file.
q1=/home/ubuntu/a-thaliana-kmer-analysis/data/SRR1946554_1_phred21_trimmed.fq
q2=/home/ubuntu/a-thaliana-kmer-analysis/data/SRR1946554_2_phred21_trimmed.fq

Save and exit.

Now, the basic command for SOAPdenovo2 looks like:

soapdenovo2-127mer all -s /path/to/config-file -K [k-mer-value] -p [num-threads] -o /path/to/assembly/directory/assembly-output-filename 1>assembly.log 2>assembly.err

To fill out this information use the following: a k-mer value of 31. Since we are already running another assembly on this EC2 instance, just use 1 thread. Our output path will be the directory we created for this workshop.

So putting that together, we get:

$ soapdenovo2-127mer all -s PATH/to/config-file -p 1 -K 31 -R -o soap-athal 1>assembly.log 2>assembly.err

Review Questions:

What command do you use to create a new virtual terminal and how do you log out of it?

  • tmux new -s [session-name]. To log out of that terminal you press the 'control' button and then the letter 'B' button at the same time, and then release them. After releasing those buttons, press the letter 'D' button.

What is the command used to run the ABySS assembly program?

  • abyss-pe name=[assembly-name] j=[num-threads] v=-v k=53 in="fastq_1.fastq fastq_2.fastq" | tee filename.log

What does SOAPdenovo2 require to run, which ABySS does not?

  • a configuration file

How do you run a SOAPdenovo2 assembly?

  • soapdenovo2-127mer all -s /path/to/config-file -K [k-mer-value] -p [num-threads] -o /path/to/assembly/directory/assembly-output-filename 1>assembly.log 2>assembly.err

Move on to Section 8: Checking our assembly quality and understanding our results

Go back to tutorial overview

Clone this wiki locally