Skip to content

Commit

Permalink
Merge tip-clipping from GSoC into GenomeGraphs framework (#15)
Browse files Browse the repository at this point in the history
  • Loading branch information
Ben J. Ward committed Nov 27, 2019
1 parent 16ddf85 commit 1833350
Show file tree
Hide file tree
Showing 29 changed files with 2,095,188 additions and 58 deletions.
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ language: julia
os:
- linux
- osx
- windows
julia:
- 1.1
- 1.2
Expand Down
8,000 changes: 8,000 additions & 0 deletions 20june_unet_ecoli.fastq

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ authors = ["Ben J. Ward <benjward@protonmail.com>"]
version = "0.1.0"

[deps]
Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
FASTX = "c2308a5c-f048-11e8-3e8a-31650f418d12"
ReadDatastores = "70a005b8-9d8a-11e9-0d98-c909fa2e52d2"
Expand Down
Binary file added after_tip_removal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added before_tip_removal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"

[compat]
Documenter = "0.24"
24 changes: 23 additions & 1 deletion docs/src/DeBruijnGraph.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The kmer size is set to 15.

```
r = FASTQ.Reader(open("URnano_ecoli.fastq", "r"))
Ecoli_reads = Set{BioSequence{DNAAlphabet{4}}}()
Ecoli_reads = Vector{BioSequence{DNAAlphabet{4}}}()
## get first 5 reads
for i in 1:5
Expand Down Expand Up @@ -74,3 +74,25 @@ DeBruijnGraph(Dict{Int64,SequenceGraphNode}(7=>SequenceGraphNode{BioSequence{DNA


So the nodes with nodeID 6 and 3 are collapsed into node 7 and we can see that both outgoing edges of 3 are given to node 7.



#### Saving to gfa

Below is an example code for saving a SequenceDistanceGraph to GFA file. This allows us to load and visualize the graph using graph visualization tools.

```
# A helpful function to let me run Bandage to visualize graphs.
const BANDAGE_BIN = "/Applications/Bandage.app/Contents/MacOS/Bandage"
function draw_graph(gr)
filename = tempname()
BioSequenceGraphs.dump_to_gfa1(gr, filename)
run(`$BANDAGE_BIN image $filename.gfa $filename.png --height 500`)
run(`open $filename.png`)
end
function show_graph(gr)
filename = tempname()
BioSequenceGraphs.dump_to_gfa1(gr, filename)
run(`$BANDAGE_BIN load $filename.gfa`)
end
```
Binary file added docs/src/assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/src/assets/tips.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
92 changes: 88 additions & 4 deletions docs/src/man/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ pkg> add GenomeGraphs
If you are interested in the cutting edge of the development, please check out
the master branch to try new features before release.

## Creating a WorkSpace
## Creating your first WorkSpace

You can create an empty genome graph workspace with the empty constructor:

Expand All @@ -42,12 +42,15 @@ do any exploration or analysis. There are a few ways to get your first graph:
1. Construct a de-Bruijn graph from raw sequencing reads.
2. Load a graph (such as produced from another assembler) from a GFAv1 file

## Constructing a de-Bruijn graph
## Constructing your first de-Bruijn graph

Let's see how to do option number 1, and construct a de-Bruijn graph from
raw sequencing reads. This can be achieved with a few simple steps:

1. Prepare the sequencing reads.
1. Prepare the sequencing reads & build a datastore.
2. Add the read datastore to a WorkSpace.
3. Run the dbg process.
4. Run the tip removal process.

### Preparing the sequencing reads

Expand Down Expand Up @@ -80,7 +83,9 @@ ds = PairedReads(fwq, rvq, "ecoli-test-paired", "my-ecoli", 250, 300, 0, FwRv)

Here "ecoli-test-paired" is provided as the base filename of the datastore, the
datastore is given the name of "my-ecoli", this name will be used to identify it
in the workspace later. The minimum length for the reads is set at 250 base
in the workspace later.

The minimum length for the reads is set at 250 base
pairs, and the maximum length is set to 300 base pairs. Reads that are too short
will be discarded, reads that are too long are truncated.

Expand All @@ -94,6 +99,8 @@ read 2 is oriented backwards (forwards on the opposite strand). This orientation
distinguishes regular paired-end reads from other paired read types like
Long Mate Pairs.

### Add datastore to the WorkSpace

Now the datastore is created, it can be added to a workspace.

```julia
Expand All @@ -115,3 +122,80 @@ produce a first de-Bruijn graph of the genome.
dbg!(BigDNAMer{61}, 10, ws, "my-ecoli")
```

!!! warning
Some steps of this process, especially the Kmer counting steps, may take a
long time for big inputs.
No parallelism or batching to disk is used currently (although it is planned).
This process should take just about a minute for E.coli paired end reads with
a decent coverage.
So, be warned, for big stuff, performance may suuuuuuucck in these early days!

### Run the `remove_tips` process

Now you have a raw compressed de-Bruijn assembly graph. You can start to use it
for analyses, and also try to improve its structure and resolve parts of the
graph that represent error, repetitive content, and so forth. Some of these
structures can be identified and resolved using only the topology of the graph,
some require additional information sources (linked reads / long reads / and so on),
to be incorporated into the workspace first.

Here let's see how to resolve a common structure, using only the graph topology.

Let's fix the tips of the graph.

Tips looks like this:

![tips](assets/tips.jpeg)

See how a piece of the assembly which should be one long stretch of sequence is
broken into 3 pieces (red) because of the existence of two tips (blue). Such
tips are defined topologically as very short segments which have one incoming
neighbour, and no outgoing neighbour.

Tips are caused by sequencer errors that occur at the end of reads, because the
sequencing by synthesis technique becomes more error prone over time; reagents
are consumed and products generated as time progresses, making the base detection
more difficult. Hence errors occur at the ends of reads, and erroneous kmers
from the read ends are unlikely to have forward neighbours, and they end up
forming tip nodes when they are incorporated into the de-Bruijn graph.

you can remove these tips and improve the contiguity of the graph by using the
`remove_tips!` process.

```julia
remove_tips!(ws, 200)
```

The value of 200 provided is a size (in base pairs) threshold:
Nodes are considered tips only if they have one ingoing neighbour, no outgoing
neighbours, and are (in this case) smaller than 200 base pairs in length.

Once this process is finished you will have a collapsed de-Bruijn graph with the
tips removed.

## Using the NodeView interface

Graphs can be complicated data structures. If you want an in depth explanation
of the data-structure used to represent genome graphs, then head [here](notyet).

However, most people should not have to care about the internal structure of the
graph.

To make things as simple as possible, the `NodeView` type provides a single
entry point for node-centric analyses. The `NodeView` wraps a point to a
workspace's graph and contains a node id. A `NodeView` gives you acces to a
node's underlying sequence, the nodes neighbouring nodes in the forward and
backward directions, the reads mapped to a node, and kmer coverage over the node.

To get a `NodeView` of a node, use the `node` method on a `WorkSpace`, providing
a node id number. A positive ID denotes a view of the node traversing it in the
forward (canonical) direction. A negative ID denotes a view of the node,
traversing it in the reverse complement direction.

```julia
julia> n = node(ws, 3)
A view of a graph node (node: 3, graph: sdg):
AAAAAACCTCCGCAACCCCATGTTTTCACATAACTGTTGGCCATGACCGGCTGGCTGTCAGGCTGTCACTGATAATCA


```
35 changes: 34 additions & 1 deletion docs/src/types/graphs.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,37 @@ remove_node!
add_link!
remove_link!
disconnect_node!
```
```

#Code example for testing the error correction and simplifying the graph

In the example below we generate a toy graph from 4-mers which contain a bubble.
As for now, we make use of randomly generated coverage information for
```
using BioSequences
using BioSequenceGraphs
kmerlist = Vector{DNAKmer{4}}([DNAKmer{4}("ATAC"),DNAKmer{4}("TACG"),DNAKmer{4}("TACC"),DNAKmer{4}("ACGA"),DNAKmer{4}("CGAA"),DNAKmer{4}("GAAT"),DNAKmer{4}("AATC"),DNAKmer{4}("ACCA"),DNAKmer{4}("CCAA"),DNAKmer{4}("CAAT")])
sdg = SequenceDistanceGraph(kmerlist,true)
```

The new SequenceDistanceGraph constructor takes as input a list of kmerlist (not necessarily sorted or in canonical form) and a boolean for whether to do error correction or not. Then by using the coverage information generated randomly, simplifies the graph.
The resulting SDG before and after the error correction steps are as follows:

***Before:***
```
SequenceDistanceGraph{BioSequence{DNAAlphabet{2}}}(SDGNode{BioSequence{DNAAlphabet{2}}}[SDGNode{BioSequence{DNAAlphabet{2}}}(AATC, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATAC, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATTCGTA, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATTGGTA, false)], Array{DistanceGraphLink,1}[[DistanceGraphLink(1, 3, -3), DistanceGraphLink(1, 4, -3)], [DistanceGraphLink(-2, -4, -3), DistanceGraphLink(-2, -3, -3)], [DistanceGraphLink(3, 1, -3), DistanceGraphLink(-3, -2, -3)], [DistanceGraphLink(4, 1, -3), DistanceGraphLink(-4, -2, -3)]])
```

***After:***
```
SequenceDistanceGraph{BioSequence{DNAAlphabet{2}}}(SDGNode{BioSequence{DNAAlphabet{2}}}[SDGNode{BioSequence{DNAAlphabet{2}}}(GATTCGTAC, false)], Array{DistanceGraphLink,1}[[]])
```

So the algorithm first generates a sdg with collapsing all simple paths using the initial kmer list. Then at the next stage the low covered bubble branch is removed and path collapsing is applied one more time.
This results in a sdg with a whole single node, a perfect unitig.


##TODO:

Implement the sdg constructor to take as input a list of reads instead of kmers and generate all kmer + coverage information from them.
8 changes: 8 additions & 0 deletions examples/construct_dbg.jl
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,11 @@ end


## At the end we have generated a dbg with Base.length(BioSequenceGraphs.nodes(dbg))= 742343
<<<<<<< HEAD
=======
start = time(); BioSequenceGraphs.new_deBruijn_Constructor(kmer_set); diff = time()-start
## Finished in 605.3275558948517


## First we test the simple path finder on this dbg
>>>>>>> e837ae91cbfe8ae18f58a8c8082260f9f12361e3
Binary file added graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file added graph1.fasta
Empty file.
Loading

0 comments on commit 1833350

Please sign in to comment.