Merge tip-clipping from GSoC into GenomeGraphs framework (#15)

BioJulia · Nov 27, 2019 · 1833350 · 1833350
1 parent 16ddf85
commit 1833350
Show file tree

Hide file tree

Showing 29 changed files with 2,095,188 additions and 58 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -3,6 +3,7 @@ language: julia
 os:
   - linux
   - osx
+  - windows
 julia:
   - 1.1
   - 1.2

diff --git a/20june_unet_ecoli.fastq b/20june_unet_ecoli.fastq
diff --git a/Project.toml b/Project.toml
@@ -4,6 +4,7 @@ authors = ["Ben J. Ward <benjward@protonmail.com>"]
 version = "0.1.0"
 
 [deps]
+Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
 BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
 FASTX = "c2308a5c-f048-11e8-3e8a-31650f418d12"
 ReadDatastores = "70a005b8-9d8a-11e9-0d98-c909fa2e52d2"

diff --git a/after_tip_removal.png b/after_tip_removal.png
diff --git a/before_tip_removal.png b/before_tip_removal.png
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,2 +1,5 @@
 [deps]
-Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+
+[compat]
+Documenter = "0.24"
diff --git a/docs/src/DeBruijnGraph.md b/docs/src/DeBruijnGraph.md
@@ -34,7 +34,7 @@ The kmer size is set to 15.
 
 ```
 r = FASTQ.Reader(open("URnano_ecoli.fastq", "r"))
-Ecoli_reads = Set{BioSequence{DNAAlphabet{4}}}()
+Ecoli_reads = Vector{BioSequence{DNAAlphabet{4}}}()
 
 ## get first 5 reads
 for i in 1:5
@@ -74,3 +74,25 @@ DeBruijnGraph(Dict{Int64,SequenceGraphNode}(7=>SequenceGraphNode{BioSequence{DNA
 
 
 So the nodes with nodeID 6 and 3 are collapsed into node 7 and we can see that both outgoing edges of 3 are given to node 7.
+
+
+
+#### Saving to gfa
+
+Below is an example code for saving a SequenceDistanceGraph to GFA file. This allows us to load and visualize the graph using graph visualization tools.
+
+```
+# A helpful function to let me run Bandage to visualize graphs.
+const BANDAGE_BIN = "/Applications/Bandage.app/Contents/MacOS/Bandage"
+function draw_graph(gr)
+    filename = tempname()
+    BioSequenceGraphs.dump_to_gfa1(gr, filename)
+    run(`$BANDAGE_BIN image $filename.gfa $filename.png --height 500`)
+    run(`open $filename.png`)
+end
+function show_graph(gr)
+    filename = tempname()
+    BioSequenceGraphs.dump_to_gfa1(gr, filename)
+    run(`$BANDAGE_BIN load $filename.gfa`)
+end
+```
diff --git a/docs/src/assets/logo.png b/docs/src/assets/logo.png
diff --git a/docs/src/assets/tips.jpeg b/docs/src/assets/tips.jpeg
diff --git a/docs/src/man/guide.md b/docs/src/man/guide.md
@@ -24,7 +24,7 @@ pkg> add GenomeGraphs
 If you are interested in the cutting edge of the development, please check out
 the master branch to try new features before release.
 
-## Creating a WorkSpace
+## Creating your first WorkSpace
 
 You can create an empty genome graph workspace with the empty constructor:
 
@@ -42,12 +42,15 @@ do any exploration or analysis. There are a few ways to get your first graph:
 1. Construct a de-Bruijn graph from raw sequencing reads.
 2. Load a graph (such as produced from another assembler) from a GFAv1 file
 
-## Constructing a de-Bruijn graph
+## Constructing your first de-Bruijn graph
 
 Let's see how to do option number 1, and construct a de-Bruijn graph from
 raw sequencing reads. This can be achieved with a few simple steps:
 
-1. Prepare the sequencing reads.
+1. Prepare the sequencing reads & build a datastore.
+2. Add the read datastore to a WorkSpace.
+3. Run the dbg process.
+4. Run the tip removal process.
 
 ### Preparing the sequencing reads
 
@@ -80,7 +83,9 @@ ds = PairedReads(fwq, rvq, "ecoli-test-paired", "my-ecoli", 250, 300, 0, FwRv)
 
 Here "ecoli-test-paired" is provided as the base filename of the datastore, the
 datastore is given the name of "my-ecoli", this name will be used to identify it
-in the workspace later. The minimum length for the reads is set at 250 base
+in the workspace later.
+
+The minimum length for the reads is set at 250 base
 pairs, and the maximum length is set to 300 base pairs. Reads that are too short
 will be discarded, reads that are too long are truncated.
 
@@ -94,6 +99,8 @@ read 2 is oriented backwards (forwards on the opposite strand). This orientation
 distinguishes regular paired-end reads from other paired read types like
 Long Mate Pairs.
 
+### Add datastore to the WorkSpace
+
 Now the datastore is created, it can be added to a workspace.
 
 ```julia
@@ -115,3 +122,80 @@ produce a first de-Bruijn graph of the genome.
 dbg!(BigDNAMer{61}, 10, ws, "my-ecoli")
 ```
 
+!!! warning
+    Some steps of this process, especially the Kmer counting steps, may take a
+    long time for big inputs.
+    No parallelism or batching to disk is used currently (although it is planned).
+    This process should take just about a minute for E.coli paired end reads with
+    a decent coverage.
+    So, be warned, for big stuff, performance may suuuuuuucck in these early days!
+
+### Run the `remove_tips` process
+
+Now you have a raw compressed de-Bruijn assembly graph. You can start to use it
+for analyses, and also try to improve its structure and resolve parts of the
+graph that represent error, repetitive content, and so forth. Some of these
+structures can be identified and resolved using only the topology of the graph,
+some require additional information sources (linked reads / long reads / and so on),
+to be incorporated into the workspace first.
+
+Here let's see how to resolve a common structure, using only the graph topology.
+
+Let's fix the tips of the graph.
+
+Tips looks like this:
+
+![tips](assets/tips.jpeg)
+
+See how a piece of the assembly which should be one long stretch of sequence is
+broken into 3 pieces (red) because of the existence of two tips (blue). Such
+tips are defined topologically as very short segments which have one incoming
+neighbour, and no outgoing neighbour.
+
+Tips are caused by sequencer errors that occur at the end of reads, because the
+sequencing by synthesis technique becomes more error prone over time; reagents
+are consumed and products generated as time progresses, making the base detection
+more difficult. Hence errors occur at the ends of reads, and erroneous kmers
+from the read ends are unlikely to have forward neighbours, and they end up
+forming tip nodes when they are incorporated into the de-Bruijn graph. 
+
+you can remove these tips and improve the contiguity of the graph by using the
+`remove_tips!` process.
+
+```julia
+remove_tips!(ws, 200)
+```
+
+The value of 200 provided is a size (in base pairs) threshold:
+Nodes are considered tips only if they have one ingoing neighbour, no outgoing
+neighbours, and are (in this case) smaller than 200 base pairs in length.
+
+Once this process is finished you will have a collapsed de-Bruijn graph with the
+tips removed.
+
+## Using the NodeView interface
+
+Graphs can be complicated data structures. If you want an in depth explanation
+of the data-structure used to represent genome graphs, then head [here](notyet).
+
+However, most people should not have to care about the internal structure of the
+graph.
+
+To make things as simple as possible, the `NodeView` type provides a single
+entry point for node-centric analyses. The `NodeView` wraps a point to a
+workspace's graph and contains a node id. A `NodeView` gives you acces to a
+node's underlying sequence, the nodes neighbouring nodes in the forward and
+backward directions, the reads mapped to a node, and kmer coverage over the node.
+
+To get a `NodeView` of a node, use the `node` method on a `WorkSpace`, providing
+a node id number. A positive ID denotes a view of the node traversing it in the
+forward (canonical) direction. A negative ID denotes a view of the node,
+traversing it in the reverse complement direction.   
+
+```julia
+julia> n = node(ws, 3)
+A view of a graph node (node: 3, graph: sdg):
+  AAAAAACCTCCGCAACCCCATGTTTTCACATAACTGTTG…GCCATGACCGGCTGGCTGTCAGGCTGTCACTGATAATCA
+
+
+```
diff --git a/docs/src/types/graphs.md b/docs/src/types/graphs.md
@@ -31,4 +31,37 @@ remove_node!
 add_link!
 remove_link!
 disconnect_node!
-```
+```
+
+#Code example for testing the error correction and simplifying the graph
+
+In the example below we generate a toy graph from 4-mers which contain a bubble.
+As for now, we make use of randomly generated coverage information for
+```
+using BioSequences
+using BioSequenceGraphs
+
+kmerlist = Vector{DNAKmer{4}}([DNAKmer{4}("ATAC"),DNAKmer{4}("TACG"),DNAKmer{4}("TACC"),DNAKmer{4}("ACGA"),DNAKmer{4}("CGAA"),DNAKmer{4}("GAAT"),DNAKmer{4}("AATC"),DNAKmer{4}("ACCA"),DNAKmer{4}("CCAA"),DNAKmer{4}("CAAT")])
+sdg = SequenceDistanceGraph(kmerlist,true)
+```
+
+The new SequenceDistanceGraph constructor takes as input a list of kmerlist (not necessarily sorted or in canonical form) and a boolean for whether to do error correction or not. Then by using the coverage information generated randomly, simplifies the graph.
+The resulting SDG before and after the error correction steps are as follows:
+
+***Before:***
+```
+SequenceDistanceGraph{BioSequence{DNAAlphabet{2}}}(SDGNode{BioSequence{DNAAlphabet{2}}}[SDGNode{BioSequence{DNAAlphabet{2}}}(AATC, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATAC, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATTCGTA, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATTGGTA, false)], Array{DistanceGraphLink,1}[[DistanceGraphLink(1, 3, -3), DistanceGraphLink(1, 4, -3)], [DistanceGraphLink(-2, -4, -3), DistanceGraphLink(-2, -3, -3)], [DistanceGraphLink(3, 1, -3), DistanceGraphLink(-3, -2, -3)], [DistanceGraphLink(4, 1, -3), DistanceGraphLink(-4, -2, -3)]])
+```
+
+***After:***
+```
+SequenceDistanceGraph{BioSequence{DNAAlphabet{2}}}(SDGNode{BioSequence{DNAAlphabet{2}}}[SDGNode{BioSequence{DNAAlphabet{2}}}(GATTCGTAC, false)], Array{DistanceGraphLink,1}[[]])
+```
+
+So the algorithm first generates a sdg with collapsing all simple paths using the initial kmer list. Then at the next stage the low covered bubble branch is removed and path collapsing is applied one more time.
+This results in a sdg with a whole single node, a perfect unitig.
+
+
+##TODO:
+
+Implement the sdg constructor to take as input a list of reads instead of kmers and generate all kmer + coverage information from them.
diff --git a/examples/construct_dbg.jl b/examples/construct_dbg.jl
@@ -27,3 +27,11 @@ end
 
 
 ## At the end we have generated a dbg with Base.length(BioSequenceGraphs.nodes(dbg))=  742343
+<<<<<<< HEAD
+=======
+start = time(); BioSequenceGraphs.new_deBruijn_Constructor(kmer_set); diff  = time()-start
+## Finished  in 605.3275558948517
+
+
+## First  we test the simple path finder on this dbg
+>>>>>>> e837ae91cbfe8ae18f58a8c8082260f9f12361e3
diff --git a/graph.png b/graph.png
diff --git a/graph1.fasta b/graph1.fasta