Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge tip-clipping from GSoC into GenomeGraphs framework #15

Merged
merged 28 commits into from
Nov 27, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
80bd4f1
Draft clip_tips method
Jul 12, 2019
401616e
error
ardakdemir Jul 13, 2019
7ea92f8
tip removal added
ardakdemir Jul 13, 2019
ad3ff39
error correction fixed
ardakdemir Jul 14, 2019
bc17456
bubble popping implemented
ardakdemir Jul 21, 2019
03a5d1a
revised bubble popping
ardakdemir Jul 30, 2019
6f6c975
sorting and canonical conversion added into the builder
ardakdemir Aug 1, 2019
2c24e66
arapan way of bubble popping initialized
ardakdemir Aug 2, 2019
64a2fc1
bubble popping/unitig extension from graph implemented!
ardakdemir Aug 3, 2019
0aa3259
building from fastq
ardakdemir Aug 6, 2019
5512834
kmer coverage using mean
ardakdemir Aug 7, 2019
8cabf88
tip deletion from graph
ardakdemir Aug 12, 2019
932ef4e
neighbor finding for sequences is updated but still looks very messy.
ardakdemir Aug 15, 2019
f945ca2
error correction updates
ardakdemir Aug 24, 2019
532a1b7
Merge branch 'gsoc/error-correction2' into benjward/gsocmerge
Sep 26, 2019
631c1d3
Move Arda's tip deletion func to its own file in processes directory.
Sep 26, 2019
4e87caf
Merge branch 'benjward/tip-clipping' into benjward/gsocmerge
Sep 26, 2019
90c3a7c
Use links & graph topology instead of prefixes and suffixes to find tips
Sep 26, 2019
50a7ac4
Fix GFAv1 writing & node deletion & add unitig finding method
Sep 27, 2019
64562c1
Add ability to search for a link in the graph
Sep 29, 2019
59289bb
Allow sequence creation from a path, and joining paths to make a node
Sep 29, 2019
b9837bf
Fix the unitig finder
Oct 9, 2019
74ecb45
Add logo to the getting started guide
Oct 31, 2019
87d2dbd
Update manual again
Oct 31, 2019
334ee44
Add very basic NodeView interface
Nov 4, 2019
de973e5
Add Windows testing, and adjust behaviour of the sequence method
Nov 22, 2019
f1d15a9
Use updated documenter
Nov 25, 2019
60a3084
Fix a typo
Nov 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ language: julia
os:
- linux
- osx
- windows
julia:
- 1.1
- 1.2
Expand Down
8,000 changes: 8,000 additions & 0 deletions 20june_unet_ecoli.fastq

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ authors = ["Ben J. Ward <benjward@protonmail.com>"]
version = "0.1.0"

[deps]
Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
FASTX = "c2308a5c-f048-11e8-3e8a-31650f418d12"
ReadDatastores = "70a005b8-9d8a-11e9-0d98-c909fa2e52d2"
Expand Down
Binary file added after_tip_removal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added before_tip_removal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"

[compat]
Documenter = "0.24"
24 changes: 23 additions & 1 deletion docs/src/DeBruijnGraph.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The kmer size is set to 15.

```
r = FASTQ.Reader(open("URnano_ecoli.fastq", "r"))
Ecoli_reads = Set{BioSequence{DNAAlphabet{4}}}()
Ecoli_reads = Vector{BioSequence{DNAAlphabet{4}}}()

## get first 5 reads
for i in 1:5
Expand Down Expand Up @@ -74,3 +74,25 @@ DeBruijnGraph(Dict{Int64,SequenceGraphNode}(7=>SequenceGraphNode{BioSequence{DNA


So the nodes with nodeID 6 and 3 are collapsed into node 7 and we can see that both outgoing edges of 3 are given to node 7.



#### Saving to gfa

Below is an example code for saving a SequenceDistanceGraph to GFA file. This allows us to load and visualize the graph using graph visualization tools.

```
# A helpful function to let me run Bandage to visualize graphs.
const BANDAGE_BIN = "/Applications/Bandage.app/Contents/MacOS/Bandage"
function draw_graph(gr)
filename = tempname()
BioSequenceGraphs.dump_to_gfa1(gr, filename)
run(`$BANDAGE_BIN image $filename.gfa $filename.png --height 500`)
run(`open $filename.png`)
end
function show_graph(gr)
filename = tempname()
BioSequenceGraphs.dump_to_gfa1(gr, filename)
run(`$BANDAGE_BIN load $filename.gfa`)
end
```
Binary file added docs/src/assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/src/assets/tips.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
92 changes: 88 additions & 4 deletions docs/src/man/guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ pkg> add GenomeGraphs
If you are interested in the cutting edge of the development, please check out
the master branch to try new features before release.

## Creating a WorkSpace
## Creating your first WorkSpace

You can create an empty genome graph workspace with the empty constructor:

Expand All @@ -42,12 +42,15 @@ do any exploration or analysis. There are a few ways to get your first graph:
1. Construct a de-Bruijn graph from raw sequencing reads.
2. Load a graph (such as produced from another assembler) from a GFAv1 file

## Constructing a de-Bruijn graph
## Constructing your first de-Bruijn graph

Let's see how to do option number 1, and construct a de-Bruijn graph from
raw sequencing reads. This can be achieved with a few simple steps:

1. Prepare the sequencing reads.
1. Prepare the sequencing reads & build a datastore.
2. Add the read datastore to a WorkSpace.
3. Run the dbg process.
4. Run the tip removal process.

### Preparing the sequencing reads

Expand Down Expand Up @@ -80,7 +83,9 @@ ds = PairedReads(fwq, rvq, "ecoli-test-paired", "my-ecoli", 250, 300, 0, FwRv)

Here "ecoli-test-paired" is provided as the base filename of the datastore, the
datastore is given the name of "my-ecoli", this name will be used to identify it
in the workspace later. The minimum length for the reads is set at 250 base
in the workspace later.

The minimum length for the reads is set at 250 base
pairs, and the maximum length is set to 300 base pairs. Reads that are too short
will be discarded, reads that are too long are truncated.

Expand All @@ -94,6 +99,8 @@ read 2 is oriented backwards (forwards on the opposite strand). This orientation
distinguishes regular paired-end reads from other paired read types like
Long Mate Pairs.

### Add datastore to the WorkSpace

Now the datastore is created, it can be added to a workspace.

```julia
Expand All @@ -115,3 +122,80 @@ produce a first de-Bruijn graph of the genome.
dbg!(BigDNAMer{61}, 10, ws, "my-ecoli")
```

!!! warning
Some steps of this process, especially the Kmer counting steps, may take a
long time for big inputs.
No parallelism or batching to disk is used currently (although it is planned).
This process should take just about a minute for E.coli paired end reads with
a decent coverage.
So, be warned, for big stuff, performance may suuuuuuucck in these early days!

### Run the `remove_tips` process

Now you have a raw compressed de-Bruijn assembly graph. You can start to use it
for analyses, and also try to improve its structure and resolve parts of the
graph that represent error, repetitive content, and so forth. Some of these
structures can be identified and resolved using only the topology of the graph,
some require additional information sources (linked reads / long reads / and so on),
to be incorporated into the workspace first.

Here let's see how to resolve a common structure, using only the graph topology.

Let's fix the tips of the graph.

Tips looks like this:

![tips](assets/tips.jpeg)

See how a piece of the assembly which should be one long stretch of sequence is
broken into 3 pieces (red) because of the existence of two tips (blue). Such
tips are defined topologically as very short segments which have one incoming
neighbour, and no outgoing neighbour.

Tips are caused by sequencer errors that occur at the end of reads, because the
sequencing by synthesis technique becomes more error prone over time; reagents
are consumed and products generated as time progresses, making the base detection
more difficult. Hence errors occur at the ends of reads, and erroneous kmers
from the read ends are unlikely to have forward neighbours, and they end up
forming tip nodes when they are incorporated into the de-Bruijn graph.

you can remove these tips and improve the contiguity of the graph by using the
`remove_tips!` process.

```julia
remove_tips!(ws, 200)
```

The value of 200 provided is a size (in base pairs) threshold:
Nodes are considered tips only if they have one ingoing neighbour, no outgoing
neighbours, and are (in this case) smaller than 200 base pairs in length.

Once this process is finished you will have a collapsed de-Bruijn graph with the
tips removed.

## Using the NodeView interface

Graphs can be complicated data structures. If you want an in depth explanation
of the data-structure used to represent genome graphs, then head [here](notyet).

However, most people should not have to care about the internal structure of the
graph.

To make things as simple as possible, the `NodeView` type provides a single
entry point for node-centric analyses. The `NodeView` wraps a point to a
workspace's graph and contains a node id. A `NodeView` gives you acces to a
node's underlying sequence, the nodes neighbouring nodes in the forward and
backward directions, the reads mapped to a node, and kmer coverage over the node.

To get a `NodeView` of a node, use the `node` method on a `WorkSpace`, providing
a node id number. A positive ID denotes a view of the node traversing it in the
forward (canonical) direction. A negative ID denotes a view of the node,
traversing it in the reverse complement direction.

```julia
julia> n = node(ws, 3)
A view of a graph node (node: 3, graph: sdg):
AAAAAACCTCCGCAACCCCATGTTTTCACATAACTGTTG…GCCATGACCGGCTGGCTGTCAGGCTGTCACTGATAATCA


```
35 changes: 34 additions & 1 deletion docs/src/types/graphs.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,37 @@ remove_node!
add_link!
remove_link!
disconnect_node!
```
```

#Code example for testing the error correction and simplifying the graph

In the example below we generate a toy graph from 4-mers which contain a bubble.
As for now, we make use of randomly generated coverage information for
```
using BioSequences
using BioSequenceGraphs

kmerlist = Vector{DNAKmer{4}}([DNAKmer{4}("ATAC"),DNAKmer{4}("TACG"),DNAKmer{4}("TACC"),DNAKmer{4}("ACGA"),DNAKmer{4}("CGAA"),DNAKmer{4}("GAAT"),DNAKmer{4}("AATC"),DNAKmer{4}("ACCA"),DNAKmer{4}("CCAA"),DNAKmer{4}("CAAT")])
sdg = SequenceDistanceGraph(kmerlist,true)
```

The new SequenceDistanceGraph constructor takes as input a list of kmerlist (not necessarily sorted or in canonical form) and a boolean for whether to do error correction or not. Then by using the coverage information generated randomly, simplifies the graph.
The resulting SDG before and after the error correction steps are as follows:

***Before:***
```
SequenceDistanceGraph{BioSequence{DNAAlphabet{2}}}(SDGNode{BioSequence{DNAAlphabet{2}}}[SDGNode{BioSequence{DNAAlphabet{2}}}(AATC, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATAC, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATTCGTA, false), SDGNode{BioSequence{DNAAlphabet{2}}}(ATTGGTA, false)], Array{DistanceGraphLink,1}[[DistanceGraphLink(1, 3, -3), DistanceGraphLink(1, 4, -3)], [DistanceGraphLink(-2, -4, -3), DistanceGraphLink(-2, -3, -3)], [DistanceGraphLink(3, 1, -3), DistanceGraphLink(-3, -2, -3)], [DistanceGraphLink(4, 1, -3), DistanceGraphLink(-4, -2, -3)]])
```

***After:***
```
SequenceDistanceGraph{BioSequence{DNAAlphabet{2}}}(SDGNode{BioSequence{DNAAlphabet{2}}}[SDGNode{BioSequence{DNAAlphabet{2}}}(GATTCGTAC, false)], Array{DistanceGraphLink,1}[[]])
```

So the algorithm first generates a sdg with collapsing all simple paths using the initial kmer list. Then at the next stage the low covered bubble branch is removed and path collapsing is applied one more time.
This results in a sdg with a whole single node, a perfect unitig.


##TODO:

Implement the sdg constructor to take as input a list of reads instead of kmers and generate all kmer + coverage information from them.
8 changes: 8 additions & 0 deletions examples/construct_dbg.jl
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,11 @@ end


## At the end we have generated a dbg with Base.length(BioSequenceGraphs.nodes(dbg))= 742343
<<<<<<< HEAD
=======
start = time(); BioSequenceGraphs.new_deBruijn_Constructor(kmer_set); diff = time()-start
## Finished in 605.3275558948517


## First we test the simple path finder on this dbg
>>>>>>> e837ae91cbfe8ae18f58a8c8082260f9f12361e3
Binary file added graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file added graph1.fasta
Empty file.
Loading