Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placement of reads from orphan contigs #1661

Open
snurk opened this issue Mar 31, 2020 · 4 comments
Open

Placement of reads from orphan contigs #1661

snurk opened this issue Mar 31, 2020 · 4 comments
Assignees
Labels

Comments

@snurk
Copy link
Contributor

snurk commented Mar 31, 2020

  • Options seem to be fix or disable.
  • Currently has implementation issues with choosing correct placement.
  • Also somehow placing bad reads (e.g. lopsided).
  • How can they be essential for contig breaking if bubbles are not?
@brianwalenz
Copy link
Member

The original incorrect placement was

asm.010.mergeOrphans.thr000.num000.log:  First read 4490490 - placed 8 times
asm.010.mergeOrphans.thr000.num000.log:    read 4490490 at  14192489-14203923 
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14192489-14203923 (cov 0.70221 erate 0.0047)
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14204677-14216097 (cov 0.42294 erate 0.0064)
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14110050-14098620 (cov 0.04678 erate 0.0019)
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14122234-14110801 (cov 1.00000 erate 0.0022)
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14134447-14123039 (cov 1.00000 erate 0.0016)
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14146654-14135248 (cov 1.00000 erate 0.0001) *
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14158864-14147452 (cov 1.00000 erate 0.0032)
asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14171062-14159649 (cov 0.71801 erate 0.0034)
asm.010.mergeOrphans.thr000.num000.log:    read 4851148 at  14192813-14205035 
asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14192813-14205035 (cov 0.84395 erate 0.0052)
asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14205005-14217212 (cov 0.36899 erate 0.0065)
asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14121901-14109678 (cov 1.00000 erate 0.0020)
asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14134121-14121923 (cov 1.00000 erate 0.0014)
asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14146329-14134136 (cov 1.00000 erate 0.0001) *
asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14158538-14146334 (cov 1.00000 erate 0.0026)
asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14170737-14158536 (cov 0.76285 erate 0.0032)
asm.010.mergeOrphans.thr000.num000.log:   Last read 1584251 - placed 6 times
asm.010.mergeOrphans.thr000.num000.log:    read 1584251 at  14194309-14206489 
asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14194309-14206489 (cov 0.97093 erate 0.0053)
asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14120412-14108229 (cov 1.00000 erate 0.0016)
asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14132627-14120468 (cov 0.99111 erate 0.0022)
asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14144832-14132678 (cov 1.00000 erate 0.0000) *
asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14157047-14144879 (cov 1.00000 erate 0.0019)
asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14169245-14157090 (cov 0.85624 erate 0.0043)
Placing orphan 11458 (length 13966) into tig 11435 at position 14192489-14206489 (length 14000):
    read 4490490 at  14192489-14203923
    read 4851148 at  14192813-14205035
    read 1584251 at  14194309-14206489
  out of 3 reads, found 3 reads placed, 2 terminal reads

My interpretation of that:

Best placements:
  asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14146654-14135248 (cov 1.00000 erate 0.0001) *
  asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14146329-14134136 (cov 1.00000 erate 0.0001) *
  asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14144832-14132678 (cov 1.00000 erate 0.0000) *

read  4490490 -> 466..-352..    <----
read  4851148 -> 463..-341..   <----
read  1584251 -> 448..-326..  <----

placed from 14132678 to 14146654 = 13976 bases -- tig length 13966 bases

Used placement:  (just picked the first placement of every read??)
  asm.010.mergeOrphans.thr027.num000.log:tig  11458 read  4490490 -> tig  11435 ( 51664 reads) at 14192489-14203923 (cov 0.70221 erate 0.0047)
  asm.010.mergeOrphans.thr028.num000.log:tig  11458 read  4851148 -> tig  11435 ( 51664 reads) at 14192813-14205035 (cov 0.84395 erate 0.0052)
  asm.010.mergeOrphans.thr010.num000.log:tig  11458 read  1584251 -> tig  11435 ( 51664 reads) at 14194309-14206489 (cov 0.97093 erate 0.0053)

read  4490490 -> 1924..-2039..   ---->
read  4851148 -> 1928..-2050..    ---->
read  1584251 -> 1943..-2064..     ---->

In particular, this found only one possible place to put the orphan, and so put it at that one (incorrect) place.

The fixed version now reports multiple possible placements, and will thus put each read where it thinks it best fits (not shown in the logging).

Processing potential orphan 11460 of length 13966 bp with 3 reads

Find anchors for orphan 11460:
  First read 4490490 - placed 8 times
   Last read 1584251 - placed 6 times
     Internal reads - placed      1/1      dovetail reads
                    - placed      0/0      contained reads

  Intervals (first read):
    tig    11437  14192489-14206455  ->
    tig    11437  14204677-14218643  ->
    tig    11437  14096084-14110050  <-
    tig    11437  14108268-14122234  <-
    tig    11437  14120481-14134447  <-
    tig    11437  14132688-14146654  <-
    tig    11437  14144898-14158864  <-
    tig    11437  14157096-14171062  <-

  Intervals (last read):
    tig    11437  14192523-14206489  ->
    tig    11437  14108229-14122195  <-
    tig    11437  14120468-14134434  <-
    tig    11437  14132678-14146644  <-
    tig    11437  14144879-14158845  <-
    tig    11437  14157090-14171056  <-

Finding intervals for orphan 11460 placed in tig 11437.
   14108229-14110050     13.0% of orphan length - first read at  14098620-14110050  last read at  14108229-14120412   TOO SMALL
   14120468-14110050    -74.6% of orphan length - first read at  14098620-14110050  last read at  14120468-14132627   MIS-ORDER
   14132678-14110050   -162.0% of orphan length - first read at  14098620-14110050  last read at  14132678-14144832   MIS-ORDER
   14144879-14110050   -249.4% of orphan length - first read at  14098620-14110050  last read at  14144879-14157047   MIS-ORDER
   14157090-14110050   -336.8% of orphan length - first read at  14098620-14110050  last read at  14157090-14169245   MIS-ORDER
   14108229-14122234    100.3% of orphan length - first read at  14110801-14122234  last read at  14108229-14120412   SUCCESS!
   14120468-14122234     12.6% of orphan length - first read at  14110801-14122234  last read at  14120468-14132627   TOO SMALL
   14132678-14122234    -74.8% of orphan length - first read at  14110801-14122234  last read at  14132678-14144832   MIS-ORDER
   14144879-14122234   -162.1% of orphan length - first read at  14110801-14122234  last read at  14144879-14157047   MIS-ORDER
   14157090-14122234   -249.6% of orphan length - first read at  14110801-14122234  last read at  14157090-14169245   MIS-ORDER
   14108229-14134447    187.7% of orphan length - first read at  14123039-14134447  last read at  14108229-14120412   TOO LARGE
   14120468-14134447    100.1% of orphan length - first read at  14123039-14134447  last read at  14120468-14132627   SUCCESS!
   14132678-14134447     12.7% of orphan length - first read at  14123039-14134447  last read at  14132678-14144832   TOO SMALL
   14144879-14134447    -74.7% of orphan length - first read at  14123039-14134447  last read at  14144879-14157047   MIS-ORDER
   14157090-14134447   -162.1% of orphan length - first read at  14123039-14134447  last read at  14157090-14169245   MIS-ORDER
   14108229-14146654    275.1% of orphan length - first read at  14135248-14146654  last read at  14108229-14120412   TOO LARGE
   14120468-14146654    187.5% of orphan length - first read at  14135248-14146654  last read at  14120468-14132627   TOO LARGE
   14132678-14146654    100.1% of orphan length - first read at  14135248-14146654  last read at  14132678-14144832   SUCCESS!
   14144879-14146654     12.7% of orphan length - first read at  14135248-14146654  last read at  14144879-14157047   TOO SMALL
   14157090-14146654    -74.7% of orphan length - first read at  14135248-14146654  last read at  14157090-14169245   MIS-ORDER
   14108229-14158864    362.6% of orphan length - first read at  14147452-14158864  last read at  14108229-14120412   TOO LARGE
   14120468-14158864    274.9% of orphan length - first read at  14147452-14158864  last read at  14120468-14132627   TOO LARGE
   14132678-14158864    187.5% of orphan length - first read at  14147452-14158864  last read at  14132678-14144832   TOO LARGE
   14144879-14158864    100.1% of orphan length - first read at  14147452-14158864  last read at  14144879-14157047   SUCCESS!
   14157090-14158864     12.7% of orphan length - first read at  14147452-14158864  last read at  14157090-14169245   TOO SMALL
   14108229-14171062    449.9% of orphan length - first read at  14159649-14171062  last read at  14108229-14120412   TOO LARGE
   14120468-14171062    362.3% of orphan length - first read at  14159649-14171062  last read at  14120468-14132627   TOO LARGE
   14132678-14171062    274.8% of orphan length - first read at  14159649-14171062  last read at  14132678-14144832   TOO LARGE
   14144879-14171062    187.5% of orphan length - first read at  14159649-14171062  last read at  14144879-14157047   TOO LARGE
   14157090-14171062    100.0% of orphan length - first read at  14159649-14171062  last read at  14157090-14169245   SUCCESS!
   14192489-14206489    100.2% of orphan length - first read at  14192489-14203923  last read at  14194309-14206489   SUCCESS!
   14204677-14206489     13.0% of orphan length - first read at  14204677-14216097  last read at  14194309-14206489   TOO SMALL

Found 6 target locations

Removing duplicate placements.

Placing orphan 11460 (length 13966) into tig 11437 at position 14108229-14122234 (length 14005):
    read 4490490 at  14122234-14110801 
    read 4851148 at  14121901-14109678 
    read 1584251 at  14120412-14108229 
  out of 3 reads, found 3 reads placed, 2 terminal reads

Placing orphan 11460 (length 13966) into tig 11437 at position 14120468-14134447 (length 13979):
    read 4490490 at  14134447-14123039 
    read 4851148 at  14134121-14121923 
    read 1584251 at  14132627-14120468 
  out of 3 reads, found 3 reads placed, 2 terminal reads

Placing orphan 11460 (length 13966) into tig 11437 at position 14132678-14146654 (length 13976):
    read 4490490 at  14146654-14135248 
    read 4851148 at  14146329-14134136 
    read 1584251 at  14144832-14132678 
  out of 3 reads, found 3 reads placed, 2 terminal reads

Placing orphan 11460 (length 13966) into tig 11437 at position 14144879-14158864 (length 13985):
    read 4490490 at  14158864-14147452 
    read 4851148 at  14158538-14146334 
    read 1584251 at  14157047-14144879 
  out of 3 reads, found 3 reads placed, 2 terminal reads

Placing orphan 11460 (length 13966) into tig 11437 at position 14157090-14171062 (length 13972):
    read 4490490 at  14171062-14159649 
    read 4851148 at  14170737-14158536 
    read 1584251 at  14169245-14157090 
  out of 3 reads, found 3 reads placed, 2 terminal reads

Placing orphan 11460 (length 13966) into tig 11437 at position 14192489-14206489 (length 14000):
    read 4490490 at  14192489-14203923 
    read 4851148 at  14192813-14205035 
    read 1584251 at  14194309-14206489 
  out of 3 reads, found 3 reads placed, 2 terminal reads

Result:
  tig    11460 of length    13966 with      3 reads      0 - MULTIPLY PLACED ORPHAN

@skoren
Copy link
Member

skoren commented Apr 2, 2020

There's still some weirdness for multiply placed orphans I think:

Find anchors for orphan 808:
  First read    307 - placed 2 times
   Last read  29558 - placed 4 times
     Internal reads - placed      0/0      dovetail reads
                    - placed      0/0      contained reads

  Intervals (first read):
    tig      319    496567-513521    ->
    tig      319    506952-523906    ->

  Intervals (last read):
    tig      319    493113-510067    ->
    tig      319    496571-513525    ->
    tig      319    506957-523911    ->
    tig      319    527650-544604    ->

Finding intervals for orphan 808 placed in tig 319.
     496567-510067       79.6% of orphan length - first read at    496567-511382    last read at    493901-510067     SUCCESS!
     496567-513525      100.0% of orphan length - first read at    496567-511382    last read at    497359-513525     SUCCESS!
     496567-523911      161.3% of orphan length - first read at    496567-511382    last read at    507744-523911     TOO LARGE
     506952-510067       18.4% of orphan length - first read at    506952-521768    last read at    493901-510067     TOO SMALL
     506952-513525       38.8% of orphan length - first read at    506952-521768    last read at    497359-513525     TOO SMALL
     506952-523911      100.0% of orphan length - first read at    506952-521768    last read at    507744-523911     SUCCESS!

Found 3 target locations

Removing duplicate placements.

Placing orphan 808 (length 16954) into tig 319 at position 496567-510067 (length 13500):
  out of 2 reads, found 0 reads placed, 0 terminal reads

Placing orphan 808 (length 16954) into tig 319 at position 496567-513525 (length 16958):
    read     307 at    496567-511382
    read   29558 at    513525-497359
  out of 2 reads, found 2 reads placed, 2 terminal reads

Placing orphan 808 (length 16954) into tig 319 at position 506952-523911 (length 16959):
    read     307 at    506952-521768
    read   29558 at    523911-507744
  out of 2 reads, found 2 reads placed, 2 terminal reads

Result:
  tig      808 of length    16954 with      2 reads 1701978220 - MULTIPLY PLACED ORPHAN

The reads end up at positions

307	319	496567	511382
29558	319	510067	493901

However, that placement of read 29558 was supposed to be subsumed by the better version that it tested (100% length vs 70% length). The latter placement is clearly better too:

pRUO()--   placements[4] - PLACE READ 29558 in tig 319 at 510067,493901 -- verified 510067,508144 -- covered 0,1919 11.9% -- errors 0.00 aligned 1919 novl 1
pRUO()--   placements[5] - PLACE READ 29558 in tig 319 at 513525,497359 -- verified 513525,497359 -- covered 0,16162 100.0% -- errors 0.00 aligned 113285 novl 18
tig    808 read    29558 -> tig    319 ( 18911 reads) at   510067-493901   (cov 0.11874 erate 0.0000)
tig    808 read    29558 -> tig    319 ( 18911 reads) at   513525-497359   (cov 1.00000 erate 0.0000)

So how does it end up being placed with a 10% placement anyway?

@skoren
Copy link
Member

skoren commented Apr 2, 2020

Just picks first one at equal coverage, I suggest orphans should be required to fully place their reads otherwise they're bubbles not orphans.

There is also lots of code duplication with these checks, present in at least bogart/AS_BAT_MergeOrphans.C, bogart/AS_BAT_AssemblyGraph.C, and bogart/AS_BAT_PlaceContains.C, would be good to refactor.

@skoren
Copy link
Member

skoren commented Apr 2, 2020

Lastly, what is causing these orphans to arise. We have orphans at 100% identity and 100% covered by a larger tig. Example:

tig    430 read      609 -> tig    403 ( 17975 reads) at 12639072-12656400 (cov 1.00000 erate 0.0000)
tig    430 read     4769 -> tig    403 ( 17975 reads) at 12636091-12651262 (cov 1.00000 erate 0.0000)
tig    430 read    47157 -> tig    403 ( 17975 reads) at 12638286-12653990 (cov 1.00000 erate 0.0000)
tig    430 read   100025 -> tig    403 ( 17975 reads) at 12640074-12663612 (cov 1.00000 erate 0.0000)

Placing orphan 430 (length 27513) into tig 403 at position 12636091-12663612 (length 27521):
    read    4769 at  12636091-12651262
    read   47157 at  12638286-12653990
    read     609 at  12639072-12656400
    read  100025 at  12640074-12663612
  out of 4 reads, found 4 reads placed, 2 terminal reads

Result:
  tig      430 of length    27513 with      4 reads 1701978220 - UNIQUELY PLACED ORPHAN

Why weren't these reads in the path originally?

@skoren skoren removed this from the v2.1 milestone Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants