Error in tama_collapse: Genome seq is not the same length as query seq #80

husensofteng · 2022-10-09T13:26:29Z

Hi,
I am running the nf-core/isoseq workflow that internally uses tama_collapse and tama_merge.
The issue is that it exits with error at the tama_collapse step for some of the BAM chunks files although it works for most of the other files when I use Ultra for the alignment.

When I try tama_collapse as stand alone on the failed files, I get the following error:

python tama_collapse.py -s sample.chunk13.bam -f genome.fa -p sample.chunk13 -x no_cap -b BAM -a 100 -m 10 -z 100
tc_version_date_2020_12_14
Default collapse exon ends flag will be used: common_ends
Default coverage: 99
Default identity: 85
Default identity calculation method: ident_cov
Default duplicate merge flag: merge_dup
Default splice junction priority: no_priority
Default splice junction error threshold: 10
Default splice junction local density error threshold: 1000
Default simple error symbol for matches is the underscore "_" .
Using BAM format for reading in.
Default log output on
Default run mode original
Default 5 read threshold
time taken since last check:	0:0:0
time taken since beginning:	0:0:0
going through fasta
^[^[time taken since last check:	0:1:9
time taken since beginning:	0:1:9
going through sam file
10596
sam count 5000
sam count 10000
Genome seq is not the same length as query seq
[]
['C', 'T', 'G', 'T', 'A', 'A', 'C', 'A', 'G', 'T', 'G', 'T', 'T', 'T', 'A', 'T', 'T', 'G', 'A', 'T', 'G', 'A', 'T', 'G', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'G', 'G', 'C', 'T', 'C', 'C', 'T', 'G', 'C', 'T', 'G', 'A', 'A', 'G', 'C', 'C', 'C', 'T', 'G', 'G', 'T', 'G', 'G', 'G', 'G', 'A', 'G', 'G', 'G', 'G', 'C', 'A', 'C', 'A', 'G', 'A', 'G', 'C', 'G', 'A', 'G', 'A', 'T', 'G', 'G', 'G', 'G', 'C', 'G', 'T', 'A', 'A', 'T', 'G', 'G', 'A', 'A', 'T', 'G', 'C', 'T', 'T', 'G', 'A', 'A', 'G', 'G', 'C', 'T', 'G', 'C', 'T', 'C', 'C', 'G', 'T', 'G', 'A', 'T', 'G', 'T', 'C', 'G', 'G', 'T', 'C', 'G', 'G', 'A', 'G', 'C', 'T', 'T', 'C', 'C', 'A', 'G', 'A', 'C', 'T', 'A', 'G', 'G', 'C', 'G', 'A', 'G', 'G', 'G', 'C', 'A', 'G', 'G', 'G', 'T', 'G', 'A', 'G', 'G', 'C', 'C', 'T', 'C', 'G', 'G', 'G', 'C', 'A', 'C', 'A', 'C', 'A', 'G', 'C', 'C', 'G', 'G', 'C', 'G', 'C', 'C', 'C', 'A', 'G', 'T', 'C', 'A', 'C', 'C', 'C', 'G', 'G', 'C', 'C', 'C', 'A', 'G', 'A', 'T', 'G', 'G', 'A', 'G', 'G', 'G', 'C', 'G', 'G', 'C', 'C', 'G', 'G', 'G', 'C', 'C', 'C', 'T', 'G', 'C']
50360
0

Could you please share your thoughts on what could be wrong here? any particular flag or sequence that can be wrong in the BAM file ?

thanks a lot

The text was updated successfully, but these errors were encountered:

GenomeRIK · 2022-10-11T16:05:35Z

Hello,

Thank you for using TAMA.

I think what is happening is that there is at least 1 read which is mapping off of the genome. Could you look in the read.txt file to find the read ID and then find it in the BAM file to see if this is the case?

Thank you,
Richard

husensofteng · 2022-10-17T11:40:24Z

Thanks for the reply. my search for the read didn't yield any useful info. I have shared the files in a google folder via email that you should have already received it . I am sure you have more insights on what might be wrong in the reads.txt vs. the BAM file. thanks in advance.

GenomeRIK · 2022-10-17T21:08:12Z

Hello,

I am not seeing the invite for the google folder. Could you send it again?

Or could you copy paste the last 5 lines of the read.txt file and the last 5 lines of the trans_read.bed file?

Thank you,
Richard

husensofteng · 2022-10-18T07:11:53Z

Maybe I don't have your correct email address. My email is my username here + gmail.com, could you send me a hello world :) Thanks a lot!

The chunk13_trans_read.bed file is empty and these are the last few lines in chunk13_read.txt:

m54224_190921_125133/26411409/ccs forward_strand accepted 99.8 99.67 0;3;0;0;2 1523 3S42=1X325=1X1151=
m54224_190921_125133/26804667/ccs forward_strand accepted 99.86 99.81 0;3;0;0;1 2161 3S368=1X1789=
m54224_190921_125133/27198126/ccs forward_strand accepted 99.86 99.77 0;3;0;0;2 2161 3S368=1X1004=1X784=
m54224_190921_125133/26739057/ccs forward_strand accepted 99.84 99.78 0;3;0;0;1 1851 3S58=1X1789=
m54224_190921_125133/25493990/ccs forward_strand accepted 99.84 99.73 0;3;1;0;1 1848 3S54=1X478=1I1311=
m54224_190921_125133/25494052/ccs forward_strand accepted 99.65 99.13 0;4;1;0;5 1145 3S19=1X2=1X389=1I142=1X24=1X187=1X372=1S
m54224_190921_125133/26017888/ccs forward_strand accepted 99.7 99.21 0;3;0;0;5 1007 3S19=1X2=1X531=1X24=1X187=1X236=
m54224_190921_125133/26542188/ccs forward_strand accepted 99.65 99.21 0;4;0;0;5 1144 3S19=1X2=1X531=1X24=1X187=1X372=1S
m54224_190921_125133/26608262/ccs forward_strand accepted 99.74 99.3 0;3;0;0;5 1144 3S19=1X2=1X531=1X24=1X187=1X373=

GenomeRIK · 2022-10-18T11:11:18Z

Hello,

I just sent an email. However, would you be able to copy paste the line in the bam file corresponding to this read "m54224_190921_125133/26608262/ccs" and 5 lines after?

Thank you,
Richard

husensofteng · 2022-10-19T10:24:59Z

Thanks, I have now shared the google folder, and below you find the read info from the BAM file.

@sguizard, who is the developer of the isoseq workflow, has detected an outbound read reported by uLTRA. The issue is now reported on uLTRA. Therefore, we can wait since it might be related to uLTRA and not tama.

m54224_190921_125133/26608262/ccs 0 chrM 14747 60 3S19=1X2=1X531=1X24=1X187=1X373= * 0 0 GGGATGACCCCAATACGCAAAATTAGCCCCCTAATAAAATTAATTAACCACTCATTCATCGACCTCCCCACCCCATCCAACATCTCCGCATGATGAAACTTCGGCTCACTCCTTGGCGCCTGCCTGATCCTCCAAATCACCACAGGACTATTCCTAGCCATGCACTACTCACCAGACGCCTCAACCGCCTTTTCATCAATCGCCCACATCACTCGAGACGTAAATTATGGCTGAATCATCCGCTACCTTCACGCCAATGGCGCCTCAATATTCTTTATCTGCCTCTTCCTACACATCGGGCGAGGCCTATATTACGGATCATTTCTCTACTCAGAAACCTGAAACATCGGCATTATCCTCCTGCTTGCAACTATAGCAACAGCCTTCATAGGCTATGTCCTCCCGTGAGGCCAAATATCATTCTGAGGGGCCACAGTAATTACAAACTTACTATCCGCCATCCCATACATTGGGACAGACCTAGTTCAATGAATCTGAGGAGGCTACTCAGTAGACAGTCCCACCCTCACACGATTCTTTACCTTTCACTTCATCTTACCCTTCATTATTGCAGCCCTAGCAGCACTCCACCTCCTATTCTTGCACGAAACGGGATCAAACAACCCCCTAGGAATCACCTCCCATTCCGATAAAATCACCTTCCACCCTTACTACACAATCAAAGACGCCCTCGGCTTACTTCTCTTCCTTCTCTCCTTAATGACATTAACACTATTCTCACCAGACCTCCTAGGCGACCCAGACAATTACACCCTAGCCAACCCCTTAAACACCCCTCCCCACATCAAGCCCGAATGATATTTCCTATTCGCCTACACAATTCTCCGATCCGTCCCTAACAAACTAGGAGGCGTCCTTGCCCTATTACTATCCATCCTCATCCTAGCAATAATCCCCATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAATCGGAGGACAACCAGTAAGCTACCCTTTTACCATCATTGGACAAGTAGCATCCGTACTATACTTCACAACAATCCTAATCCTAATACCAACTATCTCCCTAATTGAAAACAAAATACTCAAATGGGCCT * NM:i:5 ms:i:1126 AS:i:1126 nn:i:0 tp:A:P cm:i:349 s1:i:1112 s2:i:386 de:f:0.0044 rl:i:0

sguizard · 2022-10-19T10:57:15Z

Hi Richard,

The coordinates of the read are after the end of the chromosome, therefore the extracted sequence is empty.
Obviously, the genomic seq and the query are not the same length and TAMA stop it's execution.
It's a normal behaviour from TAMA.

The problem comes from uLTRA, as @husensofteng explained above.

husensofteng · 2022-10-27T09:01:48Z

The issue was related to the files generated by ULTRA (issue #17) and it has now been fixed.

husensofteng changed the title ~~Error~~ Error in tama_collapse: Genome seq is not the same length as query seq Oct 9, 2022

husensofteng closed this as completed Oct 27, 2022

GenomeRIK mentioned this issue Jan 26, 2023

tama_merge IndexError: list index out of range #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in tama_collapse: Genome seq is not the same length as query seq #80

Error in tama_collapse: Genome seq is not the same length as query seq #80

husensofteng commented Oct 9, 2022 •

edited

Loading

GenomeRIK commented Oct 11, 2022

husensofteng commented Oct 17, 2022

GenomeRIK commented Oct 17, 2022

husensofteng commented Oct 18, 2022

GenomeRIK commented Oct 18, 2022

husensofteng commented Oct 19, 2022

sguizard commented Oct 19, 2022

husensofteng commented Oct 27, 2022 •

edited

Loading

Error in tama_collapse: Genome seq is not the same length as query seq #80

Error in tama_collapse: Genome seq is not the same length as query seq #80

Comments

husensofteng commented Oct 9, 2022 • edited Loading

GenomeRIK commented Oct 11, 2022

husensofteng commented Oct 17, 2022

GenomeRIK commented Oct 17, 2022

husensofteng commented Oct 18, 2022

GenomeRIK commented Oct 18, 2022

husensofteng commented Oct 19, 2022

sguizard commented Oct 19, 2022

husensofteng commented Oct 27, 2022 • edited Loading

husensofteng commented Oct 9, 2022 •

edited

Loading

husensofteng commented Oct 27, 2022 •

edited

Loading