Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in tama_collapse: Genome seq is not the same length as query seq #80

Closed
husensofteng opened this issue Oct 9, 2022 · 8 comments
Closed

Comments

@husensofteng
Copy link

husensofteng commented Oct 9, 2022

Hi,
I am running the nf-core/isoseq workflow that internally uses tama_collapse and tama_merge.
The issue is that it exits with error at the tama_collapse step for some of the BAM chunks files although it works for most of the other files when I use Ultra for the alignment.

When I try tama_collapse as stand alone on the failed files, I get the following error:

python tama_collapse.py -s sample.chunk13.bam -f genome.fa -p sample.chunk13 -x no_cap -b BAM -a 100 -m 10 -z 100
tc_version_date_2020_12_14
Default collapse exon ends flag will be used: common_ends
Default coverage: 99
Default identity: 85
Default identity calculation method: ident_cov
Default duplicate merge flag: merge_dup
Default splice junction priority: no_priority
Default splice junction error threshold: 10
Default splice junction local density error threshold: 1000
Default simple error symbol for matches is the underscore "_" .
Using BAM format for reading in.
Default log output on
Default run mode original
Default 5 read threshold
time taken since last check:	0:0:0
time taken since beginning:	0:0:0
going through fasta
^[^[time taken since last check:	0:1:9
time taken since beginning:	0:1:9
going through sam file
10596
sam count 5000
sam count 10000
Genome seq is not the same length as query seq
[]
['C', 'T', 'G', 'T', 'A', 'A', 'C', 'A', 'G', 'T', 'G', 'T', 'T', 'T', 'A', 'T', 'T', 'G', 'A', 'T', 'G', 'A', 'T', 'G', 'A', 'G', 'T', 'C', 'C', 'A', 'G', 'G', 'G', 'C', 'T', 'C', 'C', 'T', 'G', 'C', 'T', 'G', 'A', 'A', 'G', 'C', 'C', 'C', 'T', 'G', 'G', 'T', 'G', 'G', 'G', 'G', 'A', 'G', 'G', 'G', 'G', 'C', 'A', 'C', 'A', 'G', 'A', 'G', 'C', 'G', 'A', 'G', 'A', 'T', 'G', 'G', 'G', 'G', 'C', 'G', 'T', 'A', 'A', 'T', 'G', 'G', 'A', 'A', 'T', 'G', 'C', 'T', 'T', 'G', 'A', 'A', 'G', 'G', 'C', 'T', 'G', 'C', 'T', 'C', 'C', 'G', 'T', 'G', 'A', 'T', 'G', 'T', 'C', 'G', 'G', 'T', 'C', 'G', 'G', 'A', 'G', 'C', 'T', 'T', 'C', 'C', 'A', 'G', 'A', 'C', 'T', 'A', 'G', 'G', 'C', 'G', 'A', 'G', 'G', 'G', 'C', 'A', 'G', 'G', 'G', 'T', 'G', 'A', 'G', 'G', 'C', 'C', 'T', 'C', 'G', 'G', 'G', 'C', 'A', 'C', 'A', 'C', 'A', 'G', 'C', 'C', 'G', 'G', 'C', 'G', 'C', 'C', 'C', 'A', 'G', 'T', 'C', 'A', 'C', 'C', 'C', 'G', 'G', 'C', 'C', 'C', 'A', 'G', 'A', 'T', 'G', 'G', 'A', 'G', 'G', 'G', 'C', 'G', 'G', 'C', 'C', 'G', 'G', 'G', 'C', 'C', 'C', 'T', 'G', 'C']
50360
0

Could you please share your thoughts on what could be wrong here? any particular flag or sequence that can be wrong in the BAM file ?

thanks a lot

@husensofteng husensofteng changed the title Error Error in tama_collapse: Genome seq is not the same length as query seq Oct 9, 2022
@GenomeRIK
Copy link
Owner

Hello,

Thank you for using TAMA.

I think what is happening is that there is at least 1 read which is mapping off of the genome. Could you look in the read.txt file to find the read ID and then find it in the BAM file to see if this is the case?

Thank you,
Richard

@husensofteng
Copy link
Author

Thanks for the reply. my search for the read didn't yield any useful info. I have shared the files in a google folder via email that you should have already received it . I am sure you have more insights on what might be wrong in the reads.txt vs. the BAM file. thanks in advance.

@GenomeRIK
Copy link
Owner

Hello,

I am not seeing the invite for the google folder. Could you send it again?

Or could you copy paste the last 5 lines of the read.txt file and the last 5 lines of the trans_read.bed file?

Thank you,
Richard

@husensofteng
Copy link
Author

Maybe I don't have your correct email address. My email is my username here + gmail.com, could you send me a hello world :) Thanks a lot!

The chunk13_trans_read.bed file is empty and these are the last few lines in chunk13_read.txt:

m54224_190921_125133/26411409/ccs forward_strand accepted 99.8 99.67 0;3;0;0;2 1523 3S42=1X325=1X1151=
m54224_190921_125133/26804667/ccs forward_strand accepted 99.86 99.81 0;3;0;0;1 2161 3S368=1X1789=
m54224_190921_125133/27198126/ccs forward_strand accepted 99.86 99.77 0;3;0;0;2 2161 3S368=1X1004=1X784=
m54224_190921_125133/26739057/ccs forward_strand accepted 99.84 99.78 0;3;0;0;1 1851 3S58=1X1789=
m54224_190921_125133/25493990/ccs forward_strand accepted 99.84 99.73 0;3;1;0;1 1848 3S54=1X478=1I1311=
m54224_190921_125133/25494052/ccs forward_strand accepted 99.65 99.13 0;4;1;0;5 1145 3S19=1X2=1X389=1I142=1X24=1X187=1X372=1S
m54224_190921_125133/26017888/ccs forward_strand accepted 99.7 99.21 0;3;0;0;5 1007 3S19=1X2=1X531=1X24=1X187=1X236=
m54224_190921_125133/26542188/ccs forward_strand accepted 99.65 99.21 0;4;0;0;5 1144 3S19=1X2=1X531=1X24=1X187=1X372=1S
m54224_190921_125133/26608262/ccs forward_strand accepted 99.74 99.3 0;3;0;0;5 1144 3S19=1X2=1X531=1X24=1X187=1X373=

@GenomeRIK
Copy link
Owner

Hello,

I just sent an email. However, would you be able to copy paste the line in the bam file corresponding to this read "m54224_190921_125133/26608262/ccs" and 5 lines after?

Thank you,
Richard

@husensofteng
Copy link
Author

Thanks, I have now shared the google folder, and below you find the read info from the BAM file.

@sguizard, who is the developer of the isoseq workflow, has detected an outbound read reported by uLTRA. The issue is now reported on uLTRA. Therefore, we can wait since it might be related to uLTRA and not tama.

m54224_190921_125133/26608262/ccs 0 chrM 14747 60 3S19=1X2=1X531=1X24=1X187=1X373= * 0 0 GGGATGACCCCAATACGCAAAATTAGCCCCCTAATAAAATTAATTAACCACTCATTCATCGACCTCCCCACCCCATCCAACATCTCCGCATGATGAAACTTCGGCTCACTCCTTGGCGCCTGCCTGATCCTCCAAATCACCACAGGACTATTCCTAGCCATGCACTACTCACCAGACGCCTCAACCGCCTTTTCATCAATCGCCCACATCACTCGAGACGTAAATTATGGCTGAATCATCCGCTACCTTCACGCCAATGGCGCCTCAATATTCTTTATCTGCCTCTTCCTACACATCGGGCGAGGCCTATATTACGGATCATTTCTCTACTCAGAAACCTGAAACATCGGCATTATCCTCCTGCTTGCAACTATAGCAACAGCCTTCATAGGCTATGTCCTCCCGTGAGGCCAAATATCATTCTGAGGGGCCACAGTAATTACAAACTTACTATCCGCCATCCCATACATTGGGACAGACCTAGTTCAATGAATCTGAGGAGGCTACTCAGTAGACAGTCCCACCCTCACACGATTCTTTACCTTTCACTTCATCTTACCCTTCATTATTGCAGCCCTAGCAGCACTCCACCTCCTATTCTTGCACGAAACGGGATCAAACAACCCCCTAGGAATCACCTCCCATTCCGATAAAATCACCTTCCACCCTTACTACACAATCAAAGACGCCCTCGGCTTACTTCTCTTCCTTCTCTCCTTAATGACATTAACACTATTCTCACCAGACCTCCTAGGCGACCCAGACAATTACACCCTAGCCAACCCCTTAAACACCCCTCCCCACATCAAGCCCGAATGATATTTCCTATTCGCCTACACAATTCTCCGATCCGTCCCTAACAAACTAGGAGGCGTCCTTGCCCTATTACTATCCATCCTCATCCTAGCAATAATCCCCATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAATCGGAGGACAACCAGTAAGCTACCCTTTTACCATCATTGGACAAGTAGCATCCGTACTATACTTCACAACAATCCTAATCCTAATACCAACTATCTCCCTAATTGAAAACAAAATACTCAAATGGGCCT * NM:i:5 ms:i:1126 AS:i:1126 nn:i:0 tp:A:P cm:i:349 s1:i:1112 s2:i:386 de:f:0.0044 rl:i:0

@sguizard
Copy link

Hi Richard,

The coordinates of the read are after the end of the chromosome, therefore the extracted sequence is empty.
Obviously, the genomic seq and the query are not the same length and TAMA stop it's execution.
It's a normal behaviour from TAMA.

The problem comes from uLTRA, as @husensofteng explained above.

@husensofteng
Copy link
Author

husensofteng commented Oct 27, 2022

The issue was related to the files generated by ULTRA (issue #17) and it has now been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants