-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanger sequences VER1=4.0 does not get correct error probablity #7
Comments
Thank you for using Jillion! I'd be happy to help. Are you able to provide an example of a sanger sequence with this "VER1=4.0" and the equivalent VER1=3.0 ? The sanger parsing code is the oldest code in Jillion and I wrote it about 13 years ago, I'm not sure there was a version 4.0 back then. Thanks again! |
Dear Dan,
Thanks for your quick response and your help!. I have added two ab1 files:
5565810.ab1 => this is the VER1=4.0 file which gives wrong error probabilities
XF1359_0…ab1 => this one is read in just fine
If you need more examples, just let me know.
Kind regards,
Robin van Schendel
|
I hope you received the .ab1 files. I am not sure if GitHub keeps these attachments. If you did not receive them could you send me your e-mail address so I can mail them to you?
Thanks again,
Robin
|
it doesn't look like it. looking at the github help on how to attach files to issues it only supports a few fileformats. Can you please try zipping the 2 files and then attaching the zipfile to this ticket? (might need to use the web GUI) Thanks |
Hi Dan, I have uploaded the test files and I hope the sequence files are now testable for you |
got it thanks |
I think I found the problem. will take a bit of time to fix I'll try to get it done over the weekend. Long story short - ABI1 files store 2 versions of the trace data inside the chromtogram file. An original version and a current version. Often they are identical but sometimes a different basecaller or manual edits are done and they become different. ABI files store this information in datablocks that are stored as binary blobs inside the file with an index for where each entry is inside the file. The Jillion ABI1 parser was based on data sequenced at TIGR/JCVI which produced over 1 million sanger reads over the course of many years. This parser took a few short cuts to parse this data because I'm not aware of a real file specification for it but many parts have been reverse engineered since the 90s. I think the TIGR sequencing center happened to always have the datablocks for the original vs current version in the same specific order. For whatever reason, this "version 4" not only has large differences in the original vs current sequence but some of the datablocks are in a different order so in the example you attached the original quality and current qualities are swapped. There may be other differences too, I haven't fully investigated. I will fix the parser to correctly detect original vs current but in the meantime if you need a work around you can access the original version of the traces:
I would suggest doing a check to see if the current qualities have a lot of 0s in them and if so get the original qualities instead etc. Thanks! I'll let you know when the fix is tested and pushed. PS: May I include these files in the Jillion test suite to prevent regressions ? |
You can add the files to the test suite. I tested you solution and unfortunately that does not work. If you for example open the 5565810 file in Snapgene Viewer (free software) you see that the sequence corresponds to the current version. The original sequence looks quite different from that when I use your workaround. The only difference is that in Snapgene viewer the quality values seem to deviate a lot from those of the current (and original) version. I think Snapgene viewer is correct in terms of sequence and quality. So how does it manage to obtain different quality values from this file? |
The jillion version had a bug that had the qualities for the original different sequence switched with the current. pushing the fix now |
Everything has been fixed and committed if you pull the latest version from the repository and build it it should work. If you have any questions let me know. Thanks! |
Great work! It works now indeed as expected! Thanks a lot for your help! |
See subject, for some reasons many bases get an error probability of 1.0, while other programs such as SnapGene Viewer assigns the bases a very different quality. I am not 100% sure this is causing the difference, but it was the only apparent difference as for VER1 = 3.0 it goes fine.
The text was updated successfully, but these errors were encountered: