Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanger sequences VER1=4.0 does not get correct error probablity #7

Closed
RobinVanSchendel opened this issue Jan 15, 2019 · 12 comments
Closed

Comments

@RobinVanSchendel
Copy link

See subject, for some reasons many bases get an error probability of 1.0, while other programs such as SnapGene Viewer assigns the bases a very different quality. I am not 100% sure this is causing the difference, but it was the only apparent difference as for VER1 = 3.0 it goes fine.

@dkatzel-home
Copy link
Collaborator

Thank you for using Jillion! I'd be happy to help. Are you able to provide an example of a sanger sequence with this "VER1=4.0" and the equivalent VER1=3.0 ? The sanger parsing code is the oldest code in Jillion and I wrote it about 13 years ago, I'm not sure there was a version 4.0 back then.

Thanks again!

@RobinVanSchendel
Copy link
Author

RobinVanSchendel commented Jan 16, 2019 via email

@RobinVanSchendel
Copy link
Author

RobinVanSchendel commented Jan 16, 2019 via email

@dkatzel-home
Copy link
Collaborator

it doesn't look like it. looking at the github help on how to attach files to issues it only supports a few fileformats.

Can you please try zipping the 2 files and then attaching the zipfile to this ticket? (might need to use the web GUI)

Thanks

@RobinVanSchendel
Copy link
Author

Jillion_Test_AB1.zip

@RobinVanSchendel
Copy link
Author

Hi Dan, I have uploaded the test files and I hope the sequence files are now testable for you

@dkatzel-home
Copy link
Collaborator

got it thanks

@dkatzel-home
Copy link
Collaborator

I think I found the problem. will take a bit of time to fix I'll try to get it done over the weekend.

Long story short - ABI1 files store 2 versions of the trace data inside the chromtogram file. An original version and a current version. Often they are identical but sometimes a different basecaller or manual edits are done and they become different. ABI files store this information in datablocks that are stored as binary blobs inside the file with an index for where each entry is inside the file.

The Jillion ABI1 parser was based on data sequenced at TIGR/JCVI which produced over 1 million sanger reads over the course of many years. This parser took a few short cuts to parse this data because I'm not aware of a real file specification for it but many parts have been reverse engineered since the 90s. I think the TIGR sequencing center happened to always have the datablocks for the original vs current version in the same specific order.

For whatever reason, this "version 4" not only has large differences in the original vs current sequence but some of the datablocks are in a different order so in the example you attached the original quality and current qualities are swapped. There may be other differences too, I haven't fully investigated.

I will fix the parser to correctly detect original vs current but in the meantime if you need a work around you can access the original version of the traces:

       AbiChromatogram abiChromatogram = new AbiChromatogramBuilder("id", file).build();
    
       Chromatogram original = abiChromatogram.getOriginalChromatogram();

I would suggest doing a check to see if the current qualities have a lot of 0s in them and if so get the original qualities instead etc.

Thanks! I'll let you know when the fix is tested and pushed.

PS: May I include these files in the Jillion test suite to prevent regressions ?

@RobinVanSchendel
Copy link
Author

You can add the files to the test suite. I tested you solution and unfortunately that does not work. If you for example open the 5565810 file in Snapgene Viewer (free software) you see that the sequence corresponds to the current version. The original sequence looks quite different from that when I use your workaround. The only difference is that in Snapgene viewer the quality values seem to deviate a lot from those of the current (and original) version. I think Snapgene viewer is correct in terms of sequence and quality. So how does it manage to obtain different quality values from this file?

@dkatzel-home
Copy link
Collaborator

The jillion version had a bug that had the qualities for the original different sequence switched with the current. pushing the fix now

@dkatzel-home
Copy link
Collaborator

Everything has been fixed and committed if you pull the latest version from the repository and build it it should work. If you have any questions let me know.

Thanks!

@RobinVanSchendel
Copy link
Author

Great work! It works now indeed as expected! Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants