New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erroneous characters generated in Sam #31
Comments
Can you please provide enough information to try to reproduce this problem? The .cfg you used might be a good place to start. AriocP's output log would also help. Thanks!
…________________________________
From: ZEXUAN ZHAO ***@***.***>
Sent: Monday, February 20, 2023 9:18 PM
To: RWilton/Arioc ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [RWilton/Arioc] Erroneous and not reproducible characters generated in Sam (Issue #31)
I used Arioc to map reads to my reference genome twice but both sam files generated has erroneous character generated and cause samtools view to abort with error [W::sam_read1] Parse error at line 2602870 and [W::sam_read1] Parse error at line 2644696. Those two lines in the sam file are:
K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 20253514/* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252
K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 0515640 /* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252
Could you help explain the occurance of "/*"? I'm using the Arioc v1.51.
—
Reply to this email directly, view it on GitHub<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRWilton%2FArioc%2Fissues%2F31&data=05%7C01%7Crichard.wilton%40jhu.edu%7Ca9ef9829d2a54992f3d208db13b1de04%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C638125426889746983%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sDMNWwfC%2FIT51nNK7BuhQJlnCJ0wWi2aiML1glw58K8%3D&reserved=0>, or unsubscribe<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAB66DGP3U6GD5EYH2YRTFC3WYQQVZANCNFSM6AAAAAAVCPDGJE&data=05%7C01%7Crichard.wilton%40jhu.edu%7Ca9ef9829d2a54992f3d208db13b1de04%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C638125426889746983%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9zcuKXyQV3HeJL89liEjl61FAoxfb7oCNEoIL5X4vOY%3D&reserved=0>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
The log of 2 runs are:
and
Config files are:
Thanks! |
This looks like a bug. Unfortunately, I have not been able to reproduce it using v1.51.3138 with the human reference genome and paired-end WGS reads. Can you please point me to a copy of the reference genome you are using (W2A_rescaffolded_pantoea_removed.fasta)? Alternatively, could you please provide a few of the definition lines in that FASTA file, including the one that defines chr03, the RNAME in both of the problematic mappings? (A simple Thank you. |
Hi! The seqnames of the reference are like this:
And the read name is:
and
The reason I used
|
Thank you for posting the reference sequence deflines and the workflow script. I have a couple of comments and one more request for information. 1: Unlike other general-purpose read aligners like BWA and Bowtie, Arioc separately performs the tasks of transforming the ASCII-formatted FASTA (reference) and FASTQ (read) sequences into a binary format that facilitates alignment. The underlying idea is to do these things on different computers at different times, with the goal of running things concurrently on different machines in a cluster. (This is particularly helpful in the typical use case where Arioc is used in a cluster that comprises both GPU nodes and non-GPU nodes.) This means you should design workflows to do these tasks separately. Encode the reference genome once (AriocE gapped, AriocE nongapped) and leave the encoded files in a shared directory where they can be referenced in subsequent AriocP config files. (This is actually no different from what you would do with BWA or Bowtie.) Also, if you are running in the typical cluster that has GPU nodes and non-GPU nodes, do the same with the encoded read sequences. 2: As you know, the archaic 1960s-style FASTA and SAM formats are pretty awful with regard to how whitespace is handled. AriocE and AriocP contain some special-case logic to deal with embedded spaces and tabs in RNAMEs and QNAMEs, but it's possible that your RNAMEs -- some of which contain embedded whitespace and some of which do not -- are the data that is eliciting the bug. You might try to confirm this suspicion by altering your SN specification to Rather than guess at this, however, it would be preferable to use your data to reproduce this bug. Can you please upload (or otherwise make available, e.g. via FTP) the FASTA-formatted reference sequences and the FASTQ for at least one of the read pairs that causes the error to occur? |
I revisited this bug by extracting SEQ and QUAL from the SAM records you provided and aligning to the human reference genome, but I was still unable to reproduce the problem. Two comments: -- Without the original FASTA (for the reference genome) and FASTQ (for the one or two read pairs whose alignment records are garbled) I see no straightforward way to troubleshoot the problem you observed. -- In the SAM records you provided, both SEQ and QUAL are identical for both mates. This strikes me as being highly unlikely to occur in valid sequencer output. I do not think this explains the problem you have seen with spurious characters in the SAM output, but I would be curious to understand what's going on with those reads -- in particular, whether Arioc has garbled these output fields as well as the others you discovered in the SAM output. Again, the original FASTQ for those reads would be pertinent here. Thank you again for taking the time to document this bug in the software. |
Thank you for your advice.
My lab has a workstation with graphic card installed. That is why the pipeline is not separated.
The reads and reference sequences in the region are telomeric repeats.
The reference genome is not published yet. I will send to you by email. It is also useful if you could inform me how to transfer files on GitHub securely. Thanks, |
Ok, when I used your reference genome and your FASTQ of one of the read pairs, I discovered the following:
The fixes were to repair the bug and to harden AriocE to produce an error message rather than a segfault when confronted with certain invalid FASTQ formats. I then used the SAM output you provided previously to produce a correctly-formatted FASTQ representation of the two reads for which you originally encountered corrupt SAM records as well as for the third pair in your most recent FASTQ. These reads now align without errors to a properly-encoded reference: Two comments: First, the reference genome you provided is not identical to the one for which you originally reported this problem. You might want to verify that the bug is indeed fixed with the original FASTA as well. Also, reference-genome encoding is a one-time-only process (assuming it isn't buggy :-). So even on a single machine, there's no reason to do it in the workflow for every WGS sample you align. (The same goes for every other short-read aligner I've ever used.) I have pushed a new build (v1.51.3140) of the Arioc distribution that contains these fixes. Thank you again for your help. |
I used Arioc to map reads to my reference genome twice but both sam files has erroneous characters generated and it cause samtools view to abort with error
[W::sam_read1] Parse error at line 2602870
and[W::sam_read1] Parse error at line 2644696
. Those two lines in the sam file are:K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 20253514/* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252
K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 0515640 /* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252
Another example is:
K00134:236:H5CJWBBXY:1:1106:21085:38486 83 chr03 151M = 42* 0 151M = 624 -479 AACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAA 7-A-FFF--7F<7-7F77FJAFFJ7JJJFA<FFJJJFJ<JFFJJJFFJA-JFJJFJAAFAJJFA<A<AFF<F<-<FJJJJAAA<AJAF7-J<<-<F-<JJFJJJFJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJFFFAAA AS:i:302 XS:i:302 NM:i:0 MD:Z:151 YT:Z:CP Na:i:872 Nb:i:1 c3:i:30 YS:i:302
K00134:236:H5CJWBBXY:1:1106:21085:38486 83 chr03 JJJFJJJJF* 0 151M = 624 -479 AACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAA 7-A-FFF--7F<7-7F77FJAFFJ7JJJFA<FFJJJFJ<JFFJJJFFJA-JFJJFJAAFAJJFA<A<AFF<F<-<FJJJJAAA<AJAF7-J<<-<F-<JJFJJJFJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJFFFAAA AS:i:302 XS:i:302 NM:i:0 MD:Z:151 YT:Z:CPNa:i:872 Nb:i:1 c3:i:30 YS:i:302
They seem to be generated by the same read. Could you help explain the occurance of "/*"? I'm using the Arioc v1.51.
The text was updated successfully, but these errors were encountered: