Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous characters generated in Sam #31

Closed
ZexuanZhao opened this issue Feb 21, 2023 · 8 comments
Closed

Erroneous characters generated in Sam #31

ZexuanZhao opened this issue Feb 21, 2023 · 8 comments

Comments

@ZexuanZhao
Copy link

ZexuanZhao commented Feb 21, 2023

I used Arioc to map reads to my reference genome twice but both sam files has erroneous characters generated and it cause samtools view to abort with error [W::sam_read1] Parse error at line 2602870 and [W::sam_read1] Parse error at line 2644696. Those two lines in the sam file are:
K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 20253514/* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252
K00134:236:H5CJWBBXY:1:1104:29609:17175 83 chr03 0515640 /* 0 151M = 597 -462 ATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAACAACCCAATAACCCAATAACCCACTAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCC 7-F7<AF7A7F7JAFAF<A-FFJJJ<AA7F<JFAAAFJF<7-A-FJFF<<FJAJJJJ<<JJJJF-FJAJJAAA-JJJJF<AF<7JJFJJAJJJJF<-<F-FJJJJFJFAJJJFA<FJFFFJFJFFJJJFF7JJJJFAJJJJJF<FFFFFAA AS:i:286 XS:i:286 NM:i:2 MD:Z:73A22A54 YT:Z:CPNa:i:9975 Nb:i:8 c3:i:31 YS:i:252

Another example is:
K00134:236:H5CJWBBXY:1:1106:21085:38486 83 chr03 151M = 42* 0 151M = 624 -479 AACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAA 7-A-FFF--7F<7-7F77FJAFFJ7JJJFA<FFJJJFJ<JFFJJJFFJA-JFJJFJAAFAJJFA<A<AFF<F<-<FJJJJAAA<AJAF7-J<<-<F-<JJFJJJFJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJFFFAAA AS:i:302 XS:i:302 NM:i:0 MD:Z:151 YT:Z:CP Na:i:872 Nb:i:1 c3:i:30 YS:i:302
K00134:236:H5CJWBBXY:1:1106:21085:38486 83 chr03 JJJFJJJJF* 0 151M = 624 -479 AACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAATAACCCAA 7-A-FFF--7F<7-7F77FJAFFJ7JJJFA<FFJJJFJ<JFFJJJFFJA-JFJJFJAAFAJJFA<A<AFF<F<-<FJJJJAAA<AJAF7-J<<-<F-<JJFJJJFJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJFFFAAA AS:i:302 XS:i:302 NM:i:0 MD:Z:151 YT:Z:CPNa:i:872 Nb:i:1 c3:i:30 YS:i:302

They seem to be generated by the same read. Could you help explain the occurance of "/*"? I'm using the Arioc v1.51.

@ZexuanZhao ZexuanZhao changed the title Erroneous and not reproducible characters generated in Sam Erroneous characters generated in Sam Feb 21, 2023
@RWilton
Copy link
Owner

RWilton commented Feb 21, 2023 via email

@ZexuanZhao
Copy link
Author

The log of 2 runs are:

191243002 [001adc55] AriocE v1.51.3138.23039 (release) +2023-02-08T21:39:05
191243174 [001adc55] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
191243174 [001adc55]  compiled             : 2023-02-19T13:19:27 (GNU g++ v11.3.0)
191243174 [001adc55]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
191243174 [001adc55]  computer name        : machado-kremer
191243174 [001adc55]  operating system     : Ubuntu 22.04.1 LTS
191243174 [001adc55]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocE
191243174 [001adc55]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/encode_reference_gapped.cfg
191243855 [001adc55] AriocE::encodeR: encoding 1 file (48 CPU threads available)...
191426322 [001adc55] AriocE::encodeR: encoded 1 file
191427213 [001adc55] AriocE::sortCcpu: sorting C list (1073741824 hash keys)...
191557647 [001adc55] AriocE::sortCcpu: sorted C list
191602349 [001adc55] AriocE::computeJlistSizes: m_nJ=1023599836 (61317691+962282145) m_iCshort=678580 m_iCzero=571860800
191602350 [001adc55] AriocE::computeJlistSizes: log2(nJ)  # J-lists
191602350 [001adc55] AriocE::computeJlistSizes:         0  351972116
191602350 [001adc55] AriocE::computeJlistSizes:         1  188177231
191602350 [001adc55] AriocE::computeJlistSizes:         2   25824973
191602350 [001adc55] AriocE::computeJlistSizes:         3    3914483
191602350 [001adc55] AriocE::computeJlistSizes:         4    1293417
191602350 [001adc55] AriocE::computeJlistSizes:         5     449923
191602350 [001adc55] AriocE::computeJlistSizes:         6     142946
191602350 [001adc55] AriocE::computeJlistSizes:         7      49662
191602350 [001adc55] AriocE::computeJlistSizes:         8      22208
191602350 [001adc55] AriocE::computeJlistSizes:         9      10642
191602350 [001adc55] AriocE::computeJlistSizes:        10       2507
191602350 [001adc55] AriocE::computeJlistSizes:        11        620
191602350 [001adc55] AriocE::computeJlistSizes:        12         56
191602350 [001adc55] AriocE::computeJlistSizes:        13          4
191602350 [001adc55] AriocE::computeJlistSizes:        14         10
191602350 [001adc55] AriocE::computeJlistSizes:        15          2
191610066 [001adc55] AriocE::computeJlistSizes: completed: J table uses 1024278416 elements for 1023599836 J values (0.066% for list counts in J lists)
191611209 [001adc55] AriocE::buildJlists: building J lists for 1 file (2 sequences) (48 CPU threads available)...
192125791 [001adc55] AriocE::buildJlists: built J lists for 1 files (2 sequences)
192125802 [001adc55] AriocE::sortJcpu: sorting J lists on 48 worker threads...
192155124 [001adc55] AriocE::sortJcpu: J list sort complete
192226068 [001adc55] AriocE::validateHJ: start
192243754 [001adc55] AriocE::validateHJ: H table uses 571860800/1073741824 (53.259%) hash keys
192243754 [001adc55] AriocE::validateHJ: J table contains 1024278416 J values (list counts in lists: 678580 (0.066%))
192243754 [001adc55] AriocE::validateHJ: maximum nJ=60464 for hash key 0x295C5954
192243754 [001adc55] AriocE::validateHJ: completed
192243754 [001adc55] AriocE::validateSubIds: start
192248779 [001adc55] AriocE::validateSubIds: validated 1073741824 hash values
192248779 [001adc55] AriocE::validateSubIds:  # subIds  # hash keys  avg J-list size
192248779 [001adc55] AriocE::validateSubIds:         0            0              0.0
192248779 [001adc55] AriocE::validateSubIds:         1    571860800              1.8
192248779 [001adc55] AriocE::validateSubIds:  subId        Jf       Jrc
192248779 [001adc55] AriocE::validateSubIds:      1 511799918 511799918
192248779 [001adc55] AriocE::validateSubIds: completed
192248779 [001adc55] AriocE::finalizeJ: finalizing J lists on 48 worker threads...
192308936 [001adc55] AriocE::finalizeJ: completed
192319510 [001adc55] AriocE ends (0)
192319687 [001b254e] AriocE v1.51.3138.23039 (release) +2023-02-08T21:39:05
192319691 [001b254e] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
192319691 [001b254e]  compiled             : 2023-02-19T13:19:27 (GNU g++ v11.3.0)
192319691 [001b254e]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
192319691 [001b254e]  computer name        : machado-kremer
192319691 [001b254e]  operating system     : Ubuntu 22.04.1 LTS
192319691 [001b254e]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocE
192319691 [001b254e]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/encode_reference_nongapped.cfg
192320329 [001b254e] AriocE::encodeR: encoding 1 file (48 CPU threads available)...
192511866 [001b254e] AriocE::encodeR: encoded 1 file
192512784 [001b254e] AriocE::sortCcpu: sorting C list (1073741824 hash keys)...
192630698 [001b254e] AriocE::sortCcpu: sorted C list
192635102 [001b254e] AriocE::computeJlistSizes: m_nJ=1023562844 (14769570+1008793274) m_iCshort=189354 m_iCzero=637163004
192635102 [001b254e] AriocE::computeJlistSizes: log2(nJ)  # J-lists
192635102 [001b254e] AriocE::computeJlistSizes:         0  385409029
192635102 [001b254e] AriocE::computeJlistSizes:         1  230344964
192635102 [001b254e] AriocE::computeJlistSizes:         2   19572720
192635102 [001b254e] AriocE::computeJlistSizes:         3    1209943
192635102 [001b254e] AriocE::computeJlistSizes:         4     436994
192635102 [001b254e] AriocE::computeJlistSizes:         5     121315
192635102 [001b254e] AriocE::computeJlistSizes:         6      42950
192635102 [001b254e] AriocE::computeJlistSizes:         7      17319
192635102 [001b254e] AriocE::computeJlistSizes:         8       7352
192635102 [001b254e] AriocE::computeJlistSizes:         9        399
192635102 [001b254e] AriocE::computeJlistSizes:        10          3
192635102 [001b254e] AriocE::computeJlistSizes:        11         16
192642840 [001b254e] AriocE::computeJlistSizes: completed: J table uses 1023752198 elements for 1023562844 J values (0.018% for list counts in J lists)
192643813 [001b254e] AriocE::buildJlists: building J lists for 1 file (2 sequences) (48 CPU threads available)...
193207710 [001b254e] AriocE::buildJlists: built J lists for 1 files (2 sequences)
193207721 [001b254e] AriocE::sortJcpu: sorting J lists on 48 worker threads...
193256288 [001b254e] AriocE::sortJcpu: J list sort complete
193327836 [001b254e] AriocE::validateHJ: start
193346108 [001b254e] AriocE::validateHJ: H table uses 637163004/1073741824 (59.340%) hash keys
193346108 [001b254e] AriocE::validateHJ: J table contains 1023752198 J values (list counts in lists: 189354 (0.018%))
193346108 [001b254e] AriocE::validateHJ: maximum nJ=3027 for hash key 0x11E84C2E
193346108 [001b254e] AriocE::validateHJ: completed
193346108 [001b254e] AriocE::validateSubIds: start
193350117 [001b254e] AriocE::validateSubIds: validated 1073741824 hash values
193350117 [001b254e] AriocE::validateSubIds:  # subIds  # hash keys  avg J-list size
193350117 [001b254e] AriocE::validateSubIds:         0            0              0.0
193350117 [001b254e] AriocE::validateSubIds:         1    637163004              1.6
193350117 [001b254e] AriocE::validateSubIds:  subId        Jf       Jrc
193350117 [001b254e] AriocE::validateSubIds:      1 511781422 511781422
193350117 [001b254e] AriocE::validateSubIds: completed
193350117 [001b254e] AriocE::finalizeJ: finalizing J lists on 48 worker threads...
193409502 [001b254e] AriocE::finalizeJ: completed
193416282 [001b254e] AriocE ends (0)
193416318 [001b7025] AriocE v1.51.3138.23039 (release) +2023-02-08T21:39:05
193416322 [001b7025] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
193416322 [001b7025]  compiled             : 2023-02-19T13:19:27 (GNU g++ v11.3.0)
193416322 [001b7025]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
193416322 [001b7025]  computer name        : machado-kremer
193416322 [001b7025]  operating system     : Ubuntu 22.04.1 LTS
193416322 [001b7025]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocE
193416322 [001b7025]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/encode_reads.cfg
193416323 [001b7025] AriocE::encodeQ: encoding 2 files (48 CPU threads available)...
193526665 [001b7074] tuEncodeFASTQ::summarizeReadLengthDistribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R1.trimmed.fastq: 44909161 reads [31-151 mean (sd) 145.4 (14.9)]
193526666 [001b7074] tuEncodeFASTQ::summarizeC3Distribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R1.trimmed.fastq: 44909161 reads [0-82 mean (sd) 48.3 (4.8)]
193528525 [001b70be] tuEncodeFASTQ::summarizeReadLengthDistribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R2.trimmed.fastq: 44909161 reads [31-151 mean (sd) 145.4 (14.9)]
193528525 [001b70be] tuEncodeFASTQ::summarizeC3Distribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R2.trimmed.fastq: 44909161 reads [0-82 mean (sd) 48.3 (4.9)]
193619060 [001b7025] AriocE::encodeQ: encoded 2 files
193619060 [001b7025] AriocE::validateQ: validating 2 files (48 CPU threads available)...
193619060 [001b7e2e] tuValidateQ::validateQNAME: tmp/reads_enc/W2M.2_R1.trimmed$sqm.sbf: 0x0000000800000000 'K00134:236:H5CJWBBXY:1:1101:1509:1261'
193619060 [001b7e2f] tuValidateQ::validateQNAME: tmp/reads_enc/W2M.2_R2.trimmed$sqm.sbf: 0x0000000800000001 'K00134:236:H5CJWBBXY:1:1101:1509:1261'
193621067 [001b7025] AriocE::validateQ: validated 2 files
193621067 [001b7025] AriocE ends (0)
193621107 [001b7e7a] AriocP v1.51.3138.23039 (release) +2023-02-08T21:39:05
193621111 [001b7e7a] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
193621111 [001b7e7a]  compiled             : 2023-02-19T13:29:26 (GNU g++ v11.3.0)
193621111 [001b7e7a]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
193621111 [001b7e7a]  computer name        : machado-kremer
193621111 [001b7e7a]  operating system     : Ubuntu 22.04.1 LTS
193621111 [001b7e7a]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocP
193621111 [001b7e7a]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/mapping/mapping_W2M.2.cfg
193621323 [001b7e7a] AriocBase::loadR: load R from disk...
193621562 [001b7e7a] AriocBase::loadR: loaded 1 R sequence: 390001080 bytes (0.4GB) from 2 files in 238ms (1562.7MB/s)
193621562 [001b7e7a] AriocBase::loadHJ: load H and J LUTs from disk...
193621562 [001b7e85] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 1801861
193623897 [001b7e85] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
193626032 [001b7f17] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5118761000 bytes, ... ) on thread 1802007
193628211 [001b7f17] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
193630319 [001b7fa9] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 1802153
193632578 [001b7fa9] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
193634717 [001b800a] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5121392092 bytes, ... ) on thread 1802250
193636886 [001b800a] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
193638945 [001b7e7a] AriocBase::loadHJ: loaded H and J LUTs from disk in 17383ms (1150.9MB/s)
193638945 [001b8087] void tuWatchdog::main: watchdog interval 0s
193639158 [001b8088] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
193832344 [001b8088] AriocBase::UpdateProgress:  10%: 4490920 pairs (8981840 mates) aligned (ETA 19 minutes)
194027569 [001b8088] AriocBase::UpdateProgress:  20%: 8981840 pairs (17963680 mates) aligned (ETA 16 minutes)
194225396 [001b8088] AriocBase::UpdateProgress:  30%: 13472760 pairs (26945520 mates) aligned (ETA 14 minutes)
194422324 [001b8088] AriocBase::UpdateProgress:  40%: 17963680 pairs (35927360 mates) aligned (ETA 12 minutes)
194620368 [001b8088] AriocBase::UpdateProgress:  50%: 22454600 pairs (44909200 mates) aligned (ETA 9 minutes)
194820610 [001b8088] AriocBase::UpdateProgress:  60%: 26945520 pairs (53891040 mates) aligned (ETA 7 minutes)
195049059 [001b8088] AriocBase::UpdateProgress:  70%: 31436440 pairs (62872880 mates) aligned (ETA 6 minutes)
195247223 [001b8088] AriocBase::UpdateProgress:  80%: 35927360 pairs (71854720 mates) aligned (ETA 4.1 minutes)
195443440 [001b8088] AriocBase::UpdateProgress:  90%: 40418280 pairs (80836560 mates) aligned (ETA 2.0 minutes)
195641702 [001b8088] AriocBase::UpdateProgress: 100%: 44909161 pairs (89818322 mates) aligned
195641940 [001b7e7a] AriocBase::releaseGpuResources: GPU LUT unload starts...
195645461 [001b7e7a] AriocBase::releaseGpuResources: GPU LUT unload complete in 3520ms
195645461 [001b7e7a] 
195645461 [001b7e7a] ----------------------------
195645461 [001b7e7a] 
195645461 [001b7e7a]   SAM output:
195645461 [001b7e7a]    pairs                          :  44909161 (89818322 SAM records)
195645461 [001b7e7a]     concordant pairs              :  41713628 (83427256 SAM records) (92.88%)
195645461 [001b7e7a]      with 1 mapping               :  38093302 (76186604 SAM records)
195645461 [001b7e7a]      with 2 or more mappings      :   3620326 ( 7240652 SAM records)
195645461 [001b7e7a]     discordant pairs              :   2733685 ( 5467370 SAM records)
195645461 [001b7e7a]     unmapped pairs                :    461848 (  923696 SAM records)
195645461 [001b7e7a]      two mates mapped ("rejected"):    180983 (  361966 SAM records)
195645461 [001b7e7a]      0 or 1 mates mapped          :    280865 (  561730 SAM records)
195645461 [001b7e7a]    mates not in paired mappings   :    923696
195645461 [001b7e7a]     with no mappings              :    413510
195645461 [001b7e7a]     with 1 mapping                :    510186
195645461 [001b7e7a]     with 2 or more mappings       :         0
195645461 [001b7e7a]    total mapped mates             :  89404812 (99.54%)
195645461 [001b7e7a]    duplicate mappings (unreported): 391897567
195645461 [001b7e7a]    estimated TLEN distribution    :
195645461 [001b7e7a]     TLEN mode (mean, sd, skewness):       183 (205.8, 74.0, 1.185)
195645461 [001b7e7a]     TLEN-discordant pairs         :      5241 (0.01%)
195645542 [001b7e7a] AriocP ends (0)
[W::sam_read1] Parse error at line 2644696
[main_samview] truncated file.

and

200836615 [001fd789] AriocE v1.51.3138.23039 (release) +2023-02-08T21:39:05
200836619 [001fd789] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
200836619 [001fd789]  compiled             : 2023-02-19T13:19:27 (GNU g++ v11.3.0)
200836619 [001fd789]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
200836619 [001fd789]  computer name        : machado-kremer
200836619 [001fd789]  operating system     : Ubuntu 22.04.1 LTS
200836619 [001fd789]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocE
200836619 [001fd789]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/encode_reference_gapped.cfg
200837042 [001fd789] AriocE::encodeR: encoding 1 file (48 CPU threads available)...
201013430 [001fd789] AriocE::encodeR: encoded 1 file
201014645 [001fd789] AriocE::sortCcpu: sorting C list (1073741824 hash keys)...
201148672 [001fd789] AriocE::sortCcpu: sorted C list
201153765 [001fd789] AriocE::computeJlistSizes: m_nJ=1023599836 (61317691+962282145) m_iCshort=678580 m_iCzero=571860800
201153765 [001fd789] AriocE::computeJlistSizes: log2(nJ)  # J-lists
201153765 [001fd789] AriocE::computeJlistSizes:         0  351972116
201153765 [001fd789] AriocE::computeJlistSizes:         1  188177231
201153765 [001fd789] AriocE::computeJlistSizes:         2   25824973
201153765 [001fd789] AriocE::computeJlistSizes:         3    3914483
201153765 [001fd789] AriocE::computeJlistSizes:         4    1293417
201153765 [001fd789] AriocE::computeJlistSizes:         5     449923
201153765 [001fd789] AriocE::computeJlistSizes:         6     142946
201153765 [001fd789] AriocE::computeJlistSizes:         7      49662
201153765 [001fd789] AriocE::computeJlistSizes:         8      22208
201153765 [001fd789] AriocE::computeJlistSizes:         9      10642
201153765 [001fd789] AriocE::computeJlistSizes:        10       2507
201153765 [001fd789] AriocE::computeJlistSizes:        11        620
201153765 [001fd789] AriocE::computeJlistSizes:        12         56
201153765 [001fd789] AriocE::computeJlistSizes:        13          4
201153765 [001fd789] AriocE::computeJlistSizes:        14         10
201153765 [001fd789] AriocE::computeJlistSizes:        15          2
201202552 [001fd789] AriocE::computeJlistSizes: completed: J table uses 1024278416 elements for 1023599836 J values (0.066% for list counts in J lists)
201203757 [001fd789] AriocE::buildJlists: building J lists for 1 file (2 sequences) (48 CPU threads available)...
201736549 [001fd789] AriocE::buildJlists: built J lists for 1 files (2 sequences)
201736561 [001fd789] AriocE::sortJcpu: sorting J lists on 48 worker threads...
201807674 [001fd789] AriocE::sortJcpu: J list sort complete
201840573 [001fd789] AriocE::validateHJ: start
201900887 [001fd789] AriocE::validateHJ: H table uses 571860800/1073741824 (53.259%) hash keys
201900887 [001fd789] AriocE::validateHJ: J table contains 1024278416 J values (list counts in lists: 678580 (0.066%))
201900887 [001fd789] AriocE::validateHJ: maximum nJ=60464 for hash key 0x295C5954
201900887 [001fd789] AriocE::validateHJ: completed
201900887 [001fd789] AriocE::validateSubIds: start
201906615 [001fd789] AriocE::validateSubIds: validated 1073741824 hash values
201906615 [001fd789] AriocE::validateSubIds:  # subIds  # hash keys  avg J-list size
201906615 [001fd789] AriocE::validateSubIds:         0            0              0.0
201906615 [001fd789] AriocE::validateSubIds:         1    571860800              1.8
201906615 [001fd789] AriocE::validateSubIds:  subId        Jf       Jrc
201906615 [001fd789] AriocE::validateSubIds:      1 511799918 511799918
201906615 [001fd789] AriocE::validateSubIds: completed
201906615 [001fd789] AriocE::finalizeJ: finalizing J lists on 48 worker threads...
201926756 [001fd789] AriocE::finalizeJ: completed
201935409 [001fd789] AriocE ends (0)
201935442 [002021ac] AriocE v1.51.3138.23039 (release) +2023-02-08T21:39:05
201935446 [002021ac] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
201935446 [002021ac]  compiled             : 2023-02-19T13:19:27 (GNU g++ v11.3.0)
201935446 [002021ac]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
201935446 [002021ac]  computer name        : machado-kremer
201935446 [002021ac]  operating system     : Ubuntu 22.04.1 LTS
201935446 [002021ac]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocE
201935446 [002021ac]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/encode_reference_nongapped.cfg
201935793 [002021ac] AriocE::encodeR: encoding 1 file (48 CPU threads available)...
202132305 [002021ac] AriocE::encodeR: encoded 1 file
202133517 [002021ac] AriocE::sortCcpu: sorting C list (1073741824 hash keys)...
202254627 [002021ac] AriocE::sortCcpu: sorted C list
202259716 [002021ac] AriocE::computeJlistSizes: m_nJ=1023562844 (14769570+1008793274) m_iCshort=189354 m_iCzero=637163004
202259716 [002021ac] AriocE::computeJlistSizes: log2(nJ)  # J-lists
202259716 [002021ac] AriocE::computeJlistSizes:         0  385409029
202259716 [002021ac] AriocE::computeJlistSizes:         1  230344964
202259716 [002021ac] AriocE::computeJlistSizes:         2   19572720
202259716 [002021ac] AriocE::computeJlistSizes:         3    1209943
202259716 [002021ac] AriocE::computeJlistSizes:         4     436994
202259716 [002021ac] AriocE::computeJlistSizes:         5     121315
202259716 [002021ac] AriocE::computeJlistSizes:         6      42950
202259716 [002021ac] AriocE::computeJlistSizes:         7      17319
202259716 [002021ac] AriocE::computeJlistSizes:         8       7352
202259716 [002021ac] AriocE::computeJlistSizes:         9        399
202259716 [002021ac] AriocE::computeJlistSizes:        10          3
202259716 [002021ac] AriocE::computeJlistSizes:        11         16
202308419 [002021ac] AriocE::computeJlistSizes: completed: J table uses 1023752198 elements for 1023562844 J values (0.018% for list counts in J lists)
202309684 [002021ac] AriocE::buildJlists: building J lists for 1 file (2 sequences) (48 CPU threads available)...
202853612 [002021ac] AriocE::buildJlists: built J lists for 1 files (2 sequences)
202853624 [002021ac] AriocE::sortJcpu: sorting J lists on 48 worker threads...
202935800 [002021ac] AriocE::sortJcpu: J list sort complete
203010689 [002021ac] AriocE::validateHJ: start
203028353 [002021ac] AriocE::validateHJ: H table uses 637163004/1073741824 (59.340%) hash keys
203028353 [002021ac] AriocE::validateHJ: J table contains 1023752198 J values (list counts in lists: 189354 (0.018%))
203028353 [002021ac] AriocE::validateHJ: maximum nJ=3027 for hash key 0x11E84C2E
203028353 [002021ac] AriocE::validateHJ: completed
203028353 [002021ac] AriocE::validateSubIds: start
203033216 [002021ac] AriocE::validateSubIds: validated 1073741824 hash values
203033216 [002021ac] AriocE::validateSubIds:  # subIds  # hash keys  avg J-list size
203033216 [002021ac] AriocE::validateSubIds:         0            0              0.0
203033216 [002021ac] AriocE::validateSubIds:         1    637163004              1.6
203033216 [002021ac] AriocE::validateSubIds:  subId        Jf       Jrc
203033216 [002021ac] AriocE::validateSubIds:      1 511781422 511781422
203033216 [002021ac] AriocE::validateSubIds: completed
203033216 [002021ac] AriocE::finalizeJ: finalizing J lists on 48 worker threads...
203052749 [002021ac] AriocE::finalizeJ: completed
203101274 [002021ac] AriocE ends (0)
203101309 [00206fd7] AriocE v1.51.3138.23039 (release) +2023-02-08T21:39:05
203101313 [00206fd7] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
203101313 [00206fd7]  compiled             : 2023-02-19T13:19:27 (GNU g++ v11.3.0)
203101313 [00206fd7]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
203101313 [00206fd7]  computer name        : machado-kremer
203101313 [00206fd7]  operating system     : Ubuntu 22.04.1 LTS
203101313 [00206fd7]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocE
203101313 [00206fd7]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/encode_reads.cfg
203101314 [00206fd7] AriocE::encodeQ: encoding 2 files (48 CPU threads available)...
203210400 [0020704a] tuEncodeFASTQ::summarizeReadLengthDistribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R1.trimmed.fastq: 44909161 reads [31-151 mean (sd) 145.4 (14.9)]
203210400 [0020704a] tuEncodeFASTQ::summarizeC3Distribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R1.trimmed.fastq: 44909161 reads [0-82 mean (sd) 48.3 (4.8)]
203214273 [002070bb] tuEncodeFASTQ::summarizeReadLengthDistribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R2.trimmed.fastq: 44909161 reads [31-151 mean (sd) 145.4 (14.9)]
203214273 [002070bb] tuEncodeFASTQ::summarizeC3Distribution: /home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R2.trimmed.fastq: 44909161 reads [0-82 mean (sd) 48.3 (4.9)]
203309259 [00206fd7] AriocE::encodeQ: encoded 2 files
203309259 [00206fd7] AriocE::validateQ: validating 2 files (48 CPU threads available)...
203309259 [00207e43] tuValidateQ::validateQNAME: tmp/reads_enc/W2M.2_R1.trimmed$sqm.sbf: 0x0000000800000000 'K00134:236:H5CJWBBXY:1:1101:1509:1261'
203309259 [00207e44] tuValidateQ::validateQNAME: tmp/reads_enc/W2M.2_R2.trimmed$sqm.sbf: 0x0000000800000001 'K00134:236:H5CJWBBXY:1:1101:1509:1261'
203310992 [00206fd7] AriocE::validateQ: validated 2 files
203310992 [00206fd7] AriocE ends (0)
203311031 [00207e68] AriocP v1.51.3138.23039 (release) +2023-02-08T21:39:05
203311035 [00207e68] Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
203311035 [00207e68]  compiled             : 2023-02-19T13:29:26 (GNU g++ v11.3.0)
203311035 [00207e68]  data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
203311035 [00207e68]  computer name        : machado-kremer
203311035 [00207e68]  operating system     : Ubuntu 22.04.1 LTS
203311035 [00207e68]  executable file      : /home/zexuanzhao/bin/Arioc.x.151/bin/AriocP
203311035 [00207e68]  configuration file   : /home/zexuanzhao/projects/pegoGenome/arioc/tmp/cfg/mapping/mapping_W2M.2.cfg
203311242 [00207e68] AriocBase::loadR: load R from disk...
203311465 [00207e68] AriocBase::loadR: loaded 1 R sequence: 390001080 bytes (0.4GB) from 2 files in 224ms (1660.4MB/s)
203311465 [00207e68] AriocBase::loadHJ: load H and J LUTs from disk...
203311466 [00207e73] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 2129523
203313667 [00207e73] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
203316115 [00207f05] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5118761000 bytes, ... ) on thread 2129669
203318193 [00207f05] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
203320641 [00207f97] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 2129815
203322828 [00207f97] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
203325202 [00208029] CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5121392092 bytes, ... ) on thread 2129961
203327274 [00208029] CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
203329649 [00207e68] AriocBase::loadHJ: loaded H and J LUTs from disk in 18184ms (1100.2MB/s)
203329649 [002080bb] void tuWatchdog::main: watchdog interval 0s
203329837 [002080bc] AriocBase::UpdateProgress:   0%: 0 pairs (0 mates) aligned
203521976 [002080bc] AriocBase::UpdateProgress:  10%: 4490920 pairs (8981840 mates) aligned (ETA 19 minutes)
203717284 [002080bc] AriocBase::UpdateProgress:  20%: 8981840 pairs (17963680 mates) aligned (ETA 16 minutes)
203914205 [002080bc] AriocBase::UpdateProgress:  30%: 13472760 pairs (26945520 mates) aligned (ETA 14 minutes)
204111264 [002080bc] AriocBase::UpdateProgress:  40%: 17963680 pairs (35927360 mates) aligned (ETA 12 minutes)
204308999 [002080bc] AriocBase::UpdateProgress:  50%: 22454600 pairs (44909200 mates) aligned (ETA 9 minutes)
204508572 [002080bc] AriocBase::UpdateProgress:  60%: 26945520 pairs (53891040 mates) aligned (ETA 7 minutes)
204735623 [002080bc] AriocBase::UpdateProgress:  70%: 31436440 pairs (62872880 mates) aligned (ETA 6 minutes)
204933556 [002080bc] AriocBase::UpdateProgress:  80%: 35927360 pairs (71854720 mates) aligned (ETA 4.1 minutes)
205130251 [002080bc] AriocBase::UpdateProgress:  90%: 40418280 pairs (80836560 mates) aligned (ETA 2.0 minutes)
205326563 [002080bc] AriocBase::UpdateProgress: 100%: 44909161 pairs (89818322 mates) aligned
205326833 [00207e68] AriocBase::releaseGpuResources: GPU LUT unload starts...
205329755 [00207e68] AriocBase::releaseGpuResources: GPU LUT unload complete in 2922ms
205329755 [00207e68] 
205329755 [00207e68] ----------------------------
205329755 [00207e68] 
205329755 [00207e68]   SAM output:
205329755 [00207e68]    pairs                          :  44909161 (89818322 SAM records)
205329755 [00207e68]     concordant pairs              :  41713628 (83427256 SAM records) (92.88%)
205329755 [00207e68]      with 1 mapping               :  38093302 (76186604 SAM records)
205329755 [00207e68]      with 2 or more mappings      :   3620326 ( 7240652 SAM records)
205329756 [00207e68]     discordant pairs              :   2733685 ( 5467370 SAM records)
205329756 [00207e68]     unmapped pairs                :    461848 (  923696 SAM records)
205329756 [00207e68]      two mates mapped ("rejected"):    180983 (  361966 SAM records)
205329756 [00207e68]      0 or 1 mates mapped          :    280865 (  561730 SAM records)
205329756 [00207e68]    mates not in paired mappings   :    923696
205329756 [00207e68]     with no mappings              :    413510
205329756 [00207e68]     with 1 mapping                :    510186
205329756 [00207e68]     with 2 or more mappings       :         0
205329756 [00207e68]    total mapped mates             :  89404812 (99.54%)
205329756 [00207e68]    duplicate mappings (unreported): 391897567
205329756 [00207e68]    estimated TLEN distribution    :
205329756 [00207e68]     TLEN mode (mean, sd, skewness):       183 (205.8, 74.0, 1.186)
205329756 [00207e68]     TLEN-discordant pairs         :      5241 (0.01%)
205329820 [00207e68] AriocP ends (0)
[W::sam_read1] Parse error at line 2602870
[main_samview] truncated file.

Config files are:

  1. encode gapped reference:
<?xml version="1.0" encoding="UTF-8"?>
<AriocE seed="hsi20_0_30" maxJ="*">
  <dataIn sequenceType="R" srcId="0" filePath="../data/assembly">
    <file subId="01" SN="(*)">W2A_rescaffolded_pantoea_removed.fasta</file>
  </dataIn>
  <dataOut>
    <path>tmp/ref_enc</path>
  </dataOut>
</AriocE>
  1. encode nongapped reference
<?xml version="1.0" encoding="UTF-8"?>
<AriocE seed="ssi84_2_30" maxJ="*">
  <dataIn sequenceType="R" srcId="0" filePath="../data/assembly">
    <file subId="01" SN="(*)">W2A_rescaffolded_pantoea_removed.fasta</file>
  </dataIn>
  <dataOut>
    <path>tmp/ref_enc</path>
  </dataOut>
</AriocE>
  1. encode reads
<?xml version="1.0" encoding="UTF-8"?>
<AriocE>
  <dataIn sequenceType="Q">
    <file subId="1" mate="1">/home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R1.trimmed.fastq</file>
    <file subId="1" mate="2">/home/zexuanzhao/projects/pegoGenome/data/illumina/W2A/trimmed/W2M.2_R2.trimmed.fastq</file>
  </dataIn>
  <dataOut>
    <path>tmp/reads_enc</path>
  </dataOut>
</AriocE>
  1. Pair mapping
<?xml version="1.0" encoding="UTF-8"?>
<AriocP gpuMask="0x00000001" batchSize="64k">
  <R>tmp/ref_enc</R>
  <nongapped seed="ssi84_2_30" maxJ="*"/>
  <gapped seed="hsi20_0_30" maxJ="*" Wmxgs="2,6,5,3" Vt="L,0,1"/>
  <Q filePath="tmp/reads_enc">
    <paired subId="1">
      <file>W2M.2_R1.trimmed</file>
      <file>W2M.2_R2.trimmed</file>
    </paired>
  </Q>
  <A overwrite="true" pairOrientation="c" pairCollision="oc" cigarFormat="MID" mapqVersion="3">
    <sam report="cdru">./</sam>
  </A>
</AriocP>

Thanks!

@RWilton
Copy link
Owner

RWilton commented Feb 24, 2023

This looks like a bug. Unfortunately, I have not been able to reproduce it using v1.51.3138 with the human reference genome and paired-end WGS reads.

Can you please point me to a copy of the reference genome you are using (W2A_rescaffolded_pantoea_removed.fasta)?

Alternatively, could you please provide a few of the definition lines in that FASTA file, including the one that defines chr03, the RNAME in both of the problematic mappings? (A simple grep '>' ought to suffice.) I just want to be certain that the SN="(*)" specification in the AriocE configuration file is appropriate for the format of the deflines.

Thank you.

@ZexuanZhao
Copy link
Author

Hi!

The seqnames of the reference are like this:

>chr01
>chr02
>chr03
>chr04
>chr05
>ptg000023l_unaligned ptg000023l
>ptg000037l_unaligned ptg000037l
>ptg000045l_unaligned ptg000045l
>ptg000048l_unaligned ptg000048l
>ptg000058l_unaligned ptg000058l
>ptg000061l_unaligned ptg000061l
>ptg000062l_unaligned ptg000062l
>ptg000063l_unaligned ptg000063l
>ptg000064l_unaligned ptg000064l
>ptg000066l_unaligned ptg000066l
>ptg000067l_unaligned ptg000067l
>ptg000069l_unaligned ptg000069l
>ptg000071l_unaligned ptg000071l
>ptg000072l_unaligned ptg000072l
>ptg000073l_unaligned ptg000073l
>ptg000079l_unaligned ptg000079l
>ptg000080l_unaligned ptg000080l
>ptg000081l_unaligned ptg000081l
>ptg000082l_unaligned ptg000082l
>ptg000083l_unaligned ptg000083l
>ptg000085l_unaligned ptg000085l
>ptg000086l_unaligned ptg000086l
>ptg000087l_unaligned ptg000087l
>ptg000088l_unaligned ptg000088l
>ptg000091l_unaligned ptg000091l
>ptg000093l_unaligned ptg000093l
>ptg000094l_unaligned ptg000094l
>ptg000095l_unaligned ptg000095l
>ptg000096l_unaligned ptg000096l
>ptg000097l_unaligned ptg000097l
>ptg000099l_unaligned ptg000099l
>ptg000100l_unaligned ptg000100l
>ptg000101l_unaligned ptg000101l
>ptg000103l_unaligned ptg000103l
>ptg000105l_unaligned ptg000105l
>ptg000106l_unaligned ptg000106l
>ptg000107l_unaligned ptg000107l
>ptg000108l_unaligned ptg000108l
>ptg000111l_unaligned ptg000111l
>ptg000112l_unaligned ptg000112l
>ptg000116l_unaligned ptg000116l
>ptg000119l_unaligned ptg000119l
>ptg000120l_unaligned ptg000120l
>ptg000121l_unaligned ptg000121l
>ptg000122l_unaligned ptg000122l
>ptg000124l_unaligned ptg000124l
>ptg000125l_unaligned ptg000125l
>ptg000126l_unaligned ptg000126l
>ptg000127l_unaligned ptg000127l
>ptg000128l_unaligned ptg000128l
>ptg000129l_unaligned ptg000129l
>ptg000131l_unaligned ptg000131l
>ptg000134l_unaligned ptg000134l
>ptg000136l_unaligned ptg000136l
>ptg000137l_unaligned ptg000137l
>ptg000139l_unaligned ptg000139l
>ptg000144l_unaligned ptg000144l
>ptg000146l_unaligned ptg000146l
>ptg000147l_unaligned ptg000147l
>ptg000148l_unaligned ptg000148l
>ptg000150l_unaligned ptg000150l
>ptg000154l_unaligned ptg000154l
>ptg000155l_unaligned ptg000155l
>ptg000156l_unaligned ptg000156l
>ptg000157l_unaligned ptg000157l
>ptg000158l_unaligned ptg000158l
>ptg000159l_unaligned ptg000159l
>ptg000161l_unaligned ptg000161l
>ptg000164l_unaligned ptg000164l
>ptg000168l_unaligned ptg000168l
>ptg000170c_unaligned ptg000170c
>ptg000172l_unaligned ptg000172l
>ptg000174l_unaligned ptg000174l
>ptg000175l_unaligned ptg000175l
>ptg000178l_unaligned ptg000178l
>ptg000179l_unaligned ptg000179l
>ptg000180l_unaligned ptg000180l
>ptg000181l_unaligned ptg000181l
>ptg000182l_unaligned ptg000182l
>ptg000183l_unaligned ptg000183l
>ptg000184l_unaligned ptg000184l
>ptg000185l_unaligned ptg000185l
>ptg000188l_unaligned ptg000188l
>ptg000189l_unaligned ptg000189l
>ptg000190l_unaligned ptg000190l
>ptg000191l_unaligned ptg000191l
>ptg000195l_unaligned ptg000195l
>ptg000198l_unaligned ptg000198l
>ptg000199l_unaligned ptg000199l
>ptg000201l_unaligned ptg000201l
>ptg000203l_unaligned ptg000203l
>ptg000204l_unaligned ptg000204l
>ptg000205l_unaligned ptg000205l
>ptg000206l_unaligned ptg000206l
>ptg000207l_unaligned ptg000207l
>ptg000208l_unaligned ptg000208l
>ptg000211l_unaligned ptg000211l
>ptg000212l_unaligned ptg000212l
>ptg000216l_unaligned ptg000216l
>ptg000222l_unaligned ptg000222l
>ptg000226l_unaligned ptg000226l
>ptg000228l_unaligned ptg000228l
>ptg000229l_unaligned ptg000229l
>ptg000230l_unaligned ptg000230l
>ptg000233l_unaligned ptg000233l
>ptg000234l_unaligned ptg000234l
>ptg000235l_unaligned ptg000235l
>ptg000236l_unaligned ptg000236l
>ptg000237l_unaligned ptg000237l
>ptg000239l_unaligned ptg000239l
>ptg000240l_unaligned ptg000240l
>ptg000244l_unaligned ptg000244l
>ptg000245l_unaligned ptg000245l
>ptg000246l_unaligned ptg000246l
>ptg000247l_unaligned ptg000247l
>ptg000248l_unaligned ptg000248l
>ptg000251l_unaligned ptg000251l
>ptg000253l_unaligned ptg000253l
>ptg000254l_unaligned ptg000254l
>ptg000255l_unaligned ptg000255l
>ptg000257l_unaligned ptg000257l
>ptg000258l_unaligned ptg000258l
>ptg000260l_unaligned ptg000260l
>ptg000262l_unaligned ptg000262l
>ptg000266l_unaligned ptg000266l
>ptg000268l_unaligned ptg000268l
>ptg000269l_unaligned ptg000269l
>ptg000270l_unaligned ptg000270l
>ptg000271l_unaligned ptg000271l
>ptg000272l_unaligned ptg000272l
>ptg000273l_unaligned ptg000273l
>ptg000274l_unaligned ptg000274l
>ptg000275l_unaligned ptg000275l
>ptg000276l_unaligned ptg000276l
>ptg000277l_unaligned ptg000277l
>ptg000279l_unaligned ptg000279l
>ptg000280l_unaligned ptg000280l
>ptg000282l_unaligned ptg000282l
>ptg000283l_unaligned ptg000283l
>ptg000285l_unaligned ptg000285l
>ptg000286l_unaligned ptg000286l
>ptg000290l_unaligned ptg000290l
>ptg000291l_unaligned ptg000291l
>ptg000293l_unaligned ptg000293l
>ptg000294l_unaligned ptg000294l
>ptg000295l_unaligned ptg000295l
>ptg000299l_unaligned ptg000299l
>ptg000300l_unaligned ptg000300l
>ptg000301l_unaligned ptg000301l
>ptg000303l_unaligned ptg000303l
>ptg000304l_unaligned ptg000304l
>ptg000305l_unaligned ptg000305l
>ptg000306l_unaligned ptg000306l
>ptg000307l_unaligned ptg000307l
>ptg000308l_unaligned ptg000308l
>ptg000309l_unaligned ptg000309l
>ptg000312l_unaligned ptg000312l
>ptg000314l_unaligned ptg000314l
>ptg000315l_unaligned ptg000315l
>ptg000319l_unaligned ptg000319l
>ptg000320l_unaligned ptg000320l
>ptg000321l_unaligned ptg000321l
>ptg000323l_unaligned ptg000323l
>ptg000324l_unaligned ptg000324l
>ptg000327l_unaligned ptg000327l
>ptg000328l_unaligned ptg000328l
>ptg000330l_unaligned ptg000330l
>ptg000333l_unaligned ptg000333l
>ptg000334l_unaligned ptg000334l
>ptg000335l_unaligned ptg000335l
>ptg000338l_unaligned ptg000338l
>ptg000339l_unaligned ptg000339l
>ptg000340l_unaligned ptg000340l
>ptg000342l_unaligned ptg000342l
>ptg000344l_unaligned ptg000344l
>ptg000345l_unaligned ptg000345l
>ptg000346l_unaligned ptg000346l
>ptg000347l_unaligned ptg000347l
>ptg000348l_unaligned ptg000348l
>ptg000349l_unaligned ptg000349l
>ptg000352l_unaligned ptg000352l
>ptg000353l_unaligned ptg000353l
>ptg000354l_unaligned ptg000354l
>ptg000355l_unaligned ptg000355l
>ptg000356l_unaligned ptg000356l
>ptg000357l_unaligned ptg000357l
>ptg000359l_unaligned ptg000359l
>ptg000360l_unaligned ptg000360l
>ptg000363l_unaligned ptg000363l
>ptg000365l_unaligned ptg000365l
>ptg000366l_unaligned ptg000366l
>ptg000369l_unaligned ptg000369l
>ptg000370l_unaligned ptg000370l
>ptg000373l_unaligned ptg000373l
>ptg000374l_unaligned ptg000374l
>ptg000375l_unaligned ptg000375l
>ptg000377l_unaligned ptg000377l
>ptg000378l_unaligned ptg000378l
>ptg000379l_unaligned ptg000379l
>ptg000380l_unaligned ptg000380l
>ptg000381l_unaligned ptg000381l
>ptg000382l_unaligned ptg000382l
>ptg000385l_unaligned ptg000385l
>ptg000386l_unaligned ptg000386l
>ptg000388l_unaligned ptg000388l
>ptg000389l_unaligned ptg000389l
>ptg000390l_unaligned ptg000390l
>ptg000392l_unaligned ptg000392l
>ptg000393l_unaligned ptg000393l
>ptg000394l_unaligned ptg000394l
>ptg000395l_unaligned ptg000395l
>ptg000397l_unaligned ptg000397l
>ptg000398l_unaligned ptg000398l
>ptg000399l_unaligned ptg000399l
>ptg000400l_unaligned ptg000400l
>ptg000401l_unaligned ptg000401l
>ptg000402l_unaligned ptg000402l
>ptg000405l_unaligned ptg000405l
>ptg000407l_unaligned ptg000407l
>ptg000408l_unaligned ptg000408l
>ptg000409l_unaligned ptg000409l
>ptg000410l_unaligned ptg000410l
>ptg000411l_unaligned ptg000411l

And the read name is:

@K00134:236:H5CJWBBXY:1:1104:29609:17175 1:N:0:ATCACGA

and

@K00134:236:H5CJWBBXY:1:1106:21085:38486 1:N:0:ATCACGA

The reason I used SN="(*)" is because I want to be as generic as possible. I also wrote a R scirpt to generate config files automatically so I could use Arioc in command line, due to the fact that aligning reads is a daily routine and it can really benefit from Arioc's speed up.

#!/usr/bin/env Rscript

# Author information
## Author: Zexuan Zhao  
## Email: zzhao127@umd.edu

if (!require('argparse')) install.packages('argparse', repos='http://cran.us.r-project.org'); suppressPackageStartupMessages(library("argparse"))
if (!require('xml2')) install.packages('xml2', repos='http://cran.us.r-project.org'); suppressPackageStartupMessages(library("xml2"))
if (!require('glue')) install.packages('glue', repos='http://cran.us.r-project.org'); suppressPackageStartupMessages(library("glue"))

# Create parser object
parser <- ArgumentParser(description='Description here')

# Parse arguments
parse_args().
parser$add_argument("-r", "--reference", 
                    action = "store", 
                    type = "character",
                    required = TRUE,
                    help = "Reference file")
parser$add_argument("-d", "--data_table", 
                    action = "store", 
                    type = "character",
                    required = TRUE,
                    help = "Tsv file storing absolute paths of reads with 3 columns: sample, R1, R2")

parser$add_argument("-b", "--binary_folder", 
                    action = "store", 
                    type = "character",
                    required = TRUE,
                    help = "Directory to Arioc's binary folder (bin).")

# get command line options, if help option encountered print help and exit,
# otherwise if options not found on command line then set defaults, 
args <- parser$parse_args()

# Debug chunk, please ignore
#print(args$reference)
#print(args$data_table)
#print(args$binary_folder)
#stop()

# Make temporary directory
dir.create("./tmp")
dir.create("./tmp/cfg")
dir.create("./tmp/cfg/mapping")
dir.create("tmp/ref_enc")
dir.create("tmp/reads_enc")

# Read data
data <- read.delim(file = args$data_table)

ref <- args$reference
ref_dir <- dirname(ref)
ref_file <- basename(ref)

# Prepare config files
## 1. prepare gapped reference
## Make root node
encode_reference_gapped <- xml_new_root("AriocE", seed="hsi20_0_30", maxJ="*")

## Make dataIn and dataOut node
xml_add_child(encode_reference_gapped, "dataIn", sequenceType="R", srcId="0", filePath = ref_dir)
xml_add_child(encode_reference_gapped, "dataOut")

## Make input and output node
xml_add_child(xml_children(encode_reference_gapped)[1], "file", subId="01", SN="(*)", ref_file)
xml_add_child(xml_children(encode_reference_gapped)[2], "path", "tmp/ref_enc")

## Write to file
write_xml(encode_reference_gapped, file = file.path("tmp", "cfg", "encode_reference_gapped.cfg"))

## 2. prepare nongapped reference
## Make root node
encode_reference_nongapped <- xml_new_root("AriocE", seed="ssi84_2_30", maxJ="*")

## Make dataIn and dataOut node
xml_add_child(encode_reference_nongapped, "dataIn", sequenceType="R", srcId="0", filePath = ref_dir)
xml_add_child(encode_reference_nongapped, "dataOut")

## Make file node to dataIn and path node to dataOut
xml_add_child(xml_children(encode_reference_nongapped)[1], "file", subId="01", SN="(*)", ref_file)
xml_add_child(xml_children(encode_reference_nongapped)[2], "path", "tmp/ref_enc")

## Write to file
write_xml(encode_reference_nongapped, file = file.path("tmp", "cfg", "encode_reference_nongapped.cfg"))

## 3. prepare reads 
## Make root node
encode_reads <- xml_new_root("AriocE")

## Make dataIn and dataOut node
xml_add_child(encode_reads, "dataIn", sequenceType="Q")
xml_add_child(encode_reads, "dataOut")

## Add read files to dataIn
for (i in 1:nrow(data)){
  xml_add_child(xml_children(encode_reads)[1], "file", subId = i, mate = "1", data[i,"R1"])
  xml_add_child(xml_children(encode_reads)[1], "file", subId = i, mate = "2", data[i,"R2"])
}

## Add path node to dataOut
xml_add_child(xml_children(encode_reads)[2], "path", "tmp/reads_enc")

## Write to file
write_xml(encode_reads, file = file.path("tmp", "cfg", "encode_reads.cfg"))

## 4. Prepare reads
## Make a file for each sample
for (i in 1:nrow(data)){
  sample <- data[i,"sample"]
  R1_file <- sub(".fastq", "", basename(data[i,"R1"]))
  R2_file <- sub(".fastq", "", basename(data[i,"R2"]))
  ## Make root node
  map_reads <- xml_new_root("AriocP", gpuMask="0x00000001", batchSize="64k")
  ## Add reference node
  xml_add_child(map_reads, "R", "tmp/ref_enc")
  ## Add gapped and nongapped reference
  xml_add_child(map_reads, "nongapped", seed="ssi84_2_30", maxJ="*")
  xml_add_child(map_reads, "gapped", seed="hsi20_0_30", maxJ="*", Wmxgs="2,6,5,3", Vt="L,0,1")
  ## Add read node
  xml_add_child(map_reads, "Q", filePath = "tmp/reads_enc")
  ## Add read files to read node
  xml_add_child(xml_children(map_reads)[4], "paired", subId = 1)
  xml_add_child(xml_children(xml_children(map_reads)[4])[1], "file", R1_file)
  xml_add_child(xml_children(xml_children(map_reads)[4])[1], "file", R2_file)
  ## Add output node
  xml_add_child(map_reads, "A", overwrite="true", pairOrientation="c", pairCollision="oc", cigarFormat="MID", mapqVersion="3")
  xml_add_child(xml_children(map_reads)[5], "sam", report="cdru", "./")
  write_xml(map_reads, file = file.path("tmp", "cfg", "mapping", paste0("mapping", "_", sample, ".cfg")))
}

## Run files

# 1. encode gapped reference
system(glue("{args$binary_folder}/AriocE tmp/cfg/encode_reference_gapped.cfg"))

# 2. encode nongapped reference
system(glue("{args$binary_folder}/AriocE tmp/cfg/encode_reference_nongapped.cfg"))

# 3. encode reads
system(glue("{args$binary_folder}/AriocE tmp/cfg/encode_reads.cfg"))

# 4. mapping 
mapping_cfgs <- list.files("tmp/cfg/mapping/")
for(mapping_cfg in mapping_cfgs){
  sample <- sub("mapping_", "", sub(".cfg", "", mapping_cfg))
  system(glue("{args$binary_folder}/AriocP tmp/cfg/mapping/{mapping_cfgs}"))
  system(glue("mv AriocP.cudr.001.sam tmp/{sample}.sam"))
}

# 5. convert to sorted bam
sams <- list.files("./tmp", "*.sam")
for (sam in sams){
  sample <- sub(".sam", "", sam)
  system(glue("samtools view -@ 10 -Sb tmp/{sam} > tmp/{sample}.bam"))
  system(glue("samtools sort -@ 10 -o {sample}.sorted.bam tmp/{sample}.bam"))
  system(glue("samtools index {sample}.sorted.bam"))
}

@RWilton
Copy link
Owner

RWilton commented Mar 2, 2023

Thank you for posting the reference sequence deflines and the workflow script. I have a couple of comments and one more request for information.

1: Unlike other general-purpose read aligners like BWA and Bowtie, Arioc separately performs the tasks of transforming the ASCII-formatted FASTA (reference) and FASTQ (read) sequences into a binary format that facilitates alignment. The underlying idea is to do these things on different computers at different times, with the goal of running things concurrently on different machines in a cluster. (This is particularly helpful in the typical use case where Arioc is used in a cluster that comprises both GPU nodes and non-GPU nodes.)

This means you should design workflows to do these tasks separately. Encode the reference genome once (AriocE gapped, AriocE nongapped) and leave the encoded files in a shared directory where they can be referenced in subsequent AriocP config files. (This is actually no different from what you would do with BWA or Bowtie.) Also, if you are running in the typical cluster that has GPU nodes and non-GPU nodes, do the same with the encoded read sequences.

2: As you know, the archaic 1960s-style FASTA and SAM formats are pretty awful with regard to how whitespace is handled. AriocE and AriocP contain some special-case logic to deal with embedded spaces and tabs in RNAMEs and QNAMEs, but it's possible that your RNAMEs -- some of which contain embedded whitespace and some of which do not -- are the data that is eliciting the bug.

You might try to confirm this suspicion by altering your SN specification to SN="(*) ", i.e., try to exclude the whitespace from the RNAMEs. This won't fix anything, since your RNAMEs will no longer contain everything in the FASTA deflines, but if the bug goes away, we'll know exactly what we're dealing with.

Rather than guess at this, however, it would be preferable to use your data to reproduce this bug. Can you please upload (or otherwise make available, e.g. via FTP) the FASTA-formatted reference sequences and the FASTQ for at least one of the read pairs that causes the error to occur?

@RWilton
Copy link
Owner

RWilton commented Mar 21, 2023

I revisited this bug by extracting SEQ and QUAL from the SAM records you provided and aligning to the human reference genome, but I was still unable to reproduce the problem.

Two comments:

-- Without the original FASTA (for the reference genome) and FASTQ (for the one or two read pairs whose alignment records are garbled) I see no straightforward way to troubleshoot the problem you observed.

-- In the SAM records you provided, both SEQ and QUAL are identical for both mates. This strikes me as being highly unlikely to occur in valid sequencer output. I do not think this explains the problem you have seen with spurious characters in the SAM output, but I would be curious to understand what's going on with those reads -- in particular, whether Arioc has garbled these output fields as well as the others you discovered in the SAM output. Again, the original FASTQ for those reads would be pertinent here.

Thank you again for taking the time to document this bug in the software.

@ZexuanZhao
Copy link
Author

Thank you for your advice.

This means you should design workflows to do these tasks separately. Encode the reference genome once (AriocE gapped, AriocE nongapped) and leave the encoded files in a shared directory where they can be referenced in subsequent AriocP config files.

My lab has a workstation with graphic card installed. That is why the pipeline is not separated.

In the SAM records you provided, both SEQ and QUAL are identical for both mates. This strikes me as being highly unlikely to occur in valid sequencer output.

The reads and reference sequences in the region are telomeric repeats.

Can you please upload (or otherwise make available, e.g. via FTP) the FASTA-formatted reference sequences and the FASTQ for at least one of the read pairs that causes the error to occur?

The reference genome is not published yet. I will send to you by email. It is also useful if you could inform me how to transfer files on GitHub securely.

Thanks,
Zexuan

@RWilton
Copy link
Owner

RWilton commented May 17, 2023

Ok, when I used your reference genome and your FASTQ of one of the read pairs, I discovered the following:

  1. There was a bug in AriocE's encoding of the reference genome that corrupted the defline of at least one of the reference sequences in the FASTA file you provided.

  2. The FASTQ you provided was not in FASTQ format; AriocE handled this with a segfault.

The fixes were to repair the bug and to harden AriocE to produce an error message rather than a segfault when confronted with certain invalid FASTQ formats.

I then used the SAM output you provided previously to produce a correctly-formatted FASTQ representation of the two reads for which you originally encountered corrupt SAM records as well as for the third pair in your most recent FASTQ. These reads now align without errors to a properly-encoded reference:

AriocP.debug.zip

Two comments:

First, the reference genome you provided is not identical to the one for which you originally reported this problem. You might want to verify that the bug is indeed fixed with the original FASTA as well.

Also, reference-genome encoding is a one-time-only process (assuming it isn't buggy :-). So even on a single machine, there's no reason to do it in the workflow for every WGS sample you align. (The same goes for every other short-read aligner I've ever used.)

I have pushed a new build (v1.51.3140) of the Arioc distribution that contains these fixes.

Thank you again for your help.

@RWilton RWilton closed this as completed Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants