-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct way to encode hg38 ? #20
Comments
Please encode each of the canonical GRCh38 chromosomes in its own FASTA file. Place the 195 "masking chromosomes" into an additional FASTA file and reference it in the configuration file as well. You will, of course, need to ensure that each of the 195 deflines in that FASTA file is formatted so that a unique identifier (in SAM/BAM, Here is an example where the unique ID is the third bar-separated field in each defline:
The If it will help, the Arioc documentation says a bit more about how to accomplish this. As an aside: 5 or 6 months ago I was asked by my colleague, Steven Salzberg, to take a look at the Simons Genome Diversity Project coverage of those erroneously duplicated regions in GRCh38. It wasn't trivial to sort out, because a short-read aligner simply sees multiple potential mappings for reads that originate in the problematic regions, and the mappings are distributed between those regions nondeterministically, i.e., they may be spread across both regions "randomly" or may tend to pile up in one of the two regions. Either way, they have very low MAPQs because they have equivalent mappings in at least two different places. It's similar to what can happen with "alt contigs"... |
I did use the wildcard defline approach in my first attempt which yielded the "Aggregated R sequence length 3099956958 exceeds the maximum supported length 1073741823" error. However, doing individual file lines for the large canonical chromosomes, and packing the remaining 173 ones into one fasta referenced by wildcard matching has worked ! Thanks! |
OK, this story goes on. I now try to encode the official GATK hg38 reference from https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle It contains >3500 contigs and apparently Arioc chokes on some of the deflines. As reported earlier i now use wildcard lines for encoding the reference, namely
After encoding i use the reference for AriocP and it produces erroneous @sq lines for the HLA chromosomes:
so it seems the "(*) *" parenthesis matching did not work for these deflines, it put two items into the SN tag of the HLA contigs which makes samtools choke. Is this analysis correct ? Do you see a reason why the parenthesis matching did not work for the HLA contigs ? |
Hello, Karl -- In the problematic definition line
can you please confirm exactly which characters separate the strings |
Dooh... they put a tab into these deflines.... |
Yes, that solved it. Thanks for your help! |
I try to encode the latest masked hg38 reference and face some problems. the file i want to encode is here: GRCh38
It contains maskings for some problems in the original hg38, see this blogpost and paper.
It has 195 contigs, using wildcard matching for the individual contigs like:
<file SN="*" subId="1">GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta</file>
yields:
worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.ungapped.cfg 215111040 [0001217e] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 215111046 [0001217e] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 215111046 [0001217e] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 215111046 [0001217e] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 215111046 [0001217e] computer name : sx253 215111046 [0001217e] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 215111046 [0001217e] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.ungapped.cfg 215112579 [0001217e] AriocE::encodeR: encoding 1 file (32 CPU threads available)... 220200556 [00012183] ApplicationException ([0x00074115] AriocE/tuEncodeFASTA.cpp 64): Aggregated R sequence length 3099956958 exceeds the maximum supported length 1073741823: /scratch/karl/arioc/R/GRCh38_masked_v2/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta
splitting the fasta into individual chromosomes and using one file line for each chromosome (n=195) of the reference yields:
worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.gapped.cfg 230517831 [000126de] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 230517838 [000126de] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 230517838 [000126de] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 230517838 [000126de] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 230517838 [000126de] computer name : sx253 230517838 [000126de] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 230517838 [000126de] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.gapped.cfg 230518180 [000126de] AriocE::encodeR: encoding 195 files (32 CPU threads available)... 230543830 [000126de] RaiiFile::internalOpen: Unable to open file: /scratch/karl/arioc/R/GRCh38_masked_v2/chromosomes/chrUn_KI270414v1.fa (error 24: Too many open files) Segmentation fault
So, how do i correctly encode the hg38 reference with all its masking chromosomes ?
The text was updated successfully, but these errors were encountered: