Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct way to encode hg38 ? #20

Closed
karlkashofer opened this issue Oct 17, 2021 · 6 comments
Closed

Correct way to encode hg38 ? #20

karlkashofer opened this issue Oct 17, 2021 · 6 comments

Comments

@karlkashofer
Copy link

I try to encode the latest masked hg38 reference and face some problems. the file i want to encode is here: GRCh38
It contains maskings for some problems in the original hg38, see this blogpost and paper.

It has 195 contigs, using wildcard matching for the individual contigs like:

<file SN="*" subId="1">GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta</file>

yields:

worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.ungapped.cfg 215111040 [0001217e] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 215111046 [0001217e] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 215111046 [0001217e] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 215111046 [0001217e] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 215111046 [0001217e] computer name : sx253 215111046 [0001217e] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 215111046 [0001217e] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.ungapped.cfg 215112579 [0001217e] AriocE::encodeR: encoding 1 file (32 CPU threads available)... 220200556 [00012183] ApplicationException ([0x00074115] AriocE/tuEncodeFASTA.cpp 64): Aggregated R sequence length 3099956958 exceeds the maximum supported length 1073741823: /scratch/karl/arioc/R/GRCh38_masked_v2/GCA_000001405.15_GRCh38_no_alt_analysis_set_maskedGRC_exclusions_v2.fasta

splitting the fasta into individual chromosomes and using one file line for each chromosome (n=195) of the reference yields:

worker@sx253:/scratch/karl/arioc/R/GRCh38_masked_v2$ ../../Arioc_150/bin/AriocE AriocE_HG38_p13.gapped.cfg 230517831 [000126de] AriocE v1.50.3035.21246 (release) +2021-09-03T17:49:58 cat: /etc/system-release: No such file or directory 230517838 [000126de] Copyright (c) 2015-2021 Johns Hopkins University. All rights reserved. 230517838 [000126de] compiled : 2021-09-24T16:15:04 (GNU g++ v8.3.0) 230517838 [000126de] data type sizes : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian) 230517838 [000126de] computer name : sx253 230517838 [000126de] executable file : /scratch/karl/arioc/Arioc_150/bin/AriocE 230517838 [000126de] configuration file : /scratch/karl/arioc/R/GRCh38_masked_v2/AriocE_HG38_p13.gapped.cfg 230518180 [000126de] AriocE::encodeR: encoding 195 files (32 CPU threads available)... 230543830 [000126de] RaiiFile::internalOpen: Unable to open file: /scratch/karl/arioc/R/GRCh38_masked_v2/chromosomes/chrUn_KI270414v1.fa (error 24: Too many open files) Segmentation fault

So, how do i correctly encode the hg38 reference with all its masking chromosomes ?

@RWilton
Copy link
Owner

RWilton commented Oct 18, 2021

Please encode each of the canonical GRCh38 chromosomes in its own FASTA file. Place the 195 "masking chromosomes" into an additional FASTA file and reference it in the configuration file as well. You will, of course, need to ensure that each of the 195 deflines in that FASTA file is formatted so that a unique identifier (in SAM/BAM, SN or "sequence name") for each can be parsed by using a simple wildcard specification.

Here is an example where the unique ID is the third bar-separated field in each defline:

<file SN="*|*|(*)|" subId="101" uriPath="https://rvdb.dbi.udel.edu/">U-RVDBv21.0.TP.1.fasta</file>

The subId= parameter is required and must be unique in the configuration file.

If it will help, the Arioc documentation says a bit more about how to accomplish this.

As an aside: 5 or 6 months ago I was asked by my colleague, Steven Salzberg, to take a look at the Simons Genome Diversity Project coverage of those erroneously duplicated regions in GRCh38. It wasn't trivial to sort out, because a short-read aligner simply sees multiple potential mappings for reads that originate in the problematic regions, and the mappings are distributed between those regions nondeterministically, i.e., they may be spread across both regions "randomly" or may tend to pile up in one of the two regions. Either way, they have very low MAPQs because they have equivalent mappings in at least two different places. It's similar to what can happen with "alt contigs"...

@karlkashofer
Copy link
Author

I did use the wildcard defline approach in my first attempt which yielded the "Aggregated R sequence length 3099956958 exceeds the maximum supported length 1073741823" error.

However, doing individual file lines for the large canonical chromosomes, and packing the remaining 173 ones into one fasta referenced by wildcard matching has worked !

Thanks!

@karlkashofer
Copy link
Author

OK, this story goes on. I now try to encode the official GATK hg38 reference from https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle

It contains >3500 contigs and apparently Arioc chokes on some of the deflines. As reported earlier i now use wildcard lines for encoding the reference, namely <file SN="chr1" subId="1">chr1.fa</file> for deflines of the canonical chromosomes:
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
and wildcard matching
<file SN="(*) *" subId="101">others.fa</file>
for all other contigs concatenated into a single fasta file with deflines like

>chrUn_KN707992v1_decoy  AC:KN707992.1  gi:734691636  LN:1830  rl:unplaced  M5:1bac13a3ad2592a344e18bf8de117574  AS:hs38d1
...
>HLA-A*01:01:01:01      HLA00001 3503 bp

After encoding i use the reference for AriocP and it produces erroneous @sq lines for the HLA chromosomes:

[E::sam_hrecs_error] Malformed key:value pair at line 2844: "@SQ	SN:HLA-A*01:01:01:01	HLA00001	LN:3503"
[E::sam_hrecs_error] Malformed key:value pair at line 2844: "@SQ	SN:HLA-A*01:01:01:01	HLA00001	LN:3503"

so it seems the "(*) *" parenthesis matching did not work for these deflines, it put two items into the SN tag of the HLA contigs which makes samtools choke.

Is this analysis correct ? Do you see a reason why the parenthesis matching did not work for the HLA contigs ?
Any recommendations on how to solve this ?

@karlkashofer karlkashofer reopened this Oct 21, 2021
@RWilton
Copy link
Owner

RWilton commented Oct 21, 2021

Hello, Karl --

In the problematic definition line

>HLA-A*01:01:01:01 HLA00001 3503 bp

can you please confirm exactly which characters separate the strings >HLA-A*01:01:01:01 and HLA00001 in the FASTA input? (You may need to do this with a binary editor or with a utility such as hexdump.)

@karlkashofer
Copy link
Author

Dooh... they put a tab into these deflines....
Ill replace that tab by space and see...

@karlkashofer
Copy link
Author

Yes, that solved it. Thanks for your help!

@karlkashofer karlkashofer reopened this Oct 30, 2021
@RWilton RWilton closed this as completed Nov 7, 2021
Repository owner deleted a comment from karlkashofer Nov 13, 2021
Repository owner deleted a comment from karlkashofer Nov 13, 2021
Repository owner deleted a comment from karlkashofer Nov 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants