-
Notifications
You must be signed in to change notification settings - Fork 0
IgBLAST auxiliary data incongruences
This document keeps track of some incongruences observed in the auxiliary data included in IgBLAST (*_gl.aux files).
When possible, we suggest a possible fix, but note that we have no authority in these matters.
The "first coding frame start" and "CDR3 stop" fields in a *_gl.aux file (2nd and 4th field) are both 0-based positions. The latter should always be the position of the 3rd nucleotide in a codon. This means that the difference between the latter and the former should always be 2 modulo 3.
However, this is not the case for the following germline J gene alleles:
- human
TRAJ31*01 - mouse
TRAJ21*02,TRAJ24*02, andTRDJ2*02
For human allele TRAJ31*01, human_gl.aux reports:
first_coding_frame_start CDR3_stop
TRAJ31*01 1 22
This gives a value for "(CDR3_stop - first_coding_frame_start) modulo 3" that is 0.
Note that this issue is still present in the updated version of human_gl.aux from April 2025 available at
https://ftp.ncbi.nih.gov/blast/executables/igblast/release/patch/optional_file/
Suggested fix:
Knowing the nucelotide sequence of the allele would help disambiguate. However, TRAJ31*01 is not part of the set of human TR alleles available at IMGT/V-QUEST, so we don't have access to its nucleotide sequence.
We were not able to find any reliable information about this allele either.
For mouse alleles TRAJ21*02, TRAJ24*02, and TRDJ2*02, mouse_gl.aux reports:
first_coding_frame_start CDR3_stop
TRAJ21*02 2 20
TRAJ24*02 3 28
TRDJ2*02 1 23
The corresponding values for "(CDR3_stop - first_coding_frame_start) modulo 3" are:
- 0 for
TRAJ21*02 - 1 for
TRAJ24*02 - 1 for
TRDJ2*02
Suggested fix:
These alleles are part of the set of mouse TR alleles available at IMGT/V-QUEST. The FASTA files containing the nucleotide sequences for these alleles are here. The header lines in these files contain 15 fields that provide various information about each allele. The 8th field is the "codon start" which is the 1-based equivalent of first_coding_frame_start.
The "codon start" reported in the IMGT FASTA header lines is:
- 1 for
TRAJ21*02 - 3 for
TRAJ24*02 - 1 for
TRDJ2*02
The 0-based values would be:
- 0 for
TRAJ21*02 - 2 for
TRAJ24*02 - 0 for
TRDJ2*02
Using these values for first_coding_frame_start in mouse_gl.aux fixes the problem.
3. Incompatible "first coding frame start" / "extra bps beyond J coding end" / "sequence length" combination
Some germline J gene alleles have a "first coding frame start" (2nd field) and "extra bps beyond J coding end" (5th field) that is incompatible with the length of the nucleotide sequence provided by IMGT.
More precisely, subtracting the "first coding frame start" and "extra bps beyond J coding end" values from the sequence length gives the length of the coding frame (in nb of nucleotides). This should always be a multiple of 3.
However, this is not the case for mouse germline J gene alleles TRAJ31*02, TRAJ32*02, TRAJ45*02, and TRAJ59*01.
For these alleles we have:
first_coding_frame_start extra_bps allele_length
TRAJ31*02 2 0 57
TRAJ32*02 1 0 65
TRAJ45*02 1 1 55
TRAJ59*01 0 1 62
Note that we added the allele_length column above which contains the length of the allele sequences provided by IMGT.
Suggested fix:
The "codon start" reported in the IMGT FASTA header lines for these alleles is in agreement with the first_coding_frame_start value in mouse_gl.aux. So since first_coding_frame_start and allele_length are both coming from IMGT, we consider them to be the truth.
This means that extra_bps should be corrected as follow:
- 1 for
TRAJ31*02 - 1 for
TRAJ32*02 - 0 for
TRAJ45*02 - 2 for
TRAJ59*01
With these values, allele_length - first_coding_frame_start - extra_bps is a multiple of 3.