Skip to content

IgBLAST auxiliary data incongruences

Hervé Pagès edited this page Jun 5, 2026 · 8 revisions

1. Introduction

This document keeps track of some incongruences observed in the auxiliary data included in IgBLAST (*_gl.aux files).

When possible, we suggest a possible fix, but note that we have no authority in these matters.

2. Incompatible "first coding frame start" / "CDR3 stop" combination

The "first coding frame start" and "CDR3 stop" fields in a *_gl.aux file (2nd and 4th field) are both 0-based positions. The latter should always be the position of the 3rd nucleotide in a codon. This means that the difference between the latter and the former should always be 2 modulo 3.

However, this is not the case for the following germline J gene alleles:

  • human TRAJ31*01
  • mouse TRAJ21*02, TRAJ24*02, and TRDJ2*02

2.1 Human TRAJ31*01

For human allele TRAJ31*01, human_gl.aux reports:

               first_coding_frame_start  CDR3_stop
   TRAJ31*01                          1         22

This gives a value for "(CDR3_stop - first_coding_frame_start) modulo 3" that is 0.

Note that this issue is still present in the updated version of human_gl.aux from April 2025 available at https://ftp.ncbi.nih.gov/blast/executables/igblast/release/patch/optional_file/

Suggested fix:

Knowing the nucelotide sequence of the allele would help disambiguate. However, TRAJ31*01 is not part of the set of human TR alleles available at IMGT/V-QUEST, so we don't have access to its nucleotide sequence.

We were not able to find any reliable information about this allele either.

2.2 Mouse TRAJ21*02, TRAJ24*02, and TRDJ2*02

For mouse alleles TRAJ21*02, TRAJ24*02, and TRDJ2*02, mouse_gl.aux reports:

               first_coding_frame_start  CDR3_stop
   TRAJ21*02                          2         20
   TRAJ24*02                          3         28
   TRDJ2*02                           1         23

The corresponding values for "(CDR3_stop - first_coding_frame_start) modulo 3" are:

  • 0 for TRAJ21*02
  • 1 for TRAJ24*02
  • 1 for TRDJ2*02

Suggested fix:

These alleles are part of the set of mouse TR alleles available at IMGT/V-QUEST. The FASTA files containing the nucleotide sequences for these alleles are here. The header lines in these files contain 15 fields that provide various information about each allele. The 8th field is the "codon start" which is the 1-based equivalent of first_coding_frame_start.

The "codon start" reported in the IMGT FASTA header lines is:

  • 1 for TRAJ21*02
  • 3 for TRAJ24*02
  • 1 for TRDJ2*02

The 0-based values would be:

  • 0 for TRAJ21*02
  • 2 for TRAJ24*02
  • 0 for TRDJ2*02

Using these values for first_coding_frame_start in mouse_gl.aux fixes the problem.

3. Incompatible "first coding frame start" / "extra bps beyond J coding end" / "sequence length" combination

Some germline J gene alleles have a "first coding frame start" (2nd field) and "extra bps beyond J coding end" (5th field) that is incompatible with the length of the nucleotide sequence provided by IMGT.

More precisely, subtracting the "first coding frame start" and "extra bps beyond J coding end" values from the sequence length gives the length of the coding frame (in nb of nucleotides). This should always be a multiple of 3.

However, this is not the case for mouse germline J gene alleles TRAJ31*02, TRAJ32*02, TRAJ45*02, and TRAJ59*01.

For these alleles we have:

               first_coding_frame_start   extra_bps   allele_length
   TRAJ31*02                          2           0              57
   TRAJ32*02                          1           0              65
   TRAJ45*02                          1           1              55
   TRAJ59*01                          0           1              62

Note that we added the allele_length column above which contains the length of the allele sequences provided by IMGT.

Suggested fix:

The "codon start" reported in the IMGT FASTA header lines for these alleles is in agreement with the first_coding_frame_start value in mouse_gl.aux. So since first_coding_frame_start and allele_length are both coming from IMGT, we consider them to be the truth.

This means that extra_bps should be corrected as follow:

  • 1 for TRAJ31*02
  • 1 for TRAJ32*02
  • 0 for TRAJ45*02
  • 2 for TRAJ59*01

With these values, allele_length - first_coding_frame_start - extra_bps is a multiple of 3.