Skip to content

Auxiliary data in igblastr

Hervé Pagès edited this page Jun 9, 2026 · 29 revisions

1. Introduction

This document provides a brief overview of what the auxiliary data is about in the context of IgBLAST/igblastr, why this data is important, and how the igblastr package handles it.

The install_IMGT_germline_db() function in the igblastr package creates and installs a blast database for germline V, D, and J gene sequences downloaded from IMGT.

All the germline databases below were installed by running:

library(igblastr)
install_IMGT_germline_db("202614-2", "Homo_sapiens")
install_IMGT_germline_db("202614-2", "Mus_musculus")
install_IMGT_germline_db("202614-2", "Oryctolagus_cuniculus")
install_IMGT_germline_db("202614-2", "Rattus_norvegicus")
install_IMGT_germline_db("202614-2", "Macaca_mulatta")

R 4.6.0 and igblastr 1.3.6 (Bioconductor 3.24) were used. See sessionInfo() at the end of this document for the details.

Installing the 5 germline databases above should take less than a minute.

Use list_germline_dbs() to get the list of germline databases currently installed in igblastr's persistent cache:

list_germline_dbs()
#  db_name                                            V  D  J intdata auxdata
#  _OGRDB.human.IGH+IGK+IGL.202410                  342 31 23    TRUE    TRUE
#  _OGRDB.human.IGH+IGK+IGL.202410.src              354 33 24    TRUE    TRUE
#  _OGRDB.human.IGH+IGK+IGL.202605                  367 31 23    TRUE    TRUE
#  _OGRDB.human.IGH+IGK+IGL.202605.src              379 33 24    TRUE    TRUE
#  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603         184  9 22    TRUE    TRUE
#  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603        169 11 22    TRUE    TRUE
#  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603          172  9 22    TRUE    TRUE
#  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205       149  9 22    TRUE    TRUE
#  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410          184 10 22    TRUE    TRUE
#  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602         2294 72 39    TRUE    TRUE
#  IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL           730 48 35    TRUE    TRUE
#  IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL         457 49 24    TRUE    TRUE
#  IMGT-202614-2.Mus_musculus.IGH+IGK+IGL           865 61 27    TRUE    TRUE
#  IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL  148 11 34    TRUE    TRUE
#  IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL      403 37 15    TRUE    TRUE

Notes:

  • The _OGRDB.* databases are built-in databases i.e. they are shipped with the igblastr package and therefore always present.
  • The V, D, and J columns indicate the number of germline gene alleles stored in the database for each germline gene region.
  • The intdata and auxdata columns indicate whether a database includes its own annotations for the germline V alleles (intdata) and germline J alleles (auxdata). These annotations consist of reporting the coding frame start position and FWR/CDR boundaries on the V and J sequences. When analyzing BCR or TCR sequences with IgBLAST, the latter needs access to this information in order to annotate the former. In IgBLAST's terminology, the annotations for the germline V and J alleles are called internal data and auxiliary data, respectively. As you can see, all the germline databases above include these annotations.

In the rest of this document, we will focus on the auxiliary data.

2. Auxiliary data

This is what the auxiliary data looks like:

load_auxdata("IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL")
#    allele_name coding_frame_start chain_type cdr3_end extra_bps
# 1     IGHJ1*01                  1         JH       21         1
# 2     IGHJ2*01                  1         JH       15         1
# 3     IGHJ3*01                  2         JH       16         1
# 4     IGHJ4*01                  2         JH       19         1
# 5     IGKJ1*01                  1         JK        6         1
# 6   IGKJ2-1*01                  2         JK        7         1
# 7   IGKJ2-2*01                  2         JK        7         1
# 8   IGKJ2-3*01                  2         JK        7         1
# 9     IGKJ3*01                  1         JK       NA         1
# 10    IGKJ4*01                  1         JK        6         1
# 11    IGKJ5*01                  1         JK        6         1
# 12    IGLJ1*01                  1         JL        6         1
# 13    IGLJ2*01                  1         JL        6         1
# 14    IGLJ3*01                  1         JL        6         1
# 15    IGLJ4*01                  1         JL        6         1

The important columns are coding_frame_start and cdr3_end:

  • coding_frame_start reports the 0-based position of the first nucleotide in the coding frame.
  • cdr3_end reports the 0-based position of the last nucleotide in the CDR3 region. Having access to this information is critical because it means that the border between the third complementarity-determining region (CDR3) and the fourth framework region (FWR4) is known. In the rest of this document we'll refer to this border as the CDR3/FWR4 junction.

IgBLAST-provided auxiliary data

IgBLAST itself provides its own auxiliary data for 5 organisms: human, mouse, rabbit, rat, and rhesus monkey. We'll refer to these organisms as IgBLAST organisms. Note that this data consists of *_gl.aux files that are shipped with IgBLAST and included in a standard IgBLAST installation.

However, this data doesn't get updated on a regular basis by the NCBI folks, and can be incomplete or out-of-sync with the germline V and J alleles provided by IMGT. This is an important concern and is what motivated the inclusion of igblastr-generated auxiliary data in the germline dbs created by install_IMGT_germline_db().

igblastr-generated auxiliary data

The auxiliary data included in a germline database created with install_IMGT_germline_db() is generated as follow:

  1. The coding_frame_start is obtained from the header lines in the IMGT FASTA files. See ?parse_imgt_fasta_headers for more information.

  2. Using the coding_frame_start information, the germline J allele sequences are translated to amino acid sequences.

  3. To identify the CDR3/FWR4 junction:

    • First a motif-based approach is used: For human (and most organisms), the FWR4 is expected to start with the WGXG and FGXG motifs. If we find these motifs on a J allele sequence, then we know where the CDR3/FWR4 junction is. See ?compute_auxdata for more information.

    • If we didn't find the WGXG/FGXG motifs in some alleles (unsolved alleles), then we try to identify the CDR3/FWR4 junction with igblastr::infer_cdr3_ends_via_fwr4_comparisons(). This is how it works:

      • For each unsolved allele, the function moves a sliding window of 10 amino acids along the allele sequence, and, for each window, it computes the Hamming distance between the sequence in the window and the set of known FWR4 sequences.

      • The "best window" is the window that minimizes this distance.

      • If the "best window" has no more than 2 mismatches with a known FWR4 sequence (Hamming distance <= 2), then it's considered to be the FWR4 of the unsolved allele.

      See ?infer_cdr3_ends_via_fwr4_comparisons for more information.

  4. Last but not least: the igblastr-generated auxiliary data is compared with the IgBLAST-provided auxiliary data (for the J alleles that are annotated in both), and an error is raised if they disagree.

The 4 steps above are performed on-the-fly by install_IMGT_germline_db().

3. A close look at some igblastr-generated auxiliary data

3.1 Auxiliary data included in the IMGT germline databases for IgBLAST organisms

In this section we are going to take a close look at the igblastr-generated auxiliary data included in the 5 IMGT germline databases that we installed earlier with install_IMGT_germline_db(). The function we will use for this is print_J_alleles(). See ?print_J_alleles for more information.

Note that these databases are for IgBLAST organisms. This means that IgBLAST provides its own auxiliary data for them. The also_in_IgBLAST_auxdata column displayed by print_J_alleles() indicates whether an allele is also annotated in the IgBLAST-provided auxiliary data or not.

IMGT germline db for human

The 35 J alleles in IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL are fully annotated (i.e. coding frame and CDR3/FWR4 junction are known):

db_name <- "IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
human_reduced4

Notes:

  • 2 human J alleles from IMGT (IGHJ5*04 and IGKJ4*03) are not annotated in the IgBLAST-provided auxiliary data.
  • For these 2 alleles, the CDR3/FWR4 junction was identify with igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for mouse

The 27 J alleles in IMGT-202614-2.Mus_musculus.IGH+IGK+IGL are fully annotated:

db_name <- "IMGT-202614-2.Mus_musculus.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
mouse_reduced4

Notes:

  • 3 mouse J alleles from IMGT (IGLJ2P*01, IGLJ4*01_Mus_spretus, and IGLJ5*01_Mus_spretus) are not annotated in the IgBLAST-provided auxiliary data.
  • For IGLJ4*01_Mus_spretus and IGLJ5*01_Mus_spretus, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).
  • For IGLJ2P*01, the CDR3/FWR4 junction was identified by igblastr::infer_cdr3_ends_via_fwr4_comparisons(). The function moves a sliding window of 10 amino acids along the allele sequence until it finds a match with one of the already known FWR4 in the set of J alleles. In this case there was a perfect match with the FWR4 of allele IGLJ3P*01.

IMGT germline db for rabbit

The 34 J alleles in IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL are fully annotated:

db_name <- "IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
rabbit_reduced4

Note that all 34 rabbit J alleles from IMGT are also annotated in the IgBLAST-provided auxiliary data.

IMGT germline db for rat

14/15 J alleles in IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL are fully annotated:

db_name <- "IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
rat_reduced4

Notes:

  • All 15 rat J alleles from IMGT are also annotated in the IgBLAST-provided auxiliary data. However, the annotations provided by IgBLAST for rat IGKJ3*01 are incomplete (the CDR3/FWR4 junction is not reported).
  • The CDR3/FWR4 junction for IGKJ3*01 could not be identified by igblastr::compute_auxdata() (WGXG/FGXG motif search), nor by igblastr::infer_cdr3_ends_via_fwr4_comparisons().

IMGT germline db for rhesus monkey

The 24 J alleles in IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL are fully annotated:

db_name <- "IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
rhesus_monkey_reduced4

Note that all 24 rhesus monkey J alleles from IMGT are also annotated in the IgBLAST-provided auxiliary data.

3.2 Auxiliary data included in the IMGT germline databases for non-IgBLAST organisms

In this section we're going to use install_IMGT_germline_db() on some of the non-IgBLAST organisms available at IMGT.

The organisms available in IMGT release 202614-2 are:

list_IMGT_organisms("202614-2")
#  [1] "Aotus_nancymaae"          "Bos_taurus"              
#  [3] "Camelus_dromedarius"      "Canis_lupus_familiaris"  
#  [5] "Capra_hircus"             "Chondrichthyes"          
#  [7] "Danio_rerio"              "Equus_caballus"          
#  [9] "Felis_catus"              "Gadus_morhua"            
# [11] "Gallus_gallus"            "Gorilla_gorilla_gorilla" 
# [13] "Heterocephalus_glaber"    "Homo_sapiens"            
# [15] "Ictalurus_punctatus"      "Lemur_catta"             
# [17] "Macaca_fascicularis"      "Macaca_mulatta"          
# [19] "Mus_musculus"             "Mus_musculus_C57BL6J"    
# [21] "Mustela_putorius_furo"    "Neogale_vison"           
# [23] "Nonhuman_primates"        "Oncorhynchus_mykiss"     
# [25] "Ornithorhynchus_anatinus" "Oryctolagus_cuniculus"   
# [27] "Ovis_aries"               "Pan_troglodytes"         
# [29] "Pongo_abelii"             "Pongo_pygmaeus"          
# [31] "Rattus_norvegicus"        "Salmo_salar"             
# [33] "Sus_scrofa"               "Teleostei"               
# [35] "Tursiops_truncatus"       "Vicugna_pacos"           

Since IgBLAST provides no auxiliary data for these organisms, install_IMGT_germline_db() will generate the entirety of the auxiliary data that gets included in the germline db. Note that, in this case, print_J_alleles() does not display the also_in_IgBLAST_auxdata column (it would contain FALSE for all the alleles).

IMGT germline db for Canis lupus familiaris (dog)

db_name <- install_IMGT_germline_db("202614-2", "Canis lupus familiaris")
print_J_alleles(db_name, translate=TRUE)
dog_reduced4

Notes:

  • Except for allele IGKJ2*01, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).
  • For IGKJ2*01, the CDR3/FWR4 junction was identified by igblastr::infer_cdr3_ends_via_fwr4_comparisons().

IMGT germline db for Equus caballus (horse)

db_name <- install_IMGT_germline_db("202614-2", "Equus_caballus")
print_J_alleles(db_name, translate=TRUE)
horse_reduced4

Notes:

  • IMGT provides no J alleles for the IGL locus.
  • Except for allele IGHJ2*01, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).
  • For IGHJ2*01, the CDR3/FWR4 junction was identified by igblastr::infer_cdr3_ends_via_fwr4_comparisons().

IMGT germline db for Gorilla gorilla gorilla

db_name <- install_IMGT_germline_db("202614-2", "Gorilla_gorilla_gorilla")
print_J_alleles(db_name, translate=TRUE)
gorilla_reduced4

Notes that, for all 26 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for Lemur catta (ring-tailed lemur)

db_name <- install_IMGT_germline_db("202614-2", "Lemur_catta")
print_J_alleles(db_name, translate=TRUE)
ring_tailed_lemur_reduced4

Notes:

  • Except for alleles IGKJ5*01 and IGKJ5*02, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).
  • For IGKJ5*01 and IGKJ5*02, the CDR3/FWR4 junction was identified by igblastr::infer_cdr3_ends_via_fwr4_comparisons().

IMGT germline db for Macaca fascicularis (crab-eating macaque)

db_name <- install_IMGT_germline_db("202614-2", "Macaca_fascicularis")
print_J_alleles(db_name, translate=TRUE)
crab_eating_macaque_reduced4

Notes:

  • IMGT only provides germline J alleles for the IGH locus (heavy chain).
  • For all 7 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for Mustela putorius furo (ferret)

db_name <- install_IMGT_germline_db("202614-2", "Mustela_putorius_furo")
print_J_alleles(db_name, translate=TRUE)
ferret_reduced4

Notes:

  • For IGKJ2*01 and IGLJ8*01, the CDR3/FWR4 junction could not be identified by igblastr::compute_auxdata() (WGXG/FGXG motif search), nor by igblastr::infer_cdr3_ends_via_fwr4_comparisons().
  • For all other 19 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for Oncorhynchus mykiss (rainbow trout)

db_name <- install_IMGT_germline_db("202614-2", "Oncorhynchus_mykiss")
print_J_alleles(db_name, translate=TRUE)
rainbow_trout_reduced4

Notes:

  • IMGT only provides germline J alleles for the IGH locus (heavy chain).
  • For all 26 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for Ornithorhynchus anatinus (platypus)

db_name <- install_IMGT_germline_db("202614-2", "Ornithorhynchus_anatinus")
print_J_alleles(db_name, translate=TRUE)
platypus_reduced4

Notes:

  • IMGT only provides germline J alleles for the IGH locus (heavy chain).
  • For all 11 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for Pongo pygmaeus (orangutan)

db_name <- install_IMGT_germline_db("202614-2", "Pongo_pygmaeus")
print_J_alleles(db_name, translate=TRUE)
orangutan_reduced4

Notes:

  • For IGHJ4*01, IGHJ9*01, and IGHJ9*02, the CDR3/FWR4 junction could not be identified by igblastr::compute_auxdata() (WGXG/FGXG motif search), nor by igblastr::infer_cdr3_ends_via_fwr4_comparisons().
  • For all other 21 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for Salmo salar (atlantic salmon)

db_name <- install_IMGT_germline_db("202614-2", "Salmo_salar")
print_J_alleles(db_name, translate=TRUE)
atlantic_salmon_reduced4

Notes:

  • IMGT only provides germline J alleles for the IGH locus (heavy chain).
  • Except for allele IGHJ1T1D*01, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).
  • For IGHJ1T1D*01, the CDR3/FWR4 junction was identified by igblastr::infer_cdr3_ends_via_fwr4_comparisons().

IMGT germline db for Sus scrofa (pig)

db_name <- install_IMGT_germline_db("202614-2", "Sus_scrofa")
print_J_alleles(db_name, translate=TRUE)
pig_reduced4

Notes that, for all 19 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).

4. sessionInfo()

> sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.6.0/lib/libRblas.so 
LAPACK: /home/hpages/R/R-4.6.0/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] igblastr_1.3.6      Biostrings_2.81.2   Seqinfo_1.3.0      
[4] XVector_0.53.0      IRanges_2.47.2      S4Vectors_0.51.3   
[7] BiocGenerics_0.59.6 generics_0.1.4      tibble_3.3.1       

loaded via a namespace (and not attached):
 [1] crayon_1.5.3        vctrs_0.7.3         httr_1.4.8         
 [4] cli_3.6.6           rlang_1.2.0         UCSC.utils_1.9.0   
 [7] jsonlite_2.0.0      xtable_1.8-8        glue_1.8.1         
[10] GenomeInfoDb_1.49.1 lifecycle_1.0.5     compiler_4.6.0     
[13] rvest_1.0.5         pkgconfig_2.0.3     R.oo_1.27.1        
[16] R.utils_2.13.0      R6_2.6.1            pillar_1.11.1      
[19] curl_7.1.0          magrittr_2.0.5      R.methodsS3_1.8.2  
[22] tools_4.6.0         xml2_1.5.2