Skip to content

Auxiliary data in igblastr

Hervé Pagès edited this page Jun 3, 2026 · 39 revisions

Auxiliary data included in the germline dbs created with install_IMGT_germline_db()

Introduction

The install_IMGT_germline_db() function in the igblastr package creates and installs a blast database for germline V, D, and J gene sequences downloaded from IMGT.

All the germline databases below were installed by running:

library(igblastr)
install_IMGT_germline_db("202614-2", "Homo_sapiens", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Mus_musculus", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Oryctolagus_cuniculus", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Rattus_norvegicus", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Macaca_mulatta", overwrite=TRUE)

R 4.6.0 and igblastr 1.3.6 (Bioconductor 3.24) were used. See sessionInfo() at the end of this document for the details.

Installing the 5 germline databases above should take less than a minute.

To get the list of germline databases currently installed in igblastr's persistent cache:

list_germline_dbs()
#  db_name                                            V  D  J intdata auxdata
#  _OGRDB.human.IGH+IGK+IGL.202410                  342 31 23    TRUE    TRUE
#  _OGRDB.human.IGH+IGK+IGL.202410.src              354 33 24    TRUE    TRUE
#  _OGRDB.human.IGH+IGK+IGL.202605                  367 31 23    TRUE    TRUE
#  _OGRDB.human.IGH+IGK+IGL.202605.src              379 33 24    TRUE    TRUE
#  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603         184  9 22    TRUE    TRUE
#  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603        169 11 22    TRUE    TRUE
#  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603          172  9 22    TRUE    TRUE
#  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205       149  9 22    TRUE    TRUE
#  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410          184 10 22    TRUE    TRUE
#  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602         2294 72 39    TRUE    TRUE
#  IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL           730 48 35    TRUE    TRUE
#  IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL         457 49 24    TRUE    TRUE
#  IMGT-202614-2.Mus_musculus.IGH+IGK+IGL           865 61 27    TRUE    TRUE
#  IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL  148 11 34    TRUE    TRUE
#  IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL      403 37 15    TRUE    TRUE

Notes:

  • The _OGRDB.* databases are built-in databases i.e. they are shipped with the igblastr package and therefore always present.
  • The V, D, and J columns indicate the number of germline gene alleles stored in the database for each germline gene region.
  • The intdata and auxdata columns indicate whether a database includes its own annotations for the germline V alleles (intdata) and germline J alleles (auxdata). These annotations consist of reporting the coding start position and FWR/CDR boundaries on the V and J sequences. When analyzing BCR or TCR sequences with IgBLAST, the latter needs access to this information in order to annotate the former. In IgBLAST's terminology, the annotations for the germline V and J alleles are called internal data and auxiliary data, respectively. As you can see, all the germline databases above include these annotations. More on the auxiliary data below.

Auxiliary data

IMGT germline db for human

The 35 J alleles in IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL are fully annotated (i.e. coding frame and CDR3/FWR4 junction are known):

IMGT_human_J_alleles

Notes:

  • 2 alleles (IGHJ5*04 and IGKJ4*03) are not annotated in the auxiliary data shipped with IgBLAST.
  • For these 2 alleles, the CDR3/FWR4 junction was determined with igblastr::compute_auxdata() (WGXG/FGXG motif search).

IMGT germline db for mouse

The 27 J alleles in IMGT-202614-2.Mus_musculus.IGH+IGK+IGL are fully annotated:

IMGT_mouse_J_alleles

Notes:

  • 3 alleles (IGLJ2P*01, IGLJ4*01_Mus_spretus, and IGLJ5*01_Mus_spretus) are not annotated in the auxiliary data shipped with IgBLAST.
  • For IGLJ4*01_Mus_spretus and IGLJ5*01_Mus_spretus, the CDR3/FWR4 junction was determined with igblastr::compute_auxdata() (WGXG/FGXG motif search).
  • For IGLJ2P*01, the CDR3/FWR4 junction was determined with igblastr::infer_cdr3_ends_via_fwr4_comparisons(). The function moves a sliding window of 10 amino acids across the allele sequence until it finds a match with one of the already known FWR4 in the set of J alleles. In this case there was a perfect match with the FWR4 of allele IGLJ3P*01.

IMGT germline db for rabbit

The 34 J alleles in IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL are fully annotated:

IMGT_rabbit_J_alleles

Note that all 34 alleles are also annotated in the auxiliary data shipped with IgBLAST.

IMGT germline db for rat

14/15 J alleles in IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL are fully annotated:

IMGT_rat_J_alleles

Notes:

  • All 15 alleles are also annotated in the auxiliary data shipped with IgBLAST. However, the annotations provided for IGKJ3*01 are incomplete (the CDR3/FWR4 junction is not reported).
  • The CDR3/FWR4 junction for IGKJ3*01 could not be determined with igblastr::compute_auxdata() (WGXG/FGXG motif search), nor with igblastr::infer_cdr3_ends_via_fwr4_comparisons().

IMGT germline db for rhesus monkey

The 24 J alleles in IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL are fully annotated:

IMGT_rhesus_monkey_J_alleles

Note that all 24 alleles are also annotated in the auxiliary data shipped with IgBLAST.

sessionInfo()

> sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.6.0/lib/libRblas.so 
LAPACK: /home/hpages/R/R-4.6.0/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] igblastr_1.3.6      Biostrings_2.81.2   Seqinfo_1.3.0      
[4] XVector_0.53.0      IRanges_2.47.2      S4Vectors_0.51.3   
[7] BiocGenerics_0.59.6 generics_0.1.4      tibble_3.3.1       

loaded via a namespace (and not attached):
 [1] crayon_1.5.3        vctrs_0.7.3         httr_1.4.8         
 [4] cli_3.6.6           rlang_1.2.0         UCSC.utils_1.9.0   
 [7] jsonlite_2.0.0      xtable_1.8-8        glue_1.8.1         
[10] GenomeInfoDb_1.49.1 lifecycle_1.0.5     compiler_4.6.0     
[13] rvest_1.0.5         pkgconfig_2.0.3     R.oo_1.27.1        
[16] R.utils_2.13.0      R6_2.6.1            pillar_1.11.1      
[19] curl_7.1.0          magrittr_2.0.5      R.methodsS3_1.8.2  
[22] tools_4.6.0         xml2_1.5.2         

Clone this wiki locally