-
Notifications
You must be signed in to change notification settings - Fork 0
Auxiliary data in igblastr
This document provides a brief overview of what the auxiliary data is about in the context of IgBLAST/igblastr, why this data is important, and how the igblastr package handles it.
The install_IMGT_germline_db() function in the igblastr package creates and installs a blast database for germline V, D, and J gene sequences downloaded from IMGT.
All the germline databases below were installed by running:
library(igblastr)
install_IMGT_germline_db("202614-2", "Homo_sapiens")
install_IMGT_germline_db("202614-2", "Mus_musculus")
install_IMGT_germline_db("202614-2", "Oryctolagus_cuniculus")
install_IMGT_germline_db("202614-2", "Rattus_norvegicus")
install_IMGT_germline_db("202614-2", "Macaca_mulatta")R 4.6.0 and igblastr 1.3.6 (Bioconductor 3.24) were used. See sessionInfo() at the end of this document for the details.
Installing the 5 germline databases above should take less than a minute.
Use list_germline_dbs() to get the list of germline databases currently installed in igblastr's persistent cache:
list_germline_dbs()
# db_name V D J intdata auxdata
# _OGRDB.human.IGH+IGK+IGL.202410 342 31 23 TRUE TRUE
# _OGRDB.human.IGH+IGK+IGL.202410.src 354 33 24 TRUE TRUE
# _OGRDB.human.IGH+IGK+IGL.202605 367 31 23 TRUE TRUE
# _OGRDB.human.IGH+IGK+IGL.202605.src 379 33 24 TRUE TRUE
# _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603 184 9 22 TRUE TRUE
# _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603 169 11 22 TRUE TRUE
# _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603 172 9 22 TRUE TRUE
# _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205 149 9 22 TRUE TRUE
# _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410 184 10 22 TRUE TRUE
# _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602 2294 72 39 TRUE TRUE
# IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL 730 48 35 TRUE TRUE
# IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL 457 49 24 TRUE TRUE
# IMGT-202614-2.Mus_musculus.IGH+IGK+IGL 865 61 27 TRUE TRUE
# IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL 148 11 34 TRUE TRUE
# IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL 403 37 15 TRUE TRUENotes:
- The
_OGRDB.*databases are built-in databases i.e. they are shipped with the igblastr package and therefore always present. - The
V,D, andJcolumns indicate the number of germline gene alleles stored in the database for each germline gene region. - The
intdataandauxdatacolumns indicate whether a database includes its own annotations for the germline V alleles (intdata) and germline J alleles (auxdata). These annotations consist of reporting the coding frame start position and FWR/CDR boundaries on the V and J sequences. When analyzing BCR or TCR sequences with IgBLAST, the latter needs access to this information in order to annotate the former. In IgBLAST's terminology, the annotations for the germline V and J alleles are called internal data and auxiliary data, respectively. As you can see, all the germline databases above include these annotations.
In the rest of this document, we will focus on the auxiliary data.
This is what the auxiliary data looks like:
load_auxdata("IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL")
# allele_name coding_frame_start chain_type cdr3_end extra_bps
# 1 IGHJ1*01 1 JH 21 1
# 2 IGHJ2*01 1 JH 15 1
# 3 IGHJ3*01 2 JH 16 1
# 4 IGHJ4*01 2 JH 19 1
# 5 IGKJ1*01 1 JK 6 1
# 6 IGKJ2-1*01 2 JK 7 1
# 7 IGKJ2-2*01 2 JK 7 1
# 8 IGKJ2-3*01 2 JK 7 1
# 9 IGKJ3*01 1 JK NA 1
# 10 IGKJ4*01 1 JK 6 1
# 11 IGKJ5*01 1 JK 6 1
# 12 IGLJ1*01 1 JL 6 1
# 13 IGLJ2*01 1 JL 6 1
# 14 IGLJ3*01 1 JL 6 1
# 15 IGLJ4*01 1 JL 6 1The important columns are coding_frame_start and cdr3_end:
-
coding_frame_startreports the 0-based position of the first nucleotide in the coding frame. -
cdr3_endreports the 0-based position of the last nucleotide in the CDR3 region. Having access to this information is critical because it means that the border between the third complementarity-determining region (CDR3) and the fourth framework region (FWR4) is known. In the rest of this document we'll refer to this border as the CDR3/FWR4 junction.
IgBLAST itself provides its own auxiliary data for 5 organisms: human, mouse, rabbit, rat, and rhesus monkey. We'll refer to these organisms as IgBLAST organisms. Note that this data consists of *_gl.aux files that are shipped with IgBLAST and included in a standard IgBLAST installation.
However, this data doesn't get updated on a regular basis by the NCBI folks, and can be incomplete or out-of-sync with the germline V and J alleles provided by IMGT. This is an important concern and is what motivated the inclusion of igblastr-generated auxiliary data in the germline dbs created by install_IMGT_germline_db().
The auxiliary data included in a germline database created with install_IMGT_germline_db() is generated as follow:
-
The
coding_frame_startis obtained from the header lines in the IMGT FASTA files. See?parse_imgt_fasta_headersfor more information. -
Using the
coding_frame_startinformation, the germline J allele sequences are translated to amino acid sequences. -
To identify the CDR3/FWR4 junction:
-
First a motif-based approach is used: For human (and most organisms), the FWR4 is expected to start with the
WGXGandFGXGmotifs. If we find these motifs on a J allele sequence, then we know where the CDR3/FWR4 junction is. See?compute_auxdatafor more information. -
If we didn't find the
WGXG/FGXGmotifs in some alleles (unsolved alleles), then we try to identify the CDR3/FWR4 junction withigblastr::infer_cdr3_ends_via_fwr4_comparisons(). This is how it works:-
For each unsolved allele, the function moves a sliding window of 10 amino acids along the allele sequence, and, for each window, it computes the Hamming distance between the sequence in the window and the set of known FWR4 sequences.
-
The "best window" is the window that minimizes this distance.
-
If the "best window" has no more than 2 mismatches with a known FWR4 sequence (Hamming distance <= 2), then it's considered to be the FWR4 of the unsolved allele.
See
?infer_cdr3_ends_via_fwr4_comparisonsfor more information. -
-
-
Last but not least: the igblastr-generated auxiliary data is compared with the IgBLAST-provided auxiliary data (for the J alleles that are annotated in both), and an error is raised if they disagree.
The 4 steps above are performed on-the-fly by install_IMGT_germline_db().
In this section we are going to take a close look at the igblastr-generated auxiliary data included in the 5 IMGT germline databases that we installed earlier with install_IMGT_germline_db(). The function we will use for this is print_J_alleles(). See ?print_J_alleles for more information.
Note that these databases are for IgBLAST organisms. This means that IgBLAST provides its own auxiliary data for them. The also_in_IgBLAST_auxdata column displayed by print_J_alleles() indicates whether an allele is also annotated in the IgBLAST-provided auxiliary data or not.
The 35 J alleles in IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL are fully annotated (i.e. coding frame and CDR3/FWR4 junction are known):
db_name <- "IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
Notes:
- 2 human J alleles from IMGT (
IGHJ5*04andIGKJ4*03) are not annotated in the IgBLAST-provided auxiliary data. - For these 2 alleles, the CDR3/FWR4 junction was identify with
igblastr::compute_auxdata()(WGXG/FGXGmotif search).
The 27 J alleles in IMGT-202614-2.Mus_musculus.IGH+IGK+IGL are fully annotated:
db_name <- "IMGT-202614-2.Mus_musculus.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
Notes:
- 3 mouse J alleles from IMGT (
IGLJ2P*01,IGLJ4*01_Mus_spretus, andIGLJ5*01_Mus_spretus) are not annotated in the IgBLAST-provided auxiliary data. - For
IGLJ4*01_Mus_spretusandIGLJ5*01_Mus_spretus, the CDR3/FWR4 junction was identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search). - For
IGLJ2P*01, the CDR3/FWR4 junction was identified byigblastr::infer_cdr3_ends_via_fwr4_comparisons(). The function moves a sliding window of 10 amino acids along the allele sequence until it finds a match with one of the already known FWR4 in the set of J alleles. In this case there was a perfect match with the FWR4 of alleleIGLJ3P*01.
The 34 J alleles in IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL are fully annotated:
db_name <- "IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
Note that all 34 rabbit J alleles from IMGT are also annotated in the IgBLAST-provided auxiliary data.
14/15 J alleles in IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL are fully annotated:
db_name <- "IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
Notes:
- All 15 rat J alleles from IMGT are also annotated in the IgBLAST-provided auxiliary data. However, the annotations provided by IgBLAST for rat
IGKJ3*01are incomplete (the CDR3/FWR4 junction is not reported). - The CDR3/FWR4 junction for
IGKJ3*01could not be identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search), nor byigblastr::infer_cdr3_ends_via_fwr4_comparisons().
The 24 J alleles in IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL are fully annotated:
db_name <- "IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL"
print_J_alleles(db_name, translate=TRUE)
Note that all 24 rhesus monkey J alleles from IMGT are also annotated in the IgBLAST-provided auxiliary data.
In this section we're going to use install_IMGT_germline_db() on some of the non-IgBLAST organisms available at IMGT.
The organisms available in IMGT release 202614-2 are:
list_IMGT_organisms("202614-2")
# [1] "Aotus_nancymaae" "Bos_taurus"
# [3] "Camelus_dromedarius" "Canis_lupus_familiaris"
# [5] "Capra_hircus" "Chondrichthyes"
# [7] "Danio_rerio" "Equus_caballus"
# [9] "Felis_catus" "Gadus_morhua"
# [11] "Gallus_gallus" "Gorilla_gorilla_gorilla"
# [13] "Heterocephalus_glaber" "Homo_sapiens"
# [15] "Ictalurus_punctatus" "Lemur_catta"
# [17] "Macaca_fascicularis" "Macaca_mulatta"
# [19] "Mus_musculus" "Mus_musculus_C57BL6J"
# [21] "Mustela_putorius_furo" "Neogale_vison"
# [23] "Nonhuman_primates" "Oncorhynchus_mykiss"
# [25] "Ornithorhynchus_anatinus" "Oryctolagus_cuniculus"
# [27] "Ovis_aries" "Pan_troglodytes"
# [29] "Pongo_abelii" "Pongo_pygmaeus"
# [31] "Rattus_norvegicus" "Salmo_salar"
# [33] "Sus_scrofa" "Teleostei"
# [35] "Tursiops_truncatus" "Vicugna_pacos" Since IgBLAST provides no auxiliary data for these organisms, install_IMGT_germline_db() will generate the entirety of the auxiliary data that gets included in the germline db. Note that, in this case, print_J_alleles() does not display the also_in_IgBLAST_auxdata column (it would contain FALSE for all the alleles).
db_name <- install_IMGT_germline_db("202614-2", "Canis lupus familiaris")
print_J_alleles(db_name, translate=TRUE)
Notes:
- Except for allele
IGKJ2*01, the CDR3/FWR4 junction was identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search). - For
IGKJ2*01, the CDR3/FWR4 junction was identified byigblastr::infer_cdr3_ends_via_fwr4_comparisons().
db_name <- install_IMGT_germline_db("202614-2", "Equus_caballus")
print_J_alleles(db_name, translate=TRUE)
Notes:
- IMGT provides no J alleles for the IGL locus.
- Except for allele
IGHJ2*01, the CDR3/FWR4 junction was identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search). - For
IGHJ2*01, the CDR3/FWR4 junction was identified byigblastr::infer_cdr3_ends_via_fwr4_comparisons().
db_name <- install_IMGT_germline_db("202614-2", "Gorilla_gorilla_gorilla")
print_J_alleles(db_name, translate=TRUE)
Notes that, for all 26 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).
db_name <- install_IMGT_germline_db("202614-2", "Lemur_catta")
print_J_alleles(db_name, translate=TRUE)
Notes:
- Except for alleles
IGKJ5*01andIGKJ5*02, the CDR3/FWR4 junction was identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search). - For
IGKJ5*01andIGKJ5*02, the CDR3/FWR4 junction was identified byigblastr::infer_cdr3_ends_via_fwr4_comparisons().
db_name <- install_IMGT_germline_db("202614-2", "Macaca_fascicularis")
print_J_alleles(db_name, translate=TRUE)
Notes:
- IMGT only provides germline J alleles for the IGH locus (heavy chain).
- For all 7 alleles, the CDR3/FWR4 junction was identified by
igblastr::compute_auxdata()(WGXG/FGXGmotif search).
db_name <- install_IMGT_germline_db("202614-2", "Mustela_putorius_furo")
print_J_alleles(db_name, translate=TRUE)
Notes:
- For
IGKJ2*01andIGLJ8*01, the CDR3/FWR4 junction could not be identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search), nor byigblastr::infer_cdr3_ends_via_fwr4_comparisons(). - For all other 19 alleles, the CDR3/FWR4 junction was identified by
igblastr::compute_auxdata()(WGXG/FGXGmotif search).
db_name <- install_IMGT_germline_db("202614-2", "Oncorhynchus_mykiss")
print_J_alleles(db_name, translate=TRUE)
Notes:
- IMGT only provides germline J alleles for the IGH locus (heavy chain).
- For all 26 alleles, the CDR3/FWR4 junction was identified by
igblastr::compute_auxdata()(WGXG/FGXGmotif search).
db_name <- install_IMGT_germline_db("202614-2", "Ornithorhynchus_anatinus")
print_J_alleles(db_name, translate=TRUE)
Notes:
- IMGT only provides germline J alleles for the IGH locus (heavy chain).
- For all 11 alleles, the CDR3/FWR4 junction was identified by
igblastr::compute_auxdata()(WGXG/FGXGmotif search).
db_name <- install_IMGT_germline_db("202614-2", "Pongo_pygmaeus")
print_J_alleles(db_name, translate=TRUE)
Notes:
- For
IGHJ4*01,IGHJ9*01, andIGHJ9*02, the CDR3/FWR4 junction could not be identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search), nor byigblastr::infer_cdr3_ends_via_fwr4_comparisons(). - For all other 21 alleles, the CDR3/FWR4 junction was identified by
igblastr::compute_auxdata()(WGXG/FGXGmotif search).
db_name <- install_IMGT_germline_db("202614-2", "Salmo_salar")
print_J_alleles(db_name, translate=TRUE)
Notes:
- IMGT only provides germline J alleles for the IGH locus (heavy chain).
- Except for allele
IGHJ1T1D*01, the CDR3/FWR4 junction was identified byigblastr::compute_auxdata()(WGXG/FGXGmotif search). - For
IGHJ1T1D*01, the CDR3/FWR4 junction was identified byigblastr::infer_cdr3_ends_via_fwr4_comparisons().
db_name <- install_IMGT_germline_db("202614-2", "Sus_scrofa")
print_J_alleles(db_name, translate=TRUE)
Notes that, for all 19 alleles, the CDR3/FWR4 junction was identified by igblastr::compute_auxdata() (WGXG/FGXG motif search).
> sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /home/hpages/R/R-4.6.0/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.6.0/lib/libRlapack.so; LAPACK version 3.12.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] igblastr_1.3.6 Biostrings_2.81.2 Seqinfo_1.3.0
[4] XVector_0.53.0 IRanges_2.47.2 S4Vectors_0.51.3
[7] BiocGenerics_0.59.6 generics_0.1.4 tibble_3.3.1
loaded via a namespace (and not attached):
[1] crayon_1.5.3 vctrs_0.7.3 httr_1.4.8
[4] cli_3.6.6 rlang_1.2.0 UCSC.utils_1.9.0
[7] jsonlite_2.0.0 xtable_1.8-8 glue_1.8.1
[10] GenomeInfoDb_1.49.1 lifecycle_1.0.5 compiler_4.6.0
[13] rvest_1.0.5 pkgconfig_2.0.3 R.oo_1.27.1
[16] R.utils_2.13.0 R6_2.6.1 pillar_1.11.1
[19] curl_7.1.0 magrittr_2.0.5 R.methodsS3_1.8.2
[22] tools_4.6.0 xml2_1.5.2