-
Notifications
You must be signed in to change notification settings - Fork 0
Auxiliary data in igblastr
The install_IMGT_germline_db() function in the igblastr package creates and installs a blast database for germline V, D, and J gene sequences downloaded from IMGT.
All the germline databases below were installed by running:
library(igblastr)
install_IMGT_germline_db("202614-2", "Homo_sapiens", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Mus_musculus", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Oryctolagus_cuniculus", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Rattus_norvegicus", overwrite=TRUE)
install_IMGT_germline_db("202614-2", "Macaca_mulatta", overwrite=TRUE)R 4.6.0 and igblastr 1.3.6 (Bioconductor 3.24) were used. See sessionInfo() at the end of this document for the details.
Installing the 5 germline databases above should take less than a minute.
To get the list of germline databases currently installed in igblastr's persistent cache:
list_germline_dbs()
# db_name V D J intdata auxdata
# _OGRDB.human.IGH+IGK+IGL.202410 342 31 23 TRUE TRUE
# _OGRDB.human.IGH+IGK+IGL.202410.src 354 33 24 TRUE TRUE
# _OGRDB.human.IGH+IGK+IGL.202605 367 31 23 TRUE TRUE
# _OGRDB.human.IGH+IGK+IGL.202605.src 379 33 24 TRUE TRUE
# _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603 184 9 22 TRUE TRUE
# _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603 169 11 22 TRUE TRUE
# _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603 172 9 22 TRUE TRUE
# _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205 149 9 22 TRUE TRUE
# _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410 184 10 22 TRUE TRUE
# _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602 2294 72 39 TRUE TRUE
# IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL 730 48 35 TRUE TRUE
# IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL 457 49 24 TRUE TRUE
# IMGT-202614-2.Mus_musculus.IGH+IGK+IGL 865 61 27 TRUE TRUE
# IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL 148 11 34 TRUE TRUE
# IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL 403 37 15 TRUE TRUENotes:
- The
_OGRDB.*databases arebuilt-indatabases i.e. they are shipped with the igblastr package and therefore always present. - The
V,D, andJcolumns indicate the number of germline gene alleles stored in the database for each germline gene region. - The
intdataandauxdatacolumns indicate whether a database includes its own annotations for the germline V alleles (intdata) and germline J alleles (auxdata). These annotations consist of reporting the coding start position and FWR/CDR boundaries on the V and J sequences. When analyzing BCR or TCR sequences with IgBLAST, the latter needs access to this information in order to annotate the former. In IgBLAST's terminology, the annotations for the germline V and J alleles are called internal data and auxiliary data, respectively. As you can see, all the germline databases above include these annotations. More on the auxiliary data below.
The 35 J alleles in IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL are fully annotated (i.e. coding frame and CDR3/FWR4 junction are known):
Notes:
- 2 alleles (
IGHJ5*04andIGKJ4*03) are not annotated in the auxiliary data shipped with IgBLAST. - For these 2 alleles, the CDR3/FWR4 junction was determined with
igblastr::compute_auxdata()(WGXG/FGXGmotif search).
The 27 J alleles in IMGT-202614-2.Mus_musculus.IGH+IGK+IGL are fully annotated:
Notes:
- 3 alleles (
IGLJ2P*01,IGLJ4*01_Mus_spretus, andIGLJ5*01_Mus_spretus) are not annotated in the auxiliary data shipped with IgBLAST. - For
IGLJ4*01_Mus_spretusandIGLJ5*01_Mus_spretus, the CDR3/FWR4 junction was determined withigblastr::compute_auxdata()(WGXG/FGXGmotif search). - For
IGLJ2P*01, the CDR3/FWR4 junction was determined withigblastr::infer_cdr3_ends_via_fwr4_comparisons(). The function moves a sliding window of 10 amino acids across the allele sequence until it finds a match with one of the already known FWR4 in the set of J alleles. In this case there was a perfect match with the FWR4 of alleleIGLJ3P*01.
The 34 J alleles in IMGT-202614-2.Oryctolagus_cuniculus.IGH+IGK+IGL are fully annotated:
Note that all 34 alleles are also annotated in the auxiliary data shipped with IgBLAST.
14/15 J alleles in IMGT-202614-2.Rattus_norvegicus.IGH+IGK+IGL are fully annotated:
Notes:
- All 15 alleles are also annotated in the auxiliary data shipped with IgBLAST. However, the annotations provided for
IGKJ3*01are incomplete (the CDR3/FWR4 junction is not reported). - The CDR3/FWR4 junction for
IGKJ3*01could not be determined withigblastr::compute_auxdata()(WGXG/FGXGmotif search), nor withigblastr::infer_cdr3_ends_via_fwr4_comparisons().
The 24 J alleles in IMGT-202614-2.Macaca_mulatta.IGH+IGK+IGL are fully annotated:
Note that all 24 alleles are also annotated in the auxiliary data shipped with IgBLAST.
> sessionInfo()
R version 4.6.0 (2026-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /home/hpages/R/R-4.6.0/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.6.0/lib/libRlapack.so; LAPACK version 3.12.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] igblastr_1.3.6 Biostrings_2.81.2 Seqinfo_1.3.0
[4] XVector_0.53.0 IRanges_2.47.2 S4Vectors_0.51.3
[7] BiocGenerics_0.59.6 generics_0.1.4 tibble_3.3.1
loaded via a namespace (and not attached):
[1] crayon_1.5.3 vctrs_0.7.3 httr_1.4.8
[4] cli_3.6.6 rlang_1.2.0 UCSC.utils_1.9.0
[7] jsonlite_2.0.0 xtable_1.8-8 glue_1.8.1
[10] GenomeInfoDb_1.49.1 lifecycle_1.0.5 compiler_4.6.0
[13] rvest_1.0.5 pkgconfig_2.0.3 R.oo_1.27.1
[16] R.utils_2.13.0 R6_2.6.1 pillar_1.11.1
[19] curl_7.1.0 magrittr_2.0.5 R.methodsS3_1.8.2
[22] tools_4.6.0 xml2_1.5.2