Genejam is intended to freshen gene annotations to the current official standard. It is particularly useful when comparing genes in two datasets, for example when those datasets may not be using the same gene symbols to represent equivalent genes.
Install using the remotes
package:
remotes::install_github("jmw86069/genejam")
Note: It is recommended not to use the devtools
package to install
Github packages, mostly because the devtools
package has many more
components than are required for installation. Instead, devtools
includes all components needed to develop R packages, beyond the scope
of installing one such R package.
The simplest use is to supply a set of gene symbols:
genejam::freshenGenes(c("APOE", "APOA", "HIST1H1C"))
input | intermediate | SYMBOL |
---|---|---|
APOE | 348 | APOE |
APOA | 4018 | LPA |
HIST1H1C | 3006 | H1-2 |
For slightly more detail, you can edit the argument final
to include
things like "SYMBOL"
(default), "GENENAME"
, "ALIAS"
, "ACCNUM"
,
and more.
genejam::freshenGenes(c("APOE", "APOA", "HIST1H1C"),
final=c("SYMBOL",
"GENENAME",
"ALIAS"))
input | intermediate | SYMBOL | GENENAME | ALIAS |
---|---|---|---|---|
APOE | 348 | APOE | apolipoprotein E | AD2,APO-E,ApoE4,APOE,LDLCQ5,LPG |
APOA | 4018 | LPA | lipoprotein(a) | AK38,APOA,LP,LPA |
HIST1H1C | 3006 | H1-2 | H1.2 linker histone, cluster member | H1-2,H1.2,H1C,H1F2,H1s-1,HIST1H1C |
I frequently find myself wanting gene symbol, and the long gene name, so
I created a simple function freshenGenes2()
that uses default argument
final=c("SYMBOL", "GENENAME")
:
genejam::freshenGenes2(c("APOE", "APOA", "HIST1H1C"))
input | intermediate | SYMBOL | GENENAME |
---|---|---|---|
APOE | 348 | APOE | apolipoprotein E |
APOA | 4018 | LPA | lipoprotein(a) |
HIST1H1C | 3006 | H1-2 | H1.2 linker histone, cluster member |
The other common use case is to include other gene aliases, with the
function freshenGenes3()
:
genejam::freshenGenes3(c("APOE", "APOA", "HIST1H1C"))
input | intermediate | SYMBOL | GENENAME | ALIAS |
---|---|---|---|---|
APOE | 348 | APOE | apolipoprotein E | AD2,APO-E,ApoE4,APOE,LDLCQ5,LPG |
APOA | 4018 | LPA | lipoprotein(a) | AK38,APOA,LP,LPA |
HIST1H1C | 3006 | H1-2 | H1.2 linker histone, cluster member | H1-2,H1.2,H1C,H1F2,H1s-1,HIST1H1C |
What if you already have Entrez gene ID, and want associated
annotations? The function freshenGenes()
runs two steps:
- convert input to
intermediate
(which is the Entrez gene ID) - convert
intermediate
to the output defined byfinal
, for examplefinal=c("SYMBOL")
would create a column"SYMBOL"
.
In this example, the Entrez gene ID values are in a column "ENTREZID"
,
so we will use argument intermediate="ENTREZID"
. In this case, you
already have "intermediate"
, so you invoke the function with a
data.frame
with values in a column named "intermediate"
, and set
try_list=NULL
.
df <- data.frame(ENTREZID=c("348", "4018", "3006", "100"));
genejam::freshenGenes2(x=df, intermediate="ENTREZID")
ENTREZID | SYMBOL | GENENAME |
---|---|---|
348 | APOE | apolipoprotein E |
4018 | LPA | lipoprotein(a) |
3006 | H1-2 | H1.2 linker histone, cluster member |
100 | ADA | adenosine deaminase |
Similarly, you can provide input with a mixture of gene symbols and Entrez gene ID values. Shown below is mixed input.
idf <- data.frame(Gene=c("MINA", "", "GABRR3", "GABRR3", ""),
ENTREZID=c("", "84864", "", "200959", "200959"))
idf
Gene | ENTREZID |
---|---|
MINA | |
84864 | |
GABRR3 | |
GABRR3 | 200959 |
200959 |
You only need to specify intermediate="ENTREZID"
as before.
genejam::freshenGenes2(x=idf, intermediate="ENTREZID")
Gene | ENTREZID | SYMBOL | GENENAME |
---|---|---|---|
MINA | 84864 | RIOX2 | ribosomal oxygenase 2 |
84864 | RIOX2 | ribosomal oxygenase 2 | |
GABRR3 | 200959 | GABRR3 | gamma-aminobutyric acid type A receptor rho3 subunit (gene/pseudogene) |
GABRR3 | 200959 | GABRR3 | gamma-aminobutyric acid type A receptor rho3 subunit (gene/pseudogene) |
200959 | GABRR3 | gamma-aminobutyric acid type A receptor rho3 subunit (gene/pseudogene) |
Notice the values in "ENTREZID"
are updated based upon the first step
resolution of "Gene"
values to "ENTREZID"
. The "SYMBOL"
and
"GENENAME"
columns are populated using values in "ENTREZID"
.
The official gene nomenclature is updated multiple times per year, which means one Entrez gene ID may have a different official gene symbol before and after the update. When comparing data from two experiments, it is important to use the same gene nomenclature. Otherwise, there will be differences in results only because the names of some genes are different.
Most microarray platforms provide gene annotations, which are updated much less frequently than the official genes. For example, Affymetrix array “Clariom D Human” was last updated between 2016 and 2018 (this document was written in 2021.) In order to compare microarray results to those from literature, biological pathways, or other experiments, the gene nomenclature needs to be updated to the most current version.
In rare cases an official gene symbol is “moved” from one Entrez gene ID to another, usually when the original Entrez gene ID is deleted. In these cases, the most reasonable link between an experimental asay and the targeted gene is the gene symbol. An alternative is to use a sequence accession number used to design the assay.
As a result, a “best possible” gene annotation strategy is used.
- When Entrez gene ID is available, use it.
- When Entrez official gene symbol is available, use that to determine the Entrez gene ID.
- When a sequence accession used for assay design is available, use that to determine the associated Entrez gene ID value or values.
- When a gene symbol alias is available, use that to determine the Entrez gene ID value or values associated with this alias.
Sometimes an assay measures two genes. The steps in genejam
are
designed to maintain multi-gene associations where necessary. If one
gene symbol alias is associated with two (or more) genes, then all those
genes will become associated. Note this only happens if a higher
priority association was not already found.
Ultimately the workflow is what I and others have been doing all along, to assemble the best available gene annotations for a given dataset, while also leaving behind the fewest possible un-annotated entries. When one source of annotation fails, try another on missing entries; and so on.
The steps used in genejam
are designed for speed, to the extent that
providing 100,000 rows should return results within a few seconds at
most. Only unique values are queried, and only missing entries are
updated. When multiple values are combined by a delimiter, a highly
optimized method is used.
Lastly, these operations also use optimized mixed-alphanumeric sorting
even in the context of a list
, so things like "chr2"
will appear
before "chr10"
in sort order. Incidentally, a sort step is necessary
so you can compare whether two entries are associated with the same
genes. If one is stored as "APOE,APOE4"
and the other as
"APOE4,APOE"
, this comparison fails.
All that to say, I use these functions a lot so I need them to be
reliable and fast. It takes a few seconds just to load the associated
SQLite gene annotation data, and the process used by freshenGenes()
is
usually substantially faster than that step alone.
A full online function reference is available via the pkgdown documentation:
Full genejam command reference: https://jmw86069.github.io/genejam