GPROB

Multiple diseases can present with similar initial symptoms, making it difficult to clinically differentiate between these conditions. GPROB uses patients’ genetic information to help prioritize a diagnosis. This genetic diagnostic tool can be applied to any situation with phenotypically similar diseases with different underlying genetics.

Citation

Please cite:

Knevel, R. et al. Using genetics to prioritize diagnoses for rheumatology outpatients with inflammatory arthritis. Sci. Transl. Med. 12, (2020)

License

Please see the LICENSE file for details. Contact us for other licensing options.

Installation

Install and load the GPROB R package.

devtools::install_github("immunogenomics/GPROB")
library(GPROB)

Synopsis

GPROB estimates the probability that each individual has a given phenotype.

We need three inputs:

Population prevalences of the phenotypes of interest.
Odds ratios for SNP associations with the phenotypes.
SNP genotypes (0, 1, 2) for each individual.

Example

Let’s use a small example with artificial data to learn how to use GPROB.

Suppose we have 10 patients, and we know of 7 single nucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA) or systemic lupus erythematosus (SLE).

Prevalence

First, we should find out the prevalence of RA and SLE in the population that is representative of our patients.

prevalence <- c("RA" = 0.001, "SLE" = 0.001)

Odds Ratios

Next, we need to obtain the odds ratios (ORs) from published genome-wide association studies (GWAS). We should be careful to note which alleles are associated with the phenotype to compute the risk in the correct direction.

or <- read.delim(
  sep = "",
  row.names = 1,
  text = "
snp  RA SLE
SNP1 1.0 0.4
SNP2 1.0 0.9
SNP3 1.0 1.3
SNP4 0.4 1.6
SNP5 0.9 1.0
SNP6 1.3 1.0
SNP7 1.6 1.0
")
or <- as.matrix(or)

Genotypes

Finally, we need the genotype data for each of our 10 patients. Here, the data is coded in the form (0, 1, 2) to indicate the number of copies of the risk allele.

geno <- read.delim(
  sep = "",
  row.names = 1,
  text = "
id SNP1 SNP2 SNP3 SNP4 SNP5 SNP6
 1    0    1    0    2    1    0
 2    0    0    1    0    2    2
 3    1    0    1    1    0    2
 4    1    1    0    2    0    0
 5    0    1    1    1    1    0
 6    0    0    1    3    0    2
 7    2    2    2    2    2    2
 8    1    2    0    2    1    1
 9    0    2    1   NA    1    2
10    1    0    2    2    2    0
")
geno <- as.matrix(geno)

Dealing with missing or invalid data

Before we run the GPROB() function, we need to deal with invalid and missing data.

We remove individuals who have NA for any SNP:

ix <- apply(geno, 1, function(x) !any(is.na(x)))
geno <- geno[ix,]

We remove individuals who have invalid allele counts:

ix <- apply(geno, 1, function(x) !any(x < 0 | x > 2))
geno <- geno[ix,]

And we make sure that we use the same SNPs in the or and geno matrices:

or <- or[colnames(geno),]

Run GPROB

Then we can run the GPROB function to estimate probabilities:

library(GPROB)
res <- GPROB(prevalence, or, geno)
res
#> $pop_prob
#>              RA          SLE
#> 1  0.0003556116 0.0017797758
#> 2  0.0033703376 0.0010049932
#> 3  0.0016672084 0.0006434285
#> 4  0.0003951084 0.0007126714
#> 5  0.0008885550 0.0014465506
#> 7  0.0005407850 0.0004337103
#> 8  0.0004622457 0.0006414499
#> 10 0.0003200618 0.0013374018
#> 
#> $cond_prob
#>           RA       SLE
#> 1  0.1665326 0.8334674
#> 2  0.7703046 0.2296954
#> 3  0.7215363 0.2784637
#> 4  0.3566669 0.6433331
#> 5  0.3805203 0.6194797
#> 7  0.5549386 0.4450614
#> 8  0.4188163 0.5811837
#> 10 0.1931034 0.8068966

In this example, we might interpret the numbers as follows:

Individual 2 has RA with probability 0.003, given individual genetic risk factors, disease prevalence, and the number of patients used in genetic risk score calculations.
Individual 2 has RA with probability 0.77, conditional on the additional assumption that individual 2 has either RA or SLE.

Calculations, step by step

Let’s go through each step of GPROB to understand how how it works.

The genetic risk score S_ki of individual i for disease k is defined as:

where:

x_ij is the number of risk alleles of SNP j in individual i
β_kj is the log odds ratio for SNP j reported in a genome-wide association study (GWAS) for disease k

Note: We might want to consider shrinking the risk by some factor (e.g. 0.5) to correct for possible overestimation of the effect sizes due to publication bias. In other words, consider running

geno \<-
0.5 \* geno

.

risk <- geno %*% log(or)
risk
#>            RA         SLE
#> 1  -1.9379420  0.83464674
#> 2   0.3140075  0.26236426
#> 3  -0.3915622 -0.18392284
#> 4  -1.8325815 -0.08164399
#> 5  -1.0216512  0.62700738
#> 7  -1.5185740 -0.57856671
#> 8  -1.6755777 -0.18700450
#> 10 -2.0433025  0.54844506

The known prevalence V_k of each disease in the general population:

prevalence
#>    RA   SLE 
#> 0.001 0.001

We can calculate the population level probability P_ki that each individual has the disease.

We find α_k for each disease k by minimizing (P̅_k - V_k)². This ensures that the mean probability P̅_k across individuals is equal to the known prevalence V_k of the disease in the population.

# @param alpha A constant that we choose manually.
# @param risk A vector of risk scores for individuals.
# @returns A vector of probabilities for each individual.
prob <- function(alpha, risk) {
  1 / (
    1 + exp(alpha - risk)
  )
}
alpha <- sapply(seq(ncol(risk)), function(i) {
  o <- optimize(
    f        = function(alpha, risk, prevalence) {
      ( mean(prob(alpha, risk)) - prevalence ) ^ 2
    },
    interval = c(-100, 100),
    risk = risk[,i],
    prevalence = prevalence[i]
  )
  o$minimum
})
alpha
#> [1] 6.003374 7.164133

Now that we have computed alpha, we can compute the population-level probabilities of disease for each individual.

# population-level disease probability
p <- sapply(seq_along(alpha), function(i) prob(alpha[i], risk[,i]))
p
#>            [,1]         [,2]
#> 1  0.0003556116 0.0017797758
#> 2  0.0033703376 0.0010049932
#> 3  0.0016672084 0.0006434285
#> 4  0.0003951084 0.0007126714
#> 5  0.0008885550 0.0014465506
#> 7  0.0005407850 0.0004337103
#> 8  0.0004622457 0.0006414499
#> 10 0.0003200618 0.0013374018

Next we assume that each individual has one of the diseases:

Then, we calculate the conditional probability C_ki of each disease k:

# patient-level disease probability
cp <- p / rowSums(p)
cp
#>         [,1]      [,2]
#> 1  0.1665326 0.8334674
#> 2  0.7703046 0.2296954
#> 3  0.7215363 0.2784637
#> 4  0.3566669 0.6433331
#> 5  0.3805203 0.6194797
#> 7  0.5549386 0.4450614
#> 8  0.4188163 0.5811837
#> 10 0.1931034 0.8068966

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
R		R
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPROB

Citation

License

Installation

Synopsis

Example

Prevalence

Odds Ratios

Genotypes

Dealing with missing or invalid data

Run GPROB

Calculations, step by step

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPROB

Citation

License

Installation

Synopsis

Example

Prevalence

Odds Ratios

Genotypes

Dealing with missing or invalid data

Run GPROB

Calculations, step by step

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages