-
Notifications
You must be signed in to change notification settings - Fork 0
ADPenetrance
Updated 16/12/2022
The repository is maintained by Thomas Spargo (thomas.spargo@kcl.ac.uk) - please reach out with any questions.
The adpenetrance R function can be used to calculate the pentrance of a germline genetic variant, or aggregation of variants, which are pathogenic for an autosomal dominant phenotype. Penetrance can be estimated with or without confidence intervals. The calculation is based on the rate at which one family disease structure ('disease state') occurs across a valid subset of disease states in people who harbour the assessed variant, the average sibship size of those people sampled these data, and the disease risk for people not harbouring the tested variant.
The approach considers the following disease states:
- familial - two or more first-degree family members are affected
- sporadic - one family member is affected (none of their first-degree relatives are affected)
- unaffected - no family members are affected (a control population)
- affected - one or more first-degree family members are affected
To operate the approach, input data must be include any two or three of the familial, sporadic and unaffected states OR for the affected and unaffected states.
The approach is further described in the details section below, and a comprehensive outline of the method is given in (1).
Estimate penetrance using input data from the familial and sporadic states without confidence intervals, taking into account disease risk for people not harbouring the tested variant, denoted in the model as g:
adpenetrance(N, MF, MS, PF, useG)
As above, but assuming that
adpenetrance(N, MF, MS, PF)
Estimate penetrance using input data from the familial and unaffected states with confidence intervals, and assuming that
adpenetrance(N, MF, MU, PA, PF, MF_SE, MU_SE, Zout)
Estimate penetrance using input data from the familial and sporadic states without confidence intervals, assuming that
adpenetrance(N, MF, MS, PF, define_sibstructure=matrix(c(0:4,c(0.18, 0.18, 0.37, 0.16, 0.11)),ncol=2))
Estimate penetrance according to the rate of familial disease across people harbouring the variant sampled from the familial and sporadic disease states (without confidence intervals), assuming that adpenetrance(N, RX, states="fs")
Estimate penetrance according to the rate of familial disease across people harbouring the variant sampled from the familial and sporadic disease states (with confidence intervals), taking into account g:
adpenetrance(N, RX, RX_SE, Zout, states="fs", useG)
N - sibship size (define the average sibship size for the sample from which variant characteristics are estimated or assign an estimate representative of this sample). Must be provided.
useG - Probability of disease among family members not inheriting the variant. Defaults to 0 and can be specified optionally. This can affect penetrance estimates substantially in more common traits.
MF - Variant frequency in familial disease state. Specify alongside MS and/or MU. Do not specify MA or RX if used.
MS - Variant frequency in sporadic disease state. Specify alongside MF and/or MU. Do not specify MA or RX if used.
MU - Variant frequency in unaffected disease state. Specify alongside MF and/or MS OR with MA. Do not specify RX if used.
MA - Variant frequency in affected disease state. Specify alongside MU. Do not specify MF, MS, or RX if used.
PA - Probability of a person from the sampled population of being affected. Specify if values are given for MA and/or MU.
PF - Probability of being familial if affected (i.e. the disease first-degree familiality rate). Specify if values are given for MF and/or MS.
MF_SE - Standard error in MF. Used to calculate confidence intervals of penetrance estimate. Specify alongside MF and the SE estimates for each state with variant frequency data provided.
MS_SE - Standard error in MS. Used to calculate confidence intervals of penetrance estimate. Specify alongside MS and the SE estimates for each state with variant frequency data provided.
MU_SE - Standard error in MU. Used to calculate confidence intervals of penetrance estimate. Specify alongside MU and the SE estimates for each state with variant frequency data provided.
MA_SE - Standard error in MA. Used to calculate confidence intervals of penetrance estimate. Specify alongside MA and the SE estimates for each state with variant frequency data provided.
Zout - Specify Z value for deriving confidence intervals of output from the calculated standard error. Defaults to 1.96, estimating 95% confidence intervals.
RX - Specify the rate of 'state X' among people harbouring the tested variant sampled from a valid set of disease states. State X can be either familial, sporadic or affected (see details below). Must also specify the states term where RX is given. Do not specify any of MF, MS, MU, or MA if used.
RX_SE - Standard error in RX. Used to calculate confidence intervals of penetrance estimate. Specify alongside RX.
states - Indicates which states are represented within the RX calculation. Is a string variable and can be defined as: "fsu","fs","fu","su","au" (see details below). Must be provided where RX is defined.
define_sibstructure - Optionally supply either a vector detailing the sibship sizes of all sampled families or a summary of the sibship distribution of the sample (see details below). Passed to adpenetrance.errorfit subfunction.
include_MLE - Logical (defaults to TRUE). For adpenetrance.MLE only, indicate whether or not to make additional unadjusted penetrance estimates via a maximum likelihood approach. If FALSE, adpenetrance.MLE functions equivalently to adpenetrance.
The approach is comprehensively described in the associated manuscript (1) and its supplementary materials.
Function Input
Input data for penetance calculation can be given as one of two main structures.
In both data structures:
-
Sibship size (
N) must be indicated. This should represent the average size of sibships across the samples used to define variant characteristics. It can be estimated for the sample either directly, based on the average sibship size among the described families, or indirectly, by designating an estimate representative of the sampled population (e.g. available within global databases). In the original publication, we drew upon the World Bank, World Development Indicators database, approximatingNas the Total Fertility Rate of the regions from which variant frequencies were estimated. -
Residual disease risk for people not inheriting the variant (
useG) is an optional parameter but we recommend providing it where possible. This will affect penetrance estimation particularly in more common traits (e.g. where g>0.01) The parameter can be readily calculated using thegetResidualRiskfunction provided (documentation here). -
The user can optionally indicate the distribution of sibship sizes across sampled families (
define_sibstructure). This should be either a vector of integers containing the sibship sizes of each sampled family or a 2 column matrix or data frame where column 1 details the sibship sizes occuring in the sample and column 2 details the sample proportion to which each sib-size corresponds. This sib-structure information is passed toadpenetrance.errorfitand is used to tailor the sibship distribution of an internally-simulated population; where no information is passed todefine_sibstructure, sibships in the simulated dataset follow a Poisson distribution, with lambda defined byN. This simulated population is used to fit a polynomial regression model to predict the difference between an unadjusted penetrance estimate and the true penetrance value. The unadjusted penetrance estimate made byadpenetranceis then adjusted by this predicted error to derive the adjusted penetrance estimate (See 1 for further details). Supplying accurate information todefine_sibstructurewill allow for a more precise adjustment of the penetrance estimate.
Within data structure 1, the user must also indicate:
-
Variant frequency within populations screened for the variant, drawn from either two or three of the familial (
MF), sporadic (MS) and unaffected (MU) states OR for the affected (MA) and unaffected (MU) states. -
Weighting factors (
PAand/orPF) for the variant frequency estimates.
Within data structure 2, the user must also indicate:
-
Rate of state X (
RX), which is either familial, sporadic, or affected among people harbouring the assessed variant, drawn from either two or three of the familial, sporadic and unaffected states OR for the affected and unaffected states as indicated within thestatesargument. State X is always the first state indicated within the states argument. -
States (
states) that are included for calculatingRX.
Information for data structure 1
Variant frequency estimates can be given for the following family disease structures ('disease states'):
- familial (
MF) - two or more first-degree family members are affected - sporadic (
MS) - one family member is affected (none of their first-degree relatives are affected) - unaffected (
MU) - no family members are affected (a control population) - affected (
MA) - one or more first-degree family members are affected
Note that the familial and sporadic states are subsumed within the affected state; variant frequency estimates for the 'affected' state cannot be provided alongside estimates for either or both of the the 'familial' or 'sporadic' states.
Weighting factors must be given, but PA and PF are not both necessary in all disease-state combinations:
-
PAmust be specified ifMAand/orMUare provided. -
PFmust be specified ifMFand/orMSare provided.
To additionally calculate confidence intervals for the penetrance estimate, the user must indicate the standard error of all the variant frequency estimates provided. These are the specified in the arguments MF_SE, MS_SE, MU_SE, MA_SE and should be given in all the states for which variant frequency estimates were provided. The Zout argument defines the level of confidence to be estimated, defaulting to a value of 1.96 which will give 95% confidence.
Information for data structure 2
The value of the RX argument indicates the rate at which 'state X' occurs across a valid set of disease states among people harbouring the assessed variant. The value of the states argument indicates from which states people have been sampled and which state is considered to be 'X' within the function:
-
"fsu"- people are sampled from the familial, sporadic and unaffected states and state X is familial -
"fs"- people are sampled from the familial and sporadic states and state X is familial -
"fu"- people are sampled from the familial, unaffected states and state X is familial -
"su"- people are sampled from the sporadic and unaffected states and state X is sporadic -
"au"- people are sampled from the affected and unaffected states and state X is affected
The user should specify which states have been sampled in the states argument and the value of RX for this state combination (e.g. if states="fs", RX is the rate of familial disease across people harbouring the assessed variant sampled across the familial and sporadic states).
To calculate confidence intervals for the penetrance estimate, the user must indicate the standard error of RX, given in the RX_SE argument. The Zout argument defines the level of confidence to be estimated, defaulting to a value of 1.96 which will give 95% confidence.
The results are returned within a list of elements:
$output
This element stores the results, returned within a matrix.
Rows of the matrix present the rate at which 'state X' occurred across all the states modelled within the input data for people harbouring the assessed variant, the unadjusted penetrance estimate to which this rate corresponds (at the defined N, and useG), and the adjusted penetrance estimate after correcting for systematic bias within the unadjusted estimate.
The disease state rate result is presented in two forms:
- the 'observed' rate
- the 'expected' rate
If data are specified using structure 1, then the 'observed' rate has been calculated as a weighted proportion of the variant frequencies defined in the input data. If data are specified using structure 2 then the 'observed' rate is the value defined as RX.
The 'expected' rate is one of a series of 'state X' rates stored in a lookup table that is generated within the function. These are derived as per the disease model equations followed within this method (1, updated citation needed for model incorporating the 'g' term) for values of penetrance between 0 and 1 at intervals of 0.0001 and the sibship size N. The expected rate shown in the output represents closest match in the lookup table to the observed rate.
The unadjusted penetrance estimate is obtained from the ookup table and corresponds directly to the 'expected' disease state rate. The adjusted penetrance estimate is derived from the unadjusted estimate and error predicted in this estimate under an nth degree polynomial regression model which is fitted by adpenetrance.errorfit: adjusted penetrance = unadjusted penetrance + predicted error. Refer to (1) for further detail on fitting this model.
Note that the expected disease state rate should approximately equal the observed rate. An exception to this would be if the unadjusted penetrance is estimated to be 0 or 1. In this scenario, the observed and expected rates may deviate, as the observed rate could be less than or exceed the rate expected respectively at penetrance values of 0 and 1.
If data are given to allow estimation of error in the penetrance estimate, then the output matrix will include estimates of disease state rate and penetrance at the confidence interval bounds. The standard error in the 'observed' rate will also be given.
$states
This character string indicates the disease state combination that adpenetrance has used in it's analysis (see details in the 'function input' section above). Check that this matches your expectations; if not, refer back to the input.
$ResidualRiskG
This numeric indicates the probability of disease expected for people in families not harbouring the variant. This value corresponds to the useG input argument, and will be 0 by default.
$errorfit
This contains the polynomial regression model, a lm() fitted for an nth degree polynomial according to Akaike information criterion (AIC) and the optimise() function, to the simulated sibship population as part of the error correction step perormed by the adpenetrance.errorfit subfunction. It is provided for reference.
- Spargo, T. P., Opie-Martin, S., Bowles, H., Lewis, C. M., Iacoangeli, A., & Al-Chalabi, A. (2022). Calculating variant penetrance from family history of disease and average family size in population-scale data. Genome Medicine 14, 141. doi: 10.1186/s13073-022-01142-7