Alexander Rakowski, Remo Monti, Viktoriia Huryn, Marta Lemanczyk, Uwe Ohler, Christoph Lippert
Paper: https://doi.org/10.1093/bioinformatics/btae403
Data used for model training: https://figshare.com/ndownloader/files/38200983
- clone the repository
bash download_data.sh
cp src/config.yaml.template src/config.yaml
- edit the paths in the
src/config.yaml
Training relies on weights and biases (wandb). wandb project and user name / entity are defined inside src/config.yaml
.
For an example on how to set up a sweep consider the information here.
After training has completed, a snakemake workflow automates model (checkpoint) evaluation. Follow these steps to set up snakemake:
- install snakemake in a new conda environment preferably called
snakemake
(see information on https://snakemake.readthedocs.io/en/stable/) - make sure you have configured
src/config.yaml
(variables in there are also used by snakemake) - change the name of the GPU (pytorch,wandb,pytorch lightning) environment from
nucleotran_cuda11_2
to the name of your environment inside the insrc/config.yaml
- In order to configure which checkpoints to evaluate, you can place separate sweep metadata files into
./snakemake/
as long as they follow the pattern./snakemake/sweep*.csv
. Each sweep CSV file needs to have at least these columns:"Name","ID","Sweep","basepath","metadata_file","metadata_mapping_config"
. A CSV like this can be exported from the wandb website or using the wandb API. More information on configuring a workflow with tabular data can be found here.- There is a script
src/wandb_create_sweep_csv.py
that can create such CSV files from a list of run IDs.
- There is a script
- rules (analysis steps) are defined inside
Snakefile
- each rule defines input files, output files, and code which is executed to produce those files.
Snakemake rules define their own compute resources (e.g., cores, GPUs). These can be used to automatically generate job scripts and distribute the workflow using a scheduler like slurm.
This repository comes with a snakemake slurm configuration located at slurm/config.yaml
that works on the DHC-lab cluster.
Rules can be invoked by requesting their output files (e.g., checkpoints/{ID}/best_model.ckpt
), where {ID}
is a "wildcard" that will match the wand run ID in this case (more about wildcards here).
Alternatively, rules can be invoked with the rule name (as long as their ouput files do not contain wildcards, or they have no output files but request only input files).
There is a script that sets up some sensible default snakemake arguments called run_snakemake_sbatch.sh
.
- for example, to request AUC values for the "best" checkpoints saved for all models listed in the
snakemake/sweep.csv
file(s):- check what would be run:
bash run_snakemake_cluster.sh -n all_calculate_best_model_roc_pr_validation
- submit a job that will execute the steps by submitting other jobs:
sbatch run_snakemake_cluster.sh all_calculate_best_model_roc_pr_validation
- check what would be run:
If you only want to request the outputs for a specific run, and set (train
,test
,val
), you could instead:
- check what would be run:
bash run_snakemake_cluster.sh -n checkpoints/{ID}/best_model.{set}.roc_pr.tsv.gz
{ID}
is a wildcard that needs to be substituted witht he run ID, and{set}
has to be substituted with eithertrain
,test
orval
- submit to the cluster
sbatch run_snakemake_cluster.sh checkpoints/{ID}/best_model.{set}.roc_pr.tsv.gz
The Snakefile
contains many rules starting with all_...
that allow running a specific rule for all runs listed in the snakemake/sweep*csv
files.
To run the workflow in debugging mode (subset to only one sample ID and trigger the --debug
flag for many rules), you can set debugging_mode = True
at the top of the Snakefile
. This will make many rules run on just a subset of the data and make them complete a lot faster (make sure to remove the output files after execution though!).
To run all model evaluation rules related to the prediction task:
sbatch run_snakemake_cluster.sh all_model_evaluation_on_prediction_task
To run all gnomAD rules:
sbatch run_snakemake_cluster.sh all_gnomad
To run all GTEx rules:
sbatch run_snakemake_cluster.sh all_gtex
To run all evaluation and downstream tasks:
sbatch run_snakemake_cluster.sh all
To export results from evaluation tasks to a file results_dump.tar.gz
(runs locally and does not trigger re-runs)
bash run_snakemake_cluster.sh all_data_export
See description of the tasks below.
Below, all rules are briefly documented. Usually there is a rule starting with all_<rule name>
to run them for all run IDs.
Uses wandb to collect the logged metrics across epochs for a given run {ID}
.
This file is used to select the best checkpoint.
The metrics TSV will contain all the metrics logged during training like train/loss
, train/loss_pred
, etc.
outputs:
checkpoints/{ID}/metrics.tsv
Selects the "best" available checkpoint based on val/loss
for run {ID}
if more than one checkpoint was saved.
outputs:
checkpoints/{ID}/best_model.ckpt
Tip: you can circumvent this selection by simply defining the
best_model.ckpt
yourself for each{ID}
if you want to select on different criteria.
outputs:
checkpoints/{ID}/best_model.{set}.roc_pr.tsv.gz
used command-line-tool:
src/evaluate_model_roc_pr.py
Calculates the area under the precision recall curve and area under the receiver operator characteristic curve of the model stored in the best_checkpoint.ckpt
file for each run {ID}
and set {set}
(train
,test
,val
). Predictions are averaged across forward and reverse strands.
The rule calculate_best_model_roc_pr
will request to run this rule for all run IDs and the validation set.
These rules evaluate the zero-shot performance of the best model's variant effect predictions to distinguish rare variants (potentially damaging variants) from common variants (likely benign variants). The original data were taken from Benegas et al. GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction.
The full dataset is available on huggingface. This dataset contains human variants. Find more information about gnomAD here.
Variants were intersected with ENCODE-defined cis-regulatory elements (ENCODE CREs) (https://screen.encodeproject.org/) and filtered for non-coding variants. The ENCODE CREs comprise potential enhancers (els
), promoters (pls
) and CTCF-bound elements (ctcf
).
The intersection brings the dataset down to about 1 million variants. Processing steps are described in data/external/00_download_gnomad_variants_intersect_with_encode.ipynb
.
This rule will perform variant effect predictions (model(alt_seq) - model(ref_seq)
) for all variants in {set}
for run {ID}
and dump them to an HDF5 file.
outputs (by name):
vep
: HDF5 file containing variant effect predictions- the HDF5 contains 5 datasets
predictions
,features_biological
,features_technical
,logits_biological
,logits_technical
.
- the HDF5 contains 5 datasets
varid
a text file containing the variant IDs for the predictions invep
.
used command-line-tool:
src/evaluate_VEP.py
special wildcards:
{set}
the ENCODE-CRE set to use (ctcf
,els
,pls
). By default the workflow will chooseels
Note: the HDF5 files are very large (up to 10G), and should be deleted once the other rules depending on these files below have finished.
Calculates the enrichment of rare variants at different VEP thresholds (quantiles).
outputs (by name):
results
: a TSV file containing, for each feature and evaluated threshold, the enrichment for rare variants (see below)total_counts
: a small file containing total counts of Rare/Common variants in the analysed set
used command-line-tool:
src/evaluate_VEP.py
special wildcards:
{pred_type}
one of the types of predictions (predictions
,features_biological
,features_technical
,logits_biological
,logits_technical
)- by default the workflow will run all of these
{method}
whether to look at both extreme sides of VEPs separately (bidirectional
) or take the absolute value of VEPs (absolute
). By default will dobidirectional
analysis
Example output for one feature
q | cut | Common | Rare | OR | pval | idx |
---|---|---|---|---|---|---|
0.0001 | -0.02372 | 44 | 48 | 1.00999 | 1.0 | 0 |
0.001 | -0.00870 | 426 | 493 | 1.07151 | 0.30594 | 0 |
0.01 | -0.00250 | 4480 | 4773 | 0.98623 | 0.51006 | 0 |
0.1 | -0.00040 | 44262 | 47523 | 0.99337 | 0.34005 | 0 |
0.5 | 0.00001 | 220321 | 238463 | 1.00412 | 0.32598 | 0 |
0.9 | 0.00041 | 47829 | 43891 | 0.83433 | 4.06703e-149 | 0 |
0.99 | 0.00233 | 5333 | 3853 | 0.66619 | 1.24165e-82 | 0 |
0.999 | 0.00787 | 556 | 365 | 0.60748 | 8.64057e-14 | 0 |
0.9999 | 0.02182 | 62 | 30 | 0.44794 | 0.00023 | 0 |
- q: the quantile
- cut: the cutoff corresponding to the quantile
- Common: the number of common variants at the cutoff
- Rare: the number of rare variants at the cutoff
- OR: the odds-ratio for the test comparing variants in that quantile agains all the rest (measures the enrichment of rare variants)
- pval: p-value for the fisher exact test
- idx: the index for the feature (in the same order it comes out of the model, i.e., 0 is the first feature)
In the example above, the feature's variant effect predictions deplete rare variant at larger quantiles.
Extract the variants for features for which variant effect predictions were significantly enriched for rare variants. These files can be used to see if different types of predictions or models select different sets of variants, or calculate the number of unique variants prioritized by the model.
outputs (by name):
q001
a TSV file containing variants selected at quantile0.001
q999
a TSV file containing variants selected at quantile0.999
special wildcards:
- uses the same wildcards as the rule above
used command-line-tool:
src/select_variants_from_VEP_quantile.py
the default is to select at quantiles 0.999
and 0.001
and select features enriched with p-value threshold 1e-5
.
Example output
varID | N | i_select |
---|---|---|
1_924305_G_A | 6 | 76,151,246,512,558,2026 |
1_924310_C_G | 6 | 160,517,558,632,1584,18 |
1_924533_A_G | 3 | 160,632,1584 |
1_938654_G_A | 116 | 27,66,70,126,133,153,[...],15 |
1_941787_A_G | 2 | 495,537 |
1_961346_A_C | 1 | 699 |
- varID: the variant identifier
- N: the number of features that selected this variant
- i_select: the feature index values (match
idx
in output of the rule above) in named afte the order they come out of the model
This task was inspired by Karrolus et al. 2023. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. But re-processes the GTEx data from the eQTL-catalog (it does not use the data provided with the Karrolus study). Find more information about GTEx here.
The goal is the measure the zero-shot performance of the model VEPs to distinguish Susie-finemapped-eQTL variants (potentially causal expression eQTL variants with high "posterior inclusion probabilities"; PIP) from position- and allele-frequency-matched non-finemapped variants (potentially non-causal variants with low PIP).
To arrive at the two sets, the following filters were applied:
- exclude all variants that overlap protein-coding genes
- exclude all variants that overlap the gene with which they are associated (if the gene is a lncRNA for example)
for the positive set:
- keep only variants with
PIP > 0.95
in the tissue of interest, and at leastPIP > 0.5
across tissues in which the variant was associated with gene expression.
for the negative set:
- for each positive variant, select variants in a +- 5kb window around the variant
- make sure the variant does not overlap the gene associated with the positive variant
- keep only variants with
max(PIP) < 0.05
across all tissues - keep only the variant with the closest minor allele frequency to the positive variant
- in case there are ties, select the physically closest variant (based on variant position)
Data download and processing are described in
data/external/eQTL_catalog/00_download_eqtl_catalog_gene_annotation.ipynb
data/external/eQTL_catalog/01_download_eqtl_catalog_susie_and_sumstats.ipynb
data/external/eQTL_catalog/02_get_highpip_variants_select_negatives.ipynb
These filters lead to a set of 2339 positives (SNPs) and 2304 negatives (SNPs), some negatives were selected for multiple positives (explaining the small difference between the numbers). There are also some longer indels in the files but these are not evaluated by the rule below.
Run the GTEx variant effect predictions and export result tables similar to those for the gnomAD task above. Unlike for the gnomAD task, this only looks at absolute values for the VEPs. It additionally performs logistic regression of the absolute VEPs against an indicator variable (0/1) indicating if the variant was fine-mapped (positive) or not (negative), as well as a zero-shot AUROC against those labels. It also calculates enrichments and performs Fisher exact tests for fine-mapped variants at two VEP quantiles (0.9 and 0.99, i.e., the 10% and 1% largest VEPs vs the rest) for each feature and prediction type.
outputs (by name):
results
a TSV containing, for each feature and evaluated threshold, the enrichment for fine-mapped variants- there is one results file for each prediction type (
predictions
,features_biological
, ...)
- there is one results file for each prediction type (
variants
a TSV containing variants selected at quantile0.99
for features with VEPs significantly enriched for eQTL variants (p<1e-5
).- there is one variants file for each prediction type
used command-line-tool:
src/predict_and_evaluate_gtex_eqtl_finemapped_variants.py
The variants
TSV files have the same format at the on from gnomAD rule select_variants_from_vep_quantile_best_model
.
The results
TSV contains many columns (displayed transposed here):
idx | 0 | the feature index |
---|---|---|
mean_diff | 0.000407 | the average absolute VEPs |
sd_diff | 0.00125 | the standard deviation of the absolute VEPs |
coef | -0.00514 | coefficient in the logistic regression on (0/1); log(OR)/sd(VEP) |
se | 0.0295 | standard deviation of the coefficient |
stat | -0.174 | test statisic |
pval | 0.862 | p-value |
ci_low | -0.0630 | 95% confidence interval lower bound |
ci_high | 0.0527 | 95% confience interval upper bound |
pseudo_r2 | 4.71e-06 | pseudo r-squared |
converged | True | if the optimization converged |
N | 4643 | total number of observations |
N_pos | 2339 | number of positive variants |
N_neg | 2304 | number of negative variants |
AUROC | 0.506 | zero-shot area under the ROC curve |
cut_q0.9 | 0.711 | cutoff for quantile 0.9 |
N_pos_q0.9 | 245 | number of positives selected at q0.9 |
N_neg_q0.9 | 222 | number of negatives selected at q0.9 |
OR_q0.9 | 1.10 | odds ratio at q0.9 |
pfisher_q0.9 | 0.354 | p-value for the Fisher exact test at q0.9 |
cut_q0.99 | 3.29 | cutoff for quantile 0.99 |
N_pos_q0.99 | 24 | number of positives selected at q0.99 |
N_neg_q0.99 | 24 | number of negatives selected at q0.99 |
OR_q0.99 | 0.985 | odds ratio at q0.99 |
pfisher_q0.99 | 1.0 | p-value for the Fisher exact test at q0.99 |
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources including notebooks for processing
│ └── processed <- Main data for model training etc
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ and a short description. notebooks are ordered in groups with folders
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported (TODO: make this work)
├── src <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
│
├── dataloading <- pytorch dataloaders and related classes
│
├── features <- Scripts to turn raw data into features for modeling
│
├── models <- model-related classes etc
Project based on the cookiecutter data science project template. #cookiecutterdatascience
python3 src/train_wandb.py --no_wandb_use_config_file_directly=sweep_configurations/default_sweep_config.json