This document outlines the anatomy of the Squirls model, specifically, how a Squirls score is calculated for a variant.
As outlined in the Squirls manuscript, Squirls consists of two random forest estimators (one for the donor and the other for the acceptor site) followed by a logistic regression. Both random forests calculate predictions for a single variant, the predictions are subsequently transformed by the logistic regression into the final Squirls score. For a single variant, Squirls calculates scores for all overlapping transcripts.
The first step of the prediction process is the calculation of a small set of interpretable numeric features for machine learning. The features are then passed to random forest estimators. The random forests use different feature subsets to perform the prediction.
This section lists the features used by the donor random forest estimator:
- Ri wt donor
Information content
(Ri) of the closest canonical donor site.- ΔRi canonical donor
Difference between Ri of ref and alt alleles of the closest donor site (0 bits if the variant does not affect the site).
- ΔRi wt closest donor
Difference between Ri of the closest donor and the downstream (3’) donor site (0 bits if this is the donor site of the last intron).
- Donor offset
Number of 1 bp-long steps required to pass through the exon/intron border of the closest donor site. The number is negative if the variant is located upstream from the border.
- max Ri cryptic donor window
Maximum Ri of sliding window of all 9 bp sequences that contain the alt allele.
- ΔRi cryptic donor
Difference between max Ri of sliding window of all 9 bp sequences that contain the alt allele and Ri of alt allele of the closest donor site.
- phyloP
Mean phyloP score of the ref allele region.
These are the features used by the acceptor random forest estimator:
- ΔRi canonical acceptor
Difference between
information content
(Ri) of ref and alt alleles of the closest acceptor site (0
if the variant does not affect the acceptor site).- ΔRi cryptic acceptor
Difference between max Ri of sliding window applied to alt allele neighboring sequence and Ri of alt allele of the closest acceptor site.
- Creates
AG
inAGEZ
1
if the variant creates a novelAG
di-nucleotide inAGEZ
,0
otherwise.- Creates
YAG
inAGEZ
1
if the variant creates a novelYAG
tri-nucleotide inAGEZ
whereY
stands for a pyrimidine derivative (cytosine or thymine),0
otherwise (see Wimmer et al., 2020).- Acceptor offset
Number of 1 bp-long steps required to pass through the exon/intron border of the closest acceptor site. The number is negative if the variant is located upstream from the border.
- Exon length
Number of nucleotides spanned by the exon where the variant is located in (
-1
for non-coding variants that do not affect the canonical donor/acceptor regions).- ESRSeq
Estimate of impact of random hexamer sequences on splicing efficiency when inserted into five distinct positions of two different minigene exons obtained by in vitro screening (Ke et al., 2011).
- SMS
Estimated splicing efficiency for 7-mer sequences obtained by saturating a model exon with single and double base substitutions (saturation mutagenesis derived splicing score, Ke et al., 2018).
- phyloP
Mean phyloP score of the ref allele region.
Note
The values of all features based on information theory are in bits of information
Squirls algorithm consists of two random forest estimators trained to recognize variants that change splicing of a donor or acceptor site. Given a set of splice features, the estimator calculates deleteriousness for the corresponding variant.
If a feature cannot be calculated for a variant, the missing feature value is imputed by a median feature value that was observed during training of the model.
The random forest consists of n decision trees that use the splice features to make a decision regarding deleteriousness of the variant in question.
Squirls uses logistic regression as the final step to integrate outputs of the donor and acceptor random forests into the final Squirls score.
- Information content
Individual information content of a nucleotide sequence Ri(j) that is related to thermodynamic entropy and the free energy of binding. Ri can also be used to compare sites with one another.
- AGEZ
AG‐exclusion zone, the sequence between the branch point and the proper 3'ss
AG
that is devoid ofAG
s, as defined by Gooding et al., 2006