# Predicted effects of mutations vs TF motif matches

BlueSTARR model(s) trained on STARRseq data were used to predict the effect of mutations using in-silico saturated mutagenesis on the ENCODE list of cCRE regions. To generate these predictions, the trained BlueSTARR model is used to predict the activation of a sequence region centered on the cCRE, with the length of the sequence bins with which the model was trained (such as 300bp). This is the reference allele prediction. Then each base is mutated to all three alternative alleles, and the predicted activation of each mutated sequence is obtained (a total of 3x seqlen + 1 predictions for each cCRE).

In this approach, the predicted effect of a mutation is the $\log_2$ of the predicted activation of the mutated sequence minus $\log_2$ of the predicted activation of the reference sequence.

Here we investigate the concordance (or lack thereof) of transcription factor motif matches and the predicted effects of mutations.

## Setup

In [1]:
import duckdb
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

import sys
if '../' not in sys.path: sys.path.append('../')

For now, we only have mutation predictions from a BlueSTARR model trained on K562 data using 300bp sequence bins. 

In [5]:
PROJECT_ROOT = Path('/hpc/group/igvf')
DB_ROOT = PROJECT_ROOT / 'db'
DATA_ROOT = Path('../../igvf-pm')
MUT_PRED_DB = DB_ROOT / 'cCRE-preds-K562'
MOTIFS_DB = DB_ROOT / 'motifs'

## Open and/or create databases

Open the database of predicted effects of saturated mutagenesis on cCREs. See [igvf_allelepred2db.py](../starrutil/igvf_allelepred2db.py) for how to generate this database from "processed" prediction tables. We also need the database of motif matches deemed to be significant as a result of a genome-wide scan. 

In [7]:
mutpreds = duckdb.read_parquet(str(DB_ROOT / f'{MUT_PRED_DB}/**/*.parquet'), hive_partitioning=True)
motifs_db = duckdb.read_parquet(f'{MOTIFS_DB}-signif.parquet')

In [8]:
print(f"There are {mutpreds.count('chrom').fetchall()[0][0]:,} mutations with predicted effect, " +
      f"and {motifs_db.count('chrom').fetchall()[0][0]:,} motif matches deemed significant.")

There are 816,826,569 mutations with predicted effect, and 723,249 motif matches deemed significant.


### Match motif matches to cCRE mutations

To make the query more performant, we use an optimization function that partitions the query, using the column by which the database of mutations is partitioned for partitioning the query.

Note that we include a certain number of basepairs leading and trailing the motif match.

In [9]:
motif_mut_hits_db = f"{MUT_PRED_DB}_motif-hits.parquet"
if Path(motif_mut_hits_db).exists():
    print(f"Reading motif hits from existing {motif_mut_hits_db}")
else:
    print(f"Joining motif matches and mutations, one chromosome at a time.")
    num_leading_bps = 5 # number of leading (and trailing) base pairs to include in the join
    from starrutil.dbutil import partition_join
    project_expr = (
        "m.motif_id, m.motif_alt_id, "
        "regexp_extract(m.motif_alt_id, \'\.([^\.]+)$\', 1) as motif_name, "
        "m.chrom, m.start as mot_start, m.stop as mot_end, "
        "ap.allele_pos, ap.ref_allele, ap.allele as alt_allele, log2FC"
    )
    motif_mut_hits = partition_join(
        motifs_db, mutpreds,
        partition_col="chrom",
        project_expr=project_expr,
        join_expr=f"m.start-{num_leading_bps} <= ap.allele_pos and m.stop+{num_leading_bps} >= ap.allele_pos",
        aliases=("m", "ap"),
        verbose=True
    )
    print(f"Writing motif hits to {motif_mut_hits_db}")
    motif_mut_hits.to_parquet(motif_mut_hits_db)
motif_mut_hits = duckdb.read_parquet(motif_mut_hits_db)

Joining motif matches and mutations, one chromosome at a time.
Running query for chrom='chr11'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr12'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr13'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr19'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr2'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr3'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr5'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr8'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr1'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr22'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr14'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr16'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr18'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr21'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr9'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chrX'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chrY'
Running query for chrom='chr10'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr15'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr17'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr20'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr4'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr6'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Running query for chrom='chr7'


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Writing motif hits to /hpc/group/igvf/db/cCRE-preds-K562_motif-hits.parquet


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [10]:
num_motif_hits = motif_mut_hits.count('chrom').fetchall()[0][0]
print(f"There are {num_motif_hits:,} mutations with motif hits, which is " +
      f"{num_motif_hits/mutpreds.count('chrom').fetchall()[0][0]:.2%} of mutations.")


There are 2,920,638 mutations with motif hits, which is 0.36% of mutations.


## Visualizing predicted activations versus motif matches

### Per-motif statistics for filtering

Note that the relationship between motif name (derived from the suffix of the alternative motif_id) and motif_id is not strictly 1:1. Specifically, for a few motifs (i.e., motif names) there are different motif_ids with PWMs of different lengths and thus different lengths of motif matches.

In [11]:
duckdb.sql("select motif_name, list(distinct motif_id) as motif_IDs, "
           "min(mot_end - mot_start + 1) as motif_len_min, "
           "max(mot_end - mot_start + 1) as motif_len_max "
           "from motif_mut_hits "
           "group by motif_name having count(distinct motif_id) > 1")

┌────────────┬────────────────────────────────┬───────────────┬───────────────┐
│ motif_name │           motif_IDs            │ motif_len_min │ motif_len_max │
│  varchar   │           varchar[]            │     int64     │     int64     │
├────────────┼────────────────────────────────┼───────────────┼───────────────┤
│ RXRB       │ [MA0855.1, MA1555.1]           │            14 │            14 │
│ THRB       │ [MA1575.2, MA1576.2]           │            17 │            18 │
│ CTCF       │ [MA1930.2, MA1929.2, MA0139.2] │            15 │            33 │
│ RXRG       │ [MA0856.1, MA1556.1]           │            14 │            14 │
│ RARA       │ [MA0730.1, MA0729.1]           │            17 │            18 │
└────────────┴────────────────────────────────┴───────────────┴───────────────┘

We generate per-motif statistics including how many times it hit the cCREs and the most negative and most positive effects, so we can filter based on those. Due to the above, we have to group by motif_id, not motif_name.

More specifically, because we want to relate mutation effect predictions to motif matches, we have to distinguish motifs with different PWMs (and thus IDs).

In [12]:
motif_mut_stats = duckdb.sql(
    "select motif_id, first(motif_name) as motif_name, "
    "count(distinct chrom) as num_chroms, "
    "count(distinct mot_start) as num_motif_hits, count(*) as num_mut, "
    "min(log2FC) as min_log2FC, max(log2FC) as max_log2FC, "
    "max(abs(log2FC)) as max_abs_log2FC "
    "from motif_mut_hits "
    "group by motif_id").df()

### Visualize per-basepair statistics for motif matches

Here we visualize for each base within a motif match, and a few leading up and trailing the matches, what the predicted effect of mutating them is.

For each base (= reference) in motif matches to cCREs, we plot (as a boxplot) the largest predicted effect for mutating it. The largest effect is the one with the largest magnitude, positive or negative.

In [13]:
import matplotlib.ticker as ticker

def plot_motif_mutation_effects(mutation_effects, motif_id, ax, 
                                xlim=None, n_xticks=None):
    motif_name = mutation_effects['motif_name'].unique()[0]
    ax = sns.boxplot(y="motif_pos", x="log2FC", data=mutation_effects,
                     hue="ref_allele", hue_order=["A", "C", "G", "T"],
                     orient="h", 
                     whis=(2.5,97.5), fliersize=0.5, gap=0.2, width=0.9, linewidth=0.5, ax=ax)
    ax.xaxis.grid(True)
    group_pos = [tick.get_position()[1] for tick in ax.get_yticklabels()]
    ax.axhspan(group_pos[0]-0.5, group_pos[4]+0.5, color='lightgray', alpha=0.3)
    ax.axhspan(group_pos[-5]-0.5, group_pos[-1]+0.5, color='lightgray', alpha=0.3)
    for i in range(5, len(group_pos)-6):
        ax.axhline(group_pos[i]+0.5, color='gray', linewidth=0.5, linestyle='--')
    ax.set_ylim(group_pos[-1]+0.5, group_pos[0]-0.5)
    ax.set_ylabel("Position relative to start of motif match")
    ax.set_xlabel("Predicted activation relative to reference ($\log_2(alt/ref)$)")
    if xlim is not None:
        ax.set_xlim(xlim)
    if n_xticks is not None:
        ax.xaxis.set_major_locator(ticker.MaxNLocator(nbins=n_xticks))
    ax.set_title(f"Predicted effects of mutating {motif_name} ({motif_id}) matches")
    ax.legend(title="Ref. allele")

In [None]:
for _, row in motif_mut_stats.sort_values("max_abs_log2FC", ascending=False).iterrows():
    # stop when we reach a motif with absolute predicted effect less than a threshold
    if row['max_abs_log2FC'] < 0.5:
        break
    # skip motifs with fewer than 9 hits across all cCREs
    if row['num_motif_hits'] < 9:
        continue
    motif_id = row['motif_id']
    mutation_effects = duckdb.sql(
        "select motif_name, chrom, mot_start, mot_end, "
        "allele_pos - mot_start as motif_pos, ref_allele, "
        "last(alt_allele order by abs(log2FC)) as alt_allele, "
        "last(log2FC order by abs(log2FC)) as log2FC "
        "from motif_mut_hits " +
        "where motif_id = ? " +
        "group by motif_name, chrom, mot_start, mot_end, allele_pos, ref_allele "
        "order by chrom, mot_start, allele_pos ",
        params=[motif_id]).df()
    fig, ax = plt.subplots(figsize=(6, 20))
    plot_motif_mutation_effects(mutation_effects, motif_id=motif_id, ax=ax, xlim=(-1.1, 1.1))
    # save plot to file
    fig.savefig(Path('../figs') / f"{row['motif_name']}_{motif_id}-mutation-effects.png",
                bbox_inches='tight', dpi=300)
    plt.show()
    plt.close(fig)
