# Sequence bin lengths vs motif statistics

We will comparatively assess the effects of different sequence bin windows, specifically 300bp vs 600bp windows, using distributions and other statistics of transcription factor motif matches resulting from a genome-wide scan.

## Setup

### Imports

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import duckdb
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('..')

### Paths for data and databases

In [2]:
PROJECT_ROOT = Path('/hpc/group/igvf')
DB_ROOT = PROJECT_ROOT / 'db'
DATA_ROOT = Path('../igvf-pm')
MOTIFS_DB = DB_ROOT / 'motifs'
STARR_DS = {
    '300bp': 'A549-Dex-w300',
    '600bp': 'A549-Dex-w600',
}
STARR_DB = {k: DB_ROOT / f'{ds}' for k, ds in STARR_DS.items()}

## Map motif matches from genome-wide scan to STARRseq sequence bins

We start with the database of significant matches (currently defined as p≤1e-8, corresponding to a q-value of ~0.05).

In [3]:
motifs_db_signif = duckdb.read_parquet(f'{MOTIFS_DB}-signif.parquet')
print(f"{motifs_db_signif.count('motif_id').fetchone()[0]} signifiant motifs")

723249 signifiant motifs


Open STARRseq database(s) to evaluate:

In [4]:
data_dbs = {
    k: duckdb.read_parquet(f"{ds}/**/*.parquet", hive_partitioning=True) for k, ds in STARR_DB.items()
}

Open the database(s) of motif matches mapped to STARRseq sequence bins. Where they don't exist yet, create them (this can take a while).

For accepting a motif hit to a sequence bin, we require that the motif match to the genome lies fully within the sequence bin. 

In [5]:
hits_dbs = {}
for k in data_dbs.keys():
    data_db = data_dbs[k]
    hits_db_file = f'{MOTIFS_DB}_{STARR_DS[k]}_hits.parquet'
    if Path(hits_db_file).exists():
        print(f"Opening existing motif hits database for {k} dataset")
    else:
        print(f"Generating motif hits database for {k} dataset")
        from starrutil.dbutil import motif_hits_to_dataset
        hits_dbs[k] = motif_hits_to_dataset(data_db, motifs_db_signif, verbose=True)
        hits_dbs[k].to_parquet(hits_db_file)
    hits_dbs[k] = duckdb.read_parquet(hits_db_file)

Opening existing motif hits database for 300bp dataset
Opening existing motif hits database for 600bp dataset


How many sequence bins with at least one TF motif match do we have?

In [None]:
for k, hits_db in hits_dbs.items():
    print(f"Dataset {k} has {hits_db.count('chrom').fetchall()[0][0]:,} motif hits " +
          f"to {hits_db.unique('chrom, seq_start').count('chrom').fetchall()[0][0]:,} sequence bins " +
          f"(out of {data_dbs[k].count('chrom').fetchall()[0][0]:,} total bins)")

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(8, 8), sharex=True)
fig.subplots_adjust(hspace=0.05)
for i, (k, hits_db) in enumerate(hits_dbs.items()):
    data_db = data_dbs[k]
    starr_sample = duckdb.sql('select chrom, start, log2FC from data_db using sample 1000000').df()
    h = axs[i].hist((starr_sample['log2FC'],
                     hits_db.unique('chrom, seq_start, log2FC').df()['log2FC']),
                    density=True, bins=40, label=('Random sample', 'Bins with motif hit(s)'))
    if i == len(hits_dbs) - 1:
        axs[i].set_xlabel('$\log_2(\Theta)$ of sequence bins')
    axs[i].set_ylabel(f'Density ({k})')
    axs[i].legend()
axs[0].set_title(f'Distribution of $\log_2(\Theta)$ values')


### Distribution of activations per motif

First we need to compute the median log2FC statistics for each motif.

In [8]:
hit_stats = {}
for k, hits_db in hits_dbs.items():
    hit_stats[k] = hits_db.unique('chrom, seq_start, log2FC, motif_name')\
        .aggregate('motif_name, median(log2FC) as median_log2FC').df()
hit_stats = pd.concat(hit_stats.values(), keys=hit_stats.keys(), names=['dataset'])

We would expect that far less than all motifs show elevated activations in any given STARR-seq dataset (i.e., cell-line and treatment). Therefore, we will limit ourselves to looking a reasonably small number of motifs which for simplicity we refer to as "active". We define this as having a median activation higher than some threshold.

In [None]:
med_log2FC_thresh = 0.0
activeTFs = hit_stats[hit_stats['median_log2FC'] >= med_log2FC_thresh]\
    .groupby('motif_name').max('median_log2FC')
activeTFs = activeTFs.reset_index(drop=False).sort_values('median_log2FC')
with pd.option_context('display.max_rows', None):
    display(activeTFs.reset_index(drop=True))

We can now extract those sequence bins whose motif match(es) are among those motifs we consider "active".

In [10]:
activeTFseqs = {}
for k, hits_db in hits_dbs.items():
    activeTFseqs[k] = duckdb.sql(
        'select distinct hdb.motif_name, chrom, seq_start, log2FC '
        'from hits_db hdb join activeTFs atf on (hdb.motif_name = atf.motif_name)').df()
activeTFseqs = pd.concat(activeTFseqs.values(), keys=activeTFseqs.keys(), names=['dataset'])

With this subset we make a boxplot of activations vs. motifs:

In [None]:
fig, ax = plt.subplots(figsize=(6, 10))
sns.boxplot(y='motif_name', x='log2FC', data=activeTFseqs, hue='dataset',
            order=activeTFs['motif_name'], fliersize=2, gap=0.2, ax=ax)
ax.set_title(f'Motifs with median $\log_2(\Theta)\geq {med_log2FC_thresh}$')
ax.set_ylabel('Motif')
ax.set_xlabel('$\log_2(\Theta)$ of sequence bins matching motif')
ax.xaxis.grid(True)
ax.axvline(0, color='black', linestyle='dotted')

### Sequence bins with combinations of TF motifs

Here we define motif combinations as the particular combination of motif hits (_num\_hits_) for the same sequence bin. Motif combinations can consist of hits by the same or different motifs.

We consider TF motifs different (_num\_tfs_) if their motif IDs are different. Different motif IDs can still have the same name as a part of their _alt ID_. We report the motif names both in "natural" and in lexical order.

For accepting a motif to be part of the combination for the sequence bin, we require that the last (by position) motif falling into a sequence bin starts after the first motif falling into the bin ends. I.e., at least the first and the last along the sequence bin don't overlap (but for a combination of _more_ than 2 motifs, others, such as the first and the second, may overlap).

In [None]:
TFcomb_hits = {}
for k, hits_db in hits_dbs.items():
    TFcomb_hits[k] = duckdb.sql('select chrom, seq_start, seq_end, '
           'count(*) as num_hits, count(distinct motif_id) as num_tfs, '
           'max(mot_start) - min(mot_start) as max_motif_dist, '
           'group_concat(motif_name, \', \') as motif_names, '
           'group_concat(motif_name, \', \' order by motif_name) as motif_names_o, '
           'first(log2FC) as log2FC, '
           'from hits_db mh '
           'group by chrom, seq_start, seq_end '
           'having num_hits > 1 and max(mot_start) > min(mot_stop) '
           'order by chrom, seq_start').df()
TFcomb_hits = pd.concat(TFcomb_hits.values(), keys=TFcomb_hits.keys(), names=['dataset'])
TFcomb_hits.drop(columns=['motif_names_o'])

#### Some comparative statistics for motif combinations

Comparing the 300bp and 600bp datasets, how many sequence bin with multiple motif hits do we have compared to all sequence bins with a motif hit?

In [12]:
for k, hits_db in hits_dbs.items():
    seqs_with_motif = hits_db.unique('chrom, seq_start').count('chrom').fetchall()[0][0]
    print(f'For dataset {k}, of the {seqs_with_motif:,} sequence bins with a motif hit, '+
          f'{len(TFcomb_hits.loc[k])/seqs_with_motif:.1%} ({len(TFcomb_hits.loc[k]):,}) have more than one.')

For dataset 300bp, of the 571,675 sequence bins with a motif hit, 6.6% (37,664) have more than one.
For dataset 600bp, of the 1,173,736 sequence bins with a motif hit, 10.7% (125,918) have more than one.


As we can see, the vast majority of seqence bins have only one motif match, although the 600bp dataset (expectedly) captures substantially more motif hits, both in absolute number and as a fraction of all motif hits.

And for both datasets, for the vast majority of sequence bins with multiple motif hits the combination consists of either one or two motifs:

In [31]:
TFcomb_hits.groupby(['dataset', 'num_tfs'])['chrom'].count()

dataset  num_tfs
300bp    1           34105
         2            3502
         3              49
         4               3
         5               3
         6               2
600bp    1          108297
         2           17178
         3             349
         4              35
         5              39
         6              20
Name: chrom, dtype: int64

How many motif hits per sequence bin? For both datasets, the great majority of motif combination hits consist of two or three motifs (whether the same or different motifs). However, there are many hundreds of combination hits consisting of 4, 5, 6, etc motifs, with a few having dozens. (We cut off the below at 10.)

In [32]:
t = TFcomb_hits.groupby(['dataset', 'num_hits'])['chrom'].count()
t.loc[:,2:10]

dataset  num_hits
300bp    2            31567
         3             1396
         4              585
         5              465
         6              610
         7              433
         8              418
         9              394
         10             359
600bp    2           103910
         3             8944
         4             1922
         5             1441
         6             1548
         7             1114
         8             1012
         9              983
         10             873
Name: chrom, dtype: int64

### Activations per motif combination

Note that for these statistics, we use the lexically ordered motif names making up the combination, meaning we will ignore the order of motifs in the combination for treating combinations as the same or different, and will also ignore the motif's IDs.

In [None]:
TFcomb_hits_reidx = TFcomb_hits.reset_index(drop=False, level='dataset')
TFcomb_stats = duckdb.sql(
    'select mh.dataset, mh.motif_names_o as motif_names, first(num_hits) as num_motifs, first(num_tfs) as num_tfs, '
    'count(*) as num_seqs, '
    'median(log2FC) as median_log2FC '
    'from TFcomb_hits_reidx mh '
    'group by mh.dataset, mh.motif_names_o ').df()
TFcomb_stats.sort_values('median_log2FC', ascending=False).reset_index(drop=True)

#### Distribution of activations per motif combination

As for individual motifs (or TFs), we would expect that far less than all motif combinations show elevated activations in any given STARR-seq dataset (i.e., cell-line and treatment).

Therefore, as we did for individual motifs, to limit ourselves to a reasonably small number of motif combinations which for simplicity we refer to as "active", we define "active" as having a median activation higher than some threshold. In addition, we require a minimum number of sequence bins with that combination, and we also limit the number of motifs making up the combination.

Recall that here the order of motifs in a combination is ignored for deciding whether two combinations are the same. 

In [None]:
med_log2FC_thresh = 0.1
activeTFcombs = TFcomb_stats[(TFcomb_stats['median_log2FC'] >= med_log2FC_thresh) &
                             (TFcomb_stats['num_seqs'] >= 9) &
                             (TFcomb_stats['num_motifs'] <= 3)].groupby('motif_names').max('median_log2FC')
activeTFcombs = activeTFcombs.reset_index(drop=False).sort_values('median_log2FC')
with pd.option_context('display.max_rows', None):
    display(activeTFcombs.reset_index(drop=True))

Using the list of "active" motif combinations, we subset the sequence bins to those matching those combinations.

In [28]:
activeTFcomb_hits = TFcomb_hits[TFcomb_hits['motif_names_o'].isin(activeTFcombs['motif_names'])]
activeTFcomb_hits = activeTFcomb_hits.drop(columns=['motif_names']).rename(columns={'motif_names_o': 'motif_names'})

Now we can plot the distribution of activations per motif combination for each combination in the list. (Note that the majority of these motif combinations are motif pairs.)

In [None]:
fig, ax = plt.subplots(figsize=(6, 8))
sns.boxplot(y='motif_names', x='log2FC', data=activeTFcomb_hits, hue='dataset',
            order=activeTFcombs['motif_names'], fliersize=2, ax=ax)
ax.set_title(f'Motif combinations with median $\log_2(\Theta)\geq {med_log2FC_thresh}$')
ax.set_ylabel('Motif combination')
ax.tick_params(axis='y', labelrotation=45)
ax.set_xlabel('$\log_2(\Theta)$ of sequence bins matching motif combination')
ax.xaxis.grid(True)
ax.axvline(0, color='black', linestyle='dotted')

### Distribution of distances for motif combinations in sequence bins

Distances are between pairs of motifs. However, recall that the distances we calculated are for the maximum distance between a pair in a combination. For combinations that are pairs, this is the same as the motif pair distance, but for combinations of more than 2 motifs, we are not enumerating the distances between all possible pairs.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
h = ax.hist((TFcomb_hits.loc['300bp']['max_motif_dist'],
             TFcomb_hits.loc['600bp']['max_motif_dist']),
             bins=60, log=True, label=['300bp', '600bp'])
ax.set_xlabel('Distance')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of maximum distances between motifs on sequence bins')
ax.legend()
ax.grid(True)


### Distribution of activations for motif pairs vs their distance

For a better sense of presence or absence of a statistical change over the range of distances, we divide up the distances into bins, and then create a boxplot of activations vs distance bin.

In [46]:
dist_bin_w = 20
TFcomb_dists = TFcomb_hits.copy()
TFcomb_dists['dist_category'] = TFcomb_dists['max_motif_dist'] // dist_bin_w * dist_bin_w + dist_bin_w/2
TFcomb_dists['dist_category'] = TFcomb_dists['dist_category'].astype(int)

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.boxplot(x='dist_category', y='log2FC', data=TFcomb_dists, hue='dataset',
            native_scale=True, notch=True, fliersize=2, gap=0.2, ax=ax)
ax.set_title('$\log_2(\Theta)$ of sequence bins with motif combination(s) vs motif distance')
ax.set_xlabel('Maximum distance between motifs')
ax.set_ylabel('$\log_2(\Theta)$ of sequence bins')
ax.axhline(0, color='black', linestyle='dotted')