
# Pancancer trans effect overlaps

We want to test whether the number of cis-affected proteins that we see across multiple cancer types is much different from the number we'd expect to see if we just randomly chose that proportion of proteins from each cancer type. To do this, we will use a permutation test. Each permutation will randomly select a number of proteins from each cancer type that is the same as the number that we found to be affected. It will then calculate the proportions of those proteins that are common across different numbers of cancer types. This will create a null distribution for effect overlap among cancer types.

## Setup

In [1]:
import cnvutils
import pandas as pd
import numpy as np
import os
import altair as alt
import operator

In [2]:
NUM_PERMUTATIONS = 10000

params = cnvutils.load_params(os.path.join("..", "data", "params.json"))
CHROMOSOME = params["CHROMOSOME"]
ARM = params["ARM"]
CIS_OR_TRANS = "trans"

ttest_results_file = os.path.join("..", "data", f"{CHROMOSOME}{ARM}_{CIS_OR_TRANS}_ttest.tsv")

## Load and parse ttest results file

In [3]:
ttest_results = pd.read_csv(ttest_results_file, sep="\t")

## Find the number of proteins that were significantly affected in each cancer type

In [4]:
props_different = ttest_results.\
groupby("cancer_type")["adj_p"].\
apply(lambda x: (x <= 0.05).sum())

props_different

INFO:numexpr.utils:NumExpr defaulting to 8 threads.


cancer_type
brca       25
colon      22
hnscc      28
lscc       32
luad       42
ovarian    29
Name: adj_p, dtype: int64

## Permutation tests

In each iteration: For each cancer type, randomly select the correct number of the appropriate group of proteins, and then see how many overlaps we have between cancer types.

In [5]:
counts = pd.Series(0, index=range(1, props_different.size + 1))

generator = np.random.RandomState(0)

for i in range(NUM_PERMUTATIONS):
    perm_summary = ttest_results.\
    groupby("cancer_type")["protein"].\
    apply(lambda x: x.sample(n=props_different[x.name], replace=False, random_state=generator)).\
    to_frame().\
    droplevel(1).\
    reset_index().\
    groupby("protein").\
    agg(**{"cancers": ("cancer_type", lambda x: x.sort_values().drop_duplicates(keep="first").tolist())})

    perm_summary = perm_summary.\
    assign(
        num_cancers=perm_summary["cancers"].apply(len),
    ).\
    groupby("num_cancers")["num_cancers"].\
    count()

    counts = counts.combine(perm_summary, func=operator.add, fill_value=0)

counts

1    195507
2    214846
3    193324
4     99501
5     27961
6      3224
dtype: int64

## Make a barchart of results

In [6]:
counts = counts.rename("counts").to_frame().reset_index(drop=False)

In [7]:
alt.Chart(counts).mark_bar().encode(
    x=alt.X(
        "index:O",
        axis=alt.Axis(
            labelAngle=0,
            title="Overlap size"
        )
    ),
    y=alt.Y(
        "counts",
        axis=alt.Axis(
            title="Number of occurrences"
        )
    )
)

## Calculate one-sided p values for each overlap size

We do one-sided p values because the null hypothesis is that there is no similarity between cancer types, and therefore no overlap, while the alternative hypothesis is that there is similarity between cancer types, so there is some overlap.

In [8]:
counts = counts.set_index("index")["counts"]

In [9]:
total = counts.sum()
pvals = pd.Series(np.nan, index=counts.index.copy())

for overlap_size in counts.index:
    as_or_more_extreme_count = counts[counts.index >= overlap_size].sum() / total
    pvals[overlap_size] = as_or_more_extreme_count
    
pvals = pvals.rename("pvals").to_frame()
pvals.index.name = "overlap_size"
pvals

Unnamed: 0_level_0,pvals
overlap_size,Unnamed: 1_level_1
1,1.0
2,0.733773
3,0.441212
4,0.177958
5,0.042465
6,0.00439


In [10]:
save_path = os.path.join("..", "data", f"{CHROMOSOME}{ARM}_{CIS_OR_TRANS}_overlap_pvals.tsv")
pvals.to_csv(save_path, sep="\t")