# Pancancer trans effect overlaps

We want to test whether the number of trans-affected proteins that we see across multiple cancer types is much different from the number we'd expect to see if we just randomly chose that proportion of proteins from each cancer type. To do this, we will use a permutation test. Each permutation will randomly select a number of proteins from each cancer type that is the same as the number that we found to be affected. It will then calculate the proportions of those proteins that are common across different numbers of cancer types. This will create a null distribution for effect overlap among cancer types.

## Setup

In [1]:
import pandas as pd
import numpy as np
import os
import altair as alt
import operator

In [2]:
NUM_PERMUTATIONS = 1000

CHROMOSOME = "8"
ARM = "p"
TRANS_OR_CIS = "trans"

ttest_results_file = f"{CHROMOSOME}{ARM}_{TRANS_OR_CIS}effects_ttest.tsv"

## Load and parse ttest results file

In [3]:
ttest_results = pd.\
read_csv(ttest_results_file, sep="\t").\
rename(columns={"Name": "protein"}).\
set_index("protein")

cancer_types = sorted(ttest_results.columns.to_series().str.split("_", n=1, expand=True)[0].unique())

long_results = pd.DataFrame()

for cancer_type in cancer_types:
    cancer_df = ttest_results.\
    loc[:, ttest_results.columns.str.startswith(cancer_type)].\
    dropna(axis="index", how="all").\
    reset_index(drop=False)
    
    # If the cancer type has database IDs, make a separate column that has them.
    # If not, create a column of NaNs (so that the tables all match)
    if f"{cancer_type}_Database_ID" in cancer_df.columns:
        cancer_df = cancer_df.rename(columns={f"{cancer_type}_Database_ID": "Database_ID"})
    else:
        cancer_df = cancer_df.assign(Database_ID=np.nan)
        
    # Rename the pvalue and diff columns to not have the cancer type
    cancer_df = cancer_df.rename(columns={
        f"{cancer_type}_pvalue": "adj_p",
        f"{cancer_type}_diff": "change"
    }).\
    assign(cancer_type=cancer_type)
    
    # Reorder the columns
    cancer_df = cancer_df[["cancer_type", "protein", "Database_ID", "adj_p", "change"]]
    
    # Append to the overall dataframe
    long_results = long_results.append(cancer_df)

# Drop duplicate rows and reset the index
long_results = long_results[~long_results.duplicated(keep=False)].\
reset_index(drop=True)

## Find the number of proteins that were significantly affected in each cancer type

In [4]:
props_different = long_results.\
groupby("cancer_type")["adj_p"].\
apply(lambda x: (x <= 0.05).sum())

props_different

cancer_type
brca          2
colon        40
hnscc         7
lscc       1139
luad         63
ovarian       0
Name: adj_p, dtype: int64

## Permutation tests

In each iteration: For each cancer type, randomly select the correct number of the appropriate group of proteins, and then see how many overlaps we have between cancer types.

In [5]:
counts = pd.Series(0, index=range(1, props_different.size + 1))

generator = np.random.RandomState(0)

for i in range(NUM_PERMUTATIONS):
    perm_summary = long_results.\
    groupby("cancer_type")["protein"].\
    apply(lambda x: x.sample(n=props_different[x.name], replace=False, random_state=generator)).\
    to_frame().\
    droplevel(1).\
    reset_index().\
    groupby("protein").\
    agg(**{"cancers": ("cancer_type", lambda x: x.sort_values().drop_duplicates(keep="first").tolist())})

    perm_summary = perm_summary.\
    assign(
        num_cancers=perm_summary["cancers"].apply(len),
    ).\
    groupby("num_cancers")["num_cancers"].\
    count()

    counts = counts.combine(perm_summary, func=operator.add, fill_value=0)

counts

1    1220252
2      14670
3         44
4          0
5          0
6          0
dtype: int64

## Make a barchart of results

In [6]:
counts = counts.rename("counts").to_frame().reset_index(drop=False)

In [7]:
alt.Chart(counts).mark_bar().encode(
    x=alt.X(
        "index:O",
        axis=alt.Axis(
            labelAngle=0,
            title="Overlap size"
        )
    ),
    y=alt.Y(
        "counts",
        axis=alt.Axis(
            title="Number of occurrences"
        )
    )
)

## Calculate one-sided p values for each overlap size

We do one-sided p values because the null hypothesis is that there is no similarity between cancer types, and therefore no overlap, while the alternative hypothesis is that there is similarity between cancer types, so there is some overlap.

In [8]:
counts = counts.set_index("index")["counts"]

In [9]:
total = counts.sum()
pvals = pd.Series(np.nan, index=counts.index.copy())

for overlap_size in counts.index:
    as_or_more_extreme_count = counts[counts.index >= overlap_size].sum() / total
    pvals[overlap_size] = as_or_more_extreme_count
    
pvals = pvals.rename("pvals").to_frame()
pvals.index.name = "overlap_size"
pvals

Unnamed: 0_level_0,pvals
overlap_size,Unnamed: 1_level_1
1,1.0
2,0.011914
3,3.6e-05
4,0.0
5,0.0
6,0.0


In [10]:
pvals.to_csv(f"overlap_pvals_{CHROMOSOME}{ARM}_{TRANS_OR_CIS}.tsv", sep="\t")