# Equivalence tests for cis genes for 8q

Here we look for genes IN the event discussed that are NOT affected by the arm-level event.

## Setup (Install necessary packages)

We will start by importing necessary packages and collecting all of the proteomics data we will need to run the tests. The cancer types analyzed should have been determined in 01_event_basic_info where we determine which types of cancer seem to have the event we are looking at.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cptac
from scipy import stats
import cnvutils
import cptac.utils
import statsmodels.stats.power
import statsmodels.stats.weightstats
import statsmodels.stats.multitest

In [2]:
# These variables specify which chromosome and arm we're working on, and whether to do cis or trans
CHROMOSOME = '8'
ARM = 'q'
CIS_OR_TRANS = "cis"

In [3]:
if ARM == "p":
    EVENT_COLUMN = "loss_event"
    EXCLUDE_COLUMN = "gain_event"
    EVENT_START = 0
    EVENT_END = 30794385
    
elif ARM == "q":
    EVENT_COLUMN = "gain_event"
    EXCLUDE_COLUMN = "loss_event"
    EVENT_START = 80794385
    EVENT_END = 130794385

else:
    raise ValueError("Invalid value for ARM variable.")

## Load data

In [4]:
# Load in the cptac data for each cancer type that you want to analyze.

br = cptac.Brca()
# cc = cptac.Ccrcc()
co = cptac.Colon()
# en = cptac.Endometrial()
# gb = cptac.Gbm()
hn = cptac.Hnscc()
ls = cptac.Lscc()
lu = cptac.Luad()
ov = cptac.Ovarian()

Checking that luad index is up-to-date... 



                                            

In [5]:
# Now we need to get the proteomics tables for each type of cancer to analyze.
proteomics = {
    "brca": br.get_proteomics(tissue_type="tumor"),
    "colon": co.get_proteomics(tissue_type="tumor"),
    "hnscc": hn.get_proteomics(tissue_type="tumor"),
    "lscc": ls.get_proteomics(tissue_type="tumor"),
    "luad": lu.get_proteomics(tissue_type="tumor"),
    "ovarian": ov.get_proteomics(tissue_type="tumor")
}

## Select the proteins we're interested in

If we're looking at cis effects, we select proteins within the event. If we're looking at trans effects, we select proteins outside of the event.

Note that the cnvutils.get_event_genes function uses Ensembl gene IDs for the Database_ID column, while the proteomics dataframes that have a Database_ID column use RefSeq protein IDs. So, when we're selecting the genes we want, we ignore the Database_ID column if it is present, and just use gene names.

In [6]:
selected_genes = cnvutils.\
get_event_genes(
    chrm=CHROMOSOME,
    event_start=EVENT_START,
    event_end=EVENT_END,
    cis_or_trans=CIS_OR_TRANS
)["Name"].\
drop_duplicates(keep="first")

for cancer_type in proteomics.keys():
    df = proteomics[cancer_type].transpose()
    
    if df.index.nlevels == 1:
        df = df[df.index.isin(selected_genes)]
    else:
        df = df[df.index.isin(selected_genes, level="Name")]

    proteomics[cancer_type] = df

## Append Event Data

We now append the data from the event table that should have been created in a previous notebook.

In [7]:
has_event = dict()
for cancer_type in proteomics.keys():
    
    df = proteomics[cancer_type]
    df = df.transpose()
    
    event = pd.read_csv(
        f'{cancer_type}_has_event.tsv', 
        sep='\t', 
        index_col=0,
        dtype={EVENT_COLUMN: bool}
    )
    
    if EXCLUDE_COLUMN:
        event.drop(EXCLUDE_COLUMN, axis=1, inplace=True)
        
    event.index.rename('Name')
    df = df.join(other=event, how="inner")
    
    has_event[cancer_type] = df[EVENT_COLUMN]
    proteomics[cancer_type] = df



## Run equivalence tests

To determine the upper and lower bounds for our equivalence tests, we will use power calculations to determine the minimum effect size we could detect in the first place for each gene. Note that the power calculations are for Student's t test, while our TOST equivalence tests use Welch's t test, but this is okay because Student's is more powerful than Welch's, so the minimum effect size will be underestimated, thus not hurting our accuracy. If it ends up being too strict, we'll alter our approach.

In [8]:
results_df = None
for cancer_type in proteomics.keys():
    prot_df = proteomics[cancer_type]
    
    comparisons = []
    pvals = []
    nulls = []
    
    # Iterate over all columns except the event column
    for prot in prot_df.columns[~(prot_df.columns == EVENT_COLUMN)]:
        
        # Get the data
        in_event = prot_df.loc[prot_df[EVENT_COLUMN], [prot]].iloc[:, 0].dropna()
        not_event = prot_df.loc[~prot_df[EVENT_COLUMN], [prot]].iloc[:, 0].dropna()
        
        # Calculate the minimum effect size, to use for upper and lower bounds for the TOST
        # Formula from the power calculation for a two sample Student's t-test from The
        # Analysis of Biological Data by Whitlock and Schluter, 2nd edition (2015), Roberts and
        # Company Publishers, pg. 444. Original formula was sample_size = 16 * (stdev / min_event) ^ 2
        
        # We separately calculate the standard deviation for each group and then mean them,
        # because the distributions may have different locations even if their standard deviations
        # are similar
        min_effect = 4 * np.mean([np.std(in_event), np.std(not_event)]) / np.sqrt(np.mean([in_event.size, not_event.size]))

        # Run TOST test
        pval, res_lower, res_upper = statsmodels.stats.weightstats.ttost_ind(
            x1=in_event,
            x2=not_event,
            low=-1.25 * min_effect,
            upp=1.25 * min_effect,
            usevar="unequal"
        )
        
        if pd.notnull(pval):
            comparisons.append(prot)
            pvals.append(pval)
        else:
            nulls.append(prot)
            continue

    # Multiple testing correction
    reject, pvals, alpha_sidak, alpha_bonf = statsmodels.stats.multitest.multipletests(
        pvals=pvals, 
        alpha=0.05, 
        method="fdr_bh"
    )
        
    results = pd.DataFrame({"Comparison": comparisons, "P_Value": pvals})

    results.set_index('Comparison', inplace=True)
    if isinstance(results.index[0], tuple):
        results[['Name', f'{cancer_type}_Database_ID']] = pd.DataFrame(
            results.index.values.tolist(),
            index=results.index
        )
        results.set_index(['Name', f'{cancer_type}_Database_ID'], inplace=True)
    else:
        results.index.name='Name'
    results.rename(columns={'P_Value': f'{cancer_type}_pvalue'}, inplace=True)
    if results_df is None:
        results_df = results
    else:
        results_df = results_df.join(results)

  return np.sqrt(d1._var / (d1.nobs - 1) + d2._var / (d2.nobs - 1))
  sem1 = d1._var / (d1.nobs - 1)
  return self.sum / self.sum_weights
  return self.sumsquares / self.sum_weights


## Save results

In [9]:
# This will save the resulting table in the same directory as this notebook.
# Modify if you would like to save to a different location.
results_df.to_csv(f"{CHROMOSOME}{ARM}_{CIS_OR_TRANS}_equiv.tsv", sep='\t')