# Trans-Effects with t tests for 7q

Here we look for genes NOT in the event discussed that are affected by the arm-level event. We will find these effects by performing a series of t-tests comparing the proteomic values of the patients with the event against those without the event.

## Setup (Install necessary packages)

We will start by importing necessary packages and collecting all of the proteomics data we will need to run the tests. The cancer types analyzed should have been determined in 01_event_basic_info where we determine which types of cancer seem to have the event we are looking at.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cptac
from scipy import stats
import cnvutils
import cptac.utils
import os
import sys

In [2]:
# Load in the cptac data for each cancer type that you want to analyze.

br = cptac.Brca()
cc = cptac.Ccrcc()
co = cptac.Colon()
en = cptac.Endometrial()
gb = cptac.Gbm()
# hn = cptac.Hnscc()
ls = cptac.Lscc()
lu = cptac.Luad()
ov = cptac.Ovarian()

Checking that lscc index is up-to-date...       



Checking that luad index is up-to-date...



                                            

In [3]:
# Now we need to get the proteomics tables for each type of cancer to analyze.
proteomics = {
    "brca": br.get_proteomics(tissue_type="tumor"),
    "ccrcc": cc.get_proteomics(tissue_type="tumor"),
    "colon": co.get_proteomics(tissue_type="tumor"),
    "endometrial": en.get_proteomics(tissue_type="tumor"),
    "gbm": gb.get_proteomics(tissue_type="tumor"),
#     "hnscc": hn.get_proteomics(tissue_type="tumor"),
    "lscc": ls.get_proteomics(tissue_type="tumor"),
    "luad": lu.get_proteomics(tissue_type="tumor"),
    "ovarian": ov.get_proteomics(tissue_type="tumor")
}

## Append gene locations

We now append the location information to the proteomics tables. This will allow us to determine which proteins are in the event. 

In [4]:
locations = cnvutils.get_gene_locations()

In [5]:
# This will append the location data to each table in the proteomics dictionary
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.transpose()
    if not isinstance(df.index, pd.MultiIndex):
        new_df = df.join(locations.droplevel(1))
        new_df = new_df.drop_duplicates()
        new_df = new_df[new_df["chromosome"].notna()]
        proteomics[cancer_type] = new_df
    else:
        new_df = df.join(locations)
        new_df.drop_duplicates(inplace=True)
        proteomics[cancer_type] = new_df.dropna()

## Remove proteins in event

Before running our t-tests, we need to remove all the proteins that are in the event we are looking at (we only want to look at TRANS effects)

In [6]:
# Place here which chromosome and arm you want to look at
# the chromosome number should be a string, 
# the arm should be either p or q (lower case)
CHROMOSOME = '7'
ARM = 'q'

In [7]:
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df[(df.arm != ARM) | (df.chromosome != CHROMOSOME)]
    
    # Now that we've selected the proteins we want, we can drop the
    # location information columns.
    df = df.drop(columns=['chromosome', 'start_bp', 'end_bp', 'arm'])
    proteomics[cancer_type] = df

## Append Event Data

We now append the data from the event table that should have been created in a previous notebook.

In [8]:
# Use this variable to indicate whether you want to look at amplification or deletion
amp_or_del = "amplification" 

if amp_or_del == "amplification":
    EVENT_COLUMN = "prop_arm_amplified"
    COL_NAME = "arm_amplified"
        
elif amp_or_del == "deletion":
    EVENT_COLUMN = "prop_arm_deleted"
    COL_NAME = "arm_deleted"

In [9]:
has_event = dict()
for cancer_type in proteomics.keys():
    
    # Get the proteomics table
    df = proteomics[cancer_type]
    df = df.transpose()
    
    # Get the event table
    event_file_path = os.path.join("01_event_tables", f"{cancer_type}_cna_summary.tsv.gz")
    
    event = pd.\
    read_csv(event_file_path, sep='\t', dtype={"chromosome": str}).\
    rename(columns={"Patient_ID": "Name"}).\
    set_index("Name")
    
    # Get just the info for the chromosome arm we want
    event = event[(event["chromosome"] == CHROMOSOME) & (event["arm"] == ARM)]

    # We say that >= 95% of the arm has to be affected for it to count as an arm level event
    event = event[EVENT_COLUMN] >= 0.95
    event.name = COL_NAME
    
    # If the df has a multilevel column index, handle that.
    if df.columns.nlevels == 2:
        df = cptac.utils.reduce_multiindex(df, tuples=True)
    
    df = df.join(event)
    df = df.dropna(subset=[COL_NAME])
    has_event[cancer_type] = df[COL_NAME]
    proteomics[cancer_type] = df

## Run T-Tests

In [10]:
results_df = None
for cancer_type in proteomics.keys():
    
    prot_df = proteomics[cancer_type]
  
    results = cptac.utils.wrap_ttest(
        df=prot_df, 
        label_column=COL_NAME,
        correction_method="fdr_bh",
        return_all=True
    )
   
    results.set_index('Comparison', inplace=True)
    
    if isinstance(results.index[0], tuple):
        results[['Name', f'{cancer_type}_Database_ID']] = pd.DataFrame(
            results.index.values.tolist(),
            index=results.index
        )
        results.set_index(['Name', f'{cancer_type}_Database_ID'], inplace=True)
    else:
        results.index.name='Name'
    
    results.rename(columns={'P_Value': f'{cancer_type}_pvalue'}, inplace=True)
    
    if results_df is None:
        results_df = results
    else:
        results_df = results_df.join(results)

## Append Difference Data

We want to collect one more piece of information about the data. We want to find the difference between the averages of the two groups. This will tell us if there is a positive or negative coorelation between the event and the proteomic data and can give us some indication of the strength of the coorelation. 

In [11]:
def get_diff(col, event):
    has_event = col[event]
    invert_list = [not x for x in event]
    no_event = col[invert_list]
    event_avg = has_event.mean()
    no_event_avg = no_event.mean()
    return event_avg - no_event_avg

In [12]:
diff_df = None
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.drop(COL_NAME, axis=1)
    results = df.apply(lambda x: get_diff(x, has_event[cancer_type]))
    df = pd.DataFrame(results)
    if isinstance(df.index[0], tuple):
        df[['Name', f'{cancer_type}_Database_ID']] = pd.DataFrame(df.index.values.tolist(), index=df.index)
        df.set_index(['Name', f'{cancer_type}_Database_ID'], inplace=True)
    else:
        df.index.name='Name'
    df.rename(columns={0: f'{cancer_type}_diff'}, inplace=True)
    if diff_df is None:
        diff_df = df
    else:
        diff_df = diff_df.join(df)

## Join the tables and save

We now join the difference table and the results table together. We also save the table to a tsv for use in future analyses. 

In [13]:
results_df = results_df.join(diff_df)

In [14]:
# This will save the resulting table in the same directory as this notebook.
# Modify if you would like to save to a different location.
results_df.to_csv(f"{CHROMOSOME}{ARM}_transeffects_ttest.tsv", sep='\t')

## Explore

Next we want to understand what our results mean. There are many ways you may find  significant proteins and patterns to look at. Some examples of what you might do include:

 * Print the rows where all cancer types have a significant pvalue
 * Print the rows where a given portion of the cancer types have a significant pvalue
 * Find the proteins that appear in the top left and right corners of the volcano plots. Where do these proteins fall on other graphs? Learn a litte about these 