# Cis-Effects with t tests for (your chromosome arm here!)

Here we look for genes IN the event discussed that are affected by the arm-level event. We will find these effects by performing a series of t-tests comparing the proteomic values of the patients with the event against those without the event.

## Setup (Install necessary packages)

We will start by importing necessary packages and collecting all of the proteomics data we will need to run the tests. The cancer types analyzed should have been determined in 01_event_basic_info where we determine which types of cancer seem to have the event we are looking at.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cptac
from scipy import stats
import cnvutils
import cptac.utils



In [2]:
# Load in the cptac data for each cancer type that you want to analyze.
# Just uncomment the lines for the cancer types you want
# br = cptac.Brca()
# cc = cptac.Ccrcc()
# co = cptac.Colon()
# en = cptac.Endometrial()
# gb = cptac.Gbm()
hn = cptac.Hnscc()
ls = cptac.Lscc()
# lu = cptac.Luad()
ov = cptac.Ovarian()

Checking that lscc index is up-to-date... 



Checking that ovarian index is up-to-date...



                                            

In [3]:
# Now we need to get the proteomics tables for each type of cancer to analyze.
proteomics = {
    # For example, if we were doing colon, hnscc, lscc, luad, and ovarian:

#     "brca": br.get_proteomics(tissue_type="tumor"),
#     "colon": co.get_proteomics(tissue_type="tumor"),
    "hnscc": hn.get_proteomics(tissue_type="tumor"),
    "lscc": ls.get_proteomics(tissue_type="tumor"),
#     "luad": lu.get_proteomics(tissue_type="tumor"),
    "ovarian": ov.get_proteomics(tissue_type="tumor")
}

## Append gene locations

We now append the location information to the proteomics tables. This will allow us to determine which proteins are in the event. 

In [4]:
locations = cnvutils.get_gene_locations()

In [5]:
# This will append the location data to each table in the proteomics dictionary
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.transpose()
    if not isinstance(df.index, pd.MultiIndex):
        new_df = df.join(locations.droplevel(1))
        new_df = new_df.drop_duplicates()
        new_df = new_df[new_df["chromosome"].notna()]
        proteomics[cancer_type] = new_df
    else:
        new_df = df.join(locations)
        new_df.drop_duplicates(inplace=True)
        proteomics[cancer_type] = new_df.dropna()

## Remove proteins not in event

Before running our t-tests, we need to remove all the proteins that are not in the event we are looking at (we only want to look at CIS effects)

In [6]:
# Place here which chromosome and arm you want to look at
# the chromosome number should be a string, 
# the arm should be either p or q (lower case)
CHROMOSOME = '5' # Example: '8'
ARM = 'q' # Example: 'p'

In [7]:
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df[(df.arm == ARM) & (df.chromosome == CHROMOSOME)]
    # Now that we've selected the proteins we want, we can drop the
    # location information columns.
    df = df.drop(columns=['chromosome', 'start_bp', 'end_bp', 'arm'])
    proteomics[cancer_type] = df

## Append Event Data

We now append the data from the event table that should have been created in a previous notebook.

In [8]:
# set this variable to the column that represents the event we are looking at
EVENT_COLUMN = "loss_event" # Example: "loss_event"
# If there are more than 2 columns in the dataframes you will need to drop the columns you
# will not be using. If you don't need to drop any columns, leave this as None.
EXCLUDE_COLUMNS = "gain_event" # Example: "gain_event"

In [9]:
has_event = dict()
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.transpose()
    event = pd.read_csv(f'{cancer_type}_has_event.tsv', sep='\t', index_col=0)
    if EXCLUDE_COLUMNS:
        event.drop(EXCLUDE_COLUMNS, axis=1, inplace=True)
    event.index.rename('Name')
    df = df.join(event)
    df = df.dropna(subset=[EVENT_COLUMN])
    has_event[cancer_type] = df[EVENT_COLUMN]
    proteomics[cancer_type] = df



## Run T-Tests

In [10]:
results_df = None
for cancer_type in proteomics.keys():
    prot_df = proteomics[cancer_type]
    results = cptac.utils.wrap_ttest(
        df=prot_df, 
        label_column=EVENT_COLUMN,
        correction_method="fdr_bh",
        return_all=True
    )
    results.set_index('Comparison', inplace=True)
    if isinstance(results.index[0], tuple):
        results[['Name', f'{cancer_type}_Database_ID']] = pd.DataFrame(
            results.index.values.tolist(),
            index=results.index
        )
        results.set_index(['Name', f'{cancer_type}_Database_ID'], inplace=True)
    else:
        results.index.name='Name'
    results.rename(columns={'P_Value': f'{cancer_type}_pvalue'}, inplace=True)
    if results_df is None:
        results_df = results
    else:
        results_df = results_df.join(results)

## Append Difference Data

We want to collect one more piece of information about the data. We want to find the difference between the averages of the two groups. This will tell us if there is a positive or negative coorelation between the event and the proteomic data and can give us some indication of the strength of the coorelation. 

In [11]:
def get_diff(col, event):
    has_event = col[event]
    invert_list = [not x for x in event]
    no_event = col[invert_list]
    event_avg = has_event.mean()
    no_event_avg = no_event.mean()
    return event_avg - no_event_avg

In [12]:
diff_df = None
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.drop(EVENT_COLUMN, axis=1)
    results = df.apply(lambda x: get_diff(x, has_event[cancer_type]))
    df = pd.DataFrame(results)
    if isinstance(df.index[0], tuple):
        df[['Name', f'{cancer_type}_Database_ID']] = pd.DataFrame(df.index.values.tolist(), index=df.index)
        df.set_index(['Name', f'{cancer_type}_Database_ID'], inplace=True)
    else:
        df.index.name='Name'
    df.rename(columns={0: f'{cancer_type}_diff'}, inplace=True)
    if diff_df is None:
        diff_df = df
    else:
        diff_df = diff_df.join(df)

## Join the tables and save

We now join the difference table and the results table together. We also save the table to a tsv for use in future analyses. 

In [13]:
results_df = results_df.join(diff_df)

In [14]:
# This will save the resulting table in the same directory as this notebook.
# Modify if you would like to save to a different location.
results_df.to_csv(f"{CHROMOSOME}{ARM}_ciseffects_ttest.tsv", sep='\t')

In [15]:
results_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,hnscc_pvalue,lscc_pvalue,ovarian_pvalue,hnscc_diff,lscc_diff,ovarian_diff
Name,lscc_Database_ID,ovarian_Database_ID,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ABLIM3,NP_001287947.1|NP_001287944.1,,0.031005,0.862478,,-0.279494,-0.044613,
ACTBL2,NP_001017992.1,NP_001017992,0.607895,0.265713,0.901363,-0.045491,-0.216464,0.030015
AFAP1L1,NP_689619.1|NP_001309991.1|NP_001139809.1|NP_001309992.1,,0.069555,0.317960,,-0.239139,-0.244924,
AFF4,NP_055238.1,NP_055238,0.608087,0.667100,0.766233,-0.052258,-0.094496,-0.089095
AGGF1,NP_060516.2,,0.722615,0.668629,,-0.104129,-0.081199,
...,...,...,...,...,...,...,...,...
YIPF5,NP_001020118.1|NP_001258661.1,NP_110426,0.028127,0.001155,0.154862,-0.180555,-0.632568,-0.193072
YTHDC2,NP_073739.3|NP_001332904.1|NP_001332905.1,NP_073739,0.093131,0.003844,0.109846,-0.127605,-0.439153,-0.255130
ZCCHC9,NP_001124507.1,,0.990265,0.590027,,0.004245,-0.150852,
ZFYVE16,NP_001098721.1|NP_001271166.1,NP_001098721,0.000550,0.000008,0.528641,-0.183690,-0.559883,-0.084204


## Explore

Next we want to understand what our results mean. There are many ways you may find  significant proteins and patterns to look at. Some examples of what you might do include:

 * Print the rows where all cancer types have a significant pvalue
 * Print the rows where a given portion of the cancer types have a significant pvalue
 * Find the proteins that appear in the top left and right corners of the volcano plots. Where do these proteins fall on other graphs? Learn a litte about these 