# Cis-Effects with t tests for (your chromosome arm here!)

Here we look for genes IN the event discussed that are affected by the arm-level event. We will find these effects by performing a series of t-tests comparing the proteomic values of the patients with the event against those without the event.

## Setup (Install necessary packages)

We will start by importing necessary packages and collecting all of the proteomics data we will need to run the tests. The cancer types analyzed should have been determined in 01_event_basic_info where we determine which types of cancer seem to have the event we are looking at.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import cptac
from scipy import stats
import cnvutils
import cptac.utils



In [2]:
# Load in the cptac data for each cancer type that you want to analyze.
# Just uncomment the lines for the cancer types you want
br = cptac.Brca()
# cc = cptac.Ccrcc()
# co = cptac.Colon()
# en = cptac.Endometrial()
gb = cptac.Gbm()
# hn = cptac.Hnscc()
# ls = cptac.Lscc()
# lu = cptac.Luad()
ov = cptac.Ovarian()

Checking that ovarian index is up-to-date...



                                            

In [25]:
# Now we need to get the proteomics tables for each type of cancer to analyze.
proteomics = {
    # For example, if we were doing colon, hnscc, lscc, luad, and ovarian:

    "brca": br.get_proteomics(tissue_type="tumor"),
#     "colon": co.get_proteomics(tissue_type="tumor"),
#     "hnscc": hn.get_proteomics(tissue_type="tumor"),
#     "lscc": ls.get_proteomics(tissue_type="tumor"),
#     "luad": lu.get_proteomics(tissue_type="tumor"),
    "ovarian": ov.get_proteomics(tissue_type="tumor"),
    "gbm": gb.get_proteomics(tissue_type="tumor")
}

## Append gene locations

We now append the location information to the proteomics tables. This will allow us to determine which proteins are in the event. 

In [4]:
locations = cnvutils.get_gene_locations()

In [26]:
# This will append the location data to each table in the proteomics dictionary
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.transpose()
    if not isinstance(df.index, pd.MultiIndex):
        new_df = df.join(locations.droplevel(1))
        new_df = new_df.drop_duplicates()
        new_df = new_df[new_df["chromosome"].notna()]
        proteomics[cancer_type] = new_df
    else:
        new_df = df.join(locations)
        new_df.drop_duplicates(inplace=True)
        proteomics[cancer_type] = new_df.dropna()

## Remove proteins not in event

Before running our t-tests, we need to remove all the proteins that are not in the event we are looking at (we only want to look at CIS effects)

In [27]:
# Place here which chromosome and arm you want to look at
# the chromosome number should be a string, 
# the arm should be either p or q (lower case)
CHROMOSOME = '20' # Example: '8'
ARM = None # Example: 'p'

In [28]:
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
#     print(df)
    if ARM:
        df = df[(df.arm == ARM) & (df.chromosome == CHROMOSOME)]
    else:
        df = df[df.chromosome == CHROMOSOME]
        print(df)
    # Now that we've selected the proteins we want, we can drop the
    # location information columns.
    df = df.drop(columns=['chromosome', 'start_bp', 'end_bp', 'arm'])
    proteomics[cancer_type] = df

                                                            CPT000814  \
Name    Database_ID                                                     
AAR2    NP_001258803.1                                         1.3460   
ABHD12  NP_056415.1|NP_001035937.1                            -3.6002   
ACOT8   NP_005460.2                                           -0.0373   
ACSS1   NP_115890.2|NP_001239604.1|NP_001239605.1|NP_00...    -1.4617   
ACSS2   NP_001070020.2|NP_061147.1|NP_001229322.1             -2.7100   
...                                                               ...   
ZHX3    NP_055850.1                                            0.1995   
ZMYND8  NP_001268704.1|NP_001268702.1|NP_001268705.1|NP...    -0.1586   
ZNF217  NP_006517.1|NP_001307484.1|NP_001307485.1|NP_00...     1.0780   
ZNF512B NP_065764.1                                           -0.1351   
ZNFX1   NP_066363.1                                           -0.1332   

                                                  

## Append Event Data

We now append the data from the event table that should have been created in a previous notebook.

In [29]:
# set this variable to the column that represents the event we are looking at
EVENT_COLUMN = "gain_event" # Example: "loss_event"
# If there are more than 2 columns in the dataframes you will need to drop the columns you
# will not be using. If you don't need to drop any columns, leave this as None.
EXCLUDE_COLUMNS = None # Example: "gain_event"

In [30]:
has_event = dict()
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.transpose()
    print(df)
    event = pd.read_csv(f'{cancer_type}_has_event.tsv', sep='\t', index_col=0)
    print(event)
    if EXCLUDE_COLUMNS:
        event.drop(EXCLUDE_COLUMNS, axis=1, inplace=True)
    event.index.rename('Name')
    df = df.join(event)
    df = df.dropna(subset=[EVENT_COLUMN])
    has_event[cancer_type] = df[EVENT_COLUMN]
    proteomics[cancer_type] = df

Name                  AAR2                     ABHD12       ACOT8  \
Database_ID NP_001258803.1 NP_056415.1|NP_001035937.1 NP_005460.2   
CPT000814           1.3460                    -3.6002     -0.0373   
CPT001846           0.9834                    -2.5332      0.2359   
X01BR001           -0.2220                    -2.6346     -0.0427   
X01BR008           -0.1332                    -0.1057     -0.1240   
X01BR009           -0.0901                    -0.2262      1.1349   
...                    ...                        ...         ...   
X21BR001            1.2163                    -1.0043     -1.0407   
X21BR002           -0.5463                     1.2052      0.8402   
X21BR010           -0.3436                    -1.7378     -0.7072   
X22BR005            0.7861                     0.2538      0.2682   
X22BR006            0.3259                    -1.8870     -2.5222   

Name                                                           ACSS1  \
Database_ID NP_115890.2|NP_001



## Run T-Tests

In [31]:
results_df = None
for cancer_type in proteomics.keys():
    prot_df = proteomics[cancer_type]
    results = cptac.utils.wrap_ttest(
        df=prot_df, 
        label_column=EVENT_COLUMN,
        correction_method="fdr_bh",
        return_all=True
    )
    print(results)
    results.set_index('Comparison', inplace=True)
    if isinstance(results.index[0], tuple):
        results[['Name', f'{cancer_type}_Database_ID']] = pd.DataFrame(
            results.index.values.tolist(),
            index=results.index
        )
        results.set_index(['Name', f'{cancer_type}_Database_ID'], inplace=True)
    else:
        results.index.name='Name'
    results.rename(columns={'P_Value': f'{cancer_type}_pvalue'}, inplace=True)
    if results_df is None:
        results_df = results
    else:
        results_df = results_df.join(results, how='outer')

                                            Comparison   P_Value
0    (MGME1, NP_443097.1|NP_001297267.1|NP_00129726...  0.000153
1                                (ZCCHC3, NP_149080.2)  0.002360
2    (STAU1, NP_001309858.1|NP_001309860.1|NP_00459...  0.015075
3     (NSFL1C, NP_057227.2|NP_001193665.1|NP_061327.2)  0.016260
4    (CTSA, NP_000299.2|NP_001121167.1|NP_001161066.1)  0.021671
..                                                 ...       ...
206                               (PDRG1, NP_110442.1)  0.981472
207                (NOL4L, NP_001243727.1|NP_542183.2)  0.985358
208                             (ZNF512B, NP_065764.1)  0.987401
209                                (CD93, NP_036204.2)  0.988865
210                               (AP5S1, NP_060817.1)  0.988955

[211 rows x 2 columns]
                 Comparison   P_Value
0       (RPRD1B, NP_067038)  0.000168
1        (PCIF1, NP_071387)  0.000168
2      (ADNP, NP_001269460)  0.000278
3        (APMAP, NP_065392)  0.001034
4     

## Append Difference Data

We want to collect one more piece of information about the data. We want to find the difference between the averages of the two groups. This will tell us if there is a positive or negative coorelation between the event and the proteomic data and can give us some indication of the strength of the coorelation. 

In [32]:
def get_diff(col, event):
    has_event = col[event]
    invert_list = [not x for x in event]
    no_event = col[invert_list]
    event_avg = has_event.mean()
    no_event_avg = no_event.mean()
    return event_avg - no_event_avg

In [33]:
diff_df = None
for cancer_type in proteomics.keys():
    df = proteomics[cancer_type]
    df = df.drop(EVENT_COLUMN, axis=1)
    results = df.apply(lambda x: get_diff(x, has_event[cancer_type]))
    df = pd.DataFrame(results)
    if isinstance(df.index[0], tuple):
        df[['Name', f'{cancer_type}_Database_ID']] = pd.DataFrame(df.index.values.tolist(), index=df.index)
        df.set_index(['Name', f'{cancer_type}_Database_ID'], inplace=True)
    else:
        df.index.name='Name'
    df.rename(columns={0: f'{cancer_type}_diff'}, inplace=True)
    if diff_df is None:
        diff_df = df
    else:
        diff_df = diff_df.join(df)

## Join the tables and save

We now join the difference table and the results table together. We also save the table to a tsv for use in future analyses. 

In [34]:
results_df = results_df.join(diff_df)

In [35]:
# This will save the resulting table in the same directory as this notebook.
# Modify if you would like to save to a different location.
results_df.to_csv(f"{CHROMOSOME}{ARM}_ciseffects_ttest.tsv", sep='\t')

In [36]:
results_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,brca_pvalue,ovarian_pvalue,gbm_pvalue,brca_diff,ovarian_diff,gbm_diff
Name,brca_Database_ID,ovarian_Database_ID,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AAR2,NP_001258803.1,NP_001258803,0.178310,0.121384,0.007219,0.314120,0.204359,0.204341
ABHD12,NP_056415.1|NP_001035937.1,NP_001035937,0.509693,0.237899,0.856229,0.340422,0.143669,-0.020127
ACOT8,NP_005460.2,NP_005460,0.083974,0.176220,0.058966,0.667037,0.234363,0.179592
ACSS1,NP_115890.2|NP_001239604.1|NP_001239605.1|NP_001239606.1,NP_001239604,0.967032,0.436928,0.016857,0.029020,0.157158,0.331331
ACSS2,NP_001070020.2|NP_061147.1|NP_001229322.1,NP_001070020,0.303658,0.856865,0.066273,0.424784,0.029700,0.161025
...,...,...,...,...,...,...,...,...
ZMYND8,NP_001268704.1|NP_001268702.1|NP_001268705.1|NP_001268698.1,NP_001268700,0.498110,0.013172,0.265629,0.165025,0.248100,0.092758
ZMYND8,NP_001268704.1|NP_001268702.1|NP_001268705.1|NP_001268698.1,NP_001268703,0.498110,0.019155,0.265629,0.165025,0.235859,0.092758
ZNF217,NP_006517.1|NP_001307484.1|NP_001307485.1|NP_001288708.1|NP_068735.1|NP_149981.2|NP_001288748.1|NP_001287878.1,NP_006517,0.880403,0.739210,,0.058325,0.051781,
ZNF512B,NP_065764.1,NP_065764,0.987401,0.025644,0.246474,-0.009170,0.358483,0.133493


## Explore

Next we want to understand what our results mean. There are many ways you may find  significant proteins and patterns to look at. Some examples of what you might do include:

 * Print the rows where all cancer types have a significant pvalue
 * Print the rows where a given portion of the cancer types have a significant pvalue
 * Find the proteins that appear in the top left and right corners of the volcano plots. Where do these proteins fall on other graphs? Learn a litte about these 