# Use Case 3: Associating Clinical Variables with Acetylation

In this use case, we aim to analyze acetylation data with a clinical attribute, specifically "histologic_type". Our goal is to identify acetylation sites that differ significantly in frequency between non-tumor, serous and endometrial cells.

# Step 1: Import Packages and Load Data

First, we'll import the necessary packages, including the cptac package, and load the Endometrial dataset.

In [None]:
import pandas as pd
import numpy as np
import scipy.stats
import statsmodels.stats.multitest
import matplotlib.pyplot as plt
import seaborn as sns
import math
import cptac
import cptac.utils as ut

en = cptac.Ucec()

# Step 2: Understanding the Acetylproteomic Dataframe

The Endometrial acetylproteomic dataframe has a multiindex. The 'Name' index lists the gene of interest, 'Site' index shows the site of acetylation, and 'Peptide' index shows the peptide sequence where the modification took place. 'Database_ID' differentiates entries with the same gene name. After joining with other dataframes, we typically drop 'Database_ID' for easier data manipulation.

In [None]:
en.get_acetylproteomics('umich')

# Step 3: Choose Clinical Attribute and Join Dataframes

For this use case, we'll use the 'histologic_type' clinical attribute to identify differences in acetylation sites between "endometrioid" and "serous" cancer cells. We'll join this clinical attribute with our acetylation dataframe using the en.join_metadata_to_omics method

In [None]:
#Set desired attribute to variable 'clinical_attribute'
clinical_attribute = "histologic_type"

#Join attribute with acetylation dataframe
clinical_and_acetylation = en.join_metadata_to_omics(metadata_name='clinical',
                                                    omics_name='acetylproteomics',
                                                    omics_source='umich',
                                                    metadata_source='mssm',
                                                    metadata_cols=clinical_attribute)
clinical_and_acetylation

 Now, we'll drop the 'Peptide' level and flatten the 'Site' level, appending the 'Site' to the column names.

In [None]:
# Use the cptac.utils.reduce_multiindex function to combine the multiple column levels
clinical_and_acetylation = ut.reduce_multiindex(clinical_and_acetylation, levels_to_drop="Peptide")
clinical_and_acetylation = ut.reduce_multiindex(clinical_and_acetylation, flatten=True)

clinical_and_acetylation

# Step 4: Format Dataframe to Compare Acetylproteomic Sites Between Histologic Types

In [None]:
clinical_attribute = "histologic_type"
#Show possible variations of histologic_type
clinical_and_acetylation[clinical_attribute].unique()

In this step, we will make two different dataframes for "Endometrioid" and "Serous" cancer types, as well as fill the NaN columns with "Non-Tumor."

In [None]:
#Make dataframes with only endometrioid and only serous data in order to compare 
endom = clinical_and_acetylation.loc[clinical_and_acetylation[clinical_attribute] == "Endometrioid carcinoma"]
serous = clinical_and_acetylation.loc[clinical_and_acetylation[clinical_attribute] == "Serous carcinoma"]
#Here is where we set the NaN values to "Non_Tumor"
clinical_and_acetylation[[clinical_attribute]] = clinical_and_acetylation[[clinical_attribute]].fillna(
    value="Non_Tumor")

Now that we have our different dataframes, we want to make sure that the amount of data we are using for each site is significant. Since there are fewer patients with "serous" tumors than with "endometrioid," we will check to make sure that we have at least five values for each acetylation site that we are comparing that have a measurement of intensity for serous patients. We will remove every acetylation site from our dataframe that doesn't have at least five values among the serous patients.

In [None]:
#Remove every column that doesn't have at least 5 values among the serous patients
print("Total Sites: ", len(serous.columns) - 1)
sites_to_remove = []
for num in range(1, len(serous.columns)):
    serous_site = serous.columns[num]
    one_site = serous[serous_site]
    num_datapoints_ser = one_site.count()
    if num_datapoints_ser.mean() < 5:
        sites_to_remove.append(serous_site)

clinical_and_acetylation = clinical_and_acetylation.drop(sites_to_remove, axis = 1)

#Also remove non-tumor patients from our dataframe to use in comparison, as we want to compare only endometrioid and serous types
clinical_and_acetylation_comparison = clinical_and_acetylation.loc[clinical_and_acetylation['histologic_type'] != 'Non_Tumor']
clinical_and_acetylation_comparison = clinical_and_acetylation_comparison.loc[clinical_and_acetylation_comparison['histologic_type'] != 'Mixed cell adenocarcinoma']
clinical_and_acetylation_comparison = clinical_and_acetylation_comparison.loc[clinical_and_acetylation_comparison['histologic_type'] != 'Clear cell carcinoma']


print("Removed: ", len(sites_to_remove))
print("Remaining Sites: ", len(clinical_and_acetylation_comparison.columns) - 1)
print("Adjusted p-value cutoff will be: ", .05/(len(clinical_and_acetylation_comparison.columns)-1))

# Step 5: Compare Endometrioid and Serous Values

We will now call the wrap_ttest method, which will loop through the data and compare endometrioid versus serous data for each acetylation site. If we find a site that is significantly different, we will add it to a dataframe, with its p-value. The default alpha used is .05, which will be adjusted to account for multiple testing using a bonferroni correction, dividing alpha by the number of comparisons that will occur (the number of comparison columns).

In [None]:
#Make list of all remaining sites in dataframe to pass to wrap_ttest function
columns_to_compare = list(clinical_and_acetylation_comparison.columns)

#Remove the "Histologic_type" column (at index 0) from this list
columns_to_compare = columns_to_compare[1:]
# print(columns_to_compare)

clinical_and_acetylation_comparison = clinical_and_acetylation_comparison.loc[:,~clinical_and_acetylation_comparison.columns.duplicated()]
#Perform ttest on each column in dataframe
significant_sites_df = ut.wrap_ttest(df=clinical_and_acetylation_comparison, label_column="histologic_type", comparison_columns=columns_to_compare)

#List significant results
significant_sites_df

# Step 6: Graph Results

Now that we have eight acetylation sites that differ significantly between endometrioid and serous intensities, we will graph a couple of them using a boxplot and a stripplot in order to visually see the difference, as well as compare with normal cells.

In [None]:
graphingSite = 'FOXA2_umich_acetylproteomics_K274'
clinical_and_acetylation[graphingSite] = pd.to_numeric(clinical_and_acetylation[graphingSite], errors='coerce')
print(np.__version__)
print(scipy.stats.ttest_ind(endom[graphingSite], serous[graphingSite]))
sns.boxplot(x=clinical_attribute, y=graphingSite, data=clinical_and_acetylation, showfliers=False, 
            order=["Non_Tumor", "Endometrioid carcinoma", "Serous carcinoma"])
sns.stripplot(x=clinical_attribute, y=graphingSite, data=clinical_and_acetylation, color='.3', 
              order=["Non_Tumor", "Endometrioid carcinoma", "Serous carcinoma"])
plt.show()

In [None]:
graphingSite = 'TBL1XR1_umich_acetylproteomics_K102'
clinical_and_acetylation[graphingSite] = pd.to_numeric(clinical_and_acetylation[graphingSite], errors='coerce')
print(scipy.stats.ttest_ind(endom[graphingSite], serous[graphingSite]))
sns.boxplot(x=clinical_attribute, y=graphingSite, data=clinical_and_acetylation, showfliers = False, 
            order=["Non_Tumor", "Endometrioid carcinoma", "Serous carcinoma"])
sns.stripplot(x=clinical_attribute, y=graphingSite, data=clinical_and_acetylation, color='.3', 
              order=["Non_Tumor", "Endometrioid carcinoma", "Serous carcinoma"])
plt.show()