## Use Case 3: Comparing BMI above and below 25 across the proteomics data

These are the tools we will use to play with the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

This is the data we will be playing with

In [2]:
import CPTAC

Loading Clinical Data...
Loading Proteomics Data...
Loading Transcriptomics Data...
Loading CNA Data...
Loading Phosphoproteomics Data...
Loading Somatic Data...

 ******PLEASE READ******


Use case 3: BMI above and below 25
The first step is to load the clinical dataframe and the dataframe to compare it with

In [3]:
clinical = CPTAC.get_clinical()
proteomics = CPTAC.get_proteomics()

Next we will use the compare_clinical() function to create a dataframe that appends a column from the clinical dataframe to our chosen dataframe

In [4]:
print(clinical.columns)

Index(['Proteomics_Aliquot_ID', 'Proteomics_Participant_ID',
       'Proteomics_TMT_batch', 'Proteomics_TMT_plex', 'Proteomics_TMT_channel',
       'Proteomics_Parent_Sample_IDs', 'Proteomics_Tumor_normal',
       'Proteomics_OCT', 'WXS_patient_id', 'WXS_Tumor_sample_id',
       'WXS_Tumor_file', 'WXS_Tumor_UUID', 'WXS_Tumor_type',
       'WXS_Normal_sample_id', 'WXS_Normal_file', 'WXS_Normal_UUID',
       'WXS_Normal_type', 'RNAseq_sample_id', 'RNAseq_patient_id',
       'RNAseq_sample_type', 'RNAseq_UUID_R1', 'RNAseq_filename_R1',
       'RNAseq_UUID_R2', 'RNAseq_filename_R2', 'methylation_sample_id',
       'Histologic_Grade_(FIGO)', 'Histologic_Type',
       'Num_full_term_pregnancies', 'Tumor_Size_(cm)', 'FIGO_stage',
       'Myometrial_invasion_Specify', 'tumor_Stage_Pathological', 'Diabetes',
       'BMI', 'LVSI', 'Endo_S1G1G2_LVSI', 'Age',
       'CIBERSORT-T_cells_CD4_memory_resting',
       'CIBERSORT-Dendritic_cells_resting',
       'CIBERSORT-T_cells_regulatory_(Tregs)', 'C

Now we can split the bmiCNA dataframe into two dataframes based on whether the patient has a BMI (or whatever variable was specified in the previous step) above or below 25.
This is done by using the dataframe logic to create an array of boolean values, which we can then use to select the respective dataframes

In [5]:
trait = 'CIBERSORT-T_cells_CD4_memory_activated'

In [6]:
traitProt = CPTAC.compare_clinical(clinical, proteomics, trait)
print(traitProt)

      CIBERSORT-T_cells_CD4_memory_activated  A1BG   A2M  A2ML1  A4GALT  AAAS  \
idx                                                                             
S001                                0.000000 -1.01 -0.81  -0.28    0.24  0.29   
S002                                0.000000 -0.51 -1.00  -0.99    1.50  0.18   
S003                                0.000000 -0.56 -1.33   0.64     NaN -0.26   
S004                                0.000000 -1.53 -1.19  -0.49    0.26 -0.03   
S005                                0.026120 -0.16  0.09   0.01    0.34  0.51   
S006                                0.000000 -1.03 -0.63  -0.04   -0.25 -0.09   
S007                                0.000000 -1.09 -0.60  -1.11    0.02  0.16   
S008                                0.000000 -0.29  0.51  -0.51     NaN  0.46   
S009                                0.002505 -0.93 -1.28   0.67    0.43 -0.05   
S010                                0.000000 -0.44 -0.87   2.83   -0.32  0.18   
S011                        

We can now check for genes that have a significantly different protein abundance between the high and low BMI. First we need to set a more accurate threshold since we have so many samples.

In [7]:
threshold = .05 / len(traitProt.columns) #TODO: this doesn't yield anything significant yet
tscutoff = 0.5
print("Threshold:", threshold)
significantTests = []
significantGenes = []
for num in range(1,len(traitProt.columns)):
    gene = traitProt.columns[num]
    oneGene = traitProt[[trait, gene]]
    oneGene = oneGene.dropna(axis=0)
    spearmanrTest = stats.spearmanr(oneGene[trait], oneGene[gene])
    if (abs(spearmanrTest[0]) >= tscutoff) and (spearmanrTest[1] <= threshold):
        print(spearmanrTest)
        significantTests.append(spearmanrTest)
        significantGenes.append(gene)
print(len(significantGenes))
print(significantGenes)

Threshold: 5.215939912372209e-06
SpearmanrResult(correlation=0.6387130387155361, pvalue=5.233618175945269e-08)
SpearmanrResult(correlation=0.5465199283392974, pvalue=1.210700814001469e-08)
SpearmanrResult(correlation=0.6741628320340319, pvalue=5.8576065289802025e-08)
SpearmanrResult(correlation=0.5409011227280242, pvalue=6.259332705863731e-09)
SpearmanrResult(correlation=0.5001421230266248, pvalue=1.169269869608808e-07)
SpearmanrResult(correlation=0.5168402005832128, pvalue=1.8368839040494775e-07)
6
['CALHM6', 'CCL5', 'CD3E', 'GBP5', 'LCP2', 'SKAP1']
