## Use Case 3: Comparing Clinical Threshold for Significant Genes

## Step 1: Importing Packages and the Data

First, load standard imports for playing with dataframes.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

Next, load the CPTAC data we will be playing with.

In [3]:
import CPTAC.Endometrial as en

Welcome to the CPTAC data service package. This import contains
information about the package. In order to access a specific data set,
import a CPTAC subfolder by either 'import CPTAC.DataName' or 'from
CPTAC import DataName'.
Loading Endometrial CPTAC data:
Loading Dictionary...
Loading Clinical Data...
Loading Proteomics Data...
Loading Transcriptomics Data...
Loading CNA Data...
Loading Phosphoproteomics Data...
Loading Somatic Mutation Data...

 ******PLEASE READ******
CPTAC is a community resource project and data are made available
rapidly after generation for community research use. The embargo
allows exploring and utilizing the data, but the data may not be in a
publication until July 1, 2019. Please see
https://proteomics.cancer.gov/data-portal/about/data-use-agreement or
enter embargo() to open the webpage for more details.


## Step 2: Getting data

Load the clinical dataframe and the dataframe to compare it with, in this case, proteomics

In [4]:
clinical = en.get_clinical()
proteomics = en.get_proteomics()

Columns of the clinical data can be viewed with <code>clinical.columns</code> command. To view all columns without truncation, first use the command <code>pd.set_option('display.max_seq_items', None)</code>. 

In [5]:
print(clinical.columns)

Index(['Proteomics_Participant_ID', 'Case_excluded', 'Proteomics_TMT_batch',
       'Proteomics_TMT_plex', 'Proteomics_TMT_channel',
       'Proteomics_Parent_Sample_IDs', 'Proteomics_Aliquot_ID',
       'Proteomics_Tumor_Normal', 'Proteomics_OCT', 'Country',
       ...
       'RNAseq_R1_sample_type', 'RNAseq_R1_filename', 'RNAseq_R1_UUID',
       'RNAseq_R2_sample_type', 'RNAseq_R2_filename', 'RNAseq_R2_UUID',
       'miRNAseq_sample_type', 'miRNAseq_UUID', 'Methylation_available',
       'Methylation_quality'],
      dtype='object', length=170)


The trait we will be using for this example is the continuous variable for T cells memory activated

In [6]:
trait = 'Pathway_activity_p53'

## Step 3: Merge Dataframes

Next we will use the <code>en.compare_clinical()</code> function to create a dataframe that appends a column from the clinical dataframe to our chosen dataframe

In [7]:
traitProt = en.compare_clinical(proteomics, trait)
traitProt.head()

idx,Pathway_activity_p53,A1BG,A2M,A2ML1,A4GALT,AAAS,AACS,AADAT,AAED1,AAGAB,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
S001,-0.67,-1.18,-0.863,-0.802,0.222,0.256,0.665,1.28,-0.339,0.412,...,-0.0877,,0.0229,0.109,,-0.332,-0.433,-1.02,-0.123,-0.0859
S002,-0.53,-0.685,-1.07,-0.684,0.984,0.135,0.334,1.3,0.139,1.33,...,-0.0356,,0.363,1.07,0.737,-0.564,-0.00461,-1.13,-0.0757,-0.473
S003,0.43,-0.528,-1.32,0.435,,-0.24,1.04,-0.0213,-0.0479,0.419,...,0.00112,-0.145,0.0105,-0.116,,0.151,-0.074,-0.54,0.32,-0.419
S004,,2.35,2.82,-1.47,,0.154,0.0332,0.513,0.674,0.431,...,-0.538,-0.427,0.0926,1.28,1.08,0.0695,0.303,-0.325,0.236,0.443
S005,0.15,-1.67,-1.19,-0.443,0.243,-0.0993,0.757,0.74,-0.929,0.229,...,0.0725,-0.0552,-0.0714,0.0933,0.156,-0.398,-0.0752,-0.797,-0.0301,-0.467


## Step 4: Statistical Analysis

The next step is more statistically intensive. We are looking for genes that have a significant correlation with the chosen clinical attribute. First we will establish a lower p-value threshold due to such a large sample size of genes by dividing .05 (the usual p-value) by the number of genes (or columns).

Next we will loop through the genes, testing each with a SpearmanR correlation test, only listing those that fall within our parameters of significant

In [8]:
threshold = .05 / len(traitProt.columns)
tscutoff = 0.5
print("Threshold:", threshold)
significantTests = []
significantGenes = []
for num in range(1,len(traitProt.columns)):
    gene = traitProt.columns[num]
    oneGene = traitProt[[trait, gene]]
    oneGene = oneGene.dropna(axis=0)
    spearmanrTest = stats.spearmanr(oneGene[trait], oneGene[gene])
    if (abs(spearmanrTest[0]) >= tscutoff) and (spearmanrTest[0] < 1) and (spearmanrTest[1] <= threshold) and (spearmanrTest[1] > 0):
        print(spearmanrTest)
        significantTests.append(spearmanrTest)
        significantGenes.append(gene)
print(len(significantGenes))
print(significantGenes)

Threshold: 4.5454545454545455e-06
SpearmanrResult(correlation=0.5163908610313054, pvalue=2.2193065109225645e-07)
SpearmanrResult(correlation=-0.5196493254270892, pvalue=6.803872988798687e-08)
SpearmanrResult(correlation=-0.512437880703719, pvalue=1.1047601933785191e-07)
SpearmanrResult(correlation=-0.5310749098407427, pvalue=3.0836308255695726e-08)
SpearmanrResult(correlation=-0.5219427182769559, pvalue=5.8180881366640145e-08)
SpearmanrResult(correlation=-0.5181594847541582, pvalue=7.527435145468716e-08)
SpearmanrResult(correlation=-0.5110854794128336, pvalue=3.092274373023385e-07)
SpearmanrResult(correlation=0.5825587253690367, pvalue=4.414786927952596e-07)
SpearmanrResult(correlation=-0.517066035119974, pvalue=8.104508894418577e-08)
SpearmanrResult(correlation=0.5586352677701082, pvalue=4.032275930670269e-09)
SpearmanrResult(correlation=-0.5160791343030166, pvalue=8.661316256665918e-08)
SpearmanrResult(correlation=0.5299340874050138, pvalue=2.17370887957826e-07)
SpearmanrResult(corre