## Use Case 3: Comparing Clinical Threshold for Significant Genes

## Step 1: Importing Packages and the Data

First, load standard imports for playing with dataframes.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

Next, load the CPTAC data we will be playing with.

In [2]:
import CPTAC.Endometrial as en

Loading Endometrial CPTAC data:
Loading Dictionary...
Loading Clinical Data...
Loading Proteomics Data...
Loading Transcriptomics Data...
Loading CNA Data...
Loading Phosphoproteomics Data...
Loading Somatic Mutation Data...

 ******PLEASE READ******
CPTAC is a community resource project and data are made available
rapidly after generation for community research use. The embargo
allows exploring and utilizing the data, but the data may not be in a
publication until July 1, 2019. Please see
https://proteomics.cancer.gov/data-portal/about/data-use-agreement or
enter embargo() to open the webpage for more details.


## Step 2: Getting data

Load the clinical dataframe and the dataframe to compare it with, in this case, proteomics

In [3]:
clinical = en.get_clinical()
proteomics = en.get_proteomics()

Columns of the clinical data can be viewed with <code>clinical.columns</code> command. To view all columns without truncation, first use the command <code>pd.set_option('display.max_seq_items’, None)</code>. 

In [4]:
print(clinical.columns)

Index(['Proteomics_Participant_ID', 'Case_excluded', 'Proteomics_TMT_batch',
       'Proteomics_TMT_plex', 'Proteomics_TMT_channel',
       'Proteomics_Parent_Sample_IDs', 'Proteomics_Aliquot_ID',
       'Proteomics_Tumor_Normal', 'Proteomics_OCT', 'Country',
       ...
       'RNAseq_R1_sample_type', 'RNAseq_R1_filename', 'RNAseq_R1_UUID',
       'RNAseq_R2_sample_type', 'RNAseq_R2_filename', 'RNAseq_R2_UUID',
       'miRNAseq_sample_type', 'miRNAseq_UUID', 'Methylation_available',
       'Methylation_quality'],
      dtype='object', length=170)


The trait we will be using for this example is the continuous variable for T cells memory activated

In [5]:
trait = 'Pathway_activity_p53'

## Step 3: Merge Dataframes

Next we will use the <code>en.compare_clinical()</code> function to create a dataframe that appends a column from the clinical dataframe to our chosen dataframe

In [6]:
traitProt = en.compare_clinical(proteomics, trait)
traitProt.head()

idx   Pathway_activity_p53    A1BG     A2M    A2ML1  A4GALT      AAAS    AACS  \
S001                 -0.67 -1.1800 -0.8630 -0.80200  0.2220  0.256000  0.6650   
S002                 -0.53 -0.6850 -1.0700 -0.68400  0.9840  0.135000  0.3340   
S003                  0.43 -0.5280 -1.3200  0.43500     NaN -0.240000  1.0400   
S004                   NaN  2.3500  2.8200 -1.47000     NaN  0.154000  0.0332   
S005                  0.15 -1.6700 -1.1900 -0.44300  0.2430 -0.099300  0.7570   
S006                 -1.98 -0.3740 -0.0206 -0.53700  0.3110  0.375000  0.0131   
S007                  0.56 -1.0800 -0.7080 -0.12600 -0.4260 -0.114000 -0.1110   
S008                 -1.05 -1.3200 -0.7080 -0.80800 -0.0709  0.138000  0.6560   
S009                 -1.60 -0.4670  0.3700 -0.33900     NaN  0.434000  0.0358   
S010                 -1.39 -1.1200 -1.3100  0.91200  0.4180 -0.076800  0.8460   
S011                  0.95 -0.7160 -0.8850  2.82000 -0.3430  0.147000  0.4450   
S012                  0.47 -

## Step 4: Statistical Analysis

The next step is more statistically intensive. We are looking for genes that have a significant correlation with the chosen clinical attribute. First we will establish a lower p-value threshold due to such a large sample size of genes by dividing .05 (the usual p-value) by the number of genes (or columns).

Next we will loop through the genes, testing each with a SpearmanR correlation test, only listing those that fall within our parameters of significant

In [7]:
threshold = .05 / len(traitProt.columns)
tscutoff = 0.5
print("Threshold:", threshold)
significantTests = []
significantGenes = []
for num in range(1,len(traitProt.columns)):
    gene = traitProt.columns[num]
    oneGene = traitProt[[trait, gene]]
    oneGene = oneGene.dropna(axis=0)
    spearmanrTest = stats.spearmanr(oneGene[trait], oneGene[gene])
    if (abs(spearmanrTest[0]) >= tscutoff) and (spearmanrTest[0] < 1) and (spearmanrTest[1] <= threshold) and (spearmanrTest[1] > 0):
        print(spearmanrTest)
        significantTests.append(spearmanrTest)
        significantGenes.append(gene)
print(len(significantGenes))
print(significantGenes)

Threshold: 4.5454545454545455e-06
SpearmanrResult(correlation=0.5163908610313054, pvalue=2.2193065109225645e-07)
SpearmanrResult(correlation=-0.5196493254270892, pvalue=6.803872988798687e-08)
SpearmanrResult(correlation=-0.512437880703719, pvalue=1.1047601933785191e-07)
SpearmanrResult(correlation=-0.5310749098407427, pvalue=3.0836308255695726e-08)
SpearmanrResult(correlation=-0.5219427182769559, pvalue=5.8180881366640145e-08)
SpearmanrResult(correlation=-0.5181594847541582, pvalue=7.527435145468716e-08)
SpearmanrResult(correlation=-0.5110854794128336, pvalue=3.092274373023385e-07)
SpearmanrResult(correlation=0.5825587253690367, pvalue=4.414786927952596e-07)
SpearmanrResult(correlation=-0.517066035119974, pvalue=8.104508894418577e-08)
SpearmanrResult(correlation=0.5586352677701082, pvalue=4.032275930670269e-09)
SpearmanrResult(correlation=-0.5160791343030166, pvalue=8.661316256665918e-08)
SpearmanrResult(correlation=0.5299340874050138, pvalue=2.1737088795782603e-07)
SpearmanrResult(cor