## Use Case 3: Comparing Clinical Threshold for Significant Genes

<b>First, load standard imports for playing with dataframes.</b>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

<b>Next, load the CPTAC data we will be playing with.</b>

In [2]:
import CPTAC

Processing c:\users\david\documents\github\paynelab\cptac
Building wheels for collected packages: CPTAC
  Running setup.py bdist_wheel for CPTAC: started
  Running setup.py bdist_wheel for CPTAC: finished with status 'done'
  Stored in directory: C:\Users\David\AppData\Local\Temp\pip-ephem-wheel-cache-0k_7bme3\wheels\c6\9e\67\70ffb2e65dc7dce8af1e18bcfde28338f7de6e805792af5ac2
Successfully built CPTAC
Installing collected packages: CPTAC
  Found existing installation: CPTAC 0.2
    Uninstalling CPTAC-0.2:
      Successfully uninstalled CPTAC-0.2
Successfully installed CPTAC-0.1.3
Loading CPTAC data:
Loading Dictionary...
Loading Clinical Data...
Loading Proteomics Data...
Loading Transcriptomics Data...
Loading CNA Data...
Loading Phosphoproteomics Data...
Loading Somatic Mutation Data...

 ******PLEASE READ******
CPTAC is a community resource project and data are made available
rapidly after generation for community research use. The embargo
allows exploring and utilizing the data, but

<b>The first step is to load the clinical dataframe and the dataframe to compare it with, in this case, proteomics</b>

In [3]:
clinical = CPTAC.get_clinical()
proteomics = CPTAC.get_proteomics()

<b>Columns of the clinical data can be viewed with <code>clinical.columns</code> command</b>

In [5]:
print(clinical.columns)

Index(['Tumor_Focality', 'Tumor_Size_cm', 'Estrogen_Receptor',
       'Estrogen_Receptor_%', 'Progesterone_Receptor',
       'Progesterone_Receptor_%', 'MLH1', 'MLH2', 'MSH6', 'PMS2', 'p53',
       'Other_IHC_specify', 'MLH1_Promoter_Hypermethylation',
       'Num_full_term_pregnancies', 'EPIC_Bcells', 'EPIC_CAFs',
       'EPIC_CD4_Tcells', 'EPIC_CD8_Tcells', 'EPIC_Endothelial',
       'EPIC_Macrophages', 'EPIC_NKcells', 'EPIC_otherCells',
       'CIBERSORT_B _cells _naive', 'CIBERSORT_B _cells _memory',
       'CIBERSORT_Plasma _cells', 'CIBERSORT_T _cells _CD8',
       'CIBERSORT_T _cells _CD4 _naive',
       'CIBERSORT_T _cells _CD4 _memory _resting',
       'CIBERSORT_T _cells _CD4 _memory _activated',
       'CIBERSORT_T _cells _follicular _helper',
       'CIBERSORT_T _cells _regulatory _(Tregs)',
       'CIBERSORT_T _cells _gamma _delta', 'CIBERSORT_NK _cells _resting',
       'CIBERSORT_NK _cells _activated', 'CIBERSORT_Monocytes',
       'CIBERSORT_Macrophages _M0', 'CIBERSORT

<b>The trait we will be using for this example is the continuous variable for T cells memory activated</b>

In [6]:
trait = 'Pathway_activity_p53'

<b>Next we will use the <code>CPTAC.compare_clinical()</code> function to create a dataframe that appends a column from the clinical dataframe to our chosen dataframe</b>

In [7]:
traitProt = CPTAC.compare_clinical(proteomics, trait)
print(traitProt)

idx   Pathway_activity_p53    A1BG     A2M    A2ML1  A4GALT      AAAS    AACS  \
S001                 -0.67 -1.1800 -0.8630 -0.80200  0.2220  0.256000  0.6650   
S002                 -0.53 -0.6850 -1.0700 -0.68400  0.9840  0.135000  0.3340   
S003                  0.43 -0.5280 -1.3200  0.43500     NaN -0.240000  1.0400   
S004                   NaN  2.3500  2.8200 -1.47000     NaN  0.154000  0.0332   
S005                  0.15 -1.6700 -1.1900 -0.44300  0.2430 -0.099300  0.7570   
S006                 -1.98 -0.3740 -0.0206 -0.53700  0.3110  0.375000  0.0131   
S007                  0.56 -1.0800 -0.7080 -0.12600 -0.4260 -0.114000 -0.1110   
S008                 -1.05 -1.3200 -0.7080 -0.80800 -0.0709  0.138000  0.6560   
S009                 -1.60 -0.4670  0.3700 -0.33900     NaN  0.434000  0.0358   
S010                 -1.39 -1.1200 -1.3100  0.91200  0.4180 -0.076800  0.8460   
S011                  0.95 -0.7160 -0.8850  2.82000 -0.3430  0.147000  0.4450   
S012                  0.47 -

<b>The next step is more statistically intensive. We are looking for genes that have a significant correlation with the chosen clinical attribute. First we will establish a lower p-value threshold due to such a large sample size of genes by dividing .05 (the usual p-value) by the number of genes (or columns).</b>

<b>Next we will loop through the genes, testing each with a SpearmanR correlation test, only listing those that fall within our parameters of significant</b>

In [10]:
threshold = .05 / len(traitProt.columns)
tscutoff = 0.5
print("Threshold:", threshold)
significantTests = []
significantGenes = []
for num in range(1,len(traitProt.columns)):
    gene = traitProt.columns[num]
    oneGene = traitProt[[trait, gene]]
    oneGene = oneGene.dropna(axis=0)
    spearmanrTest = stats.spearmanr(oneGene[trait], oneGene[gene])
    if (abs(spearmanrTest[0]) >= tscutoff) and (spearmanrTest[0] < 1) and (spearmanrTest[1] <= threshold) and (spearmanrTest[1] > 0):
        print(spearmanrTest)
        significantTests.append(spearmanrTest)
        significantGenes.append(gene)
print(len(significantGenes))
print(significantGenes)

Threshold: 4.5454545454545455e-06
SpearmanrResult(correlation=-0.5101680062878945, pvalue=5.045697072830158e-08)
SpearmanrResult(correlation=-0.512397344476199, pvalue=4.313228611332701e-08)
SpearmanrResult(correlation=0.5857624399784926, pvalue=1.2418374041220899e-07)
SpearmanrResult(correlation=0.5244192956885811, pvalue=1.815209632078782e-08)
SpearmanrResult(correlation=0.5332623965636514, pvalue=7.437809995671341e-08)
SpearmanrResult(correlation=0.5384704996030009, pvalue=6.3195068019392865e-09)
SpearmanrResult(correlation=0.5839398341178488, pvalue=1.0517420555123825e-08)
SpearmanrResult(correlation=-0.5128123738673442, pvalue=4.188582828594598e-08)
SpearmanrResult(correlation=0.63012438083762, pvalue=1.663329481961446e-12)
SpearmanrResult(correlation=0.5064287569263813, pvalue=6.547692551446557e-08)
SpearmanrResult(correlation=0.61386265557212, pvalue=8.734191794433271e-12)
SpearmanrResult(correlation=-0.512449764535652, pvalue=4.297292148112274e-08)
SpearmanrResult(correlation=-