## BlackSheep Cookbook Take 2

The Black Sheep Analysis allows researchers to find trends in abnormal protein enrichment among patients in CPTAC datasets. In this Cookbook, we will go through the steps needed to perform a full Black Sheep Analysis, to answer a research question, of if BMI, age, or country of origin play a role in protein enrichments for patients with Endometrial Cancer.

### Step 1a: Import Dependencies
First, import the necessary dependencies and install cptac through pip.

In [38]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#!pip install cptac
import cptac
import binarization_functions as bf
import blackSheepCPTACmoduleCopy as blsh
import importlib
importlib.reload(bf)
importlib.reload(blsh)

<module 'blackSheepCPTACmoduleCopy' from 'C:\\Users\\Daniel\\Documents\\GitHub\\WhenMutationsMatter\\Daniel\\blackSheepCPTACmoduleCopy.py'>

## Step 1b: Load Data and Choose Omics Table

In [2]:
en = cptac.Endometrial()
proteomics = en.get_proteomics()
phospho = en.get_phosphoproteomics()
clinical = en.get_clinical()

Checking that data files are up-to-date...
100% [..................................................................................] 649 / 649
Data check complete.
endometrial data version: 2.1

Loading acetylproteomics data...
Loading clinical data...
Loading CNA data...
Loading definitions data...
Loading miRNA data...
Loading phosphoproteomics_gene data...
Loading phosphoproteomics_site data...
Loading proteomics data...
Loading somatic data...
Loading somatic_binary data...
Loading transcriptomics_circular data...
Loading transcriptomics_linear data...

 ******PLEASE READ******
CPTAC is a community resource project and data are made available
rapidly after generation for community research use. The embargo
allows exploring and utilizing the data, but analysis may not be
published until July 1, 2019. Please see
https://proteomics.cancer.gov/data-portal/about/data-use-agreement or
enter cptac.embargo() to open the webpage for more details.


In [3]:
#Append Genomics_subtype and MSI_status to en.get_clinical()
df= en.get_derived_molecular()
important_things_to_append = ['Genomics_subtype', 'MSI_status']

clinical['Genomics_subtype'] = df['Genomics_subtype']
clinical['MSI_status'] = df['MSI_status']

In [61]:
miRNA = en.get_miRNA()
miRNA.head()

Unnamed: 0_level_0,hsa-let-7a-2-3p,hsa-let-7a-3p,hsa-let-7a-5p,hsa-let-7b-3p,hsa-let-7b-5p,hsa-let-7c-3p,hsa-let-7c-5p,hsa-let-7d-3p,hsa-let-7d-5p,hsa-let-7e-3p,...,hsa-miR-9901,hsa-miR-9902,hsa-miR-9903,hsa-miR-9983-3p,hsa-miR-9985,hsa-miR-9986,hsa-miR-99a-3p,hsa-miR-99a-5p,hsa-miR-99b-3p,hsa-miR-99b-5p
Sample_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
S001,0.93,6.45,15.92,8.98,13.66,1.33,14.41,9.58,12.31,5.01,...,0.0,0.0,1.56,0.0,4.55,0.0,3.04,11.77,8.42,13.28
S002,0.69,6.86,16.05,9.38,13.69,1.6,14.65,9.04,12.28,4.48,...,0.0,0.0,1.94,0.0,3.16,0.0,2.65,12.27,7.83,12.87
S003,1.2,6.41,15.86,8.83,12.87,3.21,14.06,9.12,11.85,5.31,...,0.16,0.0,3.44,0.77,4.03,0.0,3.81,12.98,7.89,13.12
S005,0.37,8.57,16.21,8.77,13.37,2.5,14.44,8.4,12.23,4.87,...,0.2,0.0,2.43,0.0,4.15,0.0,3.92,11.77,7.8,12.03
S009,1.71,8.11,15.61,8.28,12.23,2.92,13.54,8.31,11.7,5.6,...,0.78,0.0,1.48,0.67,4.27,0.0,3.81,12.41,9.19,12.91


In [55]:
en.list_data()

Below are the dataframes contained in this dataset:
	acetylproteomics
		Dimensions: (144, 10862)
	circular_RNA
		Dimensions: (109, 4945)
	clinical
		Dimensions: (144, 26)
	CNA
		Dimensions: (95, 28057)
	derived_molecular
		Dimensions: (144, 125)
	experimental_setup
		Dimensions: (144, 26)
	miRNA
		Dimensions: (99, 2337)
	phosphoproteomics
		Dimensions: (144, 73212)
	phosphoproteomics_gene
		Dimensions: (144, 8466)
	proteomics
		Dimensions: (144, 10999)
	somatic_mutation
		Dimensions: (52560, 3)
	somatic_mutation_binary
		Dimensions: (95, 51559)
	transcriptomics
		Dimensions: (109, 28057)


## Step 2: Determine what attributes you would like to A/B test. 
For this analysis, we will be testing MSI_status versus BMI in the proteomics dataset, and Histologic_type versus Age in the phosphoproteomics dataset.

In [41]:
#Create a subset copy of the original Clinical DataFrame for proteomics. 
annotations_prot = clinical[['MSI_status', 'BMI']].copy()

#Binarize the BMI column into two options: Healthy and Unhealthy
annotations_prot['BMI'] = bf.binarizeRange(annotations_prot, 
                                           'BMI', 18.5, 25, 
                                           'Healthy', 'Unhealthy')

#Create a subset copy of the orignial Clinical DataFrame for phosphoproteomics.
annotations_phospho = clinical[['Histologic_type', 'Age']].copy()

#Binarize the Age column into two options: Old and Young
annotations_phospho['Age'] = bf.binarizeCutOff(annotations_phospho, 
                                               'Age', 65, 'Old', 'Young')

## Step 3: Perform outliers analysis

In [44]:
outliers_prot = blsh.make_outliers_table(proteomics, iqrs=2.0, 
                                         up_or_down='up', 
                                         aggregate=False, 
                                         frac_table=False)

outliers_phospho = blsh.make_outliers_table(phospho, iqrs=2.0, 
                                            up_or_down='up', 
                                            aggregate=False, 
                                            frac_table=False)

  overwrite_input, interpolation)


## Step 4: Wrap your A/B test into the outliers analysis, and create a table

In [45]:
results_prot = blsh.compare_groups_outliers(outliers_prot, 
                                            annotations_prot, 
                                            frac_filter=0.1)

results_phospho = blsh.compare_groups_outliers(outliers_phospho, 
                                               annotations_phospho, 
                                               frac_filter=0.1)

Testing 147 rows for enrichment in MSI_status MSS samples
Testing 292 rows for enrichment in MSI_status MSI-H samples
Testing 72 rows for enrichment in BMI Unhealthy samples
Testing 2121 rows for enrichment in BMI Healthy samples
Testing 41 rows for enrichment in Histologic_type Endometrioid samples
Testing 1550 rows for enrichment in Histologic_type Serous samples
Testing 108 rows for enrichment in Age Young samples
Testing 104 rows for enrichment in Age Old samples


## Step 5: Visualize these enrichments

In [46]:
results_prot.head()

Unnamed: 0,MSI_status_MSS_enrichment_FDR,MSI_status_MSI-H_enrichment_FDR,BMI_Unhealthy_enrichment_FDR,BMI_Healthy_enrichment_FDR
A1BG,,,,
A2M,,,,
A2ML1,0.521501,,1.0,
A4GALT,,,,0.412229
AAAS,,,,


In [47]:
results_phospho.head()

Unnamed: 0,Histologic_type_Endometrioid_enrichment_FDR,Histologic_type_Serous_enrichment_FDR,Age_Young_enrichment_FDR,Age_Old_enrichment_FDR
AAAS-S495,,,,
AAAS-S541,,,,
AAAS-Y485,,,,
AACS-S618,,,,
AAED1-S12,,,,


## Step 6: Determine significant enrichments, and link with cancer drug database.

In [49]:
#Check for significant columns in proteomics
for col in results_prot.columns:
    bf.significantEnrichments(results_prot, col)

#Check for significant columns in proteomics
for col in results_phospho.columns:
    bf.significantEnrichments(results_phospho, col)

No significant results in MSI_status_MSS

There is 1 significant protein enrichment in MSI_status_MSI-H:

No significant results in BMI_Unhealthy

No significant results in BMI_Healthy

No significant results in Histologic_type_Endometrioid

There are 523 significant proteins enrichments in Histologic_type_Serous

No significant results in Age_Young

No significant results in Age_Old



In [53]:
#Store the dataframe of significant enrichments
sig_results_MSI = bf.significantEnrichments(results_prot, 
                                            'MSI_status_MSI-H_enrichment_FDR')
sig_results_Hist = bf.significantEnrichments(results_phospho, 
                                            'Histologic_type_Serous_enrichment_FDR', 
                                             0.01)

There is 1 significant protein enrichment in MSI_status_MSI-H:

There are 13 significant proteins enrichments in Histologic_type_Serous



In [54]:
sig_results_Hist.head()

Unnamed: 0,Histologic_type_Serous_P_values
ACTL6A-S233,0.002109
AKAP8L-S552,0.002109
FXR1-S409,0.000261
FXR1-T411,0.001805
INCENP-S197,0.0069
