## BlackSheep Cookbook Exploration

The Black Sheep Analysis allows researchers to find trends in abnormal protein enrichment among patients in CPTAC datasets. In this Cookbook, we will go through the steps needed to perform a full Black Sheep Analysis.

### Step 1a: Import Dependencies
First, import the necessary dependencies and load cptac data.

In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cptac
import binarization_functions as bf
import blackSheepCPTACmoduleCopy as blsh
import gseapy as gp
from gseapy.plot import barplot, heatmap, dotplot

## Step 1b: Load Data and Choose Omics Table
For this analysis, we will be looking at results across the proteomics, transcriptomics, and phosphoproteomics tables.

In [36]:
ov = cptac.Ovarian()
proteomics = ov.get_proteomics()
transcriptomics = ov.get_transcriptomics()
clinical = ov.get_clinical()

Checking that data files are up-to-date...
100% [..................................................................................] 407 / 407
Data check complete.
ovarian data version: Most recent release

Loading clinical data...
Loading cnv data...
Loading definitions data...
Loading phosphoproteomics data...
Loading proteomics data...
Loading somatic_38 data...
Loading transcriptomics data...
Loading treatment data...


## Step 2: Determine what attributes you would like to A/B test. 
For this analysis, we will iteratively go through the various columns in the clinical dataset, to determine if any of them have trends within them for protein enrichment.

In [62]:
#Create a copy of the original Clinical DataFrame
annotations = pd.DataFrame(clinical.copy())

In [63]:
#Drop irrelevant columns.
irrelevant_cols = ['Patient_ID', 'Participant_Gender', 'Histological_Subtype']
annotations = annotations.drop(irrelevant_cols, axis=1)

#Determine which columns we should either drop, or be generally skeptical of in our analysis
questionable_cols = ['Participant_History_Neo-adjuvant_Treatment', #all said no but one
                     'Participant_History_Radiation_Therapy', #all said no but one
                     'Participant_History_Hormonal_Therapy', #all said no but one
                     'Participant_Ethnicity', #all are not hispanic or latino, weren't evaluated, or unknown. Only one hispanic
                     'Normal_Sample_4_Surgical_Devascularized_Time', #very few datatpoints
                     'Normal_Sample_4_Weight', 'Normal_Sample_4_LN2_Time', #very few datapoints
                     'Normal_Sample_4_Ischemia_Time', #very few datapoints
                     'Normal_Sample_5_Surgical_Devascularized_Time', #very few datapoints
                     'Normal_Sample_5_Weight', 'Normal_Sample_5_LN2_Time', #very few datapoints
                     'Normal_Sample_5_Ischemia_Time', #very few datapoints
                     'Other_New_Tumor_Event_Site', #few datapoints, and hard to really binarize 
                     'Days_Between_Collection_And_New_Tumor_Event_Surgery'] #few datapoints

annotations = annotations.drop(questionable_cols, axis=1)

In [64]:
#Determine which columns are binary and which aren't
binary_cols = []
non_binary_cols = []

for col in annotations.columns:
    if len(annotations[col].value_counts()) == 2:
        binary_cols.append(col)
    elif len(annotations[col].value_counts()) > 2:
        non_binary_cols.append(col)
    else:
        annotations = annotations.drop(col, axis=1) 
        #This would defeat the purpose of binarization if there were only 1 possible outcome.

In [65]:
#for item in non_binary_cols:
    #print(item)
    #print(annotations[item].value_counts())
    #print('\n')
    
#This is where I left off

In [67]:
numeric_non_bin = []
categorical_non_bin = []

for item in non_binary_cols:
    if np.issubdtype(annotations[item].dtype, np.number):
        print(item+" is a numeric column\n")
        numeric_non_bin.append(item)
        mean = annotations[item].mean()
        annotations[item]= bf.binarizeCutOff(annotations, item, mean, 
                                             "Above_Mean("+str(round(mean, 2))+")", 
                                             "Below_Mean("+str(round(mean, 2))+")")
    else:
        print(item+" is a categorical column\n")
        categorical_non_bin.append(item)

Participant_Procurement_Age is a numeric column

Participant_Race is a categorical column

Participant_Jewish_Heritage is a categorical column

Aliquots_Plasma is a numeric column

Blood_Collection_Time is a categorical column

Blood_Collection_Method is a categorical column

Anesthesia_Time is a numeric column

Tumor_Surgical_Devascularized_Time is a numeric column

Tumor_Sample_Number is a numeric column

Tumor_Sample_1_Weight is a numeric column

Tumor_Sample_1_LN2_Time is a numeric column

Tumor_Sample_1_Ischemia_Time is a numeric column

Tumor_Sample_2_Weight is a numeric column

Tumor_Sample_2_LN2_Time is a numeric column

Tumor_Sample_2_Ischemia_Time is a numeric column

Tumor_Sample_3_Weight is a numeric column

Tumor_Sample_3_LN2_Time is a numeric column

Tumor_Sample_3_Ischemia_Time is a numeric column

Tumor_Sample_4_Weight is a numeric column

Tumor_Sample_4_LN2_Time is a numeric column

Tumor_Sample_4_Ischemia_Time is a numeric column

Tumor_Sample_5_Weight is a numeric co

## Step 2a: Binarize column values

In [8]:
'''annotations['Participant_Procurement_Age'] = bf.binarizeCutOff(annotations, 
                                                               'Participant_Procurement_Age', 
                                                               700, 'Old', 'Young')
'''

In [71]:
race_map = {'White':'White', 
            'Asian':'Not_White', 
            'Black or African American':'Not_White', 
            'Unknown (Could not be determined or unsure)':np.nan, 
            'American Indian or Alaska Native': 'Not_White'}

annotations['Participant_Race'] = bf.binarizeCategorical(annotations, 
                                                         'Participant_Race', 
                                                         race_map)

In [72]:
jewish_map = {'Not Jewish':'Not_Jewish', 
              'Unknown':np.nan, 
              'Ashkenazi':'Jewish', 
              'Jewish, NOS':'Jewish'}

annotations['Participant_Jewish_Heritage'] = bf.binarizeCategorical(annotations, 
                                                                    'Participant_Jewish_Heritage', 
                                                                    jewish_map)

In [11]:
'''annotations['Aliquots_Plasma'] = bf.binarizeCutOff(annotations, 'Aliquots_Plasma', 
                                                   3.0, '3-4', '0-2')
'''

In [73]:
#Replace categorical 'Not Reported/ Unknown' with NaN and convert to numeric for easier binarization
annotations['Blood_Collection_Time'] = annotations['Blood_Collection_Time'].replace('Not Reported/ Unknown', np.nan)
annotations['Blood_Collection_Time'] = pd.to_numeric(annotations['Blood_Collection_Time'])

#Binarize Column
annotations['Blood_Collection_Time'] = bf.binarizeCutOff(annotations, 
                                                         'Blood_Collection_Time', 
                                                         1000, 'Long', 'Short')

In [74]:
blood_collection_method_map = {'Venipuncture (Vacutainer Apparatus)':'Venipuncture', 
                               'Venipuncture (Syringe)':'Venipuncture', 'IV Catheter':'IV'}

annotations['Blood_Collection_Method'] = bf.binarizeCategorical(annotations, 
                                                                'Blood_Collection_Method', 
                                                                blood_collection_method_map)

In [14]:
'''annotations['Anesthesia_Time'] = bf.binarizeCutOff(annotations, 'Anesthesia_Time', 
                                                   1000, 'Long', 'Short')
'''

In [15]:
'''annotations['Tumor_Surgical_Devascularized_Time'] = bf.binarizeCutOff(annotations, 
                                                                      'Tumor_Surgical_Devascularized_Time', 
                                                                      1000, 'Long', 'Short')
'''

In [16]:
'''annotations['Tumor_Sample_Number'] = bf.binarizeCutOff(annotations,
                                                       'Tumor_Sample_Number', 
                                                       4, '4-6', '1-3')
'''

In [17]:
'''annotations['Tumor_Sample_1_Weight'] = bf.binarizeCutOff(annotations, 
                                                         'Tumor_Sample_1_Weight', 
                                                         1000, 'Heavy', 'Light')
'''

In [31]:
print(annotations['Tumor_Sample_2_LN2_Time'].min())
print(annotations['Tumor_Sample_2_LN2_Time'].max())
print(annotations['Tumor_Sample_2_LN2_Time'].mean())

21.0
2127.0
1189.549019607843


In [19]:
'''annotations['Tumor_Sample_1_LN2_Time'] = bf.binarizeCutOff(annotations, 
                                                           'Tumor_Sample_1_LN2_Time', 
                                                           1000, 'Long', 'Short')
'''

In [20]:
'''annotations['Tumor_Sample_1_Ischemia_Time'] = bf.binarizeCutOff(annotations, 
                                                                'Tumor_Sample_1_Ischemia_Time', 
                                                                15, 'Long', 'Short')
'''

In [21]:
'''annotations['Tumor_Sample_2_Weight'] = bf.binarizeCutOff(annotations, 
                                                         'Tumor_Sample_2_Weight', 
                                                         600, 'Heavy', 'Light')
'''

In [32]:
'''annotations['Tumor_Sample_2_LN2_Time'] = bf.binarizeCutOff(annotations, 
                                                           'Tumor_Sample_2_LN2_Time', 
                                                           1200, 'Long', 'Short')
'''

In [76]:
for item in categorical_non_bin:
    print(item)
    print(annotations[item].value_counts())
    print('\n')

Participant_Race
White        95
Not_White    14
Name: Participant_Race, dtype: int64


Participant_Jewish_Heritage
Not_Jewish    60
Jewish        13
Name: Participant_Jewish_Heritage, dtype: int64


Blood_Collection_Time
Long     58
Short    43
Name: Blood_Collection_Time, dtype: int64


Blood_Collection_Method
Venipuncture    86
IV              25
Name: Blood_Collection_Method, dtype: int64


Origin_Site_Disease
Ovary             73
Fallopian tube    22
Peritoneum        10
Name: Origin_Site_Disease, dtype: int64


Anatomic_Site_Tumor
Ovary                    55
Omentum                  43
Pelvic mass               3
Peritoneum                3
Not Reported/ Unknown     1
Name: Anatomic_Site_Tumor, dtype: int64


Anatomic_Lateral_Position_Tumor
Not applicable           49
Right                    31
Left                     12
Not Reported/ Unknown     7
Bilateral                 6
Name: Anatomic_Lateral_Position_Tumor, dtype: int64


Method_of_Pathologic_Diagnosis
Tumor resection   

In [77]:
origin_site_map = {'Ovary':'Ovary', 
                   'Fallopian tube':'Other', 
                   'Peritoneum':'Other'}

annotations['Origin_Site_Disease'] = bf.binarizeCategorical(annotations, 
                                                            'Origin_Site_Disease',
                                                            origin_site_map)

In [79]:
anatomic_site_map = {'Ovary':'Ovary', 
                     'Omentum':'Other', 
                     'Pelvic mass':'Other', 
                     'Peritoneum':'Other', 
                     'Not Reported/ Unknown':np.nan}

annotations['Anatomic_Site_Tumor'] = bf.binarizeCategorical(annotations, 
                                                            'Anatomic_Site_Tumor',
                                                            anatomic_site_map)

In [80]:
anatomic_lateral_map = {'Not applicable':'Other',
                        'Right':'Right',
                        'Left':'Other', 
                        'Not Reported/ Unknown':np.nan, 
                        'Bilateral':'Other'}

annotations['Anatomic_Lateral_Position_Tumor'] = bf.binarizeCategorical(annotations,
                                                                        'Anatomic_Lateral_Position_Tumor',
                                                                        anatomic_lateral_map)

In [81]:
path_diagnosis_map = {'Tumor resection': 'Tumor resection',
                      'Excisional Biopsy':'Biopsy', 
                      'Excisional Biopsy':'Biopsy'}

annotations['Method_of_Pathologic_Diagnosis'] = bf.binarizeCategorical(annotations,
                                                                       'Method_of_Pathologic_Diagnosis',
                                                                       path_diagnosis_map)

In [83]:
tumor_stage_map = {'IC':'I_or_II', 
                   'IIB':'I_or_II', 
                   'III':'III_or_IV', 
                   'IIIA':'III_or_IV', 
                   'IIIB':'III_or_IV', 
                   'IIIC':'III_or_IV', 
                   'IV':'III_or_IV', 
                   'Not Reported/ Unknown':np.nan}

annotations['Tumor_Stage_Ovary_FIGO'] = bf.binarizeCategorical(annotations, 
                                                               'Tumor_Stage_Ovary_FIGO', 
                                                               tumor_stage_map)

In [29]:
FIGO_stage_map = {'IA':'I_or_II', 
                  'IB':'I_or_II', 
                  'II':'I_or_II', 
                  'IIIA':'III_or_IV', 
                  'IIIC1':'III_or_IV', 
                  'IVB':'III_or_IV', 
                  'IIIC2':'III_or_IV', 
                  'IIIB':'III_or_IV'}

annotations['FIGO_stage'] = bf.binarizeCategorical(clinical, 
                                                   'FIGO_stage', 
                                                   FIGO_stage_map)

In [None]:
#Continue from here

Tumor_Grade
G3                       84
Not Reported/ Unknown    11
G2                        6
G1                        1
GB                        1
GX                        1

In [30]:
diabetes_map = {'No':'No', 
                'Yes':'Yes', 
                'Unknown':'No'}

annotations['Diabetes'] = bf.binarizeCategorical(clinical, 
                                                 'Diabetes', 
                                                 diabetes_map)

In [31]:
race_map = {'White':'White', 
            'Black or African American':'Not_White', 
            'Asian':'Not_White', 
            'Not Reported':'Not_White'}

annotations['Race'] = bf.binarizeCategorical(clinical, 
                                             'Race', 
                                             race_map)

In [32]:
ethnicity_map = {'Not-Hispanic or Latino':'Not_Hispanic', 
                 'Not reported':'Not_Hispanic', 
                 'Hispanic or Latino':'Hispanic'}

annotations['Ethnicity'] = bf.binarizeCategorical(clinical, 
                                                  'Ethnicity', 
                                                  ethnicity_map)

In [33]:
tumor_site_map = {'Other, specify':'Not_Anterior', 
                  'Anterior endometrium':'Anterior', 
                  'Posterior endometrium':'Not_Anterior'}

annotations['Tumor_Site'] = bf.binarizeCategorical(clinical, 
                                                   'Tumor_Site', 
                                                   tumor_site_map)

In [34]:
annotations['Tumor_Size_cm'] = bf.binarizeCutOff(clinical, 
                                                'Tumor_Size_cm', 4.0, 
                                                'Large_tumor', 
                                                'Small_tumor')

In [35]:
num_pregnancies_map = {2:'Less_than_3', 
                       1:'Less_than_3', 
                       'None':'Less_than_3', 
                       None:'Less_than_3', 
                       3:'3_or_more', 
                       '4 or more':'3_or_more'}

annotations['Num_full_term_pregnancies'] = bf.binarizeCategorical(clinical, 
                                                                  'Num_full_term_pregnancies', 
                                                                  num_pregnancies_map)

In [36]:
genomics_map = {'MSI-H':'MSI-H', 
                'CNV_low':'Other_subtype', 
                'CNV_high':'Other_subtype', 
                'POLE':'Other_subtype'}

annotations['Genomics_subtype'] = bf.binarizeCategorical(clinical, 
                                                         'Genomics_subtype', 
                                                         genomics_map)

## Step 3: Perform outliers analysis

In [37]:
outliers_prot = blsh.make_outliers_table(proteomics, iqrs=1.5, 
                                         up_or_down='up', 
                                         aggregate=False, 
                                         frac_table=False)

outliers_trans = blsh.make_outliers_table(transcriptomics, iqrs=1.5, 
                                          up_or_down='up', 
                                          aggregate=False, 
                                          frac_table=False)

  overwrite_input, interpolation)


## Step 4: Wrap your A/B test into the outliers analysis, and create a table
First for proteomics, and then phosphoproteomics.

In [38]:
results_prot = blsh.compare_groups_outliers(outliers_prot, 
                                            annotations)

No rows had outliers in at least 0.3 of Proteomics_Tumor_Normal Normal_Tumor samples
Testing 14 rows for enrichment in Proteomics_Tumor_Normal Other_tumor samples
No rows had outliers in at least 0.3 of Country Other samples
No rows had outliers in at least 0.3 of Country US samples
Testing 2 rows for enrichment in Histologic_Grade_FIGO High_grade samples
Testing 1 rows for enrichment in Histologic_Grade_FIGO Low_grade samples
No rows had outliers in at least 0.3 of Myometrial_invasion_Specify under_50% samples
Testing 7 rows for enrichment in Myometrial_invasion_Specify 50%_or_more samples
No rows had outliers in at least 0.3 of Histologic_type Endometrioid samples
Testing 626 rows for enrichment in Histologic_type Serous samples
No rows had outliers in at least 0.3 of Path_Stage_Primary_Tumor-pT Not_FIGO_III samples
Testing 241 rows for enrichment in Path_Stage_Primary_Tumor-pT FIGO_III samples
No rows had outliers in at least 0.3 of Path_Stage_Reg_Lymph_Nodes-pN Not_FIGO_III samples

In [39]:
results_trans = blsh.compare_groups_outliers(outliers_trans, 
                                             annotations)

KeyError: "['S139_outliers', 'S146_outliers', 'S131_outliers', 'S135_outliers', 'S113_outliers', 'S147_outliers', 'S110_outliers', 'S132_outliers', 'S115_outliers', 'S136_outliers', 'S133_outliers', 'S141_outliers', 'S151_outliers', 'S109_outliers', 'S149_outliers', 'S116_outliers', 'S118_outliers', 'S126_outliers', 'S134_outliers', 'S144_outliers', 'S148_outliers', 'S140_outliers', 'S142_outliers', 'S112_outliers', 'S117_outliers', 'S138_outliers', 'S152_outliers', 'S153_outliers', 'S137_outliers', 'S145_outliers', 'S130_outliers', 'S150_outliers', 'S143_outliers', 'S123_outliers', 'S114_outliers'] not in index"

Many of the output values from compare_group_outliers are NaN, so here we will get rid of the NaN values for visualization purposes.

In [None]:
results_prot = results_prot.dropna(axis=0, how='all')
results_trans = results_trans.dropna(axis=0, how='all')

## Step 5: Visualize these enrichments

In [None]:
sns.heatmap(results_prot)
plt.show()

In [None]:
sns.heatmap(results_trans)
plt.show()

In [27]:
#How can we automate something like this?
'''results_prot = results_prot.drop(['Proteomics_Tumor_Normal_Other_tumor_enrichment_FDR', 
                                  'Histologic_Grade_FIGO_High_grade_enrichment_FDR',
                                  'Histologic_Grade_FIGO_Low_grade_enrichment_FDR', 
                                  'Myometrial_invasion_Specify_50%_or_more_enrichment_FDR'], 
                                 axis=1)'''

## Step 6: Determine significant enrichments, and link with cancer drug database.

In [41]:
print("TESTING FOR PROTEOMICS:")
sig_cols = []
for col in results_prot.columns:
    sig_col = bf.significantEnrichments(results_prot, col, 0.025)
    if sig_col is not None:
        sig_cols.append(sig_col)
    else:
        continue

TESTING FOR PROTEOMICS:
14 significant protein enrichments in Proteomics_Tumor_Normal_Other_tumor

1 significant protein enrichment in Histologic_Grade_FIGO_High_grade:

2 significant protein enrichments in Myometrial_invasion_Specify_50%_or_more

538 significant protein enrichments in Histologic_type_Serous

26 significant protein enrichments in Path_Stage_Reg_Lymph_Nodes-pN_FIGO_III

4 significant protein enrichments in FIGO_stage_III_or_IV

9 significant protein enrichments in LVSI_1.0

7 significant protein enrichments in BMI_Healthy

3 significant protein enrichments in Age_Young

5 significant protein enrichments in Tumor_Site_Anterior

2 significant protein enrichments in Tumor_Focality_Multifocal

7 significant protein enrichments in MSI_status_MSI-H

7 significant protein enrichments in Genomics_subtype_MSI-H



In [42]:
for col in sig_cols:
    col_name = col.columns[0]
    gene_name_list = list(col.index)
    enrichment = gp.enrichr(gene_list = gene_name_list, 
                            description=col_name, 
                            gene_sets='KEGG_2019_Human', 
                            outdir='test/enrichr_kegg', #This isn't saving correctly...why is that?
                            cutoff=0.5)
    print(enrichment.res2d)
    barplot(enrichment.res2d, title=col_name)

           Gene_set                                             Term Overlap  \
0   KEGG_2019_Human                    Glycosaminoglycan degradation    1/19   
1   KEGG_2019_Human                                      Ferroptosis    1/40   
2   KEGG_2019_Human                            TNF signaling pathway   1/110   
3   KEGG_2019_Human             Toll-like receptor signaling pathway   1/104   
4   KEGG_2019_Human            Regulation of lipolysis in adipocytes    1/55   
5   KEGG_2019_Human                                      Endocytosis   2/244   
6   KEGG_2019_Human                                      Necroptosis   1/162   
7   KEGG_2019_Human                       Osteoclast differentiation   1/127   
8   KEGG_2019_Human                T cell receptor signaling pathway   1/101   
9   KEGG_2019_Human                           GnRH signaling pathway    1/93   
10  KEGG_2019_Human                  Adipocytokine signaling pathway    1/69   
11  KEGG_2019_Human                     

[192 rows x 10 columns]
           Gene_set                                               Term  \
0   KEGG_2019_Human                            African trypanosomiasis   
1   KEGG_2019_Human                 Cysteine and methionine metabolism   
2   KEGG_2019_Human                              TNF signaling pathway   
3   KEGG_2019_Human                       NF-kappa B signaling pathway   
4   KEGG_2019_Human                                        Spliceosome   
5   KEGG_2019_Human                                            Malaria   
6   KEGG_2019_Human                      Inositol phosphate metabolism   
7   KEGG_2019_Human              Phosphatidylinositol signaling system   
8   KEGG_2019_Human             Fluid shear stress and atherosclerosis   
9   KEGG_2019_Human  AGE-RAGE signaling pathway in diabetic complic...   
10  KEGG_2019_Human                     Cell adhesion molecules (CAMs)   
11  KEGG_2019_Human               Leukocyte transendothelial migration   
12  KEGG_2019_

3     0.009198              0.010725 -12.009321       61.977171  SRP19  
          Gene_set           Term Overlap   P-value  Adjusted P-value  \
0  KEGG_2019_Human        Measles   1/138  0.013753          0.016432   
1  KEGG_2019_Human  RNA transport   1/165  0.016432          0.016432   

   Old P-value  Old Adjusted P-value   Z-score  Combined Score  Genes  
0     0.017814              0.021274 -8.631029       36.997061  EIF3H  
1     0.021274              0.021274 -5.640213       23.172834  EIF3H  
          Gene_set                                    Term Overlap   P-value  \
0  KEGG_2019_Human                   TNF signaling pathway   1/110  0.037876   
1  KEGG_2019_Human                    Rheumatoid arthritis    1/91  0.031423   
2  KEGG_2019_Human                 IL-17 signaling pathway    1/93  0.032104   
3  KEGG_2019_Human                                Ribosome   1/153  0.052344   
4  KEGG_2019_Human             Chemokine signaling pathway   1/190  0.064644   
5  KEGG_201