## BlackSheep Cookbook Exploration

The Black Sheep Analysis allows researchers to find trends in abnormal protein enrichment among patients in CPTAC datasets. In this Cookbook, we will go through the steps needed to perform a full Black Sheep Analysis.

### Step 1a: Import Dependencies
First, import the necessary dependencies and load cptac data.

In [20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cptac
import binarization_functions as bf
import blackSheepCPTACmoduleCopy as blsh
import gseapy as gp
from gseapy.plot import barplot, heatmap, dotplot

## Step 1b: Load Data and Choose Omics Table
For this analysis, we will be looking at results across the proteomics, transcriptomics, and phosphoproteomics tables.

In [21]:
co = cptac.Colon()
proteomics = co.get_proteomics()
transcriptomics = co.get_transcriptomics()
clinical = co.get_clinical()

Checking that data files are up-to-date...
100% [..................................................................................] 487 / 487
Data check complete.
colon data version: Most recent release

Loading clinical data...
Loading miRNA data...
Loading mutation data...
Loading mutation_binary data...
Loading phosphoproteomics_normal data...
Loading phosphoproteomics_tumor data...
Loading proteomics_normal data...
Loading proteomics_tumor data...
Loading transcriptomics data...


## Step 2: Determine what attributes you would like to A/B test. 
For this analysis, we will iteratively go through the various columns in the clinical dataset, to determine if any of them have trends within them for protein enrichment.

In [22]:
#Create a copy of the original Clinical DataFrame and drop irrelevant columns.
annotations = pd.DataFrame(clinical.copy())

In [23]:
annotations = annotations.drop(['Patient_ID'], axis=1)

In [24]:
already_binary_columns = ['Mutation_Phenotype', 'Tumor.Status', 
                          'Vital.Status', 'Polyps_Present', 
                          'Polyps_History', 'Synchronous_Tumors', 
                          'Perineural_Invasion', 'Lymphatic_Invasion', 
                          'Vascular_Invasion', 'Mucinous', 'Gender', 
                          'Sample_Tumor_Normal']

#should pathalogy_T_stage be pathology_T_stage??
columns_2_binarize = ['Age', 'Subsite', 
                      'pathalogy_T_stage',
                      'pathalogy_N_stage', 
                      'Stage', 'CEA', 
                      'Transcriptomic_subtype', 
                      'Proteomic_subtype', 
                      'mutation_rate']

## Step 2a: Binarize column values

Clean up columns that are strings, that should be numerics

In [25]:
clinical['CEA'] = pd.to_numeric(clinical['CEA'])
clinical['Age'] = pd.to_numeric(clinical['Age'])
clinical['mutation_rate'] = pd.to_numeric(clinical['mutation_rate'])

In [26]:
annotations['Age'] = bf.binarizeCutOff(clinical, 'Age', 
                                       730, '2 years or older', 
                                       'Younger than 2 years')

In [27]:
subsite_map = {'Sigmoid Colon':'Sigmoid_Colon', 
               'Ascending Colon':'Other_site', 
               'Cecum ':'Other_site', 
               'Descending Colon':'Other_site', 
               'Hepatix Flexure':'Other_site', 
               'Splenic Flexure':'Other_site', 
               'Tranverse Colon':'Other_site'}

annotations['Subsite'] = bf.binarizeCategorical(clinical, 
                                                'Subsite', 
                                                subsite_map)

In [28]:
pathalogy_T_stage_map = {'T3':'T3orT2', 'T2':'T3orT2', 
                         'T4a':'T4', 'T4b':'T4'}

annotations['pathalogy_T_stage'] = bf.binarizeCategorical(clinical, 
                                                          'pathalogy_T_stage', 
                                                          pathalogy_T_stage_map)

In [29]:
pathalogy_N_stage_map = {'N0':'N0', 'N1':'N1orN2',
                         'N1a':'N1orN2', 'N1b':'N1orN2', 
                         'N2a':'N1orN2', 'N2b':'N1orN2'}

annotations['pathalogy_N_stage'] = bf.binarizeCategorical(clinical, 
                                                          'pathalogy_N_stage', 
                                                          pathalogy_N_stage_map)

In [30]:
stage_map = {'Stage I':'StageIorII', 
             'Stage II':'StageIorII', 
             'Stage III':'StageIIIorIV', 
             'Stage IV':'StageIIIorIV'}

annotations['Stage'] = bf.binarizeCategorical(clinical, 
                                              'Stage', 
                                              stage_map)

In [31]:
annotations['CEA'] = bf.binarizeCutOff(clinical, 
                                       'CEA', 15, 
                                       'High_CEA', 
                                       'Low_CEA')

In [32]:
Transcriptomic_subtype_map = {'CMS1':'CMS1or2', 
                              'CMS2':'CMS1or2', 
                              'CMS3':'CMS3or4', 
                              'CMS4':'CMS3or4'}

annotations['Transcriptomic_subtype'] = bf.binarizeCategorical(clinical, 
                                                               'Transcriptomic_subtype', 
                                                               Transcriptomic_subtype_map)

In [33]:
Proteomic_subtype_map = {'A':'AorBorC', 
                         'B':'AorBorC', 
                         'C':'AorBorC', 
                         'D':'DorE', 
                         'E':'DorE'}

annotations['Proteomic_subtype'] = bf.binarizeCategorical(clinical, 
                                                          'Proteomic_subtype', 
                                                          Proteomic_subtype_map)

In [34]:
annotations['mutation_rate'] = bf.binarizeCutOff(clinical, 
                                                 'mutation_rate', 50, 
                                                 'High_Mutation_Rate', 
                                                 'Low_Mutation_Rate')

## Step 3: Perform outliers analysis

In [35]:
outliers_prot = blsh.make_outliers_table(proteomics, iqrs=1.5, 
                                         up_or_down='up', 
                                         aggregate=False, 
                                         frac_table=False)

outliers_trans = blsh.make_outliers_table(transcriptomics, iqrs=1.5, 
                                          up_or_down='up', 
                                          aggregate=False, 
                                          frac_table=False)

In [45]:
blsh.make_outliers_table

In [36]:
outliers_prot

Unnamed: 0,A1BG,A1CF,A2M,AAAS,AACS,AAGAB,AAK1,AAMDC,AAMP,AAR2,...,ZNHIT6,ZNRD1,ZNRF2,ZPR1,ZRANB2,ZW10,ZWILCH,ZWINT,ZYX,ZZEF1
S002_outliers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
S003_outliers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,0.0,,0.0,0.0,0.0,0.0,,0.0,0.0
S004_outliers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
S005_outliers,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0
S006_outliers,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
S007_outliers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
S008_outliers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0.0,0.0,0.0,0.0,1.0,,0.0,0.0
S009_outliers,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,,,,0.0,0.0,0.0,,,0.0,0.0
S010_outliers,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
S011_outliers,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,0.0,0.0,0.0,0.0,0.0,,0.0,0.0


## Step 4: Wrap your A/B test into the outliers analysis, and create a table
First for proteomics, and then phosphoproteomics.

In [44]:
#This is where it starts to fail. Probably because of the dimensions of the dataframes
results_prot = blsh.compare_groups_outliers(outliers_prot, 
                                            annotations)

KeyError: "['S015_outliers', 'S017_outliers', 'S013_outliers', 'S088_outliers', 'S090_outliers', 'S081_outliers', 'S080_outliers', 'S099_outliers'] not in index"

In [43]:
results_trans = blsh.compare_groups_outliers(outliers_trans, 
                                             annotations)

KeyError: "['S088_outliers'] not in index"

Many of the output values from compare_group_outliers are NaN, so here we will get rid of the NaN values for visualization purposes.

In [None]:
results_prot = results_prot.dropna(axis=0, how='all')
results_trans = results_trans.dropna(axis=0, how='all')

## Step 5: Visualize these enrichments

In [None]:
sns.heatmap(results_prot)
plt.show()

In [None]:
sns.heatmap(results_trans)
plt.show()

In [None]:
#How can we automate something like this?
results_prot = results_prot.drop(['Proteomics_Tumor_Normal_Other_tumor_enrichment_FDR', 
                                  'Histologic_Grade_FIGO_High_grade_enrichment_FDR',
                                  'Histologic_Grade_FIGO_Low_grade_enrichment_FDR', 
                                  'Myometrial_invasion_Specify_50%_or_more_enrichment_FDR'], 
                                 axis=1)

## Step 6: Determine significant enrichments, and link with cancer drug database.

In [None]:
print("TESTING FOR PROTEOMICS:")
sig_cols = []
for col in results_prot.columns:
    sig_col = bf.significantEnrichments(results_prot, col, 0.025)
    if sig_col is not None:
        sig_cols.append(sig_col)
    else:
        continue

In [None]:
for col in sig_cols:
    col_name = col.columns[0]
    gene_name_list = list(col.index)
    enrichment = gp.enrichr(gene_list = gene_name_list, 
                            description=col_name, 
                            gene_sets='KEGG_2019_Human', 
                            outdir='test/enrichr_kegg', #This isn't saving correctly...why is that?
                            cutoff=0.5)
    print(enrichment.res2d)
    barplot(enrichment.res2d, title=col_name)