# Make Figure 4

This notebooks takes all the trans genes that have positive and negative results and runs a GSEA using Reactome. It then takes a subset of genes from the top hit (Hemostasis) pathway and maps them on a large circle heat map. This heatmap focuses on coagulation and urokinase related genes. 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas.util.testing as tm
import scipy.stats
import re
import sys 
import statsmodels.stats.multitest

import gseapy as gp
from gseapy.plot import barplot, dotplot

import plot_utils as p 

  import pandas.util.testing as tm


# Step 1: Find Trans proteins with opposite effects in different cancers 

Load df with all of the genes that are FDR significant. This dataframe was made in the Make_Supplemental_Tables notebook. See
https://github.com/PayneLab/WhenMutationsDontMatter/blob/master/EGFR/Make_Tables/Make_Supplemental_Tables.ipynb

In [52]:
FDR_sig = pd.read_csv("Make_Tables/csv_files/Supplemental_Table_EGFR_sig_only.csv")
FDR_sig = FDR_sig.set_index("Comparison")
FDR_sig.index.("PHLDA3")

AttributeError: 'Index' object has no attribute 'start_swith'

In [39]:
FDR_sig.max(axis=1)
FDR_sig.min(axis = 1)

Comparison
PHLDA1    3.507071e-21
GRB2     -6.108891e-01
SOCS2     3.420388e-06
CDH4      3.420388e-06
DAB2     -5.564015e-01
              ...     
CLTC      4.813589e-02
PLEC      4.824560e-02
LRRK2    -2.674570e-01
MBD1     -2.660975e-01
RRP12     4.993781e-02
Length: 6230, dtype: float64

In [3]:
def HasPosNeg(row):
    hasPos = False
    hasNeg= False

    for item in row:
        if pd.isnull(item):
            continue
        if item < 0:
            hasNeg = True
        if item > 0:
            hasPos = True
            
    if hasPos & hasNeg:
        return True
    return False

Subset data frame to include only trans genes that have opposite effects in different cancers by using apply function

In [17]:
col = ["Correlation_GBM","Correlation_ccRCC","Correlation_OV","Correlation_BR","Correlation_LUAD","Correlation_HNSCC","Correlation_LSCC","Correlation_CO"]
FDR_corr = FDR_sig[col]
FDR_corr["Pos_Neg"] = FDR_corr.apply(HasPosNeg, axis = 1)
FDR_corr_True = FDR_corr[FDR_corr['Pos_Neg']==True]
FDR_corr_True.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,Correlation_GBM,Correlation_ccRCC,Correlation_OV,Correlation_BR,Correlation_LUAD,Correlation_HNSCC,Correlation_LSCC,Correlation_CO,Pos_Neg
Comparison,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
DAB2,-0.556402,,,0.326055,,,,,True
PLA2G15,-0.556624,-0.298029,,0.274185,,,,,True
CTSC,-0.546285,-0.302316,,0.26694,0.30276,,,,True
SCPEP1,-0.531494,-0.386583,,0.399187,,,,,True
FAM129B,-0.514984,,,0.344093,,,0.360092,,True
PPP1R18,-0.497202,,,0.359142,,,,,True
NPC2,-0.498791,-0.319133,,0.279599,0.29252,,,,True
CTSB,-0.496895,,,0.341048,,,,,True
KYNU,-0.495517,,,0.373575,-0.341363,,,,True
HSD17B11,-0.491843,0.272218,,0.481667,,-0.402146,,,True


In [31]:
def Pvalue_sig(row):
    numSig = 0

    for item in row:
        if pd.isnull(item):
            continue
        if item < 0.05:
            numSig += 1
            
    return numSig

In [50]:
df = FDR_corr_True.drop("Pos_Neg",axis = 1)
diff = df.max(axis=1) - df.min(axis = 1)
diff.sort_values(ascending = False).head(20)

Comparison
AADAT       1.053837
EHBP1       0.976850
ARHGAP10    0.973586
HSD17B11    0.973510
RARA        0.959860
CMBL        0.954009
CELSR1      0.950254
PPP2R3A     0.949914
HAAO        0.944973
TBC1D10C    0.941293
CTNND2      0.938110
FAM49A      0.934342
LPIN1       0.933992
SCPEP1      0.930681
ACSL4       0.930283
TES         0.921650
DSC2        0.919490
GLIPR2      0.917386
CXXC5       0.917084
CRYBG3      0.914881
dtype: float64

In [30]:
abs_val = FDR_corr_True.abs()
abs_val.sum(1).sort_values(ascending = False).head(20)

Comparison
MYO10       3.264387
KIF13B      3.140698
CD109       3.068188
IL16        2.949050
CGGBP1      2.940795
RCSD1       2.912306
CNNM4       2.851991
PLCG2       2.845178
BAG2        2.837454
RIN3        2.834336
BIN2        2.806953
SDC1        2.803257
WIPF1       2.779699
ITGB1       2.743227
MICALL1     2.735946
PSTPIP1     2.734371
CELSR1      2.669411
TRIM26      2.661191
HSD17B11    2.647874
ALDH1L1     2.639734
dtype: float64

The manuscript mentions 945 of trans proteins that opposite effects in different cancers. Here is the derivation of that number 

In [5]:
pos_neg_prot = FDR_corr_True.index.tolist()
print("Total number of trans proteins with opposite effects in different cancers is " + str(len(pos_neg_prot)))

Total number of trans proteins with opposite effects in different cancers is 945


# Run GSEA

In [6]:
pos_neg_enr = gp.enrichr(gene_list = pos_neg_prot, description='Tumor_partition', gene_sets='Reactome_2016', 
                       outdir='test/enrichr_Reactome')
pos_neg_enr.res2d.head(5)

Unnamed: 0,Gene_set,Term,Overlap,P-value,Adjusted P-value,Old P-value,Old Adjusted P-value,Odds Ratio,Combined Score,Genes
0,Reactome_2016,Hemostasis Homo sapiens R-HSA-109582,80/552,2.1650839999999998e-19,3.312578e-16,0,0,3.067249,131.82013,ITGB1;DOCK5;ITGAM;DGKB;DGKA;PROS1;ITGB3;SERPIN...
1,Reactome_2016,Innate Immune System Homo sapiens R-HSA-168249,98/807,4.949186e-18,3.786127e-15,0,0,2.570104,102.411734,AHCYL1;WIPF1;WIPF2;PROS1;ARAF;ICAM3;FGF1;CLU;R...
2,Reactome_2016,Formation of Fibrin Clot (Clotting Cascade) Ho...,20/39,7.420408000000001e-17,3.784408e-14,0,0,10.853344,403.090083,FGB;FGA;VWF;F10;SERPIND1;SERPINC1;PROS1;FGG;F1...
3,Reactome_2016,Immune System Homo sapiens R-HSA-168256,145/1547,5.663493e-16,2.166286e-13,0,0,1.983699,69.642373,AHCYL1;NCF1;NCF2;WIPF1;PROS1;WIPF2;NCF4;ARAF;I...
4,Reactome_2016,Response to elevated platelet cytosolic Ca2+ H...,28/110,1.385496e-13,4.239617e-11,0,0,5.387205,159.501943,ITIH4;PROS1;ITGB3;SERPINE1;F13A1;PLG;A1BG;CLU;...


In [7]:
#get just the clotting cascade genes and add urokinase genes 
pos_neg_df = pos_neg_enr.res2d
coag = pos_neg_df.iloc[2,9]
coag = coag.split(';')
upa = ["F3","PLAUR","PLAU","PLG","MMP9","MMP12","SERPINE1"]
coag_upa =  coag + upa
len(coag_upa)

27

# Step 3 Make Data frame for Figure 4

In [8]:
#Get append version of the df with all cancer type, fdr sig trans results
df_FDR_append = pd.read_csv("Make_Tables/csv_files/sig_prot_heatmap_EGFR.csv")
 

#subset dataframe to include genes only desired for figure 
df_FDR_append= df_FDR_append[df_FDR_append.Comparison.isin(coag_upa)]
df_FDR_append

Unnamed: 0,Comparison,Correlation,P_Value,Cancer
75,PROCR,-0.470784,0.000120,GBM
198,PLAUR,-0.425639,0.000605,GBM
305,FGB,-0.404936,0.001126,GBM
349,FGG,-0.396275,0.001494,GBM
445,FGA,-0.380889,0.002412,GBM
...,...,...,...,...
8714,SERPIND1,0.351169,0.014485,CO
8755,VWF,0.341362,0.017933,CO
8830,F3,0.400421,0.023047,CO
8920,A2M,0.311803,0.032411,CO


Set add new column to be unique index and order the new index. This way genes will be grouped by coagulation factors, regulators, and urokinase genes.

In [9]:

df_FDR_append["Index"] = df_FDR_append["Comparison"] + " " + df_FDR_append["Cancer"]
df_FDR_append = df_FDR_append.set_index("Index")
df_ordered = df_FDR_append.reindex(["F2 GBM","F3 GBM","F9 GBM","F10 GBM","F11 GBM","F13A1 GBM","F13B GBM","KLKB1 GBM","VWF CO","FGA GBM","FGB GBM","FGG GBM","SERPINC1 GBM", "SERPIND1 GBM","SERPING1 GBM","A2M GBM","PROS1 GBM","PROC OV","PROCR GBM","THBD GBM","KNG1 GBM","PLAUR GBM","PLAU GBM","PLG GBM","MMP9 BR","MMP12 BR","SERPINE1 GBM",
                                "F2 BR","F9 BR","F10 BR","F11 BR","F13A1 BR","F13B BR","FGA BR","FGB BR","FGG BR", "SERPIND1 BR","SERPING1 BR","A2M BR","PROS1 BR","PROCR BR","KLKB1 BR", "PLAUR BR","PLAU BR","PLG BR","SERPINE1 BR",
                               "VWF HNSCC","THBD HNSCC","PLAUR HNSCC","PLAU HNSCC","SERPINE1 HNSCC",
                               "F9 LUAD","F13A1 LUAD", "F13B LUAD", "SERPIND1 LUAD","PROS1 LUAD","PROC LUAD","VWF LUAD",
                                "PROCR ccRCC",
                                "SERPIND1 OV","PROC OV",
                               "F3 CO","SERPINC1 CO", "SERPIND1 CO","A2M CO","KNG1 CO","KLKB1 CO"])



# Step 4: Plot Figure 4

In [10]:
legend_min = df_ordered["P_Value"].min()
#Make plot using plot utils
p.plotCircleHeatMap(df_ordered, circle_var = "P_Value",color_var = "Correlation", x_axis = "Comparison", y_axis = "Cancer", plot_width= 1000, plot_height = 500, legend_min = legend_min, legend_max = 0.05, font_size = 10, show_legend = True , save_png = "png_files/Figure4.png")

# Check if blanks are due to no data 

The follow code chunks show that the following cancers/genes don't have data: colon THBD, Kidney MMP12, and Ovarian MMP12. (As mentioned in EGFR Figure 2 legend)

In [11]:
#Get append version of the df with all proteins 
df_all_prot_append = pd.read_csv("Make_Tables/csv_files/all_prot_heatmap_EGFR.csv")
df_all_prot_append 


Unnamed: 0,Comparison,Correlation,P_Value,Cancer
0,EGFR,1.000000,0.000000e+00,GBM
1,PHLDA1,0.816848,3.507071e-21,GBM
2,GRB2,-0.610889,6.729990e-08,GBM
3,SOCS2,0.562720,3.420388e-06,GBM
4,CDH4,0.559180,3.420388e-06,GBM
...,...,...,...,...
80644,AK1,-0.000256,9.985768e-01,CO
80645,KRI1,-0.000217,9.986912e-01,CO
80646,MUL1,-0.000272,9.986912e-01,CO
80647,CADPS,0.000064,9.997745e-01,CO


In [12]:
#subset dataframe to include genes only desired for figure 
df_all_comp_coag = df_all_prot_append[df_all_prot_append.Comparison.isin(coag_upa)]
print("Number of rows in data frame " + str(len(df_all_comp_coag)))


Number of rows in data frame 212


Our figure includes 27 genes for 8 cancers. If all data was present there would be 216 rows. However, the data frame only has 212 rows. 4 genes are missing. 

In [13]:
def find_missing_genes(test_list, full_list):
    for gene in full_list:
        if (gene not in test_list):
            print(gene)
        

In [14]:
#Get list of genes for colon, kidney, and ovarian
colon = df_all_prot_append[df_all_prot_append["Cancer"] == "CO"]
colon_list = colon.Comparison.to_list()

Kidney = df_all_prot_append[df_all_prot_append["Cancer"] == "ccRCC"]
Kidney_list = Kidney.Comparison.to_list()

Ovarian = df_all_prot_append[df_all_prot_append["Cancer"] == "OV"]
Ovarian_list = Ovarian.Comparison.to_list()

In [15]:
#Show the 3 missing genes 
print("Ovarian missing genes: ")
find_missing_genes(Ovarian_list, coag_upa)
print("Kidney missing genes: ")
find_missing_genes(Kidney_list, coag_upa)
print("Colon missing genes: ")
find_missing_genes(colon_list, coag_upa)

Ovarian missing genes: 
F3
MMP12
Kidney missing genes: 
MMP12
Colon missing genes: 
THBD
