# Create Heat Map for significant DNA Replication genes

This notebook looks at the significant genes in at least one cancer in the DNA Replication pathway. Pancancer heat maps are created with circle size showing significance and color showing differences in median.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats
import gseapy as gp
import re
import sys 

import cptac
import cptac.utils as u

import plot_utils as p

# Step 1: Run GSEA for significant genes in at least 1 cancer

First read in sig_pval_all_proteins.csv into a df. This csv file contains only genes with a significant p-value in at least one cancer. Then run GSEA using a list of genes from the df.

In [2]:
root = R'~\Github\WhenMutationsDontMatter\PTEN\Step_3_trans_effect\csv'
sig_df = pd.read_csv(root+R'\mult_sig_pval_heatmap.csv')

prot_list = list(sig_df.Proteomics) # list of genes with a sig pval in >= 1 cancer
prot_enr = gp.enrichr(gene_list = prot_list, description='Tumor_partition', gene_sets='KEGG_2016', 
                       outdir='/Enrichr')

In [3]:
prot_enr.res2d.head()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Old P-value,Old Adjusted P-value,Odds Ratio,Combined Score,Genes,Gene_set
0,Spliceosome Homo sapiens hsa03040,21/134,4.645154e-13,1.36103e-10,0,0,7.626103,216.564409,DDX5;NCBP1;NCBP2;DDX23;DDX42;THOC3;PRPF40A;THO...,KEGG_2016
1,DNA replication Homo sapiens hsa03030,12/36,3.876348e-12,5.67885e-10,0,0,16.2206,426.214559,RFC5;RFC3;PCNA;RFC4;MCM7;RFC1;RFC2;MCM3;MCM4;M...,KEGG_2016
2,RNA transport Homo sapiens hsa03013,19/172,2.855921e-09,2.789283e-07,0,0,5.375431,105.755548,RANBP2;EIF5B;NUP210;NUP155;NCBP1;NUP133;NCBP2;...,KEGG_2016
3,Ribosome biogenesis in eukaryotes Homo sapiens...,14/89,3.615167e-09,2.64811e-07,0,0,7.654665,148.792361,UTP6;IMP3;WDR3;HEATR1;NAT10;WDR75;IMP4;PWP2;UT...,KEGG_2016
4,Mismatch repair Homo sapiens hsa03430,8/23,1.112006e-08,6.516354e-07,0,0,16.925844,309.988622,RFC5;MSH6;RFC3;PCNA;RFC4;RFC1;MSH2;RFC2,KEGG_2016


# Step 2: Get the list of significant genes in the DNA Replication pathway

In [4]:
dna_rep = prot_enr.res2d.Genes[1]
genes = dna_rep.split(';')
print('total genes:',len(genes))

total genes: 12


# Step 3: Create HeatMap

Use the df of sig genes in a long df formated for the plotCircleHeatMap function (read in heat_map_df.csv). Then slice out about 15 genes from the significant genes in the DNA Replication pathway. 15 genes show up well when the plotCircleHeatMap function is called. Change the variable representing a different list of genes to visualize all the genes in the pathway.  

In [5]:
# sig > 1 cancer
bool_df = sig_df.Proteomics.isin(genes)
plot_df = sig_df[bool_df]
plot_df.Proteomics.unique()
plot_df

Unnamed: 0,Proteomics,P_Value,Medians,Cancer
27,MCM6,0.000014,1.085407,Gbm
28,MCM4,0.000014,1.236484,Gbm
32,PCNA,0.000019,0.631486,Gbm
41,RFC5,0.000037,0.481635,Gbm
44,MCM5,0.000041,0.870533,Gbm
...,...,...,...,...
3103,MCM6,0.845341,-0.059250,Colon
3111,MCM2,0.856169,-0.088450,Colon
3115,MCM7,0.864011,-0.002000,Colon
3137,RFC4,0.921316,-0.104850,Colon


In [6]:
p.plotCircleHeatMap(plot_df, circle_var = 'P_Value', color_var='Medians', x_axis= 'Proteomics', y_axis = 'Cancer',
                    plot_width=1000)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["size2"] = df[circle_var].apply(lambda x: -1*(np.log(x)))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['size'] = (df["size2"])*3


# Step 4: Create a HeatMap with both pos and neg differences in median

 Read in the pos_neg_df.csv to create a df with only genes that have a pos and neg difference in median in different cancers. Slice out the genes that have a pos and neg difference in median in the pathway using the list of genes with a significant p-value in the pathway. 

In [7]:
pos_neg_df = pd.read_csv(root+R'\pos_neg_df.csv')

In [8]:
get = pos_neg_df.Proteomics.isin(genes) # bool df where True has both pos and neg
genes_pn = pos_neg_df[get] # Keep only genes with pos and neg
genes_pn.Proteomics.unique()

array(['MCM6', 'MCM4', 'RFC5', 'MCM5', 'MCM2', 'MCM7', 'MCM3', 'RFC2',
       'RFC4'], dtype=object)

In [9]:
p.plotCircleHeatMap(genes_pn, circle_var = 'P_Value', color_var='Medians', x_axis= 'Proteomics', y_axis = 'Cancer', 
                    plot_width=700)