# Mayer and Johnston Pathway Analysis

In this notebook we analyze the proteins in the Johnston and Mayer papers. First, we extract their total and differential proteins. We also sort the differential proteins into those that are upregulated and those that are downregulated. After we extract their data, we do a pathway enrichment on the differential proteins.

In [1]:
johnston_file = "data/134638_0_supp_38937_p0y7zb.xlsx"
mayer_file = "data/133399_0_supp_15943_4ybsvb.xlsx"

In [2]:
import pandas as pd
import requests
import os.path
import os
from os import path
from numpy import log10
from gprofiler import GProfiler
import longitudinalCLL
import seaborn as sns
from scipy import stats
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
from matplotlib_venn import venn3

## Mayer Paper

Dowload and import supplementary table 3, use after imputation sheets (Mayer et al., 2018)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5795392/bin/supp_RA117.000425_133399_0_supp_15943_4ybsvb.xlsx

I parse the dataframe and extract all the proteins labeled as significantly upregulated (+ in t-test difference) and downregulated (- in t-test difference).
The upregulated and downregulated will be used to generate a functional pathay analysis.
In the paper they found 440 upregulated and 427 downregulated, however, these numbers double-count some proteins. This analysis finds 425 upregulated and 371 downregulated, none are double-counted in this analysis. There are 6,945 total proteins identified in the paper.

Since this paper differentiated between nuclear and cytoplasmic, we have to read in 2 files. First, we pull the nuclear file and extract the significant, upregulated, and downregulated proteins.

In [3]:
ne_sheet_name = "NE_after imputation"
#The other is "NE_before imputation"
m_ne_df = pd.read_excel(mayer_file, sheet_name = ne_sheet_name)

In [4]:
m_sig_ne = m_ne_df.loc[m_ne_df['Student\'s t-test Significant CLL vs elderly Bcells'] == '+']
m_up_ne = m_sig_ne.loc[m_sig_ne['Student\'s t-test Difference  CLL vs  elderly B cells'] > 0]
m_down_ne = m_sig_ne.loc[m_sig_ne['Student\'s t-test Difference  CLL vs  elderly B cells'] < 0]

Next, we pull the cytoplasmic file and extract the significant, upregulated, and downregulated proteins.

In [5]:
mayer_cyt_sheet_name = "CYT_after imputation"
m_cyt_df = pd.read_excel(mayer_file, sheet_name= mayer_cyt_sheet_name,
                         skiprows = 1) #There is a header saying sup. table s3

In [6]:
m_sig_cyt = m_cyt_df.loc[m_cyt_df['Student\'s t-test Significant CLL vs elderly B cells'] == '+']
m_up_cyt = m_sig_cyt.loc[m_sig_cyt['Student\'s t-test Difference  CLL vs  elderly B cells'] > 0]
m_down_cyt = m_sig_cyt.loc[m_sig_cyt['Student\'s t-test Difference  CLL vs  elderly B cells'] < 0]

In [7]:
m_upreg = []

In [8]:
up_frames = [m_up_ne, m_up_cyt]
m_u = pd.concat(up_frames)

In [9]:
for protein in m_u['Protein IDs'] :
    temp = protein.split(";")
    m_upreg.append(temp[0])

In [10]:
print(len(set(m_upreg)))
m_upreg = set(m_upreg)

425


In [11]:
m_downreg = []

In [12]:
down_frames = [m_down_ne, m_down_cyt]
m_d = pd.concat(down_frames)

In [13]:
for protein in m_d['Protein IDs'] :
    temp = protein.split(";")
    m_downreg.append(temp[0])

In [14]:
print(len(set(m_downreg)))
m_downreg = set(m_downreg)

371


## Johnston Paper
Download and import supplementary table 2 (Johnston et al., 2018)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5880099/bin/supp_RA117.000539_134638_0_supp_38937_p0y7zb.xlsx

I parse the dataframe and extract all the proteins labeled as significantly upregulated (>0.3) and downregulated (<-0.3).
The upregulated and downregulated will be used to generate a functional pathay analysis.
In the paper they found 544 upregulated and 592 downregulated. This analysis finds 545 upregulated and 592 downregulated.

In [15]:
j_sheet_name = "CLL proteome"
j_df = pd.read_excel(johnston_file, sheet_name = j_sheet_name)

Here I extract all the proteins with a differential regulation score (>0.3 and <-0.3). I sort them into upregulated and downregulated as well as add them to the differential. I also extract the names of all proteins identified.

In [16]:
j_u = j_df.loc[j_df['Regulation score'] >= 0.3]
j_upreg = j_u['Protein group accession']

In [17]:
print(len(j_upreg))

545


In [18]:
j_d = j_df.loc[j_df['Regulation score'] <= -0.3]
j_downreg = j_d['Protein group accession']

In [19]:
print(len(j_downreg))

592


## Pathway Analysis

Here I am doing a functional pathway enrichment. I use GProfiler to run a KEGG analysis on the significantly upregulated and downregulated proteins for each paper. I then filter for the columns I want.

In [20]:
gp = GProfiler(return_dataframe = True)

In [21]:
johnston_upregulated_gp = gp.profile(organism='hsapiens', query=list(j_upreg))
j_up_df = johnston_upregulated_gp[johnston_upregulated_gp["source"] == "KEGG"]
j_up_df = j_up_df[['source', 'name', 'p_value', 'description', 'term_size', 'query_size', 'intersection_size']]
j_up_df

Unnamed: 0,source,name,p_value,description,term_size,query_size,intersection_size
136,KEGG,Spliceosome,2.851014e-22,Spliceosome,150,240,37
535,KEGG,mRNA surveillance pathway,5.066431e-08,mRNA surveillance pathway,98,240,18
544,KEGG,"Valine, leucine and isoleucine degradation",8.914372e-08,"Valine, leucine and isoleucine degradation",48,240,13
1058,KEGG,Nucleotide excision repair,0.03496473,Nucleotide excision repair,45,240,7
1060,KEGG,Notch signaling pathway,0.03512498,Notch signaling pathway,59,240,8


In [22]:
johnston_downregulated_gp = gp.profile(organism='hsapiens', query=list(j_downreg))
j_down_df = johnston_downregulated_gp[johnston_downregulated_gp["source"] == "KEGG"]
j_down_df = j_down_df[['source', 'name', 'p_value', 'description', 'term_size', 'query_size', 'intersection_size']]
j_down_df

Unnamed: 0,source,name,p_value,description,term_size,query_size,intersection_size
168,KEGG,Leukocyte transendothelial migration,1.4e-05,Leukocyte transendothelial migration,114,291,18
210,KEGG,Lysosome,8.3e-05,Lysosome,128,291,18
256,KEGG,Chemokine signaling pathway,0.000551,Chemokine signaling pathway,190,291,21
321,KEGG,Rap1 signaling pathway,0.002607,Rap1 signaling pathway,210,291,21
350,KEGG,Platelet activation,0.004443,Platelet activation,124,291,15
358,KEGG,Phosphatidylinositol signaling system,0.004946,Phosphatidylinositol signaling system,97,291,13
398,KEGG,Lipid and atherosclerosis,0.010659,Lipid and atherosclerosis,214,291,20
409,KEGG,Yersinia infection,0.012997,Yersinia infection,136,291,15
411,KEGG,B cell receptor signaling pathway,0.013457,B cell receptor signaling pathway,79,291,11
475,KEGG,Leishmaniasis,0.027925,Leishmaniasis,72,291,10


In [23]:
mayer_upregulated_gp = gp.profile(organism='hsapiens', query=list(m_upreg))
m_up_df = mayer_upregulated_gp[mayer_upregulated_gp["source"] == "KEGG"]
m_up_df = m_up_df[['source', 'name', 'p_value', 'description', 'term_size', 'query_size', 'intersection_size']]
m_up_df

Unnamed: 0,source,name,p_value,description,term_size,query_size,intersection_size
479,KEGG,Pyruvate metabolism,0.002048,Pyruvate metabolism,47,204,8
516,KEGG,Metabolic pathways,0.00531,Metabolic pathways,1491,204,61
545,KEGG,Glyoxylate and dicarboxylate metabolism,0.008932,Glyoxylate and dicarboxylate metabolism,30,204,6


In [24]:
mayer_downregulated_gp = gp.profile(organism='hsapiens', query=list(m_downreg))
m_down_df = mayer_downregulated_gp[mayer_downregulated_gp["source"] == "KEGG"]
m_down_df = m_down_df[['source', 'name', 'p_value', 'description', 'term_size', 'query_size', 'intersection_size']]
m_down_df

Unnamed: 0,source,name,p_value,description,term_size,query_size,intersection_size
34,KEGG,Complement and coagulation cascades,3.480036e-13,Complement and coagulation cascades,85,188,20
44,KEGG,Hematopoietic cell lineage,4.740364e-10,Hematopoietic cell lineage,95,188,18
78,KEGG,ECM-receptor interaction,1.456775e-07,ECM-receptor interaction,88,188,15
91,KEGG,Platelet activation,3.699831e-07,Platelet activation,124,188,17
109,KEGG,Neutrophil extracellular trap formation,1.45511e-06,Neutrophil extracellular trap formation,189,188,20
158,KEGG,Phagosome,2.947435e-05,Phagosome,147,188,16
204,KEGG,Regulation of actin cytoskeleton,0.000281025,Regulation of actin cytoskeleton,216,188,18
247,KEGG,Focal adhesion,0.001718691,Focal adhesion,200,188,16
274,KEGG,Leishmaniasis,0.004191412,Leishmaniasis,72,188,9
286,KEGG,Leukocyte transendothelial migration,0.006897581,Leukocyte transendothelial migration,114,188,11


Next I look to see what pathways are shared by Mayer and Johnston

In [25]:
up_intersect = set(m_up_df['name']).intersection(j_up_df['name'])
print(up_intersect)
down_intersect = set(m_down_df['name']).intersection(j_down_df['name'])
print(down_intersect)

set()
{'Leishmaniasis', 'Leukocyte transendothelial migration', 'Platelet activation'}


Export as image or save to file

In [26]:
#import dataframe_image as dfi
#dfi.export(j_up_df, 'data/j_up_path.png')
#dfi.export(j_down_df, 'data/j_down_path.png')
#dfi.export(m_up_df, 'data/m_up_path.png')
#dfi.export(m_down_df, 'data/m_down_path.png')

In [27]:
#from pandas import ExcelWriter
#frames = [j_up_df, j_down_df, m_up_df, m_down_df]
#saveFile = 'data/table3.xlsx'
#writer = ExcelWriter(saveFile)
#for x in range(len(frames)):
#    sheet_name = 'sheet' + str(x+1)
#    frames[x].to_excel(writer, sheet_name)
#writer.save()