## Gene Set Enrichment Analysis
We use this notebook to show that the small sample lymphocyte proteomics data is in fact representative of lymphocytes specifically.

In [1]:
import load_data
data = load_data.load_FragPipe(version='July_noMBR_FP', contains=[])

Here we use a function from load_data to get the names, both the gene name code and verbose headers. We then add these to the quantification data.

In [2]:
names = load_data.load_fasta()
names

Q96IY4                     CBPB2_HUMAN Carboxypeptidase B2 
P22362                    CCL1_HUMAN C-C motif chemokine 1 
Q8NCR9                                CLRN3_HUMAN Clarin-3 
Q8IUK8                            CBLN2_HUMAN Cerebellin-2 
Q9BX69    CARD6_HUMAN Caspase recruitment domain-contain...
                                ...                        
Q8WUP2           FBLI1_HUMAN Filamin-binding LIM protein 1 
P09038               FGF2_HUMAN Fibroblast growth factor 2 
P10071           GLI3_HUMAN Transcriptional activator GLI3 
P32189                          GLPK_HUMAN Glycerol kinase 
Q9NXW2      DJB12_HUMAN DnaJ homolog subfamily B member 12 
Length: 20364, dtype: object

In [3]:
data['Names'] = names.apply(lambda n: str(n).split('_HUMAN')[0])
data['Header'] = names.apply(lambda n: str(n).split('_HUMAN')[1])

Now we can run Gene Set Enrichment Analysis (GSEA) on the proteins identified.

In [4]:
import gseapy as gp
from gseapy.plot import barplot, dotplot
import matplotlib.pyplot as plt


In [11]:
genesets = [
    #'WikiPathways_2019_Human',  #
    #'BioPlanet_2019',   
    'ProteomicsDB_2020',    
    #'CCLE_Proteomics_2020', #says Hematopoetic and Lymphoid Tissues
    #'GO_Molecular_Function_2018',
]


We use the ProteomicsDB_2020 genesets, which compare our samples to documented lineages. The following plots show that the lymphocytes we measured are highly similar to several lymphoblastoid cell lines. Bones share a similarity as well, due to XXX

In [12]:
from numpy import nan
import data_utils

cell_types = ["cells"]#others could be specified, but here I want these generally.

    
for t in cell_types:
    for gset in genesets:
        #print(gene_list)
        enr = gp.enrichr(gene_list=list(data.Names.dropna()), 
                       description="Lymphocytes",
                       gene_sets=gset,
                       outdir='/data/test/enrichr'
                    )

        #as table:
        print(gset)
        display(enr.res2d[['Term','Adjusted P-value']][0:20])
        enr.res2d[['Term','Adjusted P-value']][0:20].to_csv( 'data/{0}.tsv'.format(gset), sep='\t')

        #as barplot
        #plt.rcParams['font.size'] = 25

        #plt.subplots_adjust(left=3, right=4,hspace=None)
        #barplot(enr.res2d, top_term=25, figsize=(12, 10))#, title=gset,)
        #plt.savefig('data/{0}_6_FP.png'.format(gset), bbox_inches='tight', dpi=400)
        #plt.show()

ProteomicsDB_2020


Unnamed: 0,Term,Adjusted P-value
0,Lymphoblastoid BTO:0000773 X129.126 HM11.GM18552,2.10347e-08
1,Lymphoblastoid BTO:0000773 X126.126 HM11.GM12878,4.892894e-08
2,Lymphoblastoid BTO:0000773 X127.126 HM11.GM12878,1.790492e-07
3,Lymphoblastoid BTO:0000773 X128.126 HM11.GM12878,1.790492e-07
4,Lymphoblastoid BTO:0000773 X130.126 HM11.GM18522,1.790492e-07
5,Lymphoblastoid BTO:0000773 X131.126 HM11.GM10847,1.790492e-07
6,Lymphoblastoid BTO:0000773 X126.126 HM10.GM12878,2.776208e-07
7,Lymphoblastoid BTO:0000773 X127.126 HM10.GM07000,2.776208e-07
8,Lymphoblastoid BTO:0000773 X128.126 HM10.GM11992,2.776208e-07
9,Lymphoblastoid BTO:0000773 X129.126 HM10.GM06985,2.776208e-07


In [7]:
#supernatant_blanks=data['Blank_3','Blank_4','Blank_5']