# Independent Enrichment Analysis

This Appyter performs enrichment analysis given an input set of items, for example, gene symbols, and a library of sets in GMT format, for example, a gene set library. The Appyter performs the Fisher exact test to compute enrichment p-value and q-values, and reports the results as a sorted table, a bar graph, and a Manhattan plot.

In [None]:
#%%appyter init
from appyter import magic
magic.init(lambda _=globals: _())

In [None]:
from maayanlab_bioinformatics.enrichment.crisp import enrich_crisp, fisher_overlap
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from IPython.display import display, FileLink, Markdown, HTML

# Manhattan Plot Imports
import matplotlib.patches as mpatches
import matplotlib.cm as cm

# Bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.models import HoverTool, CustomJS, ColumnDataSource, Span
from bokeh.layouts import layout, row, column, gridplot
from bokeh.palettes import all_palettes

import base64

In [None]:
%%appyter hide_code_exec
{% do SectionField(
    name='Set_Section',
    title='Submit Your Set',
    subtitle='Upload a text file containing your set or copy and paste your set into the text box below (one item per row). You can also try the default set provided.',
    img='analysis.png'
    
) %}
{% do SectionField(
    name='Library_Section',
    title='Submit Your Library',
    subtitle='Upload a GMT file containing your library. You can also load the default library.',
    img='analysis.png'
    
) %}
{% do SectionField(
    name='Background_Section',
    title='(Optional) Submit Your Background',
    subtitle='Upload a text file containing a list of background items to use as a filter for both the input and the libary sets. You can also copy and paste your background items into the text box below (one item per row). The default is no background and no filtering out any items.',
    img='analysis.png'
    
) %}

In [None]:
%%appyter hide_code

{% set set_kind = TabField(
    name='set_kind',
    label='Set',
    default='Try Example 1 (Drug Set)',
    description='Paste or upload your set',
    choices={
        'Paste': [ 
            TextField(
                name='set_input1',
                label='Set',
                default='',
                description='Paste your set (one item per row). Names in the set should match the names in the GMT file.',
                section = 'Set_Section'
            )
        ],
        
        'Upload': [
            FileField(
                name='set_filename',
                label='Set File',
                default='',
                description='Upload your set as a text file (one item per row). Names in the set should match the names in the GMT file.',
                section = 'Set_Section'
            ),
        ],
        
        'Try Example 1 (Drug Set)': [
            TextField(
                name='set_input2',
                label='Set',
                default='hexachlorophene\nlopinavir\nbazedoxifene\nabemaciclib\ncamostat\nmefloquine\ncyclosporine\nanidulafungin\nchloroquine\namodiaquine\nloperamide\nalmitrine\nhydroxychloroquine\nniclosamide\nivacaftor\nproscillaridin\nremdesivir',
                description='Paste your set (one item per row). Names in the set should match the names in the GMT file.',
                section = 'Set_Section'
            )
        ],
        'Try Example 2 (Gene Set)': [
            TextField(
                name='set_input3',
                label='Set',
                default='TAAR9\nEBF2\nWDR78\nRRAGA\nSPATA18\nSPINT2\nMRGPRD\nCD9\nRBP1\nCYB5RL\nMXRA8\nPM20D1\nITIH5\nEPAS1\nAHCYL2\nPANK2\nPON2\nLRP5\nSLC5A3\nNSL1\nCLDN2\nLRP8\nAQP1\nCLDN1\nTMEM72\nGNG4\nNHLH2\nC10ORF107\nS100A13\nLY6G6C\nPOF1B\nWLS\nC2ORF82\nFZD4\nCOG7\nFZD6\nFOXF1\nFZD7\nERLIN2\nTYSND1\nACADSB\nOR51I2\nPARP12\nPPFIBP2\nATP4A\nALDH7A1\nTCN2\nSLCO5A1\nSFXN4\nPRR15\nMOXD1\nCAPSL\nCOL13A1\nC1ORF177\nWFDC2\nSLC6A2\nDNALI1\nTNS1\nLGALS2\nT\nBLOC1S1\nHMOX1\nPDK4\nLRAT\nMNX1\nSLC19A1\nHOXC9\nSCARF2\nAS3MT\nARGLU1\nACE\nANXA2\nCARD9\nPAX7\nSORCS1\nRAB33B\nPHOX2A\nKIF9\nCLDN16\nPTPRB\nID3\nITPKB\nNCR1\nGAS6\nCC2D1B\nATR\nMYCBP\nIGSF6\nTPH1\nWFIKKN2\nIGSF5\nACY3\nMAOA\nCAB39L\nCTSZ\nPRDM16\nCYP7A1\nLIMD1\nTMEM27\nSLC22A18\nKRT28\nTIMP3\nEMB\nRNF152\nPLEKHN1\nCLIC3\nSTRA6\nCTSC\nCGNL1\nPARP4\nTMEM176B\nELOVL7\nSORBS3\nGPR4\nF5\nGUCA2B\nSERPINB6\nHADHB\nFOXR1\nNBR1\nSHKBP1\nRLIM\nDHRS13\nHRSP12\nCD63\nCCL11\nF13A1\nFAM69C\nKCNA7\nHCCS\nGUCA1A\nADAMTSL2\nLMAN1\nING3\nEGFLAM\nSCML4\nOLFML1\nSOSTDC1\nCTNNA1\nC16ORF78\nFADS1\nCCDC157\nPDGFRB\nCA12\nCD164\nPRLR\nLRRC69\nUNC5CL\nMPEG1\nSLC31A1\nTECRL\nVCAM1\nATP11A\nUBXN10\nZNF558\nDYDC1\nCD69\nS100A8\nFIGF\nPHLDB2\nERVFRD-1\nCD82\nASB14\nGPR65\nVWCE\nTEKT1\nTEKT4\nMSX1\nSLC16A9\nZNF423\nCA14\nIGFBP2\nSLC30A7\nLRRC46\nPDIA2\nPPEF1\nEPHX1\nFANCM\nRBPMS\nTTC21A\nMR1\nDDX52\nLSM5\nKRT31\nMAVS\nTMEM237\nSMO\nC6ORF118\nPGPEP1L\nIL7R\nC21ORF62\nC11ORF97\nDOCK6\nAKNA\nISYNA1\nCD151\nCBFB\nPYROXD2\nSLC2A1\nGSTCD\nLGALS3BP\nHIGD1B\nAK7\nLTBP1\nARHGAP5\nRGS5\nSALL1\nCOBLL1\nFHAD1\nMAEL\nBTLA\nIGFBP7\nODF1\nACAA2\nKL\nTTC16\nEMX2\nTTC12\nGGH\nCCDC37\nCFLAR\nGPR98\nLAMB2\nBICC1\nBMP6\nCUL4B\nDNAJC3\nSP1\nDAP\nDNAJC1\nPIKFYVE\nDMRTA1\nALPL\nMTRF1L\nBCAR3\nKDM5D\nSHC4\nTTC25\nDBH\nDBI\nCHD1\nWNT6\nSPN\nTTC23L\nPLTP\nCYP26B1\nCASP6\nTMEM204\nTMEM207\nCCDC180\nCCDC34\nCA9\nOVGP1\nPLEKHG2\nCPT1A\nPLEKHG3\nMYO10\nRNASET2\nTBC1D9\nNAGA\nPCOLCE\nMUT\nFOXJ1\nSOD3\nATOX1\nKRT73\nSNTB1\nRP2\nRPIA\nCOL8A1\nALS2\nCOL8A2\nSMPDL3A\nPCOLCE2\nSLC25A13\nTAF3\nFOLR1\nITGB2\nHEMGN\nPRPS2\nSLC24A5\nFLT1\nALAS2\nLSP1\nSYCP2\nSEMA3B\nETFB\nPRELP\nZBTB40\nPBXIP1\nSLC4A5\nCLN8\nEFS\nTTR\nRBM3\nHECTD3\nNAGLU\nALDH2\nCTNNAL1\nPCBD1\nCYTH2',
                description='Paste your set (one item per row). Names in the set should match the names in the GMT file.',
                section = 'Set_Section'
            )
        ],
        
    },
    section = 'Set_Section',
) %}

In [None]:
%%appyter code_exec
{% set library_filename = FileField(
    name='library_filename', 
    label='Library file (.gmt or .txt)', 
    default='Example1_DrugSet_L1000FWD_Signature_Down.txt',
    examples={'Example1_DrugSet_L1000FWD_Signature_Down.txt': "https://maayanlab.cloud/DrugEnrichr/geneSetLibrary?mode=text&libraryName=L1000FWD_Signature_Down", 'Example2_GeneSet_aibs_human-ctx_mouse-ctx-hip_10x_scRNA_2020.gmt': "https://appyters.maayanlab.cloud/storage/Independent_Enrichment_Analysis/aibs_human-ctx_mouse-ctx-hip_10x_scRNA_2020.gmt"}, 
    description='A tab-delimited file format that describes sets. Visit https://bit.ly/35crtXQ for more information.', 
    section='Library_Section')

%}

In [None]:
%%appyter hide_code

{% set background_kind = TabField(
    name='background_kind',
    label='Background',
    default='Paste',
    description='Paste or upload your background set',
    choices={
        'Paste': [
            TextField(
                name='background_input',
                label='Background Set',
                default='',
                description='Paste your background set (one item per row). Names in the background set should match the names in the GMT file.',
                section = 'Background_Section'
            ),
        ],
        'Upload': [
            FileField(
                name='background_filename',
                label='Background File',
                default='',
                description='Upload your background set as a text file (one item per row). Names in the background set should match the names in the GMT file.',
                section = 'Background_Section'
            ),
        ],
    },
    section = 'Background_Section',
) %}

In [None]:
%%appyter code_exec

{%- if set_kind.raw_value == 'Paste' or set_kind.raw_value == 'Try Example 1 (Drug Set)' or set_kind.raw_value == 'Try Example 2 (Gene Set)'%}
set_input = {{ set_kind.value[0] }}
{%- else %}
set_filename = {{ set_kind.value[0] }}
{%- endif %}


library_filename = "{{library_filename.value}}"
library_name = library_filename.replace("_", " ").replace(".txt", "").replace(".gmt", "")

{%- if background_kind.raw_value == 'Paste' %}
background_input = {{ background_kind.value[0] }}
{%- else %}
background_filename = {{ background_kind.value[0] }}
{%- endif %}


In [None]:
output_notebook()
# Table Parameters
significance_value = 0.05
display_topk = 20

# Bar Chart Parameters
figure_file_format = ['png', 'svg']
output_file_name = 'Enrichment_analysis_results_bar'
color = 'lightskyblue'
final_output_file_names = ['{0}.{1}'.format(output_file_name, file_type) for file_type in figure_file_format]
topk = 10

# Manhattan Plot Parameters
manhattan_colors = ['#003f5c', '#7a5195', '#ef5675', '#ffa600']

In [None]:
%%appyter code_exec

{%- if set_kind.raw_value == 'Paste' or set_kind.raw_value == 'Try Example 1 (Drug Set)' or set_kind.raw_value == 'Try Example 2 (Gene Set)' %}
items = set_input.split('\n')
items = [x.strip() for x in items]
{%- else %}
open_set_file = open(set_filename,'r')
lines = open_set_file.readlines()
items = [x.strip() for x in lines]
open_set_file.close()
{%- endif %}

# remove duplicates in items
items = list(set(items))

In [None]:
%%appyter code_exec

{%- if background_kind.raw_value == 'Paste' %}
background_items = background_input.split('\n')
background_items = [x.strip() for x in background_items]
{%- else %}
open_background_file = open(background_filename,'r')
lines = open_background_file.readlines()
background_items = [x.strip() for x in lines]
open_background_file.close()
{%- endif %}
condition1 = len(background_items) == 1 and background_items[0] == ""
condition2 = len(background_items) > 0
if condition1 == False and condition2 == True:
    items = [x for x in items if x in background_items]
    background_items_bool = True
else:
    background_items_bool = False



In [None]:
def load_library(library_filename, background_items):
    library_data = dict()
    with open(library_filename, "r") as f:
        lines = f.readlines()
        for line in lines:
            splited = line.strip().split("\t")
            
            if background_items_bool == True:
                elements = [x for x in splited[2:] if x in background_items]
            else:
                elements = splited[2:]
            if len(elements) > 0:
                library_data[splited[0]] = elements
    
    return library_data

def validate_inputs(items, library_data):
    library_items = set()
    for key, values in library_data.items():
        library_items.update(set(values))
    if len(items) == 0:
        raise Exception('No items in the input set. Please check the background information.') 
    if len(library_data.keys()) == 0:
        raise Exception('No items in the input library. Please check the background information.') 
    elif len(set(items).intersection(library_items)) == 0:
        raise Exception('No matches in the input set and library.') 
        
# Enrichment analysis
def get_library_iter(library_data):
    for term in library_data.keys():
        single_set = library_data[term]
        yield term, single_set

def get_enrichment_results(items, library_data):
    return sorted(enrich_crisp(items, get_library_iter(library_data), 20000, True), key=lambda r: r[1].pvalue)


def get_pvalue(row, unzipped_results, all_results):
    if row['Name'] in list(unzipped_results[0]):
        index = list(unzipped_results[0]).index(row['Name'])
        return all_results[index][1].pvalue
    else:
        return 1
    
# Call enrichment results and return a plot and dataframe for Scatter Plot
def get_values(obj_list):
    pvals = []
    odds_ratio = []
    n_overlap = []
    overlap = []
    for i in obj_list:
        pvals.append(i.pvalue)
        odds_ratio.append(i.odds_ratio)
        n_overlap.append(i.n_overlap)
        overlap.append(i.overlap)
    return pvals, odds_ratio, n_overlap, overlap


def enrichment_analysis(items, library_filename):
    library_data = load_library(library_filename, background_items)
    try:
        validate_inputs(items, library_data)
    except Exception as error:
        print('Run-time error:', error )
    all_results = get_enrichment_results(items, library_data)
    unzipped_results = list(zip(*all_results))
    pvals, odds_ratio, n_overlap, overlap = get_values(unzipped_results[1])
    df = pd.DataFrame({"Name":unzipped_results[0], "p value": pvals, \
                       "odds_ratio": odds_ratio, "n_overlap": n_overlap, "overlap": overlap})
    df["-log(p value)"] = -np.log10(df["p value"])
    return [list(unzipped_results[0])], [pvals], df



# Output a table of significant p-values
def create_download_link(df, title = "Download CSV file of this table", filename = "data.csv"):  
    csv = df.to_csv(index = False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload, title=title, filename=filename)
    return HTML(html)

results, pvals, results_df = enrichment_analysis(items, library_filename) 

In [None]:
# Bar Chart Functions
def enrichr_figure(all_terms, all_pvalues, plot_names, all_libraries, bar_color, topk=10): 
    all_terms = [all_terms[0][:topk]]
    all_pvalues = [all_pvalues[0][:topk]]
    # Bar colors
    if bar_color != 'lightgrey':
        bar_color_not_sig = 'lightgrey'
        edgecolor=None
        linewidth=0
    else:
        bar_color_not_sig = 'white'
        edgecolor='black'
        linewidth=1    

    plt.figure(figsize=(24, 12))
    
    i = 0
    bar_colors = [bar_color if (x < 0.05) else bar_color_not_sig for x in all_pvalues[i]]
    fig = sns.barplot(x=np.log10(all_pvalues[i])*-1, y=all_terms[i], palette=bar_colors, edgecolor=edgecolor, linewidth=linewidth)
    fig.axes.get_yaxis().set_visible(False)
    fig.set_title(all_libraries[i], fontsize=26)
    fig.set_xlabel('−log₁₀(p‐value)', fontsize=25)
    fig.tick_params(axis='x', which='major', labelsize=20)
    if max(np.log10(all_pvalues[i])*-1)<1:
        fig.xaxis.set_ticks(np.arange(0, max(np.log10(all_pvalues[i])*-1), 0.1))
    for ii,annot in enumerate(all_terms[i]):
        if all_pvalues[i][ii] < 0.05:
            annot = '  *'.join([annot, str(str(np.format_float_scientific(all_pvalues[i][ii], precision=2)))]) 
        else:
            annot = '  '.join([annot, str(str(np.format_float_scientific(all_pvalues[i][ii], precision=2)))])

        title_start= max(fig.axes.get_xlim())/200
        fig.text(title_start, ii, annot, ha='left', wrap = True, fontsize = 26)

    fig.spines['right'].set_visible(False)
    fig.spines['top'].set_visible(False)
    # Save results 
    for plot_name in plot_names:
        plt.savefig(plot_name, bbox_inches = 'tight')
    
    # Show plot 
    plt.show()  

In [None]:
# Create Manhattan Plots
def manhattan(df):
    df = df.sort_values("Name")
    list_of_xaxis_values = df["Name"].values.tolist()

    # define the output figure and the features we want
    p = figure(x_range = list_of_xaxis_values, plot_height=300, plot_width=750, tools='pan, box_zoom, hover, reset, save')

    # loop over all libraries
    r = []
    color_index = 0
    if color_index >= len(manhattan_colors):
        color_index = 0 

    # calculate actual p value from -log(p value)
    actual_pvalues = []
    for log_value in df["-log(p value)"].values.tolist():
        actual_pvalues += ["{:.5e}".format(10**(-1*log_value))]

    # define ColumnDataSource with our data for this library
    source = ColumnDataSource(data=dict(
        x = df["Name"].values.tolist(),
        y = df["-log(p value)"].values.tolist(),
        pvalue = actual_pvalues,
    ))

    # plot data from this library
    r += [p.circle(x = 'x', y = 'y', size=5, fill_color=manhattan_colors[color_index], line_color = manhattan_colors[color_index], line_width=1, source = source)]
    color_index += 1

    p.background_fill_color = 'white'
    p.xaxis.major_tick_line_color = None 
    p.xaxis.major_label_text_font_size = '0pt'
    p.y_range.start = 0
    p.yaxis.axis_label = '-log(p value)'

    p.hover.tooltips = [
        ("Name", "@x"),
        ("p value", "@pvalue"),
    ]
    p.output_backend = "svg"
    
    # returns the plot
    return p

# Enrichment Analysis

The table below displays the top 20 enrichment analysis results for the given set library. The table contains the sets name, p-value, odds ratio, the number of overlapping items, overlapping items, and -log(p-value). The table is sorted by p-values in ascending order. The full results are downloadable in CSV format.

In [None]:
results, pvals, results_df = enrichment_analysis(items, library_filename)

In [None]:
if 'p value' in results_df.columns:
    sorted_df = results_df.sort_values(by = ['p value'])
    filtered_df = sorted_df.iloc[:display_topk]
    if len(filtered_df) != 0:
        display(HTML(filtered_df.to_html(index = False)))
        display(Markdown(f"*Table 1. Enrichment analysis results of {library_name}*"))        
        display(create_download_link(sorted_df))

# Bar Chart

In [None]:
display(Markdown(f"The bar chart below shows the top {topk} enriched terms in a given library. Colored bars correspond to terms with significant p-values (<0.05). The bar chart is downloadable as an image in the PNG and SVG formats. "))

In [None]:
enrichr_figure(results, pvals, final_output_file_names, [library_name], color, topk)
display(Markdown(f"*Figure 1. Bar chart of the top {topk} enriched terms in {library_name}, along with their corresponding p-values. Colored bars correspond to terms with significant p-values (<0.05).*"))     
    
# Download Bar Chart
for i, file in enumerate(final_output_file_names):
    display(FileLink(file, result_html_prefix=str('Download ' + figure_file_format[i] + ': ')))
    


# Manhattan Plot

In the Manhattan plot below, each line on the x-axis denotes a single set from the library, while the y-axis measures the −log₁₀(p‐value) for each set. Hovering over a point will display the name of the set and the associated p-value. You can also zoom, pan, and save the plot as an svg using the toolbar on the right.

In [None]:
show(manhattan(results_df))
display(Markdown(f"*Figure 2. Manhattan plot that displays sets from {library_name} and their p-values on a -log10 scale.*"))     