# Enrichment Analysis using KnetMiner SPARQL endpoint with Jupyter

This Jupyter Notebook uses KnetMiner SPARQL endpoint to extract gene expression data from the RDF database and perfrom enrichment analysis.

## Choose the tax ID, concept and study or enter list of genes

### Steps:
1. Run the first cell to get the radio buttons for selection of species, concept and study or list.
2. Then run the second cell to get the results for the selected species and concept.

**Note:** You only need to run the first cell once. If you you want to change your choices, make the selections then run the second cell.

### The resuts generated are:
1. The gene-concept table containing the genes list and their related ontology terms and evidence
2. The enrichment table

### Please note:
1. The only tax IDs that will generate a table of studies are:
    - 4565: Triticum aestivum (wheat)
    - 3702: Arabidopsis thaliana<br>
    <br>
2. The list of studies generated are differential studies; baseline studies are filtered out.

3. You can filter the gene-concept table using the 2 cells at the bottom by either:
    - choosing the ontology term to display the related genes
    - or choosing a gene to display the related ontology terms

## Evidence codes
The evidence codes are generated as evidenceType_homology-interaction.

**Evidence Types:**
- TM: Text Mining
- GWAS: Genetic Study
- GO: Gene Ontology

**Definitions:**
- A homologous gene (or homolog) is a gene inherited in two species by a common ancestor.
- Genetic interaction networks represent the functional interactions between pairs of genes.

***Example:*** *GWAS_0-1 is Genetic Study, no homology, interaction present*

In [1]:
# Import the libraries and functions
from enrichment_analysis_functions import *
from os.path import exists as file_exists

# display full dataframe pandas
pd.set_option('display.max_rows', None)

# change string for interact button
interact_manual2 = interact.options(manual=True, manual_name="Run")

# create dataframe for Tax IDs and their names
dframe_taxID = df_taxID()
# create list of concepts
concepts = get_concepts()

# display radio buttons for choosing species and concept
print("Select species:")
radiobuttons1 = display_radiobuttons(data = list(dframe_taxID['Tax Names']))

print("Select concept:")
radiobuttons2 = display_radiobuttons(data = concepts)

print("Do you want to get the list of genes from a study or use your own list?")
radiobuttons3 = display_radiobuttons(data = ["Study", "List of Genes"])

Select species:


RadioButtons(options=('Triticum aestivum (wheat)', 'Arabidopsis thaliana (thale cress)', 'Oryza sativa (rice)'…

Select concept:


RadioButtons(options=('Trait', 'BioProcess'), value=None)

Do you want to get the list of genes from a study or use your own list?


RadioButtons(options=('Study', 'List of Genes'), value=None)

In [2]:
# get the selections
species = radiobuttons1.get_interact_value()
concept = radiobuttons2.get_interact_value()
studyOrList = radiobuttons3.get_interact_value()

# Alternatively, you can set them manually, this can be useful because in certain Jupyter environments
# we have seen the widgets are buggy and don't work
#

species = "Triticum aestivum (wheat)"
#species = "Arabidopsis thaliana (thale cress)"

concept = "BioProcess"
concept = "Trait"

studyOrList = "List of Genes"
studyOrList = "Study"


In [2]:

# create global variables for the interactive functions
total_DEXgenes = set()
dframe_GeneTrait_filtered = pd.DataFrame()
df_Ftest_sorted = pd.DataFrame()

# check that user made radiobutton selections
if species and concept and studyOrList:

    # get the tax ID 
    taxID = dframe_taxID[dframe_taxID['Tax Names'] == species]['Tax IDs'].item()
    print(f'Tax ID for {species} is: {taxID}')

    # get total number of genes in the database for the selected tax ID
    total_db_genes = get_gene_count(taxID)
    print('Total Number of Genes = ' + str(total_db_genes))

    # import csv for genes and related concept (for the selected tax ID)
    file_name = f'Gene{concept}Table_{taxID}'
    cols = ["Gene Accession", "Gene Name", "Ontology Term", "Preferred Name", "Evidence", "Network URL"]

    if file_exists(file_name+'.csv'):
        dframe_GeneTrait = pd.read_csv(file_name+'.csv', usecols= cols)
    elif file_exists(file_name+'.zip'):
        dframe_GeneTrait = pd.read_csv(file_name+'.zip', usecols= cols)
    else:
        print(file_name + " file does not exist!")


    if studyOrList == "Study":
        print("\nLoading studies ... Please wait.")
        
        # get the dataframe of the studies and their accession numbers
        dframe_study_list = get_study_list(taxID)
        
        if dframe_study_list.shape[0] != 0:
            print("\nChoose from the list of studies related to the chosen Tax ID:")

            @interact_manual2
            def get_study_list_for_species(Study_Title = dframe_study_list['Accession_Title']):

                # get study accession number
                studyAcc = dframe_study_list[dframe_study_list['Accession_Title'] == Study_Title]['Study Accession'].item()
                print("Study Accession is: " + studyAcc)

                # get number of genes in the study
                StudyGeneCount = get_StudyGeneCount(studyAcc)
                print("Total Number of Genes in study = " + str(StudyGeneCount))

                print("\nDo you want to perform the enrichment analysis with all the genes or filter the genes according to p-value?")

                selector = display_radiobuttons(data = ["All Genes", "Filter Genes"])

                @interact_manual2
                def evaluate():
                    global total_DEXgenes
                    global dframe_GeneTrait_filtered
                    global df_Ftest_sorted

                    selection = selector.get_interact_value()

                    if (selection == "All Genes"):

                        # get unique set of genes
                        total_DEXgenes = get_study_DEXgenes(studyAcc)
                        
                        print("\nLoading results for all genes ... Please wait.")
                        # get final tables
                        dframe_GeneTrait_filtered, df_Ftest_sorted = get_df_Ftest_sorted(dframe_GeneTrait, total_DEXgenes, total_db_genes)

                    elif (selection == "Filter Genes"):
                        
                        %matplotlib widget
                        df_StudyPvalues_count = get_StudyPvalues(studyAcc)
                        
                        print("Move the mouse over the line graph to view the values for x and y.\n" +
                        "Or move the slider (below the graph) to view the p-values and the corresponding number of genes.\n" +
                        "Note: Use the options on the left side of the graph to zoom, download and interact with the graph.\n")
                        
                        plot_pvalues(df_StudyPvalues_count, pvalues=0)

                        print("After choosing the p-value and number of genes with the slider, click the 'Run Analysis' button to get the tables results.\n")

                        # create slider and function to print values
                        pvalues_slider = widgets.FloatSlider(value=0, min=0,
                                        max=df_StudyPvalues_count.iloc[-1]['pvalues']+0.001, step=0.0001,
                                        description='Pvalues', readout_format='1.4f', layout=widgets.Layout(width='70%'))

                        def show_slider(pval):
                            if pval != 0 and len(df_StudyPvalues_count[df_StudyPvalues_count['pvalues'] <= pval]) != 0:
                                geneNum = df_StudyPvalues_count[df_StudyPvalues_count['pvalues'] <= pval].iloc[-1]['Cumulative Frequency']
                                print('p-value = %1.4f, Number of genes = %1.0f' % (pval, geneNum))
                            else:
                                print("")

                        w = widgets.interactive(show_slider, pval = pvalues_slider)

                        # create button and its function
                        button = widgets.Button(description="Run Analysis")
                        output = widgets.Output()

                        def on_button_clicked(b):
                            global total_DEXgenes
                            global dframe_GeneTrait_filtered
                            global df_Ftest_sorted

                            with output:
                                output.clear_output(wait=True)
                                pval = pvalues_slider.get_interact_value()

                                if pval != 0 and len(df_StudyPvalues_count[df_StudyPvalues_count['pvalues'] <= pval]) != 0:
                                    # get unique set of genes
                                    total_DEXgenes = get_study_DEXgenes(studyAcc, filter=True, pvalue=pval)
                                    
                                    print(f"\nLoading results for {len(total_DEXgenes)} genes ... Please wait.")
                                    # get final tables
                                    dframe_GeneTrait_filtered, df_Ftest_sorted = get_df_Ftest_sorted(dframe_GeneTrait, total_DEXgenes, total_db_genes)
                                
                                else:
                                    print("No results!")

                        button.on_click(on_button_clicked)

                        # display slider, button and output area for button
                        w.children += (button, output,)
                        display(w)
        
                    else:
                        print("Please make a selection!")

        else:
            print("\nNo studies in the databse for the selected tax ID. Please choose up 'List of Genes' and provide your list.")
        


    elif studyOrList == "List of Genes":
        print ("\nPlease paste the list of genes (separated by spaces).")
        
        @interact_manual2
        def input_genes_list(genes = ''):
            global total_DEXgenes
            global dframe_GeneTrait_filtered
            global df_Ftest_sorted
            
            # get user input genes as list
            genes_list = genes.split()
            # get unique set of genes
            total_DEXgenes = set(genes_list)
                            
            print("\n" + str(len(total_DEXgenes)) + " genes provided:")
            for g in genes_list:
                print(g)
            
            print("\nLoading results ... Please wait.")
            # get final tables
            dframe_GeneTrait_filtered, df_Ftest_sorted = get_df_Ftest_sorted(dframe_GeneTrait, total_DEXgenes, total_db_genes)


else:
    print("Please select species, concept and study or list.")



Tax ID for Triticum aestivum (wheat) is: 4565
Total Number of Genes = 116503

Loading studies ... Please wait.

Choose from the list of studies related to the chosen Tax ID:


interactive(children=(Dropdown(description='Study_Title', options=('E-GEOD-25759: Effect of downregulation of …

## View whole tables section
If you want to display the whole tables in the notebook, run each of the two cells below.

**Gene-concept Table:**

In [3]:
# copy dataframe to avoid editing and changing data type of the original
df_GeneTrait_filtered = dframe_GeneTrait_filtered[:].copy()

# display gene-trait table by rendering the HTML to clickable
s = "View Network"
df_GeneTrait_filtered['Network URL'] = df_GeneTrait_filtered['Network URL'].apply(lambda x: f'<a href="{x}">{s}</a>')

HTML(df_GeneTrait_filtered.to_html(render_links=True, escape=False))

Unnamed: 0,Gene Accession,Gene Name,Ontology Term,Preferred Name,Evidence,Network URL
0,TRAESCS2A02G334600,AIH,0000072,Grain hardness,GWAS_0-0,View Network
1,TRAESCS3D02G115400,HSP17.6,0000072,Grain hardness,GWAS_0-1,View Network
2,TRAESCS3A02G113100,HSP17.6,0000072,Grain hardness,GWAS_0-1,View Network
3,TRAESCS3B02G130900,HSP17.7,0000072,Grain hardness,GWAS_0-1,View Network
4,TRAESCS3B02G131100,HSP17.7,0000072,Grain hardness,GWAS_0-1,View Network
5,TRAESCS6A02G316200,HSP23.6,0000072,Grain hardness,GWAS_0-1,View Network
6,TRAESCS5B02G388900,LFG4,0000072,Grain hardness,GWAS_0-1,View Network
7,TRAESCS1D02G403400,TRAESCS1D02G403400,0000072,Grain hardness,GWAS_0-1,View Network
8,TRAESCS2B02G087400,TRAESCS2B02G087400,0000072,Grain hardness,GWAS_0-1,View Network
9,TRAESCS3B02G019900,TRAESCS3B02G019900,0000072,Grain hardness,GWAS_0-1,View Network


**Enrichment Table:**

In [4]:
df_Ftest_sorted

Unnamed: 0,Ontology Term,Preferred Name,odds ratio,exact p-value,adj p-value,Reference Genes,User/Study Genes
0,TO_0000430,germination rate,17.746196,2.489665e-77,3.8091870000000005e-75,5626,107
1,TO_0006002,proline content,10.730036,3.96393e-57,3.032406e-55,8960,107
2,TO_0000276,drought tolerance,5.94398,1.637385e-36,8.350665e-35,16360,112
3,TO_0000190,seed coat color,14.368467,4.736225e-22,1.8116059999999998e-20,1150,28
4,TO_0002661,seed maturation,4.699963,4.384546e-16,1.341671e-14,6291,48
5,TO_0000253,seed dormancy,4.632023,5.420875e-14,1.382323e-12,5296,41
6,TO_0000112,disease resistance,2.867433,1.414227e-11,3.091096e-10,15631,70
7,TO_0000043,root morphology trait,2.772096,4.127774e-10,7.894368e-09,13868,62
8,TO_0000495,chlorophyll content,3.268355,6.032035e-10,1.025446e-08,7763,43
9,TO_0000259,heat tolerance,4.253358,6.942071e-08,1.062137e-06,2870,22


## Choose an ontology term to display related genes

In [5]:
if len(df_Ftest_sorted) != 0:
      
      @interact
      def get_gene_list_for_triat(Ontology = sorted(df_Ftest_sorted['Preferred Name'].unique())):
      
            print(" Ontology Term is: " +
                  str(df_Ftest_sorted[df_Ftest_sorted['Preferred Name'] == Ontology]['Ontology Term'].item()))
            
            print(" Adjusted p-value is: " +
                  str(df_Ftest_sorted[df_Ftest_sorted['Preferred Name'] == Ontology]['adj p-value'].item()))
            
            df = dframe_GeneTrait_filtered.loc[dframe_GeneTrait_filtered['Preferred Name'] == Ontology]
            df = df[["Gene Accession", "Gene Name", "Evidence", "Network URL"]]
            df = df.reset_index(drop=True)
            
            s = "View Network"
            df['Network URL'] = df['Network URL'].apply(lambda x: f'<a href="{x}">{s}</a>')
            
            print("\n Total number of related unique genes from \n user/study list of genes is: " +
                  str(df_Ftest_sorted[df_Ftest_sorted['Preferred Name'] == Ontology]['User/Study Genes'].item()))  

            return HTML(df.to_html(render_links=True, escape=False))

else:
      print("No result!")

interactive(children=(Dropdown(description='Ontology', options=('1000-grain weight', 'Grain hardness', 'Indole…

## Choose a gene to display related ontology terms

**Please Note:** A gene name can have multiple accession numbers, which will be displayed in the printed table.

In [6]:
if len(dframe_GeneTrait_filtered) != 0:
    
    @interact
    def get_gene_list_for_triat(Gene_Name = sorted(dframe_GeneTrait_filtered['Gene Name'].unique())):
        
        df = dframe_GeneTrait_filtered.loc[dframe_GeneTrait_filtered['Gene Name'] == Gene_Name]
        df = df[["Gene Accession", "Ontology Term", "Preferred Name", "Evidence", "Network URL"]]
        df = df.reset_index(drop=True)
        
        s = "View Network"
        df['Network URL'] = df['Network URL'].apply(lambda x: f'<a href="{x}">{s}</a>')
        
        print("\n Total number of related unique ontology terms is: " + str(len(df['Ontology Term'].unique())))
        
        return HTML(df.to_html(render_links=True, escape=False))

else:
    print("No result!")

interactive(children=(Dropdown(description='Gene_Name', options=('AHA7', 'AIH', 'ALN', 'ARGAH1', 'ASK4', 'ATPA…