# Getting NR3C1 Rankings with Enrichr API

This notebook takes the text files the gene lists directory and runs the gene sets inside the files through Enrichr to find transcription factor overlap based on the ChEA 2016 library. The NR3C1 rankings are extracted to see how well the 3 DEG methods performed. 

In [1]:
import json
import requests
import pandas as pd
import time
import os
import scipy.stats as stats
import plotly.graph_objects as go

## Import Gene set files

In [2]:
# retrieves the file names and sorts in alphabetical order
directory = "gene_lists"
gene_files = os.listdir(directory)
gene_files.sort()
gene_files

['cd_200_down_genes.txt',
 'cd_200_up_genes.txt',
 'cd_up+down_genes.txt',
 'limma_200_down_genes.txt',
 'limma_200_up_genes.txt',
 'limma_up+down_genes.txt',
 'logfc_200_down_genes.txt',
 'logfc_200_up_genes.txt',
 'logfc_up+down_genes.txt']

In [3]:
# Creates a list to store the DEG methods associated with each file
types = ['CD', 'CD', 'CD', 'Limma', 'Limma', 'Limma', 'LogFC', 'LogFC', 'LogFC']

In [4]:
# Read genes from files and stores as a list
gene_lists = []

for filename in gene_files:
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        open_gene_list_file = open(f,'r')
        lines = open_gene_list_file.readlines()
        genes = [x.strip() for x in lines]
        open_gene_list_file.close()
        gene_lists.append(genes)

## Running Enrichr

The function below uses the Enrichr API to get results from Enrichr. Code is adapted from the Appyter https://appyters.maayanlab.cloud/Enrichr_compressed_bar_chart_figure/

The function is used to run all the gene sets through Enrichr.

In [5]:
# Function to get Enrichr Results
def Enrichr_API(enrichr_gene_list, all_libraries):


    all_ranks = []
    all_terms = []
    all_pvalues =[] 
    all_adjusted_pvalues = []
    library_success = []
    short_id = ''

    for library_name in all_libraries : 
        ENRICHR_URL = 'http://amp.pharm.mssm.edu/Enrichr/addList'
        genes_str = '\n'.join(enrichr_gene_list)
        description = 'Example gene list'
        payload = {
            'list': (None, genes_str),
            'description': (None, description)
        }

        response = requests.post(ENRICHR_URL, files=payload)
        if not response.ok:
            raise Exception('Error analyzing gene list')

        data = json.loads(response.text)
        time.sleep(0.5)
        ENRICHR_URL = 'http://amp.pharm.mssm.edu/Enrichr/enrich'
        query_string = '?userListId=%s&backgroundType=%s'
        user_list_id = data['userListId']
        short_id = data["shortId"]
        gene_set_library = library_name
        response = requests.get(
            ENRICHR_URL + query_string % (user_list_id, gene_set_library)
         )
        if not response.ok:
            raise Exception('Error fetching enrichment results')
        try:
            data = json.loads(response.text)
            results_df  = pd.DataFrame(data[library_name])
            all_ranks.append(list(results_df[0]))
            all_terms.append(list(results_df[1]))
            all_pvalues.append(list(results_df[2]))
            all_adjusted_pvalues.append(list(results_df[6]))
            library_success.append(library_name)
        except:
            print('Error for ' + library_name + ' library')

    return([all_ranks,all_terms,all_pvalues,all_adjusted_pvalues,str(short_id),library_success])

In [6]:
# Run Enrichr on the gene sets
results = []

for i in range(len(gene_lists)):
    result = Enrichr_API(gene_lists[i], ['ChEA_2016'])
    results.append(result)

## Get NR3C1 Rankings

The NR3C1 rankings are taken by iterating through the results for each gene set. The rankings for each file are displayed in a summary table.

In [7]:
# Initialize lists for storing NR3C1 information
gene_set = []
names = []
ranks = []
p_val = []
methods = []

# Iterate over each result
for i in range(len(results)):
    # Within each gene set, iterate over the transcription factor names
    for j in range(len(results[i][1][0])):
        # If NR3C1 is found, add the information to the lists
        if 'NR3C1' in results[i][1][0][j]:
            names.append(results[i][1][0][j])
            ranks.append(results[i][0][0][j])
            p_val.append(results[i][2][0][j])
            gene_set.append(gene_files[i].strip(".txt"))
            methods.append(types[i])

print(gene_set, methods, names, ranks, p_val)

['cd_200_down_genes', 'cd_200_down_genes', 'cd_200_up_genes', 'cd_200_up_genes', 'cd_up+down_genes', 'cd_up+down_genes', 'limma_200_down_genes', 'limma_200_down_genes', 'limma_200_up_genes', 'limma_200_up_genes', 'limma_up+down_genes', 'limma_up+down_genes', 'logfc_200_down_genes', 'logfc_200_down_genes', 'logfc_200_up_genes', 'logfc_200_up_genes', 'logfc_up+down_genes', 'logfc_up+down_genes'] ['CD', 'CD', 'CD', 'CD', 'CD', 'CD', 'Limma', 'Limma', 'Limma', 'Limma', 'Limma', 'Limma', 'LogFC', 'LogFC', 'LogFC', 'LogFC', 'LogFC', 'LogFC'] ['NR3C1 23031785 ChIP-Seq PC12 Mouse', 'NR3C1 21868756 ChIP-Seq MCF10A Human', 'NR3C1 23031785 ChIP-Seq PC12 Mouse', 'NR3C1 21868756 ChIP-Seq MCF10A Human', 'NR3C1 23031785 ChIP-Seq PC12 Mouse', 'NR3C1 21868756 ChIP-Seq MCF10A Human', 'NR3C1 23031785 ChIP-Seq PC12 Mouse', 'NR3C1 21868756 ChIP-Seq MCF10A Human', 'NR3C1 23031785 ChIP-Seq PC12 Mouse', 'NR3C1 21868756 ChIP-Seq MCF10A Human', 'NR3C1 23031785 ChIP-Seq PC12 Mouse', 'NR3C1 21868756 ChIP-Seq MCF1

In [8]:
# Create a display dataframe
df = pd.DataFrame(list(zip(gene_set, methods, names, ranks, p_val)),
                 columns = ['Gene Set', 'Method', 'Name','Rank','p-value'])
df

Unnamed: 0,Gene Set,Method,Name,Rank,p-value
0,cd_200_down_genes,CD,NR3C1 23031785 ChIP-Seq PC12 Mouse,348,0.436921
1,cd_200_down_genes,CD,NR3C1 21868756 ChIP-Seq MCF10A Human,416,0.582927
2,cd_200_up_genes,CD,NR3C1 23031785 ChIP-Seq PC12 Mouse,90,0.955186
3,cd_200_up_genes,CD,NR3C1 21868756 ChIP-Seq MCF10A Human,117,0.972929
4,cd_up+down_genes,CD,NR3C1 23031785 ChIP-Seq PC12 Mouse,202,0.823345
5,cd_up+down_genes,CD,NR3C1 21868756 ChIP-Seq MCF10A Human,283,0.915026
6,limma_200_down_genes,Limma,NR3C1 23031785 ChIP-Seq PC12 Mouse,78,0.209472
7,limma_200_down_genes,Limma,NR3C1 21868756 ChIP-Seq MCF10A Human,262,0.701285
8,limma_200_up_genes,Limma,NR3C1 23031785 ChIP-Seq PC12 Mouse,14,0.209472
9,limma_200_up_genes,Limma,NR3C1 21868756 ChIP-Seq MCF10A Human,132,0.701285


In [50]:
# Bar chart based on the rankings
fig1 = go.Figure()
fig1.add_trace(go.Bar(
    x=df[df['Name'] == "NR3C1 21868756 ChIP-Seq MCF10A Human"]['Gene Set'],
    y=df[df['Name'] == "NR3C1 21868756 ChIP-Seq MCF10A Human"]['Rank'],
    name="Human"))
fig1.add_trace(go.Bar(
    x=df[df['Name'] == "NR3C1 23031785 ChIP-Seq PC12 Mouse"]['Gene Set'],
    y=df[df['Name'] == "NR3C1 23031785 ChIP-Seq PC12 Mouse"]['Rank'],
    name="Mouse"))
fig1.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'})
fig1.update_layout(barmode='group')
fig1.show()

In [53]:
# Bar chart based on p-values
fig2 = go.Figure()
fig2.add_trace(go.Bar(
    x=df[df['Name'] == "NR3C1 21868756 ChIP-Seq MCF10A Human"]['Gene Set'],
    y=df[df['Name'] == "NR3C1 21868756 ChIP-Seq MCF10A Human"]['p-value'],
    name="Human"))
fig2.add_trace(go.Bar(
    x=df[df['Name'] == "NR3C1 23031785 ChIP-Seq PC12 Mouse"]['Gene Set'],
    y=df[df['Name'] == "NR3C1 23031785 ChIP-Seq PC12 Mouse"]['p-value'],
    name="Mouse"))
fig2.show()

## Comparing methods

Here, the methods are rearranged based on the average rank across method and species (mouse/human). From this preliminary summary, the characteristic direction method does not seem to work as well as the Limma and LogFC methods. A Wilcoxon test is also performed to further compare if the results of different methods are significantly different from each other. Smaller p-values indicate greater likelihood of the methods being different. 

In [10]:
# Calculate and sort by mean rank grouping by method and name
df.groupby(['Method', 'Name']).mean().sort_values(by='Rank')

Unnamed: 0_level_0,Unnamed: 1_level_0,Rank,p-value
Method,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
Limma,NR3C1 23031785 ChIP-Seq PC12 Mouse,40.0,0.176366
LogFC,NR3C1 23031785 ChIP-Seq PC12 Mouse,41.666667,0.094045
LogFC,NR3C1 21868756 ChIP-Seq MCF10A Human,128.0,0.335204
Limma,NR3C1 21868756 ChIP-Seq MCF10A Human,188.0,0.71702
CD,NR3C1 23031785 ChIP-Seq PC12 Mouse,213.333333,0.738484
CD,NR3C1 21868756 ChIP-Seq MCF10A Human,272.0,0.823627


In [11]:
# Getting ranks of each method as a list
limma_ranks = df[df["Method"] == 'Limma']['Rank'].tolist()
cd_ranks = df[df["Method"] == 'CD']['Rank'].tolist()
fc_ranks = df[df["Method"] == 'LogFC']['Rank'].tolist()

# Report results of Wilcoxon test on each pairing of methods

w1, p1 = stats.wilcoxon(cd_ranks, limma_ranks)
w2, p2 = stats.wilcoxon(cd_ranks, fc_ranks)
w3, p3 = stats.wilcoxon(limma_ranks, fc_ranks)

print("The p-value for CD and Limma is " + str(p1))
print("The p-value for CD and LogFC is " + str(p2))
print("The p-value for Limma and LogFC is " + str(p3))

The p-value for CD and Limma is 0.0625
The p-value for CD and LogFC is 0.09375
The p-value for Limma and LogFC is 0.6875


From these p-values and from the table summary, the ranking of methods from best to worst in this scenario should be Limma, LogFC, Characteristic Direction although the difference between Limma and LogFC is not significant.