Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute enrichment of gene sets in our predictions #6

Closed
tmmurali opened this issue Apr 2, 2020 · 8 comments
Closed

Compute enrichment of gene sets in our predictions #6

tmmurali opened this issue Apr 2, 2020 · 8 comments
Assignees

Comments

@tmmurali
Copy link
Member

tmmurali commented Apr 2, 2020

We have a ranked list of predictions coming from network propagation or from host-virus PPI prediction. This issue is relevant mainly for human proteins. We also have a set of gene sets, e.g., from https://amp.pharm.mssm.edu/covid19/. We want to assess to what extent each gene set is enriched in our list of predictions.

There are two approaches I suggest:

  1. For every top-k predictions, use Fisher's exact test (hypergeometric test) to compute the p-value of the intersection of the top-k predictions with a gene set. Plot the absolute value of the logarithm of the p-value as we increase k. Alternately, plot the size of the overlap and colour the point differently based on whether the overlap is statistically significant or not. There is no need to try all values of k. It may be sufficient to use increments of 10, 50, or 100. This value can be a parameter to the code.
  2. Use an enrichment method such as GSEA that can consider the entire ranked list of predictions.

We must correct for testing multiple hypotheses.

@tmmurali
Copy link
Member Author

tmmurali commented Apr 2, 2020

Let us catalogue gene sets here. We need to download each one (see #5) and add it to the enrichment analysis.

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 8, 2020

Currently the downloadable gmt file available for the COVID-19 Crowd Generated Gene sets does not have the main descriptor text of the gene set in the file, making most gene sets unidentifiable.

I made an issue on their repo (#82) asking them to fix it.

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 8, 2020

Just found out that besides running GSEA, GSEApy also has an enrichr module, which lets you run Enrichr's analysis using its api. Could be very useful as Enrichr has tons of gene sets!

@jlaw9
Copy link
Contributor

jlaw9 commented Apr 9, 2020

They fixed the gmt file for the COVID-19 Crowd Generated Gene!

jlaw9 added a commit that referenced this issue May 7, 2020
of the top predictions of each algorithm, and of any given gene
set. Currently only tests GO BP, MF, and CC. Issue #6
n-tasnina pushed a commit that referenced this issue May 16, 2020
of the top predictions of each algorithm, and of any given gene
set. Currently only tests GO BP, MF, and CC. Issue #6
@tmmurali
Copy link
Member Author

@jlaw9 @n-tasnina what is the status of running our enrichment pipeline on the COVID-19 gene sets?

@jlaw9
Copy link
Contributor

jlaw9 commented May 22, 2020

We have the COVID-19 gene sets in GMT format, just need to update our scripts to test for enrichment of them. Here's the clusterProfiler documentation for our own gene sets.
@n-tasnina can you add a function for that in our enrichment.py?

@n-tasnina
Copy link

n-tasnina commented May 22, 2020 via email

@n-tasnina
Copy link

We can close this issue as well. Here is the link to the python script where we did enrichment analysis.
https://github.com/Murali-group/SARS-CoV-2-network-analysis/blob/enrichment/src/Enrichment/fss_enrichment.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants