reduce redundancy of enriched GO terms #28

GuangchuangYu · 2015-10-20T07:08:27Z

To simplify the enriched result, we can use slim version of GO and use enricher() function to analyze.

Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term.

The criteria of selecting representative term can be:

most informative term (need pre-calculated IC data, only available for those internally supported organisms in GOSemSim; can be extended to un-supported organisms).
most significant term (as in REVIGO)

I prefer using the second criteria for it's more intuitive and more easy to implement for those not internally supported by GOSemSim.

I propose to define a function to simplify the output from enrichGO by removing redundant GO terms.

simplify <- function(enrichResult, cutoff=0.7, by="p.adjust", select_fun=min) {
     ## GO terms that have semantic similarity higher than `cutoff` are treated as redundant terms
     ## select one representative term by applying `select_fun` to feature specifying by `by`.
     ## user can defined their own `select_fun` function.

     ## return an updated `enrichResult` object.
}

Any comment/suggestion is welcome.

Reference:

The text was updated successfully, but these errors were encountered:

GuangchuangYu · 2015-10-21T09:59:55Z

finished in 7e32c7d

I will close the issue. Comment is still welcome.

GuangchuangYu · 2015-10-21T12:55:10Z

output from compareCluster is also supported, see983262b.

GuangchuangYu · 2015-10-23T09:28:09Z

see also the blog post

fssdlyl001 · 2016-10-14T17:14:06Z

Hi, I applied simplify() function to compareCluster result, but return an error message: "Error in FUN(X[[i]], ...) : unused argument (organism = "human")". If I used enrichGO result as input, simplify worked. Could you help me solve this problem? Thank you.

GuangchuangYu · 2016-10-17T02:40:49Z

it was fixed in github version which will be released on Wednesday in BioC 3.4.

fssdlyl001 · 2016-10-18T11:20:47Z

Thank you very much. Just one more question: which "semData" should I use for simpifying compareCluster results? If it needs to set manually, what's the default semData of simplify()?

GuangchuangYu · 2016-10-18T11:53:13Z

you need to refer to the GOSemSim vignette.

If you use measure = 'Wang', you don't need to pass semData.

fssdlyl001 · 2016-10-18T12:02:37Z

But when I run simplify(x, measure="Wang"), it also return an message: Error in FUN(X[[i]], ...) : argument "semData" is missing, with no default. If I add 'semData = NULL', I got Error in match.arg(ont, c("BP", "CC", "MF")) : 'arg' must be of length 1. Where did I go wrong?

GuangchuangYu · 2016-10-18T14:06:36Z

data(gcSample)
x = compareCluster(gcSample, fun='enrichGO', ont='MF', OrgDb='org.Hs.eg.db')
y <- simplify(x, measure='Wang', semData=NULL)

works.

pls follow the guide and provide a reproducible example.

tigerxu · 2017-02-28T06:57:52Z

Hi, I applied the simplify() function to reduce redundancy of the enriched GO terms produced by enrichGO. The codes below runs fine without error. But I noticed that several redundant GO terms were not filtered out. The example is below

genedata <- read.table("genedata.txt", quote=NULL, header=TRUE, check.names=F, sep="\t")
head(genedata)
eg = bitr(genedata$Protein IDs, fromType="UNIPROT", toType="ENTREZID", OrgDb="org.Hs.eg.db")
egobp <- enrichGO(gene = eg$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.01,
readable = TRUE)
egobp2 <- simplify(egobp, cutoff=0.6, by="p.adjust", select_fun=min)
head(egobp2)

I show the first six GO terms below
Count
GO:0006335 24
GO:0034723 24
GO:0045861 64
GO:0051290 25
GO:0000183 24
GO:0002576 36

I compared the similarity of GO terms using GOSemSim and the result is as follows

goSim("GO:0006335", "GO:0034723", semData=hsGObp, measure="Jiang")
[1] 1
goSim("GO:0006335", "GO:0034723", semData=hsGObp, measure="Wang")
[1] 0.671

It looks that one of "GO:0006335" and "GO:0034723" should be filtered but actually not. Can you help to look at this problem? Any advice is very appreciated.

Thanks for developing clusterProfiler!

Zhuofei Xu

GuangchuangYu · 2017-02-28T07:19:53Z

pls follow https://guangchuangyu.github.io/2016/07/how-to-bug-author/ and provide a reproducible example.

msquatrito · 2017-03-15T18:15:23Z

Hi,
I've been trying to use the simplify function and I'had similar issue to @tigerxu. I've realised that in my case the problem was due to the fact that GO terms with the same "p.adjust" are not filtered out.
I've solved it using simplify(..., by = "pvalue").
Best,
Massimo

tigerxu · 2017-03-16T00:49:39Z

thanks for sharing this info, Massino! I will try again.
Zhuofei

hmontenegro · 2018-02-06T02:03:58Z

Doesn't simplify work with gseaResult objects?

cc.simp <- simplify( ego.cc, cutoff = 0.7, by = "p.adjust", select_fun = min )

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘simplify’ for signature ‘"gseaResult"’

haiyueliu · 2018-02-28T17:43:55Z

Hi,

I am using simplify function, but it comes to errors.

ego2_52_slim <- simplify(ego2_52, cutoff=0.7, by="p.adjust", select_fun=min)
Error in .local(x, ...) :
simplify only applied to output from enrichGO...

I am sure the input is result from enrichGO. Could you please help me to have a look of this?

And also, you mentioned that you prefer using the second criteria which is "REVIGO", but here in simplify, you integrated GOSemSim, right?.

Thanks,
Haiyue

Lucyyang1991 · 2018-10-16T08:40:28Z

It takes me really long time to run the simplify function, it that normal?

sjidong12 · 2018-12-08T07:20:36Z

I have a similar problem with "Lucyyang1991". It takes quite long time to run the simplify function. Is that normal?

davemcg · 2018-12-12T12:30:54Z

simplify does not work with enrichGO output if the ontology used is "ALL"

steffenheyne · 2019-01-09T15:26:20Z

same here...also when I don't use "ALL" but like ont="BP", simplify never finishes for me

VictorGoitea · 2019-01-24T11:04:35Z

the same... ont="BP", simplify never finishes for me

steffenheyne · 2019-01-24T12:18:31Z

actually the problem seems to be that now (was there a change?) eg. enrichGO() returns an object that stores in ego@result also all unsignificant results. When I strip this down to the sig.-only results
by eg.:

ego@result = ego@result[ego@result$qvalue<=ego@qvalueCutoff,]

simplify finishes normally. But I don't know if the result of simplify is still correct with that change.
@GuangchuangYu ?

I just also strip down ego@geneSets = data.frame() as the result df contains all the genesets. With that I get a much smaller objects. But I think this is not used by simplify at all.

GuangchuangYu · 2019-01-25T22:58:36Z

@steffenheyne thanks for pointing this out. Yes, the result should be strip down before simplify. Will fix it.

VictorGoitea · 2019-01-26T10:18:56Z

Thanks Steffen! Your suggestion worked perfect.

xxz19900 · 2019-03-13T13:59:01Z

simplify does not work with enrichGO output if the ontology used is "ALL"

Have you solved this problem? I also encountered such condition.

steffenheyne · 2019-03-14T07:08:43Z

@xxz19900 simplify only works within a sub-ontology like BP, MF or CC, not with ont= 'ALL', like

x <- enrichGO(dat, ont = 'BP', OrgDb = 'org.Hs.eg.db')
y <- simplify(x)

kamalmdmostafa · 2020-05-26T09:35:42Z

I am having the following error when trying to simplify Arabidopsis GO BP data:

egoBP_earlyresponse <- enrichGO(gene = earlyresponse,

                            universe = allOE_genes,

                            keyType = "TAIR",

                            OrgDb = org.At.tair.db,

                            ont = "BP",

                            pAdjustMethod = "BH",

                            pvalueCutoff= 0.01,

                            qvalueCutoff = 0.05,

                            readable = FALSE)

cluster_summaryBP_earlyresponse <- data.frame(egoBP_earlyresponse)
#simplify

simplify(egoBP_earlyresponse, cutoff = 0.7, by = "p.adjust", select_fun=min)
Error in simplify(egoBP_earlyresponse, cutoff = 0.7, by = "p.adjust", :
unused arguments (cutoff = 0.7, by = "p.adjust", select_fun = min)

Any suggestion will be appreciated .

laijen000 · 2020-07-29T17:03:18Z

Hi Dr. Yu (@GuangchuangYu ),

I would like to use GOSemSim to select the most informative term when comparing lists of GO IDs using mgoSim(). I am using the mouse genome and calculated IC scores: mmGO <- godata('org.Mm.eg.db', ont='BP').
Once we obtain the similarity matrix, is there a way to filter it to select just the most informative term above a certain similarity cutoff?

Because my GO IDs are not from using enrichGO, I am unable to use the simplify function. So I would like to try GOSemSim instead. Thank you!

claraina · 2020-10-07T04:20:41Z

I am having the same problem as @kamalmdmostafa, getting the error "unused arguments (cutoff = 0.7, by = "p.adjust", select_fun = min)" even though it seems to work with another set of genes. I don't think it's an issue with my enrich object (ego) as it is still showing the relevant GO Terms.

EEDDOC_ego <- enrichGO(gene = EEDDOC_gene.df$ENTREZID,
OrgDb = org.Mm.eg.db,
keyType = 'ENTREZID',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE)

sEEDDOC_ego <- simplify(EEDDOC_ego, cutoff=0.7, by="p.adjust", select_fun = min)

Error in simplify(EEDDOC_ego, cutoff = 0.7, by = "p.adjust", select_fun = min) :
unused arguments (cutoff = 0.7, by = "p.adjust", select_fun = min)

kamalmdmostafa · 2020-10-29T11:01:00Z

@claraina could you solve this issue. Seems like they gave up on this issue. Any other way to solve this? Creating a new issue maybe?

huerqiang · 2020-10-29T17:02:06Z

@kamalmdmostafa #278

huerqiang · 2020-10-29T17:02:08Z

@kamalmdmostafa #278

nikofleischer · 2021-10-27T12:28:34Z

Hi, I love the clusterprofiler package and regularly use the gseaGO and gseaKEGG functions to investigate my scRNAseq experiment results. However, using these there is always the problem of having many related terms enriched. Would it be possible to implement a function that only selects the most specific term that is enriched in the result for visualization in the dotplot() function? And/or extend the simplify function to work on gsea objects?

Best, Niko

huerqiang · 2021-10-27T14:18:01Z

@sparsepix You could use the showCategory = Your_selectd_terms parameter in dotplot to display your selected terms.

nikofleischer · 2021-10-27T15:02:57Z

@huerqiang thanks for the tip. Is there a way to easily compute the terms to select (based on them being either the most specific one in one subtree or selecting only the term with the lowest FDR within the subtree or sth similar?)

TylerSagendorf · 2022-05-08T02:20:58Z

Hi, I love the clusterprofiler package and regularly use the gseaGO and gseaKEGG functions to investigate my scRNAseq experiment results. However, using these there is always the problem of having many related terms enriched. Would it be possible to implement a function that only selects the most specific term that is enriched in the result for visualization in the dotplot() function? And/or extend the simplify function to work on gsea objects?

Best, Niko

@sparsepix Hi Niko, here is some code that selects the most general term: #372

If you instead compute last_GO_level = unlist(lapply(GO_levels, max)) and then use

x <- simplify(x, cutoff = 0.7, by = "last_GO_level", select_fun = max),

you will get the most specific enriched terms.

If you are not looking for the most specific terms and just want to deal with the redundancy issue, use the msigdbr package to obtain non-redundant gene sets/pathways. Then, supply them to clusterProfiler::GSEA or fgsea::fgsea to perform FGSEA.

deevdevil88 · 2022-05-09T14:32:17Z

hi @TylerSagendorf
you mentioned in your post at the end that i can obtain non-redundant gene-sets/pathways directly from misgdbr? but i couldn't find a specific function which allows me to download non-redundant GO BP or GO CC genesets. Unless , the genesets on MSigDB are non redundant already ?

TylerSagendorf · 2022-05-09T15:50:37Z

hi @TylerSagendorf you mentioned in your post at the end that i can obtain non-redundant gene-sets/pathways directly from misgdbr? but i couldn't find a specific function which allows me to download non-redundant GO BP or GO CC genesets. Unless , the genesets on MSigDB are non redundant already ?

@deevdevil88 The gene sets/pathways provided by MSigDB on their website and through the msigdbr package are non-redundant. They explain how GO terms are filtered in the v7.0 release notes. I only know for sure that the same type of procedure is applied to Reactome pathways, but they claim that the entire database has had its redundancy decreased (doi: 10.1016/j.cels.2015.12.004).

deevdevil88 · 2022-05-09T16:54:03Z

@TylerSagendorf awesome ! Thank you for confirming and also including the relevant release notes. I use MSigdb loads but this non-redundancy of GO terms wasn't obvious. Thanks again.

TylerSagendorf · 2022-05-09T16:56:44Z

@deevdevil88 No problem! Glad I could help.

GuangchuangYu added enhancement GO labels Oct 20, 2015

GuangchuangYu changed the title ~~reduce reduency of enriched GO terms~~ reduce redundancy of enriched GO terms Oct 21, 2015

GuangchuangYu closed this as completed Oct 21, 2015

GuangchuangYu mentioned this issue Oct 23, 2015

GO enrichment at specific level #30

Closed

malcook mentioned this issue Dec 4, 2016

simplify fails on compareCluster result when ont passed as variable #72

Closed

reduce redundancy of enriched GO terms #28

reduce redundancy of enriched GO terms #28

Comments

GuangchuangYu commented Oct 20, 2015

Reference:

GuangchuangYu commented Oct 21, 2015

GuangchuangYu commented Oct 21, 2015

GuangchuangYu commented Oct 23, 2015 • edited Loading

fssdlyl001 commented Oct 14, 2016

GuangchuangYu commented Oct 17, 2016

fssdlyl001 commented Oct 18, 2016

GuangchuangYu commented Oct 18, 2016

fssdlyl001 commented Oct 18, 2016

GuangchuangYu commented Oct 18, 2016

tigerxu commented Feb 28, 2017

GuangchuangYu commented Feb 28, 2017

msquatrito commented Mar 15, 2017

tigerxu commented Mar 16, 2017

hmontenegro commented Feb 6, 2018

haiyueliu commented Feb 28, 2018

Lucyyang1991 commented Oct 16, 2018

sjidong12 commented Dec 8, 2018

davemcg commented Dec 12, 2018

steffenheyne commented Jan 9, 2019 • edited Loading

VictorGoitea commented Jan 24, 2019

steffenheyne commented Jan 24, 2019 • edited Loading

GuangchuangYu commented Jan 25, 2019

VictorGoitea commented Jan 26, 2019

xxz19900 commented Mar 13, 2019

steffenheyne commented Mar 14, 2019 • edited Loading

kamalmdmostafa commented May 26, 2020

laijen000 commented Jul 29, 2020 • edited Loading

claraina commented Oct 7, 2020 • edited Loading

kamalmdmostafa commented Oct 29, 2020

huerqiang commented Oct 29, 2020

huerqiang commented Oct 29, 2020

nikofleischer commented Oct 27, 2021 • edited Loading

huerqiang commented Oct 27, 2021

nikofleischer commented Oct 27, 2021

TylerSagendorf commented May 8, 2022 • edited Loading

deevdevil88 commented May 9, 2022

TylerSagendorf commented May 9, 2022 • edited Loading

deevdevil88 commented May 9, 2022

TylerSagendorf commented May 9, 2022

GuangchuangYu commented Oct 23, 2015 •

edited

Loading

steffenheyne commented Jan 9, 2019 •

edited

Loading

steffenheyne commented Jan 24, 2019 •

edited

Loading

steffenheyne commented Mar 14, 2019 •

edited

Loading

laijen000 commented Jul 29, 2020 •

edited

Loading

claraina commented Oct 7, 2020 •

edited

Loading

nikofleischer commented Oct 27, 2021 •

edited

Loading

TylerSagendorf commented May 8, 2022 •

edited

Loading

TylerSagendorf commented May 9, 2022 •

edited

Loading