Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce redundancy of enriched GO terms #28

Closed
GuangchuangYu opened this issue Oct 20, 2015 · 39 comments
Closed

reduce redundancy of enriched GO terms #28

GuangchuangYu opened this issue Oct 20, 2015 · 39 comments

Comments

@GuangchuangYu
Copy link
Member

To simplify the enriched result, we can use slim version of GO and use enricher() function to analyze.

Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term.

The criteria of selecting representative term can be:

  • most informative term (need pre-calculated IC data, only available for those internally supported organisms in GOSemSim; can be extended to un-supported organisms).
  • most significant term (as in REVIGO)

I prefer using the second criteria for it's more intuitive and more easy to implement for those not internally supported by GOSemSim.

I propose to define a function to simplify the output from enrichGO by removing redundant GO terms.

simplify <- function(enrichResult, cutoff=0.7, by="p.adjust", select_fun=min) {
     ## GO terms that have semantic similarity higher than `cutoff` are treated as redundant terms
     ## select one representative term by applying `select_fun` to feature specifying by `by`.
     ## user can defined their own `select_fun` function.

     ## return an updated `enrichResult` object.
}

Any comment/suggestion is welcome.

Reference:

@GuangchuangYu GuangchuangYu changed the title reduce reduency of enriched GO terms reduce redundancy of enriched GO terms Oct 21, 2015
@GuangchuangYu
Copy link
Member Author

finished in 7e32c7d

I will close the issue. Comment is still welcome.

@GuangchuangYu
Copy link
Member Author

output from compareCluster is also supported, see983262b.

@GuangchuangYu
Copy link
Member Author

GuangchuangYu commented Oct 23, 2015

see also the blog post

@fssdlyl001
Copy link

Hi, I applied simplify() function to compareCluster result, but return an error message: "Error in FUN(X[[i]], ...) : unused argument (organism = "human")". If I used enrichGO result as input, simplify worked. Could you help me solve this problem? Thank you.

@GuangchuangYu
Copy link
Member Author

it was fixed in github version which will be released on Wednesday in BioC 3.4.

@fssdlyl001
Copy link

Thank you very much. Just one more question: which "semData" should I use for simpifying compareCluster results? If it needs to set manually, what's the default semData of simplify()?

@GuangchuangYu
Copy link
Member Author

you need to refer to the GOSemSim vignette.

If you use measure = 'Wang', you don't need to pass semData.

@fssdlyl001
Copy link

But when I run simplify(x, measure="Wang"), it also return an message: Error in FUN(X[[i]], ...) : argument "semData" is missing, with no default. If I add 'semData = NULL', I got Error in match.arg(ont, c("BP", "CC", "MF")) : 'arg' must be of length 1. Where did I go wrong?

@GuangchuangYu
Copy link
Member Author

data(gcSample)
x = compareCluster(gcSample, fun='enrichGO', ont='MF', OrgDb='org.Hs.eg.db')
y <- simplify(x, measure='Wang', semData=NULL) 

works.

pls follow the guide and provide a reproducible example.

@tigerxu
Copy link

tigerxu commented Feb 28, 2017

Hi, I applied the simplify() function to reduce redundancy of the enriched GO terms produced by enrichGO. The codes below runs fine without error. But I noticed that several redundant GO terms were not filtered out. The example is below

genedata <- read.table("genedata.txt", quote=NULL, header=TRUE, check.names=F, sep="\t")
head(genedata)
eg = bitr(genedata$Protein IDs, fromType="UNIPROT", toType="ENTREZID", OrgDb="org.Hs.eg.db")
egobp <- enrichGO(gene = eg$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.01,
readable = TRUE)
egobp2 <- simplify(egobp, cutoff=0.6, by="p.adjust", select_fun=min)
head(egobp2)

I show the first six GO terms below
Count
GO:0006335 24
GO:0034723 24
GO:0045861 64
GO:0051290 25
GO:0000183 24
GO:0002576 36

I compared the similarity of GO terms using GOSemSim and the result is as follows

goSim("GO:0006335", "GO:0034723", semData=hsGObp, measure="Jiang")
[1] 1
goSim("GO:0006335", "GO:0034723", semData=hsGObp, measure="Wang")
[1] 0.671

It looks that one of "GO:0006335" and "GO:0034723" should be filtered but actually not. Can you help to look at this problem? Any advice is very appreciated.

Thanks for developing clusterProfiler!

Zhuofei Xu

@GuangchuangYu
Copy link
Member Author

pls follow https://guangchuangyu.github.io/2016/07/how-to-bug-author/ and provide a reproducible example.

@msquatrito
Copy link

Hi,
I've been trying to use the simplify function and I'had similar issue to @tigerxu. I've realised that in my case the problem was due to the fact that GO terms with the same "p.adjust" are not filtered out.
I've solved it using simplify(..., by = "pvalue").
Best,
Massimo

@tigerxu
Copy link

tigerxu commented Mar 16, 2017

thanks for sharing this info, Massino! I will try again.
Zhuofei

@hmontenegro
Copy link

Doesn't simplify work with gseaResult objects?

cc.simp <- simplify( ego.cc, cutoff = 0.7, by = "p.adjust", select_fun = min )

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘simplify’ for signature ‘"gseaResult"’

@haiyueliu
Copy link

Hi,

I am using simplify function, but it comes to errors.

ego2_52_slim <- simplify(ego2_52, cutoff=0.7, by="p.adjust", select_fun=min)
Error in .local(x, ...) :
simplify only applied to output from enrichGO...

I am sure the input is result from enrichGO. Could you please help me to have a look of this?

And also, you mentioned that you prefer using the second criteria which is "REVIGO", but here in simplify, you integrated GOSemSim, right?.

Thanks,
Haiyue

@Lucyyang1991
Copy link

It takes me really long time to run the simplify function, it that normal?

@sjidong12
Copy link

I have a similar problem with "Lucyyang1991". It takes quite long time to run the simplify function. Is that normal?

@davemcg
Copy link

davemcg commented Dec 12, 2018

simplify does not work with enrichGO output if the ontology used is "ALL"

@steffenheyne
Copy link

steffenheyne commented Jan 9, 2019

same here...also when I don't use "ALL" but like ont="BP", simplify never finishes for me

@VictorGoitea
Copy link

the same... ont="BP", simplify never finishes for me

@steffenheyne
Copy link

steffenheyne commented Jan 24, 2019

actually the problem seems to be that now (was there a change?) eg. enrichGO() returns an object that stores in ego@result also all unsignificant results. When I strip this down to the sig.-only results
by eg.:

ego@result = ego@result[ego@result$qvalue<=ego@qvalueCutoff,]

simplify finishes normally. But I don't know if the result of simplify is still correct with that change.
@GuangchuangYu ?

I just also strip down ego@geneSets = data.frame() as the result df contains all the genesets. With that I get a much smaller objects. But I think this is not used by simplify at all.

@GuangchuangYu
Copy link
Member Author

@steffenheyne thanks for pointing this out. Yes, the result should be strip down before simplify. Will fix it.

@VictorGoitea
Copy link

Thanks Steffen! Your suggestion worked perfect.

@xxz19900
Copy link

simplify does not work with enrichGO output if the ontology used is "ALL"

Have you solved this problem? I also encountered such condition.

@steffenheyne
Copy link

steffenheyne commented Mar 14, 2019

@xxz19900 simplify only works within a sub-ontology like BP, MF or CC, not with ont= 'ALL', like

x <- enrichGO(dat, ont = 'BP', OrgDb = 'org.Hs.eg.db')
y <- simplify(x)

@kamalmdmostafa
Copy link

I am having the following error when trying to simplify Arabidopsis GO BP data:

egoBP_earlyresponse <- enrichGO(gene = earlyresponse,

  •                             universe = allOE_genes,
    
  •                             keyType = "TAIR",
    
  •                             OrgDb = org.At.tair.db, 
    
  •                             ont = "BP", 
    
  •                             pAdjustMethod = "BH", 
    
  •                             pvalueCutoff= 0.01,
    
  •                             qvalueCutoff = 0.05,
    
  •                             readable = FALSE)
    

cluster_summaryBP_earlyresponse <- data.frame(egoBP_earlyresponse)
#simplify

simplify(egoBP_earlyresponse, cutoff = 0.7, by = "p.adjust", select_fun=min)
Error in simplify(egoBP_earlyresponse, cutoff = 0.7, by = "p.adjust", :
unused arguments (cutoff = 0.7, by = "p.adjust", select_fun = min)

Any suggestion will be appreciated .

@laijen000
Copy link

laijen000 commented Jul 29, 2020

Hi Dr. Yu (@GuangchuangYu ),

I would like to use GOSemSim to select the most informative term when comparing lists of GO IDs using mgoSim(). I am using the mouse genome and calculated IC scores: mmGO <- godata('org.Mm.eg.db', ont='BP').
Once we obtain the similarity matrix, is there a way to filter it to select just the most informative term above a certain similarity cutoff?

Because my GO IDs are not from using enrichGO, I am unable to use the simplify function. So I would like to try GOSemSim instead. Thank you!

@claraina
Copy link

claraina commented Oct 7, 2020

I am having the same problem as @kamalmdmostafa, getting the error "unused arguments (cutoff = 0.7, by = "p.adjust", select_fun = min)" even though it seems to work with another set of genes. I don't think it's an issue with my enrich object (ego) as it is still showing the relevant GO Terms.

EEDDOC_ego <- enrichGO(gene = EEDDOC_gene.df$ENTREZID,
OrgDb = org.Mm.eg.db,
keyType = 'ENTREZID',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
readable = TRUE)

sEEDDOC_ego <- simplify(EEDDOC_ego, cutoff=0.7, by="p.adjust", select_fun = min)

Error in simplify(EEDDOC_ego, cutoff = 0.7, by = "p.adjust", select_fun = min) :
unused arguments (cutoff = 0.7, by = "p.adjust", select_fun = min)

@kamalmdmostafa
Copy link

@claraina could you solve this issue. Seems like they gave up on this issue. Any other way to solve this? Creating a new issue maybe?

@huerqiang
Copy link
Contributor

@kamalmdmostafa #278

1 similar comment
@huerqiang
Copy link
Contributor

@kamalmdmostafa #278

@nikofleischer
Copy link

nikofleischer commented Oct 27, 2021

Hi, I love the clusterprofiler package and regularly use the gseaGO and gseaKEGG functions to investigate my scRNAseq experiment results. However, using these there is always the problem of having many related terms enriched. Would it be possible to implement a function that only selects the most specific term that is enriched in the result for visualization in the dotplot() function? And/or extend the simplify function to work on gsea objects?

Best, Niko

@huerqiang
Copy link
Contributor

@sparsepix You could use the showCategory = Your_selectd_terms parameter in dotplot to display your selected terms.

@nikofleischer
Copy link

@huerqiang thanks for the tip. Is there a way to easily compute the terms to select (based on them being either the most specific one in one subtree or selecting only the term with the lowest FDR within the subtree or sth similar?)

@TylerSagendorf
Copy link

TylerSagendorf commented May 8, 2022

Hi, I love the clusterprofiler package and regularly use the gseaGO and gseaKEGG functions to investigate my scRNAseq experiment results. However, using these there is always the problem of having many related terms enriched. Would it be possible to implement a function that only selects the most specific term that is enriched in the result for visualization in the dotplot() function? And/or extend the simplify function to work on gsea objects?

Best, Niko

@sparsepix Hi Niko, here is some code that selects the most general term: #372

If you instead compute last_GO_level = unlist(lapply(GO_levels, max)) and then use

x <- simplify(x, cutoff = 0.7, by = "last_GO_level", select_fun = max),

you will get the most specific enriched terms.

If you are not looking for the most specific terms and just want to deal with the redundancy issue, use the msigdbr package to obtain non-redundant gene sets/pathways. Then, supply them to clusterProfiler::GSEA or fgsea::fgsea to perform FGSEA.

@deevdevil88
Copy link

hi @TylerSagendorf
you mentioned in your post at the end that i can obtain non-redundant gene-sets/pathways directly from misgdbr? but i couldn't find a specific function which allows me to download non-redundant GO BP or GO CC genesets. Unless , the genesets on MSigDB are non redundant already ?

@TylerSagendorf
Copy link

TylerSagendorf commented May 9, 2022

hi @TylerSagendorf you mentioned in your post at the end that i can obtain non-redundant gene-sets/pathways directly from misgdbr? but i couldn't find a specific function which allows me to download non-redundant GO BP or GO CC genesets. Unless , the genesets on MSigDB are non redundant already ?

@deevdevil88 The gene sets/pathways provided by MSigDB on their website and through the msigdbr package are non-redundant. They explain how GO terms are filtered in the v7.0 release notes. I only know for sure that the same type of procedure is applied to Reactome pathways, but they claim that the entire database has had its redundancy decreased (doi: 10.1016/j.cels.2015.12.004).

@deevdevil88
Copy link

@TylerSagendorf awesome ! Thank you for confirming and also including the relevant release notes. I use MSigdb loads but this non-redundancy of GO terms wasn't obvious. Thanks again.

@TylerSagendorf
Copy link

@deevdevil88 No problem! Glad I could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests