Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appropiate background genes for GC & length correction #71

Closed
xoelmb opened this issue Jul 11, 2022 · 8 comments
Closed

Appropiate background genes for GC & length correction #71

xoelmb opened this issue Jul 11, 2022 · 8 comments
Assignees
Labels

Comments

@xoelmb
Copy link

xoelmb commented Jul 11, 2022

Is your feature request related to a problem? Please describe.
I have run the bootstrap enrichment test with EWCE on ~40 genesets and a scRNA-seq dataset with and without geneSizeControl (GC % & gene length control), and with automatic background genes. The number of significant results (q value < 0.05) I get when geneSizeControl is on is huge in comparison. I have been reading the code and I think that the background genes created when geneSizeControl is on include genes not present in the SCT expression dataset.

Describe the solution you'd like
I think there is a simple solution to this problem: restricting the background genes used even when geneSizeControl is on to those present in the expression data. So, basically removing one check would make it. Changing this
#### Restrict gene sets to only genes in the SCT dataset ####
if (!geneSizeControl) {
hits <- hits[hits %in% sct_genes]
bg <- bg[bg %in% sct_genes]
}
For this
#### Restrict gene sets to only genes in the SCT dataset ####
hits <- hits[hits %in% sct_genes]
bg <- bg[bg %in% sct_genes]

The function is check_ewce_genelist_input

#### Restrict gene sets to only genes in the SCT dataset ####
if (!geneSizeControl) {
hits <- hits[hits %in% sct_genes]
bg <- bg[bg %in% sct_genes]
}

Describe alternatives you've considered
Maybe I'm missing something. Is this filtering done in some other part of the code that I have not reached? Is this expected behavior?

Additional context
I think this filtering is meant to be done. According to the original EWCE article:
Methods:
Bootstrapping with Controls for Transcript Length and GC Content

...The deciles of gene size and GC content were calculated over the set of genes expressed in the SCT dataset (after dropping those with low expression levels as described above). The two sets of decile values were used to define a grid, and each gene assigned to a position within the grid based on it's transcript lengths and GC content...

@Al-Murphy
Copy link
Collaborator

Hey,

Thanks for bringing this to our attention. We are looking into it and will get back to you on it soon.

@bschilder bschilder self-assigned this Jul 21, 2022
@bschilder bschilder added the bug label Jul 21, 2022
@bschilder
Copy link
Collaborator

bschilder commented Jul 21, 2022

Hi @xoelmb, thanks for bringing this to our attention. I just checked if this was something @Al-Murphy or I had introduced when making updates to EWCE over the last couple years, but it seems that this bit of code existed long before any of those changes were made (at least as far back as May 2018):

if(geneSizeControl==FALSE){

I agree that this seems to contradict what the documentation says and I can't think of a reason why genes absent from the SCT/CTD wouldn't be dropped in this situation. @NathanSkene do you have any insight into why this was here originally?

@NathanSkene
Copy link
Owner

NathanSkene commented Jul 21, 2022 via email

@bschilder
Copy link
Collaborator

Will do @NathanSkene

Come to think of it, I seem to recall @KittyMurphy encountered a similar effect of increased significant results when using geneSizeControl =TRUE. This likely explains why.

@NathanSkene
Copy link
Owner

NathanSkene commented Jul 21, 2022 via email

@xoelmb
Copy link
Author

xoelmb commented Jul 22, 2022

Great! Thank you all for spending your time on this! I'll keep an eye on how this gets solved and in the meantime I'll be using the results without geneSizeControl.
I'm happy I could make this little contribution to such a good package!

@Al-Murphy
Copy link
Collaborator

Hey! The fix has been implemented (version 1.5.4) in the Github master branch so feel free to install EWCE from there so you can use it with geneSizeControl. Otherwise it will be up on the bioconductor dev version in the next few days (usually 2-3). You can use that by installing:

BiocManager::install(version='devel')

Thanks for pointing this out!

@xoelmb
Copy link
Author

xoelmb commented Jul 22, 2022

Great! Thanks to you again!! ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants