Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variance and C-, S-, D- scores not showing #4

Open
GettyScience opened this issue Jul 12, 2023 · 3 comments
Open

Variance and C-, S-, D- scores not showing #4

GettyScience opened this issue Jul 12, 2023 · 3 comments

Comments

@GettyScience
Copy link

GettyScience commented Jul 12, 2023

Hello,

So I am an undergraduate, so it is possible that my errors are easily an ignorance issue, but I am trying to analyze RNA-seq data through your data. I work in a lab looking for gravitropic genes in the Arabidopsis Thaliana model. We have a dataset of previously examined RNA-seq data that I am trying to run through the code in R but the results I have found are confusing to say the least. When bootstrapped at 10, we have no variance and the C- and D- values cap out at "infinity". When bootstrapped at 100, we get no C-, S-, or D- scores with only numbers showing in the Rho2 and var2.

I am running the analysis on my laptop (16Gb), but despite a longer wait it still runs just fine. Our data also only has 4 samples per treatment and tissue, so the analysis is >27,000 genes but only 4 samples. Would either of these relate to the issues we are finding or could you offer any more advice for this issue?

Thank you,

@yaccos
Copy link
Collaborator

yaccos commented Jul 12, 2023

Thank you for your request. Unfortunately, 4 samples per treatment is too low considering that you have more than 27,000 genes. With such as high number of genes, I would recommend having more than 100 samples per treatment.

@GettyScience
Copy link
Author

Thank you for the quick reply. Would you be willing to elaborate on why the small sample size causes the calculations to fail? In our field, small samples are common as the model is very inbred and genetically controlled. Is there a way to get past this issue within the larger code or would it remain inappropriate for the statistical methods being used?

Thank you,

@yaccos
Copy link
Collaborator

yaccos commented Jul 13, 2023

Considering that the csdR makes an all-to-all comparison of the genes, it reports C-,S- and D-values for 364,486,500 gene pairs when running it with 27,000 genes. Having just 4 samples per treatment will therefore make a lot of spurious associations. csdR was originally intended for large-scale clinical studies where sufficiently high samples sizes are available, but obtaining that amount of data is often unfeasible for other types of studies. For your sample size, there are still some analyses you can use. You could for instance measure the fold change in gene expression between the two conditions and show the results in a vulcano plot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants