Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM Usage and other #6

Open
GabrieleNocchi opened this issue Nov 11, 2019 · 2 comments
Open

RAM Usage and other #6

GabrieleNocchi opened this issue Nov 11, 2019 · 2 comments

Comments

@GabrieleNocchi
Copy link

From my first tries, it seems that what you save in storage (compared to other software) you spend it in RAM.

Do you have some tests that show RAM usage by this tool? Running it on a 71 GB vcf file, 360 individuals and about 7,000,000 SNPs, it can't be run with 8GB of RAM, jobs get killed because it goes out of RAM. Now trying with 64. Any suggestion?

Also, I had a look at your paper and on Table 1 you are saying that Plink can't do this on vcf files directly. This is not true. There is the function --r2 in plink that does exactly this, and then you can plot it in R. Plink also has the function --blocks which let you calculated the LD block size distribution of your dataset. and YES, plink can be run directly on vcf files for a good few years now (flag --vcf)

Finally, sorry for posting here even though it is not a bug. I tried to join your QQ group and downloaded the app. But it is not ideal for Europeans as the app is not in English and I am not able to create an account as I do not understand it.

@GabrieleNocchi
Copy link
Author

Just reporting my results. With 64 GB of RAM it worked. Checking RAM usage I have seen that it used slightly less than 10GB of RAM to do the job on my 71GB vcf file (360 individuals, 7,000,000 SNPs).

That is actually quite good and also the file generated are very small (as advertised). I like this tool.

As I pointed out before Plink can run directly on vcf files and do this. The real advantage of this tool compared to plink in my personal is experience is that:

In plink, if you use the r2 function to calculate linkage between all SNPs pair on a big and dense vcf file like mine, even if you constrained to calculate LD up to a max distance of 300kb, you will still produce a huge output file with all the pairwise distances and LD. This becomes very tedious to plot in R, because loading huge files in R is not ideal. A workaround this when using Plink is to thin your vcf, leaving only like one SNP every 500 bases or so. But doing this, of course is a random thinning and you loose information. With this tool, PopLDDecay, you can run it straight on a dense vcf without the need on thinning and without having to make bins on the output to then be able to plot, which saves a lot of time and efforts.

  1. Also the plot looks quite good.

@hewm2008
Copy link
Contributor

Thank you very much for your recognition of this tools.

sorry we have not test a 71 GB vcf file, 360 individuals and about 7,000,000 SNPs on 8GB of RAM computure.

Yes, the new version plink can read vcf. At that time, we used the old plink and do not it can read vcf directly . Now we are very sorry for the article is misleading to sameone.

QQ group is mainly for the communication of bioinformatics personnel in China, and it is really not friendly to Europe, sorry for you. if you have any question, you can email to me

Your satisfaction is the motivation for me to write these tools and software, thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants