-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting VCF --> data.table #57
Comments
@hpages ^^ I (Vince) think there would be a number of opportunities for enhancing your R function. Herve might see them all at once. |
Found a somewhat similar method by @seandavi here: |
After some quick benchmarking, I also tried a method using only Benchmark:
|
Running the internal code line-by-line, I've identified the folllowing as the rate-limiting steps: 1. Reading in the full VCFReading in the full VCF is necessary for
2. Converting to data.frameConverting the VCF https://github.com/seandavi/VCFWrenchR/blob/73da547dda3789887656c42f5ce4d529d5cbe498/R/vcf.R#L35 |
There may be some code in |
Thanks for the tips, @lawremi. Let me know if you come across any specific bits of code that would be helpful to integrate. |
Updatesreading VCFsI wrote a function that identified empty columns in the VCF based on the first #' Select VCF fields
#'
#' Select non-empty columns from each VCF field type.
#'
#' @param nrows Number of rows to use when inferring empty columns.
#' @inheritParams read_vcf2
#'
#' @returns \code{ScanVcfParam} object.
#' @keywords internal
select_vcf_fields <- function(path,
nrows=5e6){
#### Read header ####
## Read the first nrows to determine which columns are useful.
messager("Finding empty VCF columns based on first",nrows,"rows.")
t1 <- Sys.time()
gr <- GenomicRanges::GRanges(paste0("1:1-",as.integer(nrows)))
vcf_top <- VariantAnnotation::readVcf(
file = path,
param = VariantAnnotation::ScanVcfParam(which = gr)
)
#### Find empty cols #####
df <- vcf2df(vcf = vcf_top,
add_sample_names = FALSE)
empty_cols <- check_empty_cols(sumstats_dt = df)
fields <- VariantAnnotation::vcfFields(vcf_top)
for(x in names(fields)){
fields[[x]] <- fields[[x]][
!fields[[x]] %in% names(empty_cols)
]
}
messager("Constructing ScanVcfParam object.")
param <- VariantAnnotation::ScanVcfParam(
fixed = fields$fixed,
info = fields$info,
geno = fields$geno,
samples = fields$samples,
trimEmpty = TRUE)
methods::show(round(difftime(Sys.time(),t1),1))
return(param)
} For the VCF path given in the above example, this reduces import time from 4.4min to 3.3min, which is good but could still use improvement. Support functions referenced in this code can be found in their respective files here: VCF -- > data.table@hpages @seandavi, any ideas about how to improve this? I've made several attempts but none seem to offer any speed boost.
@vjcitn given that this seems to be the limiting factor. Could you look into the method (which is internal to |
Thank you for the link. We can look at it after release of 3.15. Your tested PR would be appreciated -- are you able to do this sort of coding? |
I don't feel that code with such hard-coded constants should be reused without some careful improvements. |
Happy to provide clues here and there, but someone with deeper knowledge of how
Agreed, but just meant this as a starting point rather than a finished product. Feel free to make improvements as needed. |
https://github.com/neurogenomics/MungeSumstats/blob/f581998e40fac9be31fcb4a1313a48543189ddc9/R/vcf2df.R#L101 is very slow. For the example I am not well-versed in VCF and I did not write VariantAnnotation. I think the task you have taken may need to be trimmed down. For example, are
these standardized in any way? I think one could rapidly preallocate and fill a matrix of the numeric/non-NA components of this family of fields. My approach would be to require the user to tell which of the numeric components of geno(x) should be kept, and build a matrix from those, and then convert that to data.table if you must. I think that would remove one of the slowest components of your process. |
Exactly, omitting all-NA elements is what
Thank you for clarifying. Is there someone else I should be addressing for these Issues? I realize you and your co-authors initially published
The field names are standardized, but exactly which fields appear in which rows varies (even within a single file!). That's why simple parsing strategies using For example, not every SNP/row will have a "SI" field because not every SNP is imputed (some are directly measured by the genotyping assay). That said, some columns may be more consistent across rows than others: e.g. I would expect "ES" to be present across all rows. But this is all mostly conjecture based on a few example VCFs I've worked with, so not sure how well it applies to others. |
Maybe we could reset the conversation a bit with what the specific goal is? Parse VCF to data.frame / data.table? But maybe a little deeper is there some functionality related to processing VCFs that is not available in the Bioconductor packages? The way to iterate through a vcf file is to use The GenomicFiles package provides functionality for processing chunks in parallel; usually this is meant to be a 'reduction', distributing the reading task to workers who also summarize some interesting aspect of the file. See I will look at this in a little more detail today. |
To clarify a couple of things.
In GenomicFiles using
to accomplish the same thing. Probably the munged summary statistics (the equivalent of
I think it is more robust to use independent processes (
I then have
This can be contrasted with sequential evaluation
The serial version takes about 180 seconds, whereas the parallel version takes about 60 seconds. I don't think the strategy of scanning the first
restricting to just the
decreases processing time to about 43 seconds under SnowParam. If there is just a single field of interest, then maybe update the
which takes about 30 seconds. Note that using
|
It seems like your use case is not to actually use the data structures / functionality of VariantAnnotation / GenomicRanges, but simply to parse VCF files to a different format (tibbles / data.frames). Depending on use case the gdsfmt (via SeqArray, for instance) provides a very performant way to work with large VCF files. There is an 'import' step so one would take this option if one intends to make repeated use of the data in the VCF. gds files are generally (a little) smaller than compressed vcf while containing the same information stored in a more accessible way.
This took 190 seconds, which is comparable to readVcf so not too costly; I believe this is also memory efficient. Queries are very fast and flexible
implying that your data.frame / tibble representation could be constructed from GDS files in just a few seconds from the data that you're interested in. I have not used this tool extensively. |
This issue appears to have been addressed by updating the Rhtslib package to htslib 1.15.1. The update is available only in Bioconductor version 3.16 Rsamtools 2.13.1 and later. |
@mtmorgan this seems a bit unrelated to the I think the end result is that the the So instead, I wrote a function to accomplish this and optimised this process as much as I could, and added it to our package With the helpful suggestions from @mtmorgan and @vjcitn, this was further scaled up using parallelization (useful for very large files, but for smaller files best to use the single-threaded version because the overhead cost makes it not worth it in those cases). |
Hello,
We (myself and @Al-Murphy) use
VariantAnnotation
in our packageMungeSumstats
and are looking for a way to efficiently convert the (collapsed) VCF format todata.table
(since the rest of our munging pipeline relies on everything being in table format).We've adapted solutions from various forums but it's still rather slow.
https://github.com/neurogenomics/MungeSumstats/blob/master/R/vcf2df.R
I was wondering if you know of a more efficient way to convert VCF to data.table (or a regular data.frame), or could see ways that our existing function could be improved?
Many thanks in advance,
Brian
The text was updated successfully, but these errors were encountered: