Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting VCF --> data.table #57

Closed
bschilder opened this issue Apr 15, 2022 · 18 comments
Closed

Converting VCF --> data.table #57

bschilder opened this issue Apr 15, 2022 · 18 comments

Comments

@bschilder
Copy link
Contributor

Hello,

We (myself and @Al-Murphy) use VariantAnnotation in our package MungeSumstats and are looking for a way to efficiently convert the (collapsed) VCF format to data.table (since the rest of our munging pipeline relies on everything being in table format).

We've adapted solutions from various forums but it's still rather slow.
https://github.com/neurogenomics/MungeSumstats/blob/master/R/vcf2df.R

I was wondering if you know of a more efficient way to convert VCF to data.table (or a regular data.frame), or could see ways that our existing function could be improved?

Many thanks in advance,
Brian

@vjcitn
Copy link
Contributor

vjcitn commented Apr 15, 2022

@hpages ^^ I (Vince) think there would be a number of opportunities for enhancing your R function. Herve might see them all at once.

@bschilder
Copy link
Contributor Author

Found a somewhat similar method by @seandavi here:
https://github.com/seandavi/VCFWrenchR/blob/master/R/vcf.R

@bschilder
Copy link
Contributor Author

bschilder commented Apr 15, 2022

After some quick benchmarking, VCFWrenchR's approach seems to be about 3x faster than ours and returns almost identical data.frames.

I also tried a method using only VariantAnnotation functions, which works but is far slower than the other 2 methods.

Benchmark: N <- 1e5

library(VCFWrenchR)
library(VariantAnnotation)

path <- "https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz"
vcf <- VariantAnnotation::readVcf(file = path)
N <- 1e5
vcf_sub <- vcf[1:N,]

res <- microbenchmark::microbenchmark(
    "vcf2df"={dat1 <- MungeSumstats:::vcf2df(vcf = vcf_sub)},
    "VCFWrenchR"= {dat2 <- as.data.frame(x = vcf_sub)}, 
    "VRanges"={dat3 <- data.table::as.data.table(
        methods::as(vcf_sub, "VRanges"))},
    times=1
)
testthat::expect_equal(nrow(dat1), N)
testthat::expect_equal(nrow(dat2), N)
testthat::expect_equal(nrow(dat3), N) 

results

Screenshot 2022-04-15 at 20 20 25

Benchmark: N <- 1e6

Given that 1e5 represents <1% of the total variants in this VCF (11,734,353 total), it becomes clear that the"VRanges" method is not scalable.

N <- 1e6
vcf_sub <- vcf[1:N,]

res <- microbenchmark::microbenchmark(
    "vcf2df"={dat1 <- MungeSumstats:::vcf2df(vcf = vcf_sub)},
    "VCFWrenchR"= {dat2 <- as.data.frame(x = vcf_sub)},
    # "VRanges"={dat3 <- data.table::as.data.table(
    #     methods::as(vcf_sub, "VRanges"))},
    times=1
)

results

Screenshot 2022-04-15 at 20 26 55

Benchmark: all variants

Finally, I tested VCFWrenchR using all 11+ million variants, which took ~5min. Which is ok, but it would great to further optimise to get it into the <1min range.

Other notes

We also have tried parsing VCFs by reading them in with data.table; this is very messy (doesn't parse VCFs properly in all situations) but extremely fast (<30 seconds).
https://github.com/neurogenomics/MungeSumstats/blob/baf7fe076341ff9dbda919e41e551e6e70f8a6e8/R/read_vcf.R

VariantAnnotation handles VCF data parsing much better though. So we would love to find a way that can take advantage of the speed benefits of data.table and the comprehensiveness of VariantAnnotation (or, seemingly, VCFWrenchR).

@vjcitn is this something you could try implementing within VariantAnnotation directly? VCFWrenchR isn't a Bioc/CRAN package so we wouldn't be able to use it in Bioc packages (eg MungeSumstats), and I think it would make sense to have this function/method bundled within VariantAnnotation itself.

Session info

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.40.0    Rsamtools_2.10.0            Biostrings_2.62.0          
 [4] XVector_0.34.0              SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [7] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[10] S4Vectors_0.32.4            MatrixGenerics_1.6.0        matrixStats_0.61.0         
[13] BiocGenerics_0.40.0         MungeSumstats_1.3.19       

loaded via a namespace (and not attached):
  [1] TH.data_1.1-0                               rjson_0.2.21                               
  [3] ellipsis_0.3.2                              rprojroot_2.0.3                            
  [5] fs_1.5.2                                    rstudioapi_0.13                            
  [7] roxygen2_7.1.2                              waldo_0.4.0                                
  [9] remotes_2.4.2                               bit64_4.0.5                                
 [11] mvtnorm_1.1-3                               AnnotationDbi_1.56.2                       
 [13] fansi_1.0.3                                 diffobj_0.3.5                              
 [15] xml2_1.3.3                                  codetools_0.2-18                           
 [17] splines_4.1.3                               R.methodsS3_1.8.1                          
 [19] cachem_1.0.6                                knitr_1.38                                 
 [21] pkgload_1.2.4                               jsonlite_1.8.0                             
 [23] dbplyr_2.1.1                                png_0.1-7                                  
 [25] SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20    R.oo_1.24.0                                
 [27] SNPlocs.Hsapiens.dbSNP144.GRCh38_0.99.20    compiler_4.1.3                             
 [29] httr_1.4.2                                  assertthat_0.2.1                           
 [31] Matrix_1.4-1                                fastmap_1.1.0                              
 [33] gargle_1.2.0                                cli_3.2.0                                  
 [35] htmltools_0.5.2                             prettyunits_1.1.1                          
 [37] tools_4.1.3                                 glue_1.6.2                                 
 [39] GenomeInfoDbData_1.2.7                      dplyr_1.0.8                                
 [41] rappdirs_0.3.3                              Rcpp_1.0.8.3                               
 [43] xopen_1.0.0                                 vctrs_0.4.0                                
 [45] rtracklayer_1.54.0                          xfun_0.30                                  
 [47] stringr_1.4.0                               ps_1.6.0                                   
 [49] brio_1.1.3                                  testthat_3.1.3                             
 [51] lifecycle_1.0.1                             restfulr_0.0.13                            
 [53] devtools_2.4.3                              XML_3.99-0.9                               
 [55] googleAuthR_2.0.0                           MASS_7.3-56                                
 [57] zlibbioc_1.40.0                             zoo_1.8-9                                  
 [59] microbenchmark_1.4.9                        BSgenome_1.62.0                            
 [61] hms_1.1.1                                   parallel_4.1.3                             
 [63] sandwich_3.0-1                              rematch2_2.1.2                             
 [65] VCFWrenchR_0.1.1                            BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000     
 [67] yaml_2.3.5                                  curl_4.3.2                                 
 [69] memoise_2.0.1                               rcmdcheck_1.4.0                            
 [71] biomaRt_2.50.3                              stringi_1.7.6                              
 [73] RSQLite_2.2.12                              BiocIO_1.4.0                               
 [75] desc_1.4.1                                  GenomicFeatures_1.46.5                     
 [77] filelock_1.0.2                              pkgbuild_1.3.1                             
 [79] BiocParallel_1.28.3                         rlang_1.0.2                                
 [81] pkgconfig_2.0.3                             commonmark_1.8.0                           
 [83] bitops_1.0-7                                evaluate_0.15                              
 [85] lattice_0.20-45                             purrr_0.3.4                                
 [87] GenomicAlignments_1.30.0                    bit_4.0.4                                  
 [89] processx_3.5.3                              tidyselect_1.1.2                           
 [91] magrittr_2.0.3                              R6_2.5.1                                   
 [93] generics_0.1.2                              multcomp_1.4-18                            
 [95] DelayedArray_0.20.0                         DBI_1.1.2                                  
 [97] pillar_1.7.0                                withr_2.5.0                                
 [99] survival_3.3-1                              KEGGREST_1.34.0                            
[101] RCurl_1.98-1.6                              tibble_3.1.6                               
[103] crayon_1.5.1                                utf8_1.2.2                                 
[105] BiocFileCache_2.2.1                         rmarkdown_2.13                             
[107] progress_1.2.2                              usethis_2.1.5                              
[109] grid_4.1.3                                  data.table_1.14.2                          
[111] blob_1.2.2                                  callr_3.7.0                                
[113] digest_0.6.29                               BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
[115] R.utils_2.11.0                              sessioninfo_1.2.2   

@bschilder
Copy link
Contributor Author

bschilder commented Apr 15, 2022

Running the internal code line-by-line, I've identified the folllowing as the rate-limiting steps:

1. Reading in the full VCF

Reading in the full VCF is necessary for MungeSumstats. Does readVcf take advantage of parallelisation? Are there other ways to speed this step up?

vcf <- VariantAnnotation::readVcf(file = path)

2. Converting to data.frame

Converting the VCF rowRanges andinfo to data.frame (or data.table takes quite a long time. Is there a more efficient method to do this?

https://github.com/seandavi/VCFWrenchR/blob/73da547dda3789887656c42f5ce4d529d5cbe498/R/vcf.R#L35

@lawremi
Copy link
Contributor

lawremi commented Apr 15, 2022

There may be some code in writeVcf() that could accelerate some parts of this. Not sure, since it's been a while since I worked on it. You could speed up readVcf() by restricting the read operation to the columns you care about. I am pretty sure there is a way to read chunks of the file, which could be done in parallel. Some sort of synchronized iterator would be ideal.

@bschilder
Copy link
Contributor Author

Thanks for the tips, @lawremi. Let me know if you come across any specific bits of code that would be helpful to integrate.

@bschilder
Copy link
Contributor Author

Updates

reading VCFs

I wrote a function that identified empty columns in the VCF based on the first nrows of the VCF, and omits them from the full query.

#' Select VCF fields
#' 
#' Select non-empty columns from each VCF field type.
#' 
#' @param nrows Number of rows to use when inferring empty columns.
#' @inheritParams read_vcf2
#' 
#' @returns \code{ScanVcfParam} object.
#' @keywords internal
select_vcf_fields <- function(path,
                              nrows=5e6){
    #### Read header ####
    ## Read the first nrows to determine which columns are useful.
    messager("Finding empty VCF columns based on first",nrows,"rows.") 
    t1 <- Sys.time()
    gr <- GenomicRanges::GRanges(paste0("1:1-",as.integer(nrows)))
    vcf_top <- VariantAnnotation::readVcf(
        file = path, 
        param = VariantAnnotation::ScanVcfParam(which = gr)
    ) 
    #### Find empty cols #####
    df <- vcf2df(vcf = vcf_top, 
                 add_sample_names = FALSE)
    empty_cols <- check_empty_cols(sumstats_dt = df) 
    fields <- VariantAnnotation::vcfFields(vcf_top)
    for(x in names(fields)){
        fields[[x]] <- fields[[x]][
            !fields[[x]] %in% names(empty_cols)
            ]
    } 
    messager("Constructing ScanVcfParam object.")
    param <- VariantAnnotation::ScanVcfParam(
        fixed = fields$fixed, 
        info = fields$info, 
        geno = fields$geno, 
        samples = fields$samples, 
        trimEmpty = TRUE)
    methods::show(round(difftime(Sys.time(),t1),1))
    return(param)
}

For the VCF path given in the above example, this reduces import time from 4.4min to 3.3min, which is good but could still use improvement. Support functions referenced in this code can be found in their respective files here:
https://github.com/neurogenomics/MungeSumstats/tree/master/R

VCF -- > data.table

@hpages @seandavi, any ideas about how to improve this? I've made several attempts but none seem to offer any speed boost.

2. Converting to data.frame

Converting the VCF rowRanges andinfo to data.frame (or data.table takes quite a long time. Is there a more efficient method to do this?

https://github.com/seandavi/VCFWrenchR/blob/73da547dda3789887656c42f5ce4d529d5cbe498/R/vcf.R#L35

@vjcitn given that this seems to be the limiting factor. Could you look into the method (which is internal to VariantAnnotation?) which handles this and see if you could improve it? Thanks

@bschilder
Copy link
Contributor Author

@vjcitn I believe this is the chunking code within writeVcf that @lawremi was referring to. Do you think you could adapt this to be used in readVcf as well?

@vjcitn
Copy link
Contributor

vjcitn commented Apr 25, 2022

Thank you for the link. We can look at it after release of 3.15. Your tested PR would be appreciated -- are you able to do this sort of coding?

@vjcitn
Copy link
Contributor

vjcitn commented Apr 25, 2022

I don't feel that code with such hard-coded constants should be reused without some careful improvements.

@bschilder
Copy link
Contributor Author

bschilder commented Apr 25, 2022

Thank you for the link. We can look at it after release of 3.15. Your tested PR would be appreciated -- are you able to do this sort of coding?

Happy to provide clues here and there, but someone with deeper knowledge of how VariantAnnotation works internally (i.e. the maintainers) would be much better suited to implement these suggestions. But i'd be happy to be a tester once the changes are implemented.

I don't feel that code with such hard-coded constants should be reused without some careful improvements.

Agreed, but just meant this as a starting point rather than a finished product. Feel free to make improvements as needed.

@vjcitn
Copy link
Contributor

vjcitn commented May 3, 2022

https://github.com/neurogenomics/MungeSumstats/blob/f581998e40fac9be31fcb4a1313a48543189ddc9/R/vcf2df.R#L101 is very slow. For the example ubm-a-2929.vcf.gz that you provided (perhaps elsewhere), there are elements of geno() that are all NA.

I am not well-versed in VCF and I did not write VariantAnnotation. I think the task you have taken may need to be trimmed down. For example, are

##FORMAT=<ID=ES,Number=A,Type=Float,Description="Effect size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=A,Type=Float,Description="Standard error of effect size estimate">
##FORMAT=<ID=LP,Number=A,Type=Float,Description="-log10 p-value for effect estimate">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Alternate allele frequency in the association study">
##FORMAT=<ID=SS,Number=A,Type=Float,Description="Sample size used to estimate genetic effect">
##FORMAT=<ID=EZ,Number=A,Type=Float,Description="Z-score provided if it was used to derive the EFFECT and SE fields">
##FORMAT=<ID=SI,Number=A,Type=Float,Description="Accuracy score of summary data imputation">
##FORMAT=<ID=NC,Number=A,Type=Float,Description="Number of cases used to estimate genetic effect">
##FORMAT=<ID=ID,Number=1,Type=String,Description="Study variant identifier">

these standardized in any way? I think one could rapidly preallocate and fill a matrix of the numeric/non-NA components of this family of fields. My approach would be to require the user to tell which of the numeric components of geno(x) should be kept, and build a matrix from those, and then convert that to data.table if you must. I think that would remove one of the slowest components of your process.

@bschilder
Copy link
Contributor Author

bschilder commented May 4, 2022

https://github.com/neurogenomics/MungeSumstats/blob/f581998e40fac9be31fcb4a1313a48543189ddc9/R/vcf2df.R#L101 is very slow. For the example ubm-a-2929.vcf.gz that you provided (perhaps elsewhere), there are elements of geno() that are all NA.

Exactly, omitting all-NA elements is what select_vcf_fields aims to do, by sampling only the first N rows in inferring which columns are empty from that.

I am not well-versed in VCF and I did not write VariantAnnotation.

Thank you for clarifying. Is there someone else I should be addressing for these Issues? I realize you and your co-authors initially published VariantAnnotation quite a while ago. Who is currently responsible for its maintenance? Perhaps some of the other co-authors could offer some insight as to how this is being handled? @mtmorgan @vobencha @lawremi

I think the task you have taken may need to be trimmed down. For example, are

##FORMAT=<ID=ES,Number=A,Type=Float,Description="Effect size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=A,Type=Float,Description="Standard error of effect size estimate">
##FORMAT=<ID=LP,Number=A,Type=Float,Description="-log10 p-value for effect estimate">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Alternate allele frequency in the association study">
##FORMAT=<ID=SS,Number=A,Type=Float,Description="Sample size used to estimate genetic effect">
##FORMAT=<ID=EZ,Number=A,Type=Float,Description="Z-score provided if it was used to derive the EFFECT and SE fields">
##FORMAT=<ID=SI,Number=A,Type=Float,Description="Accuracy score of summary data imputation">
##FORMAT=<ID=NC,Number=A,Type=Float,Description="Number of cases used to estimate genetic effect">
##FORMAT=<ID=ID,Number=1,Type=String,Description="Study variant identifier">

these standardized in any way? I think one could rapidly preallocate and fill a matrix of the numeric/non-NA components of this family of fields. My approach would be to require the user to tell which of the numeric components of geno(x) should be kept, and build a matrix from those, and then convert that to data.table if you must. I think that would remove one of the slowest components of your process.

The field names are standardized, but exactly which fields appear in which rows varies (even within a single file!). That's why simple parsing strategies using data.table alone fail to extract this information correctly (as we tried originally in MungeSumstats). The power of VariantAnnotation for parsing VCFs correctly is why so many other R packages rely on it.

For example, not every SNP/row will have a "SI" field because not every SNP is imputed (some are directly measured by the genotyping assay). That said, some columns may be more consistent across rows than others: e.g. I would expect "ES" to be present across all rows. But this is all mostly conjecture based on a few example VCFs I've worked with, so not sure how well it applies to others.

@mtmorgan
Copy link
Contributor

mtmorgan commented May 4, 2022

Maybe we could reset the conversation a bit with what the specific goal is? Parse VCF to data.frame / data.table? But maybe a little deeper is there some functionality related to processing VCFs that is not available in the Bioconductor packages?

The way to iterate through a vcf file is to use VcfFile() and yieldSize=; reading section 7 of the vignette also points to other ways to increase performance, including selecting specific fields for parsing using the info= and geno= fields. (The vignette refers to TabixFile() but is out of date [though not broken] as there is a VcfFIle() with similar functionality).

The GenomicFiles package provides functionality for processing chunks in parallel; usually this is meant to be a 'reduction', distributing the reading task to workers who also summarize some interesting aspect of the file. See ?reduceByYield.

I will look at this in a little more detail today.

@mtmorgan
Copy link
Contributor

mtmorgan commented May 4, 2022

To clarify a couple of things. yieldSize= can be used to read through the entire file, as in this example

library(VariantAnnotation)
fl <- system.file("extdata", "chr7-sub.vcf.gz", package="VariantAnnotation",
                       mustWork=TRUE)
nrow(readVcf(fl)) # only 3791 rows

vcffile <- VcfFile(fl, yieldSize = 1000) # unrealistically small yieldSize for illustration
open(vcffile)
n <- 0L
repeat {
    vcf = readVcf(vcffile) # read in a chunk of 1000
    if (nrow(vcf) == 0L)
        break
    message(nrow(vcf))  # informational
    n <- n + nrow(vcf)  # MAP from vcf to nrow(vcf), REDUCE by summing
}
close(vcffile)
n   # 3791

In GenomicFiles using reduceByYield() we would write

vcffile <- VcfFile(fl, yieldSize = 1000)
reduceByYield(vcffile, YIELD = readVcf, MAP = nrow, REDUCE = `+`)

to accomplish the same thing. reduceByYield() does have a parellel= argument, but as the help page indicates the MAP stage (not the YIELD [read] stage) is run in parallel. This would be appropriate if you were performing significant (10's of seconds?) computation on each chunk.

Probably the munged summary statistics (the equivalent of nrow() in our example) are quickly calculated. It then makes sense to iterate across the genome by ranges, with each range processed independently & in parallel. I implemented this using tileGenome() to divide the genome into successive windows

fl <- "ubm-a-2929.vcf.gz"
tiles <-
    seqinfo(VcfFile(fl)) |>
    keepSeqlevels(as.character(1:22)) |>
    tileGenome(cut.last.tile.in.chrom = TRUE, tilewidth = 10000000)

tiles spans the (autosome) genome with 300 ranges each of 10000000 base pairs. and then reduceByRange() to iterate over ranges. reduceByRanges() takes a MAP= function that interprets each range and file for processing. I wrote this as

MAP <- function(range, file) {
    param = VariantAnnotation::ScanVcfParam(which = range)
    vcf = VariantAnnotation::readVcf(file, "HG19/GRCh37", param)
    nrow(vcf)
}

I think it is more robust to use independent processes (SnowParam()) rather than forked processes (MulticoreParam()) for parallel processing, because the underlying libraries were not written to be re-entrant. So for instance I registered the following

BiocParallel::register(SnowParam(8, progressbar = TRUE))

I then have

system.time({
    result <- reduceByRange(tiles, fl, MAP, `+`)
})

This can be contrasted with sequential evaluation

BiocParallel::register(SerialParam(progressbar = TRUE))
system.time({
    result <- reduceByRange(tiles, fl, MAP, `+`)
})

The serial version takes about 180 seconds, whereas the parallel version takes about 60 seconds.

I don't think the strategy of scanning the first n records and then selecting only the non-empty geno fields is a good one. The header is like a contract that tells about the context of the vcf file (or perhaps of related vcf files, one for each chromosome). If the header is inaccurate, then it should be updated. But likely the header is accurate, just not for the subset of records examined. It can make a significant difference to processing speed to be selective of info and geno fields processed, so for instance

> geno(scanVcfHeader(fl))
DataFrame with 9 rows and 3 columns
        Number        Type            Description
   <character> <character>            <character>
ES           A       Float Effect size estimate..
SE           A       Float Standard error of ef..
LP           A       Float -log10 p-value for e..
AF           A       Float Alternate allele fre..
SS           A       Float Sample size used to ..
EZ           A       Float Z-score provided if ..
SI           A       Float Accuracy score of su..
NC           A       Float Number of cases used..
ID           1      String Study variant identi..

restricting to just the ES and SE fields

MAP <- function(range, file) {
    param = VariantAnnotation::ScanVcfParam(
        which = range,
        geno = c("ES", "SE"),
    )
    vcf = VariantAnnotation::readVcf(file, "HG19/GRCh37", param)
    nrow(vcf)
}

decreases processing time to about 43 seconds under SnowParam.

If there is just a single field of interest, then maybe update the MAP function to

MAP <- function(range, file) {
    param = VariantAnnotation::ScanVcfParam(which = range)
    geno = VariantAnnotation::readGeno(file, "ES", param)
    nrow(geno)
}

which takes about 30 seconds. Note that using readGeno twice would take more than twice the time it took to read both geno fields with readVcf(), so this would not be a useful strategy if 'munging' more than one field.

readVcf() and reduceByRange() under parallel evaluation are pretty competitive with (serial) data.table but with parsing included.

@mtmorgan
Copy link
Contributor

mtmorgan commented May 6, 2022

It seems like your use case is not to actually use the data structures / functionality of VariantAnnotation / GenomicRanges, but simply to parse VCF files to a different format (tibbles / data.frames).

Depending on use case the gdsfmt (via SeqArray, for instance) provides a very performant way to work with large VCF files. There is an 'import' step so one would take this option if one intends to make repeated use of the data in the VCF. gds files are generally (a little) smaller than compressed vcf while containing the same information stored in a more accessible way.

library(SeqArray)
vcffile <- "ubm-a-2929.vcf.gz"
gdsfile <- "ubm-a-2929.gds"

system.time({
    seqVCF2GDS(vcffile, gdsfile, verbose = FALSE, parallel = TRUE)
})

This took 190 seconds, which is comparable to readVcf so not too costly; I believe this is also memory efficient. Queries are very fast and flexible

gds <- seqOpen(gdsfile)
system.time({
    es <- seqGetData(gds, "annotation/format/ES")
})
## 1.53 seconds

implying that your data.frame / tibble representation could be constructed from GDS files in just a few seconds from the data that you're interested in. I have not used this tool extensively.

@mtmorgan
Copy link
Contributor

This issue appears to have been addressed by updating the Rhtslib package to htslib 1.15.1. The update is available only in Bioconductor version 3.16 Rsamtools 2.13.1 and later.

@bschilder
Copy link
Contributor Author

@mtmorgan this seems a bit unrelated to the Rhtslib updates (though they were certainly helpful to other issues).

I think the end result is that the the VariantAnnotation authors decided it was outside the scope of the package to provide this functionality of converting VCFs to data.table format.

So instead, I wrote a function to accomplish this and optimised this process as much as I could, and added it to our package MungeSumstats @Al-Murphy. It is now implemented MungeSumstats:::vcf2df as an internal function, which is called by the exported functions MungeSumstats::read_vcf and MungeSumstats::read_sumstats.

With the helpful suggestions from @mtmorgan and @vjcitn, this was further scaled up using parallelization (useful for very large files, but for smaller files best to use the single-threaded version because the overhead cost makes it not worth it in those cases).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants