Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample.idError with seqVCF2GDS #87

Open
alexisregelson opened this issue Nov 6, 2023 · 3 comments
Open

Sample.idError with seqVCF2GDS #87

alexisregelson opened this issue Nov 6, 2023 · 3 comments
Assignees
Labels

Comments

@alexisregelson
Copy link

alexisregelson commented Nov 6, 2023

Hello, I am trying to use seqVCF2GDS and am getting the following error:

library(SeqArray)
library(data.table)

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=6L)
Mon Nov 6 16:09:06 2023
Variant Call Format (VCF) Import:
file(s):
r4_PASS_chr1_updated_varID_dups_drop_updated_IDs_nhw_hwe6_noNHWrelateds_high_mod_impact.vcf (198.8M)
file format: VCFv4.2
the number of sets of chromosomes (ploidy): 2
the number of samples: 14,306
genotype storage: bit2
compression method: LZMA_RA
# of samples: 14306
calculating the total number of variants ...
the total number of variants for import: 3,632
Writing to 6 files:
r4_chr1_high_mod_tmp01_ad336f56fc72 [1..606]
r4_chr1_high_mod_tmp02_ad3315e862b7 [607..1,212]
r4_chr1_high_mod_tmp03_ad33613818b1 [1,213..1,818]
r4_chr1_high_mod_tmp04_ad33473817c6 [1,819..2,424]
r4_chr1_high_mod_tmp05_ad334e0fea8c [2,425..3,030]
r4_chr1_high_mod_tmp06_ad33607634f8 [3,031..3,632]
Done (Mon Nov 6 16:09:10 2023).
Output:
r4_chr1_high_mod.gds
Merging:
opening 'r4_chr1_high_mod_tmp01_ad336f56fc72' ... [done]
opening 'r4_chr1_high_mod_tmp02_ad3315e862b7' ... [done]
opening 'r4_chr1_high_mod_tmp03_ad33613818b1' ... [done]
opening 'r4_chr1_high_mod_tmp04_ad33473817c6' ... [done]
opening 'r4_chr1_high_mod_tmp05_ad334e0fea8c' ... [done]
opening 'r4_chr1_high_mod_tmp06_ad33607634f8' ... [done]
Digests:
sample.idError: segfault from C stack overflow

Do the sampel IDs need to be in a particular format? I created my vcf with plink and used double-id option. IDs are in format: A-[Cohort]-[A#####]. A .gds file is outputed, but I don't know if it's is incorrect due to the segfault.

gds <- seqOpen(r4_chr1_high_mod.gds)
gds
Object of class "SeqVarGDSClass"
File: r4_chr1_high_mod.gds (294.4K)

  • [ ] *
    |--+ description [ ] *
    |--+ sample.id { Str8 14306 LZMA_ra(2.94%), 12.6K }
    |--+ variant.id { Int32 3632 LZMA_ra(12.7%), 1.8K }
    |--+ position { Int32 3632 LZMA_ra(62.3%), 8.8K }
    |--+ chromosome { Str8 3632 LZMA_ra(1.62%), 125B }
    |--+ allele { Str8 3632 LZMA_ra(24.4%), 4.0K }
    |--+ genotype [ ] *
    | |--+ data { Bit2 2x14306x3632 LZMA_ra(0.95%), 242.2K }
    | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
    | --+ extra { Int16 0 LZMA_ra, 18B }
    |--+ phase [ ]
    | |--+ data { Bit1 14306x3632 LZMA_ra(0.02%), 1.3K }
    | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
    | --+ extra { Bit1 0 LZMA_ra, 18B }
    |--+ annotation [ ]
    | |--+ id { Str8 3632 LZMA_ra(28.1%), 16.0K }
    | |--+ qual { Float32 3632 LZMA_ra(0.92%), 141B }
    | |--+ filter { Int32 3632 LZMA_ra(0.92%), 141B }
    | |--+ info [ ]
    | | --+ PR { Bit1 3632 LZMA_ra(18.9%), 93B } *
    | --+ format [ ]
    --+ sample.annotation [ ]

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/lib/libR.so
LAPACK: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/modules/lapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.14.8 SeqArray_1.26.2 gdsfmt_1.22.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.32.0 compiler_3.6.0 IRanges_2.20.2
[4] XVector_0.26.0 parallel_3.6.0 GenomicRanges_1.38.0
[7] GenomeInfoDbData_1.2.2 RCurl_1.95-4.12 Biostrings_2.54.0
[10] S4Vectors_0.24.4 BiocGenerics_0.32.0 GenomeInfoDb_1.22.1
[13] bitops_1.0-6 stats4_3.6.0

Thank you,
Alexis

@zhengxwen
Copy link
Owner

See:
the total number of variants for import: 3,632
This number is too small, parallel=6L does not help at all.
I guess parallel=6L might trigger a bug when merging the data files when the number of variants is too small.

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=1)

It might solve your problem.

@zhengxwen zhengxwen self-assigned this Nov 7, 2023
@zhengxwen zhengxwen added the bug label Nov 7, 2023
@alexisregelson
Copy link
Author

alexisregelson commented Jan 8, 2024

Hello,

I've now tried this with a vcf with a 200k+ varaints. I have successfully converted this vcf to a gds using SNPRelate. However, I am using another software that specifically needs the gds file in SeqArray format, not SNPRelate. But I am still getting the same error: sample.idError: segfault from C stack overflow.

Alexis

@zhengxwen
Copy link
Owner

Your R version and gdsfmt versions are old.
The recent update was made with a focus on R (>= v4.0).
I suggest using SeqArray GDS format instead of SNPRelate GDS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants