Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File check passed: FALSE #4

Open
simonharnqvist opened this issue Apr 16, 2024 · 17 comments
Open

File check passed: FALSE #4

simonharnqvist opened this issue Apr 16, 2024 · 17 comments

Comments

@simonharnqvist
Copy link

Hi @StuartJEBaird and @nmartinkova

I'm trying to run DiemR following the vignettes (https://cran.r-project.org/web/packages/diemr/vignettes/Importing-data-for-genome-polarisation.html and https://cran.r-project.org/web/packages/diemr/vignettes/diemr-diagnostic-index-expecation-maximisation-in-r.html), but I'm running into issues with converting my VCF to Diem format.

I've converted my (subset of a) VCF file to the Diem format:

vcf2diem(SNP="brenthis_chr1.vcf",
         filename="brenthis",
         chunk=10) 

(Side question: what exactly does chunk set? It's clearly not the size of chunks in markers, nor the number of chunks - would be great if documentation could be a bit more explicit here)

I then went on to check if the first chunk is correctly formatted:

CheckDiemFormat(files="brenthis-01.txt",
                ChosenInds = 1:13,
                ploidy=list(rep(2,13)))

Which gives me:

File check passed: FALSE
Ploidy check passed: TRUE

I'm not getting any traceback, so it's hard to know what's wrong with my file. Here's what the first few lines of brenthis-01.txt looks like:

S22___________
S
S______2_0____
S______2_0____
S00_0____0____
S
S
S21_0_01_0____
S21_0_01_0____
S______2_0____
S
S______0_00___
S______0_02___
S
S
S_2____121____
S_1____001____

Ignoring this check and running diem anyway, it tells me that I have characters that are not allowed:

Error in emPolarise(origM = x[2:length(x)], changePolarity = x[1]) : 
  origM must contain only characters _, 0, 1, 2

I'm guessing it's not objecting to the 'S' beginning each line in the txt file?

What am I doing wrong here?

Thanks in advance!

@nmartinkova
Copy link
Collaborator

nmartinkova commented Apr 16, 2024

Your approach is correct and there is a bug in the diemr 1.2.2 available on CRAN that leaves blank lines starting with an S for omitted loci. The vcf2diem version available on GitHub has been fixed. Download the file and source it after loading the diemr package to update the function.

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 16, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 16, 2024 via email

@simonharnqvist
Copy link
Author

Thanks for that lightning speed reply - sourcing the new version of vcf2diem did indeed fix the formatting problem. However, the file check still fails without further explanation.

Ignoring once again:

diem(files="brenthis-01.txt", 
     ploidy = list(rep(2,13)),
     markerPolarity = list(c(FALSE, FALSE, TRUE)),
     ChosenInds = 1:13,
     nCores = 1)

Gives me:

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 14, 26

(14,26) is invariant to the ploidy and ChosenInds parameter values, so I'm guessing the problem isn't there.

@StuartJEBaird - yes please, will bother you with questions about how to do this on a genome scale once I have the subset test working.

@nmartinkova
Copy link
Collaborator

Simon, at this stage, I cannot see what the problem is right away, and would need your input file to debug. Are you open to emailing it?

@simonharnqvist
Copy link
Author

@nmartinkova That's probably the best way to go about it - I'll share a link with you later this week. Thanks!

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 19, 2024 via email

@simonharnqvist
Copy link
Author

Emailed it to Natalia earlier - have also sent to your email now @StuartJEBaird

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 19, 2024 via email

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 19, 2024 via email

@simonharnqvist
Copy link
Author

Just to let you know: my immediate problem is fixed by using an alternative conversion script, so I'll leave it to you to decide when to close this issue

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 26, 2024 via email

@simonharnqvist
Copy link
Author

Hi Stuart - if the Dropbox link I sent is still working you will have brenthis_ino_daphne.vcf.gz in there. I used a script by Sam Ebdon (I've asked him if I can share it with you) that replaces vcf2diem.R. The output of that scrip passes both checks and diem seems to be running happily.

@StuartJEBaird
Copy link
Owner

StuartJEBaird commented Apr 26, 2024 via email

@nmartinkova
Copy link
Collaborator

Hi Simon,
I have now updated vcf2diem to solve the issue with marker 27750 in your file. The function now correctly identifies the two most frequent alleles in each marker regardless of whether the most frequent alleles are REF or ALT. Addtionally:

  • The includedLoci file now lists marker quality value from the vcf file and shows which allele is encoded as 0 in its homozygous state and which is encoded as 2.
  • The omittedLoci file now also includes the QUAL column data and provides a reason why each specific marker was omitted.

The function manual describes the reasoning for the chunk values. Briefly, if chunk < 100, vcf2diem expects to create as many output files and tries to guess how many markers should go into each. Otherwise, chunk is interpreted as the desired number of markers per output file.
I hope these updates will facilitate diem usage and result interpretation.
Thank you for your patience, while I was "terminally busy" (Stuart's quote).
Best wishes,
Natalia

@nmartinkova
Copy link
Collaborator

Hi Simon,

The issue with parsing the vcf files when the reference allele is not one of the two most frequent ones has now been solved and the update will be available from the version diemr 1.3 on CRAN.

Best,
Natalia

@simonharnqvist
Copy link
Author

Thanks - I'll give this a go when I get a chance, and then we can hopefully close this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants