Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

realSFS cat #60

Closed
claudiuskerth opened this issue Dec 1, 2016 · 13 comments
Closed

realSFS cat #60

claudiuskerth opened this issue Dec 1, 2016 · 13 comments

Comments

@claudiuskerth
Copy link

Hi Thorfinn,

I have created many saf files that I thought I could concatenate with realSFS cat, but I can't make it work. Could you please provide some usage information or a little example?

many thanks

claudius

@claudiuskerth
Copy link
Author

Ok, a little more trial and error leads me to think that the usage of realSFS cat is:

realSFS cat -outnames <merged_filename_stub> [*saf.idx]

This creates a merged *saf.idx, saf.gz and saf.pos.gz file.

Although I have thus merged 70 saf files without error message from realSFS cat, I cannot print the merged *saf.idx file. When I run realSFS print merged.saf.idx | less -S, I get

[E::bgzf_read] bgzf_read_block error -1 after 0 of 528 bytes
-> Problem reading chunk in bgzf_read

@ANGSD
Copy link
Owner

ANGSD commented Dec 2, 2016 via email

@claudiuskerth
Copy link
Author

It's not the globbing (wouldn't that be done by the shell anyway?). Here is what I did and the STDERR output:

realSFS cat -outnames Par.unfolded.merged PAR.unfolded.cq.saf.idx PAR.unfolded.cr.saf.idx

-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
-> Version of fname:PAR.unfolded.cq.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cq.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cq.saf.pos.gz
-> Version of fname:PAR.unfolded.cr.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cr.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cr.saf.pos.gz
-> outnames: 'Par.unfolded.merged' number of safs:2
-> Merging 0/2
-> Merging 1/2

realSFS print Par.unfolded.merged.saf.idx

[E::bgzf_read] bgzf_read_block error -1 after 0 of 208 bytes
-> Problem reading chunk in bgzf_read

@ANGSD
Copy link
Owner

ANGSD commented Dec 5, 2016

Hi Im currently travelling with limited access to internet, i will look into this when im back. Sorry about this.

@ANGSD
Copy link
Owner

ANGSD commented Dec 8, 2016

Hi It seems to work for my test scenarios. Can you do print on all your input saf.idx?

thorfinn@rowdy:/angsd$ ./angsd -b list -dosaf 1 -gl 1 -out one -anc /home/thorfinn/shared/hg19ancNoChr.fa.gz -r 1
-> angsd version: 0.915-16-gf8a6695-dirty (htslib: 1.3.2-129-gdaae2ea) build(Nov 24 2016 15:13:05)
-> Reading fasta: /home/thorfinn/shared/hg19ancNoChr.fa.gz
-> Parsing 33 number of samples
-> Region lookup 1/1
-> Printing at chr: 1 pos:14086716 chunknumber 300 contains 251 sites
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
->"one.arg"
->"one.saf.gz"
->"one.saf.pos.gz"
->"one.saf.idx"
-> Thu Dec 8 11:30:31 2016
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 14.34 sec
[ALL done] walltime used = 60.00 sec
thorfinn@rowdy:
/angsd$ ./angsd -b list -dosaf 1 -gl 1 -out two -anc /home/thorfinn/shared/hg19ancNoChr.fa.gz -r 2
-> angsd version: 0.915-16-gf8a6695-dirty (htslib: 1.3.2-129-gdaae2ea) build(Nov 24 2016 15:13:05)
-> Reading fasta: /home/thorfinn/shared/hg19ancNoChr.fa.gz
-> Parsing 33 number of samples
-> Region lookup 1/1
-> Printing at chr: 2 pos:14087882 chunknumber 300 contains 293 sites
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
->"two.arg"
->"two.saf.gz"
->"two.saf.pos.gz"
->"two.saf.idx"
-> Thu Dec 8 11:31:25 2016
-> Arguments and parameters for all analysis are located in .arg file
[ALL done] cpu-time used = 12.56 sec
[ALL done] walltime used = 46.00 sec
thorfinn@rowdy:/angsd$ ./misc/realSFS cat
-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
-> outnames: '(null)' number of safs:0
thorfinn@rowdy:
/angsd$ ./misc/realSFS cat one.saf.idx two.saf.idx -outnames merged
-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
-> Version of fname:one.saf.idx is:2
-> Assuming .saf.gz file: one.saf.gz
-> Assuming .saf.pos.gz: one.saf.pos.gz
-> Version of fname:two.saf.idx is:2
-> Assuming .saf.gz file: two.saf.gz
-> Assuming .saf.pos.gz: two.saf.pos.gz
-> outnames: 'merged' number of safs:2
-> Merging 0/2
-> Merging 1/2
thorfinn@rowdy:/angsd$ ./misc/realSFS print merged.saf.idx |wc -l
-> Version of fname:merged.saf.idx is:2
-> Assuming .saf.gz file: merged.saf.gz
-> Assuming .saf.pos.gz: merged.saf.pos.gz
-> args: tole:0.000001 nthreads:4 maxiter:100 nsites:0 start:(null) chr:(null) start:-1 stop:-1 fname:merged.saf.idx fstout:(null) oldout:0 seed:1481193159 bootstrap:0 whichFst:0
-> Will jump to multisaf printer and will only print intersecting sites between populations
[printMulti]
-> dim(merged.saf.idx):67
-> Dimension of parameter space: 67
-> Done reading data from chromosome will prepare next chromosome
-> Is in multi sfs, will now read data from chr:1
-> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
-> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
-> Done reading data from chromosome will prepare next chromosome
-> Is in multi sfs, will now read data from chr:2
-> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
-> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
-> Done reading data from chromosome will prepare next chromosome
-> Run completed
196836
thorfinn@rowdy:
/angsd$ ./misc/realSFS print one.saf.idx |wc -l
-> Version of fname:one.saf.idx is:2
-> Assuming .saf.gz file: one.saf.gz
-> Assuming .saf.pos.gz: one.saf.pos.gz
-> args: tole:0.000001 nthreads:4 maxiter:100 nsites:0 start:(null) chr:(null) start:-1 stop:-1 fname:one.saf.idx fstout:(null) oldout:0 seed:1481193174 bootstrap:0 whichFst:0
-> Will jump to multisaf printer and will only print intersecting sites between populations
[printMulti]
-> dim(one.saf.idx):67
-> Dimension of parameter space: 67
-> Done reading data from chromosome will prepare next chromosome
-> Is in multi sfs, will now read data from chr:1
-> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
-> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
-> Done reading data from chromosome will prepare next chromosome
-> Run completed
98333
thorfinn@rowdy:/angsd$ ./misc/realSFS print two.saf.idx |wc -l
-> Version of fname:two.saf.idx is:2
-> Assuming .saf.gz file: two.saf.gz
-> Assuming .saf.pos.gz: two.saf.pos.gz
-> args: tole:0.000001 nthreads:4 maxiter:100 nsites:0 start:(null) chr:(null) start:-1 stop:-1 fname:two.saf.idx fstout:(null) oldout:0 seed:1481193182 bootstrap:0 whichFst:0
-> Will jump to multisaf printer and will only print intersecting sites between populations
[printMulti]
-> dim(two.saf.idx):67
-> Dimension of parameter space: 67
-> Done reading data from chromosome will prepare next chromosome
-> Is in multi sfs, will now read data from chr:2
-> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
-> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
-> Done reading data from chromosome will prepare next chromosome
-> Run completed
98503
thorfinn@rowdy:
/angsd$ R -q

98503+98333
[1] 196836

@claudiuskerth
Copy link
Author

Hi Thorfinn,

I can print the two input *saf.idx files without error. I have created them (and many others) with the following command:

for POP in ERY PAR; do ls ../split_noNegFis.sorted.rf/* | parallel -j 12 "angsd -bam $POP.SLIM.bamfile.list -ref ../reduced_ref.fa -anc ../reduced_ref.fa -out merge/$POP.unfolded.{/} -sites ../keep.sites -rf {} -only_proper_pairs 0 -baq 1 -minMapQ 5 -minQ 20 -minInd 10 -doCounts 1 -setMaxDepth 450 -GL 1 -doSaf 1 -doMajorMinor 1 -skipTriallelic 1"; echo "done $POP"; sleep 10; done

I am using GNU parallel because running ANGSD with multithreading makes it very slow to read in the data. My regions file contains 34,901 sorted contig ID's. So, I have split this file with GNU split to parallise over the split regions files.

I have taken the liberty to upload the two SAF index and merged file as well as companion files for you to check. It's only 6MB in total.
SAF.merge.tar.gz

@ANGSD
Copy link
Owner

ANGSD commented Dec 9, 2016

Hmm, I dont have any problems catting the files. What is your platform?
Can you run the exact commands shown below?

[fvr124@zl08368 Downloads]$ ../angsd/misc/realSFS cat PAR.unfolded.cr.saf.idx PAR.unfolded.cq.saf.idx -outnames mymerge
-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
-> Version of fname:PAR.unfolded.cr.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cr.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cr.saf.pos.gz
-> Version of fname:PAR.unfolded.cq.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cq.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cq.saf.pos.gz
-> outnames: 'mymerge' number of safs:2
-> Merging 0/2
-> Merging 1/2
[fvr124@zl08368 Downloads]$ ../angsd/misc/realSFS print mymerge.saf.idx 2>/dev/null |wc -l
2964150
[fvr124@zl08368 Downloads]$ ../angsd/misc/realSFS print PAR.unfolded.cr.saf.idx 2>/dev/null |wc -l
1386375
[fvr124@zl08368 Downloads]$ ../angsd/misc/realSFS print PAR.unfolded.cq.saf.idx 2>/dev/null |wc -l
1577775
[fvr124@zl08368 Downloads]$ R

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

1386375+1577775
[1] 2964150

@claudiuskerth
Copy link
Author

Ok, I just pulled the newest version of ANGSD and htslib and made sure that realSFS is recompiled with the up-to-date htslib. In the following I am using that version:

~/Downloads/ANGSD/angsd/misc/realSFS cat PAR.unfolded.cr.saf.idx PAR.unfolded.cq.saf.idx -outnames mymerge
-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
-> Version of fname:PAR.unfolded.cr.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cr.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cr.saf.pos.gz
-> Version of fname:PAR.unfolded.cq.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cq.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cq.saf.pos.gz
-> outnames: 'mymerge' number of safs:2
-> Merging 0/2
-> Merging 1/2

So, merging works without error message. Then printing the merged saf file leads to the same error message as before on my system:

~/Downloads/ANGSD/angsd/misc/realSFS print mymerge.saf.idx
-> Version of fname:mymerge.saf.idx is:2
-> Assuming .saf.gz file: mymerge.saf.gz
-> Assuming .saf.pos.gz: mymerge.saf.pos.gz
-> args: tole:0.000001 nthreads:4 maxiter:100 nsites:0 start:(null) chr:(null) start:-1 stop:-1 fname:mymerge.saf.idx fstout:(null) oldout:0 seed:1481287828 bootstrap:0 whichFst:0
-> Will jump to multisaf printer and will only print intersecting sites between populations
[printMulti]
-> dim(mymerge.saf.idx):37
-> Dimension of parameter space: 37
-> Done reading data from chromosome will prepare next chromosome
-> Is in multi sfs, will now read data from chr:Contig_618690
-> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
-> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
-> Problem reading chunk in bgzf_read

I am running this on Ubuntu Linux: Linux huluvu 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux.

Let me know if you need more info.

claudius

@ANGSD
Copy link
Owner

ANGSD commented Mar 1, 2017

Hi Claudius, Im not sure if this is helpful. But we had storage problems on some of our servers, and then I also observed the bgzf_read error, with a different storage device it worked.

I redid your uploaded files, that also seemed to work.

[fvr124@zl08368 Downloads]$ ../angsd/misc/realSFS cat PAR.unfolded.cr.saf.idx PAR.unfolded.cq.saf.idx -outnames mymerge
-> This will cat together .saf files from angsd
-> regions has to be disjoint between saf files. This WONT be checked (alot) !
-> This has only been tested on safs for different chrs !
-> Version of fname:PAR.unfolded.cr.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cr.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cr.saf.pos.gz
-> Version of fname:PAR.unfolded.cq.saf.idx is:2
-> Assuming .saf.gz file: PAR.unfolded.cq.saf.gz
-> Assuming .saf.pos.gz: PAR.unfolded.cq.saf.pos.gz
-> outnames: 'mymerge' number of safs:2
-> Merging 0/2
-> Merging 1/2
[fvr124@zl08368 Downloads]$ ../angsd/misc/realSFS print mymerge.saf.idx 2>&1 1>/dev/null|tail
-> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
-> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
-> Done reading data from chromosome will prepare next chromosome
-> Is in multi sfs, will now read data from chr:Contig_772867
-> hello Im the master merge part of realSFS. and I'll now do a tripple bypass to find intersect
-> 1) Will set iter according to chooseChr and start and stop, and possibly using -sites
-> Only read nSites: 0 will therefore prepare next chromosome (or exit)
-> Done reading data from chromosome will prepare next chromosome
-> Run completed
[fvr124@zl08368 Downloads]$ ../angsd/misc/realSFS mymerge.saf.idx 2>/dev/null| tail
38565.263071 213.801777 226.889112 82.205709 0.927908 34.948713 56.955107 0.000000 0.620625 69.896858 20.849552 4.000924 21.680033 1.636306 0.000125 0.000279 40.276621 2.594707 0.000000 0.000015 0.000002 0.001715 35.259449 0.000000 0.000000 0.000005 0.000004 0.000514 30.123424 4.230963 20.734794 0.000000 0.000274 18.564696 5.999078 27.121379 37.416262
[fvr124@zl08368 Downloads]$

@claudiuskerth
Copy link
Author

Interesting. However, I had no luck so far. I moved my split SAF files onto another hard drive on the same system, but got the same bgzf error. I also tried installing ANGSD on my Mac, but I couldn't make the compilation work. I used the commit 08e032a (24th Feb) and two different gcc versions (4.2 and 6.2), each gave different error messages. I did manage to install ANGSD (commit 08e032a) on a 32 bit Ubuntu system, but there already realSFS cat stops with an error on the example files above. I am always using the newest git version of htslib for compilation, i. e. not system-wide htslib.

Let me know if exact commands, error messages and system details would be of any help to you.

thank you for your support!

claudius

@ANGSD
Copy link
Owner

ANGSD commented Jun 19, 2017

Hi again Claudius. Did you resolve this problem?

@claudiuskerth
Copy link
Author

Yes, this is working now with the new version of realSFS.

many thanks Thorfinn!

claudius

@ANGSD
Copy link
Owner

ANGSD commented Jun 22, 2017

Great, im closing this issue then.

@ANGSD ANGSD closed this as completed Jun 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants