Last revision 11/14/16
New version (July 2016); extensive rewrite and bugfix to cpulldown (old program somtimes pulled down bases that should have been masked) (Nov 2016) Better handling of short maskk files and some bugfixes to cpoly
This document is the instruction of using the cTools to access the light version (compressed) Simons genomic diversity project (SGDP lite) files. The cTools include:
Tools for handling samples:
##Compile cTools In the src folder:
##Run cTools Samtools may be needed (accessable by command "Samtools") to run cTools properly.
Please see how to run cascertain, cpulldown and cpoly in README.analysis.md; see how to run uncompress.pl, ccompress and cmakefilter in README.prepare.md.
Related file formats
1. .ind file
#FullyPublicGeneral Altai F Altai Denisova F Denisovan Loschbour M WHG Stuttgart F LBK
- Every thing after '#' is comment.
- 1st column: sample name
- 2nd column: gender
- 3rd column: population
2. .snp file
X:21_15838960 21 0.158390 15838960 C T X:21_16081858 21 0.160819 16081858 T C X:21_16495307 21 0.164953 16495307 G A ...
- 1st column: name of the SNP site (if the SNP site is in DBSNP: DBSNP name; else: arbitrary name)
- 2nd column: chromosome name
- 3rd column: genetic location
- 4th column: coordinate
- 5th column: base allele. In the output of cascertain (with polarize to Href set), this is the Href allele.
- 6th column: alternative allele. In the output of cascertain, this is the forward strand allele.
- For cpulldown and cpoly, the column 5 and 6 of the output .snp file are the same as that of the input. They can be any user chosen alleles.
3. .geno file
00002 00002 22220 ...
- .geno file is used together with .ind file and .snp file.
- The count of columns in .geno file is the count of individuals in .ind file. From left to right, each column of .geno corresponds to one individual (from top to bottom) in .ind file.
- The count of lines in .geno file is the count of SNP sites in .snp file. Each line of .geno corresponds to one line in .snp file.
- Each number in .geno denotes the count of base alleles (the allele of the 5th column of .snp file) of the corresponding individual at the corresponding SNP site.
- The number is the count of the allele in the 5th column of the .snp file. For example, if the .snp file is example.snp, for the 1st snp locus, '0' means homozygous T, '2' means homozygous C.
Href - /home/mz128/cteam/usr/data/Href.fa Chimp - /home/mz128/cteam/usr/data/Chimp.fa S_Albanian-1 - /home/mz128/cteam/usr/data/S_Albanian-1.fa S_Dai-1 - /home/mz128/cteam/usr/data/S_Dai-1.fa S_Yoruba-1 - /home/mz128/cteam/usr/data/S_Yoruba-1.fa
- 1st column: sample name
- 2nd column: -
- 3rd column: location of the hetfa files (/absolute_path/sample.fa)
Href - NULL Chimp - NULL S_Albanian-1 - /home/mz128/cteam/usr/data/S_Albanian-1.filter.fa S_Dai-1 - /home/mz128/cteam/usr/data/S_Dai-1.filter.fa S_Yoruba-1 - /home/mz128/cteam/usr/data/S_Yoruba-1.filter.fa
- The 1st and 2nd columns are the same as the .dblist file for hetfa files.
- For chimp and human references - 3rd column: NULL
- For samples - 3rd column: location of the mask files (/absolute_path/sample.filter.fa)
For questions please contact Nick Patterson (firstname.lastname@example.org).