Here I will filter a VCF for loci that fit a certain criteria.

That criteria is that they must be heterozyogous in all F1 generation samples from his mating experiments. This should include the parthenogenesis locus. 

Not that this is analogous to choosing loci that are fixed between the male and the female which started the breeding line, but using the hybrids to filter will be less prone to incorrect genotpye calls. 

Here are the samples in this dataset.

CZ515_S f1  
CZ513_S f1  
CZ512_S f1  
CZ318_S f1  
CZ316_S f1  
CZ315_S f1  
CZ314_S f1  
PM658_Pmale     pmale  
PF3_S   female  
PF5_S   female  


### So . . . I will find loci that are heterozygous in all of the F1s. 

In [30]:
## Function to check if the F1s meet the filtering criteria.

def locus_filter(record, threshold, sample_list):

    N_F1s_called = 0
    N_het = 0

    for sample in record.samples:
        if sample.sample in sample_list:
            if sample.called:
                N_F1s_called += 1
                if sample.is_het:
                    N_het += 1
                    #print sample.sample, sample["GT"]

    if N_het/N_F1s_called >= threshold:
        return record.ID


In [48]:
from __future__ import division
import vcf

vcf_path = "/home/djeffrie/Data/RADseq/CASPER/parents_and_f1.vcf"
popmap = open("/home/djeffrie/Data/RADseq/CASPER/popmap_Dan.txt", 'r').readlines()

F1s = []

for line in popmap:
    if "f1" in line:
        F1s.append(line.split()[0])

myVCF = vcf.Reader(open(vcf_path, 'r'))

filtered_loci = []
het_threshold = 0.8 ## the proportion of samples that a locus is heterozygous in.
N_loci = 0

for record in myVCF:
    N_loci += 1
    
    loc_ID = locus_filter(record, het_threshold, F1s)

    if loc_ID and loc_ID not in filtered_loci:
        filtered_loci.append(loc_ID)
        
print "%s loci out of %s passed the filter" % (len(filtered_loci), N_loci)

572 loci out of 1241 passed the filter


So we now have a list of all of the loci that are heterozygous, export this list so it can be used as a whitelist in Stacks. 

In [49]:
with open("/home/djeffrie/Data/RADseq/CASPER/Whitelist_0.8.txt" ,'w') as outfile:
    for i in filtered_loci:
        outfile.write("%s\n" % i)