## Bootstrapping sex linked marker sets

In [1]:
%matplotlib inline
import MISC_RAD_tools as MISC
import SLMF_lightweight as SLMF_L

In this notebook I will first compile all of the final stacks outputs for each species dataset (or subset) and then, for each one, I will randomise the male and female assignments accross the samples. For 100 random male female assignments I will then run the sex linked marker finding analyses, using exactly the parameters used to identify the final set of sex linked markers in the paper. The only difference is the male and female assignments. 

This randomisation will give an idea of what the false positive rate in the dataset is. For example, a skew towards more males or females in the dataset may make false positives of one type or another more likely. Also, if there are several populations in the data and males and females are not distributed evenly among them then population structure could look like sex linkage. However, randomising male and female assignments across all samples will allow us to account for this. 

Due to the prohibitively long amount of time that it would take, I will not do 1000 randomisations for each species, although this is what I would prefer. Instead I will just do 100, this should still give a reasonably good estimate for the false positive rate. 

With regards to how these randomisations will be used to judge the validity of the dataset, I will look for sample sets where the number of sex-linked markers found using the correct male/female assignments is above the 95th percentile of the distribution of sex-linked markers found in the randomisations. 

In collaboration with the genome mapping, this should help validate the sex-linked markers sets found.



#### First, make a list of all of the parameter disctionaries, which contain paths and parameters used for finding sex linked markers


In [4]:
Parameter_dictionaries = []

## Wasps

In [10]:
Parameter_dict = {}
Parameter_dict["Name"] = "wasps"

##### Data ########################

Parameter_dict['Catalog'] =  "/home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict['VCF'] =  "/home/djeffrie/Data/Caspers_data/CSD/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict['Pop_map'] = "/home/djeffrie/Data/Caspers_data/CSD/Sex_ID_info_heterozygosity.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict['coverage_threshold1'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict['homogametic_REF_allele_freq'] = 0.9 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict['homogamtic_homozygosity_threshold'] = 0.9 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict['heterogamtic_heterozygosity_threshold'] = 0.5 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict['coverage_threshold2'] = 4 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict)


In [11]:
## Define a program to help parallelise the analyses


def Super_SLM_finder_parallel(popmap, Parameter_dict):
    import os
    
    Parameter_dict['Pop_map'] = popmap
    Parameter_dict['VCF'] = "%s.vcf" % popmap.rpartition(".")[0]
    results_dict = {}
    results_dict["XYset"], results_dict["ZWset"], results_dict["Detailed"] = SLMF_L.Super_SLM_finder(Parameter_dict, "111", verbose = False, write_files=False, plot=False)
    
    os.remove(Parameter_dict['VCF']) ## remove VCFs as they are used
    os.remove("%s.altered" % Parameter_dict['VCF'])
    
    

    return results_dict
    
    

In [12]:
from joblib import Parallel, delayed
import multiprocessing
import os
import shutil
from random import shuffle

results_dict = {}

for dataset in Parameter_dictionaries:
    print "processing dataset in", dataset["VCF"]
    # 1. Make a new directory next in the VCF path. 
    Randomisation_dir = "%s/Randomisations" % dataset["VCF"].rpartition("/")[0]
    if not os.path.exists(Randomisation_dir):
        os.makedirs(Randomisation_dir)
    
    # 2. Copy sex info and VCF into that folder
    
    shutil.copyfile(dataset["Pop_map"], "%s/Sex_ID_info.txt" % Randomisation_dir)  ## sex info
    shutil.copyfile(dataset["VCF"], "%s/batch_1.vcf" % Randomisation_dir)  ## sex info
    
    orig_vcf = "%s/batch_1.vcf" % Randomisation_dir
    
    print "\nRandomisations happening in %s" % Randomisation_dir
    
    ## 3. Make the randomised sex info files

    orig_popmap_path = "%s/Sex_ID_info.txt" % Randomisation_dir
    orig_popmap = open(orig_popmap_path, 'r').readlines()

    sexes = []
    samples = []
    IDs = []
    randomisations = []

    for line in orig_popmap:
        sexes.append(line.strip().split()[1])
        samples.append(line.strip().split()[0])
        IDs.append(line.strip().split()[2])

    popmaps = []
    for i in range(100):
        popmap_path = "%s/rand_popmap_%s.txt" % (Randomisation_dir, i)
        popmaps.append(popmap_path)
        rand_popmap = open(popmap_path, 'w')
        shuffle(sexes)
        randomisations.append(sexes)
        
        for i in range(len(samples)):
            rand_popmap.write("%s\t%s\t%s\n" % (samples[i], sexes[i], IDs[i]))

        rand_popmap.close()
    
    print "\nRandom sex info files made"
    
    ## make new VCFs for parallelised analyses
    
    for i in popmaps:
        new_vcf = "%s.vcf" % i.rpartition(".")[0]
        shutil.copyfile(orig_vcf, new_vcf)
        
    ## 4. Run the randomisations
    
    print "\nRunning randomisations\n"
    
    results_dict[dataset["Name"]] = Parallel(n_jobs=4, verbose = 1)(delayed(Super_SLM_finder_parallel)(i, dataset) for i in popmaps)
    
    ## 5. Output the results for each species after the species is complete (i.e. checkpoints)
    
    outfile = open("%s/Randomisations_%s.txt" % (Randomisation_dir, dataset["Name"]), 'w')
    
    for Randomisation in results_dict[dataset["Name"]]:

        XYfreq = len(Randomisation["Detailed"]["XY"]["freq"])
        XYhet = len(Randomisation["Detailed"]["XY"]["het"])
        Ytags = len(Randomisation["Detailed"]["XY"]["Ytags"])
    
        ZWfreq = len(Randomisation["Detailed"]["ZW"]["freq"])
        ZWhet = len(Randomisation["Detailed"]["ZW"]["het"])
        Wtags = len(Randomisation["Detailed"]["ZW"]["Wtags"])
                
        line = "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (dataset["Name"],XYfreq,XYhet,Ytags,ZWfreq,ZWhet,Wtags)
    
        outfile.write(line)
        
    outfile.close()
    
    print "Results outputted to %s/Randomisations_%s.txt" % (Randomisation_dir, dataset["Name"])
    
    
    

processing dataset in /home/djeffrie/Data/Caspers_data/CSD/batch_1.vcf

Randomisations happening in /home/djeffrie/Data/Caspers_data/CSD/Randomisations

Random sex info files made

Running randomisations



[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 11.7min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 27.5min finished


Results outputted to /home/djeffrie/Data/Caspers_data/CSD/Randomisations/Randomisations_wasps.txt
processing dataset in /home/djeffrie/Data/Caspers_data/CSD/batch_1.vcf

Randomisations happening in /home/djeffrie/Data/Caspers_data/CSD/Randomisations

Random sex info files made

Running randomisations
Final number of XY tags = 0
Final number of XY tags = 0
Final number of XY tags = 0
Final number of XY tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0



Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0



Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of XY tags = 

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 12.0min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 27.3min finished


Results outputted to /home/djeffrie/Data/Caspers_data/CSD/Randomisations/Randomisations_wasps.txt
Final number of XY tags = 0
Final number of XY tags = 0
Final number of XY tags = 0
Final number of XY tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0



Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0



Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0



Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final 

## So all went well, randomisations completed in less than 24 hours with 100 randomisations per dataset. 

### Now to plot

In [109]:
results_dict1 ## contains first half
results_dict ## contains 2nd half.

In [152]:
randomisation_files = []
for root, dirs, files in os.walk("/home/djeffrie/Data/RADseq/Randomisations"):
    for fil in files:
        if fil.startswith("Randomisation"):
            randomisation_files.append("%s/%s" % (root, fil))

In [170]:
randomisation_filepath = "/home/djeffrie/Data/RADseq/Randomisations/Randomisations_Ryav.txt"

randomisations = open(randomisation_filepath, 'r').readlines()

XYfreqs = []
XYhets = []
Ytags = []

ZW_freqs = []
ZW_hets = []
W_tagss = []

XYvsZW_freqs = []
XYvsZW_hets = []
XYvsZW_tagss = []

for line in randomisations:
    #print line.split()
    species = line.split()[0]
    XYfreq = line.split()[1]
    XYhet = line.split()[2]
    Y_tags = line.split()[3]
    
    ZW_freq = line.split()[4]
    ZW_het = line.split()[5]
    W_tags = line.split()[6]
    
    XYvsZW_freq = int(XYfreq) - int(ZW_freq)
    XYvsZW_het = int(XYhet) - int(ZW_het)
    XYvsZW_tags = int(Y_tags) - int(W_tags)

    XYfreqs.append(int(XYfreq))
    XYhets.append(int(XYhet))
    Ytags.append(int(Y_tags))
    ZW_freqs.append(int(ZW_freq))
    ZW_hets.append(int(ZW_het))
    W_tagss.append(int(W_tags))
    XYvsZW_freqs.append(int(XYvsZW_freq))
    XYvsZW_hets.append(int(XYvsZW_het))
    XYvsZW_tagss.append(int(XYvsZW_tags))
    

In [190]:
help(plt.vlines)

Help on function vlines in module matplotlib.pyplot:

vlines(x, ymin, ymax, colors=u'k', linestyles=u'solid', label=u'', hold=None, **kwargs)
    Plot vertical lines.
    
    Plot vertical lines at each `x` from `ymin` to `ymax`.
    
    Parameters
    ----------
    x : scalar or 1D array_like
        x-indexes where to plot the lines.
    
    ymin, ymax : scalar or 1D array_like
        Respective beginning and end of each line. If scalars are
        provided, all lines will have same length.
    
    colors : array_like of colors, optional, default: 'k'
    
    linestyles : ['solid' | 'dashed' | 'dashdot' | 'dotted'], optional
    
    label : string, optional, default: ''
    
    Returns
    -------
    lines : `~matplotlib.collections.LineCollection`
    
    Other parameters
    ----------------
    kwargs : `~matplotlib.collections.LineCollection` properties.
    
    See also
    --------
    hlines : horizontal lines
    
    Examples
    ---------
    .. plot:: mpl_exam

In [207]:
from matplotlib import pyplot as plt

fig = plt.figure(figsize = (20,10))

fig.add_subplot(1,1,3)

counts, bins, bars = plt.hist(XYfreqs, bins= 20, edgecolor = "lightblue", color = "royalblue")
plt.xlim((0,200))
plt.vlines(100, 0, max(counts)*0.75, color = "red")
plt.show()

fig.add_subplot(2,1,3)
counts, bins, bars = plt.hist(XYhets, bins= 20, edgecolor = "lightblue", color = "royalblue")
plt.xlim((0,200))
plt.vlines(100, 0, max(counts)*0.75, color = "red")
plt.show()

fig.add_subplot(3,1,3)
counts, bins, bars = plt.hist(Ytags, bins= 20, edgecolor = "lightblue", color = "royalblue")
plt.xlim((0,200))
plt.vlines(100, 0, max(counts)*0.75, color = "red")
plt.show()

ValueError: num must be 0 <= num <= 1, not 3

<matplotlib.figure.Figure at 0x7fd52fa10890>

In [195]:
print counts

[  8.   7.  18.  14.  20.  12.   5.   4.   4.   3.   1.   1.   0.   0.   0.
   0.   1.   1.   0.   1.]
