### Here I want to try to calculate a P-value for the identification of the sex linked markers identified with the three methods described in (Brelsford & Lavanchy et al 2016).


The logic is as follows:

I want to know how likely I am to find a given number of sex linked markers in a dataset by chance. As many of our datasets aren't ideal false positives and noise are always present to a certain extent. Furthermore, the presence of multiple Y haplotypes means that there is often natural variation the sex linkage of a given locus, even in the same population. 

So I propose here a randomisation approach. The first step is to identify sex linked markers using the correct male and female assignments. The next step is then to randomise the male and female assignments accross samples 1000 times (this will probably take a while). I will then have a distribution of the number of sex linked markers found in each randomised test. I can use this to compare to the number of SL markers found when using the correct sex assignments and identify the 95% CI, above which there is 0.05 probability of finding that number of SL markers by chance. 

#### First step:

Create 1000 random sex assignment combinations going to do this by writing to file, 1 so I record combinations, and 2 so i don't have to change my functions! 


In [1]:
from __future__ import division
import os
from random import shuffle
import MISC_RAD_tools as MISC
import time

In [6]:
orig_popmap_path = "/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations/Sex_ID_info_Y_haplogroup_1.txt"
orig_popmap = open(orig_popmap_path, 'r').readlines()
p_val_dir = orig_popmap_path.rpartition("/")[0] ## working in dir with orig popmap in. So remember to set up a new directory!

sexes = []
samples = []
IDs = []
randomisations = []

for line in orig_popmap:
    sexes.append(line.strip().split()[1])
    samples.append(line.strip().split()[0])
    IDs.append(line.strip().split()[2])


for i in range(100):
    rand_popmap = open("%s/rand_popmap_%s.txt" % (p_val_dir, i), 'w')
    shuffle(sexes)
    randomisations.append(sexes)
    for i in range(len(samples)):
        rand_popmap.write("%s\t%s\t%s\n" % (samples[i], sexes[i], IDs[i]))
    
    rand_popmap.close()

Heres the parameter dictionary for the analyses. 

#### NOTE: The randomisations MUST be run with the same parameters as the real analyses

In [12]:
print "/home/djeffrie/Data/RADseq/Rarvalis_NEW//Stacks/Sex_ID_info_Y_haplogroup_1.txt".rpartition("/")[2].rpartition(".")[0]

Sex_ID_info_Y_haplogroup_1


In [3]:
Parameter_dict = {}

##### Data ########################

Parameter_dict['Catalog'] =  "/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict['VCF'] =  "/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Populations_Y_haplogroup_1/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict['Pop_map'] = "/home/djeffrie/Data/RADseq/Rarvalis_NEW//Stacks/Sex_ID_info_Y_haplogroup_1.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict['homogametic_REF_allele_freq'] = 0.9 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict['homogamtic_homozygosity_threshold'] = 0.9 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.


In [9]:
XY_numbs = []
ZW_numbs = []

for root,dirs,files in os.walk("/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations/"):
    for fil in files[:10]: ## remove slice after tests
        if fil.startswith("rand"):
            Parameter_dict['Pop_map'] = "%s/%s" % (root, fil)
        print fil, "started: ", time.strftime("%H:%M:%S")
        
        XYset, ZWset = MISC.Super_SLM_finder(Parameter_dict, "010")
        ### So add in here the SL_marker_finder function and capture the number of SL markers identified. 
        
        XY_numbs.append(len(XYset))
        ZW_numbs.append(len(ZWset))
        
    print fil, "finished: ", time.strftime("%H:%M:%S")
        

rand_popmap_375.txt started:  11:13:43

##### Using SNP heterozygosity approach #####
 
Number of loci = 58288
Number of samples = 40
Number of loci with too few samples = 0
Number of loci with low MAF = 0
Number of loci with enough data = 58287
Number of putative X linked snps = 1
Number of putative X linked tags = 1
Number of putative Z linked markers = 2
Number of putative Z linked tags = 2

 ### DONE! ### 

Sex linked tags outputted to fastas 'Putative_XYlinked_makers.fa' and Putative_ZWlinked_makers.fa
in the directory /home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks

 ## After merging tags accross methods ## 

Final number of XY tags = 1
Final number of ZW tags = 2
rand_popmap_375.txt finished:  11:15:06
rand_popmap_107.txt started:  11:15:06

##### Using SNP heterozygosity approach #####
 
Number of loci = 58288
Number of samples = 40
Number of loci with too few samples = 0
Number of loci with low MAF = 0
Number of loci with enough data = 58287
Number of putative X linked snps = 1

In [4]:
import SLMF_lightweight as SLMF_L

XY_numbs = []
ZW_numbs = []

for root,dirs,files in os.walk("/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations/"):
    for fil in files[:10]: ## remove slice after tests
        if fil.startswith("rand"):
            Parameter_dict['Pop_map'] = "%s/%s" % (root, fil)
        print fil, "started: ", time.strftime("%H:%M:%S")
        
        XYset, ZWset = SLMF_L.Super_SLM_finder(Parameter_dict, "111", verbose = False, write_files=False, plot=False)
        ### So add in here the SL_marker_finder function and capture the number of SL markers identified. 
        
        XY_numbs.append(len(XYset))
        ZW_numbs.append(len(ZWset))
        
    print fil, "finished: ", time.strftime("%H:%M:%S")
        

rand_popmap_375.txt started:  13:04:18
Final number of XY tags = 0
Final number of ZW tags = 0
rand_popmap_107.txt started:  13:06:33
Final number of XY tags = 0
Final number of ZW tags = 0
rand_popmap_837.txt started:  13:08:46
Final number of XY tags = 0
Final number of ZW tags = 2
rand_popmap_133.txt started:  13:10:59


KeyboardInterrupt: 

### Try to parrallelise this

In [10]:
from joblib import Parallel, delayed
import multiprocessing
import shutil
import SLMF_lightweight as SLMF_L

In [7]:
popmaps = []

for root,dirs,files in os.walk("/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations/"):
    for fil in files[:100]: ## remove slice after tests
        if fil.startswith("rand"):
            popmaps.append("%s/%s" % (root, fil))

orig_vcf = "/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations/batch_1.vcf"
for i in popmaps:
    new_vcf = "%s.vcf" % i.rpartition(".")[0]
    shutil.copyfile(orig_vcf, new_vcf)

In [12]:
def Super_SLM_finder_parallel(popmap):
    import os
    
    Parameter_dict['Pop_map'] = popmap
    Parameter_dict['VCF'] = "%s.vcf" % popmap.rpartition(".")[0]
    results_dict = {}
    results_dict["XYset"], results_dict["ZWset"], results_dict["Detailed"] = SLMF_L.Super_SLM_finder(Parameter_dict, "100", verbose = False, write_files=False, plot=False)
    
    os.remove(Parameter_dict['VCF']) ## remove VCFs as they are used
    
    return results_dict
    
    

In [15]:
popmaps

['/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_375.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_107.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_837.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_133.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_543.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_257.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_196.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_473.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_598.txt',
 '/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Haplogroup_1_randomisations//rand_popmap_168.txt',


In [13]:
results = Parallel(n_jobs=4, verbose = 5)(delayed(Super_SLM_finder_parallel)(i) for i in popmaps[:8])

[Parallel(n_jobs=4)]: Done   3 out of   8 | elapsed:  1.6min remaining:  2.7min
[Parallel(n_jobs=4)]: Done   5 out of   8 | elapsed:  3.3min remaining:  2.0min
[Parallel(n_jobs=4)]: Done   8 out of   8 | elapsed:  3.4min finished


Final number of XY tags = 0
Final number of XY tags = 0
Final number of XY tags = 0
Final number of XY tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0Final number of XY tags = 0



Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0


In [14]:
results

[{'Detailed': {'XY': {'Ytags': [], 'freq': set(), 'het': []},
   'ZW': {'Ytags': [], 'freq': set(), 'het': []}},
  'XYset': set(),
  'ZWset': set()},
 {'Detailed': {'XY': {'Ytags': [], 'freq': set(), 'het': []},
   'ZW': {'Ytags': [], 'freq': set(), 'het': []}},
  'XYset': set(),
  'ZWset': set()},
 {'Detailed': {'XY': {'Ytags': [], 'freq': set(), 'het': []},
   'ZW': {'Ytags': [], 'freq': set(), 'het': []}},
  'XYset': set(),
  'ZWset': set()},
 {'Detailed': {'XY': {'Ytags': [], 'freq': set(), 'het': []},
   'ZW': {'Ytags': [], 'freq': set(), 'het': []}},
  'XYset': set(),
  'ZWset': set()},
 {'Detailed': {'XY': {'Ytags': [], 'freq': set(), 'het': []},
   'ZW': {'Ytags': [], 'freq': set(), 'het': []}},
  'XYset': set(),
  'ZWset': set()},
 {'Detailed': {'XY': {'Ytags': [], 'freq': set(), 'het': []},
   'ZW': {'Ytags': [], 'freq': set(), 'het': []}},
  'XYset': set(),
  'ZWset': set()},
 {'Detailed': {'XY': {'Ytags': [], 'freq': set(), 'het': []},
   'ZW': {'Ytags': [], 'freq': set(), 

In [None]:
def Randomisationalizer():
    

In [42]:
for randomisation in results:
    print "N XY tags found = %s" % len(randomisation["XYset"])
    print "N ZW tags found = %s" % len(randomisation["ZWset"])
    
    if len(randomisation["XYset"]) < 1:
        print "No XY tags found"
    elif len(randomisation["ZWset"]) < 1:
        print "No ZW tags found"
    else:
        print "ratio of XY to ZW tags found =", len(randomisation["XYset"])/len(randomisation["ZWset"])

 N XY tags found = 2
N ZW tags found = 3
ratio of XY to ZW tags found = 0.666666666667
N XY tags found = 2
N ZW tags found = 2
ratio of XY to ZW tags found = 1.0
N XY tags found = 2
N ZW tags found = 1
ratio of XY to ZW tags found = 2.0
N XY tags found = 1
N ZW tags found = 3
ratio of XY to ZW tags found = 0.333333333333
N XY tags found = 2
N ZW tags found = 0
No ZW tags found
N XY tags found = 2
N ZW tags found = 2
ratio of XY to ZW tags found = 1.0
N XY tags found = 1
N ZW tags found = 2
ratio of XY to ZW tags found = 0.5
N XY tags found = 3
N ZW tags found = 1
ratio of XY to ZW tags found = 3.0
N XY tags found = 1
N ZW tags found = 4
ratio of XY to ZW tags found = 0.25
N XY tags found = 3
N ZW tags found = 1
ratio of XY to ZW tags found = 3.0
N XY tags found = 0
N ZW tags found = 5
No XY tags found
N XY tags found = 7
N ZW tags found = 3
ratio of XY to ZW tags found = 2.33333333333
N XY tags found = 1
N ZW tags found = 2
ratio of XY to ZW tags found = 0.5
N XY tags found = 2
N ZW ta

In [59]:
for i in results:
    print i

(set(['6986_649624', '54943_5109619', '128347_11936204', '88248_8207052', '102403_9523446', '86223_8018692', '71253_6626496', '113324_10539083', '58095_5402792', '95746_8904377', '147009_13671808', '139353_12959815', '137727_12808549', '139353_12959812', '16338_1519364', '88469_8227542', '98426_9153534', '127022_11813043', '30284_2816333', '135965_12644692', '130464_12133077', '34656_3222963', '6778_630270', '52638_4895287', '30566_2842604', '21238_1975135', '50872_4731041', '112336_10447190', '109752_10206925', '138231_12855470', '83672_7781434', '2842_264250', '28245_2626725', '120488_11205314', '54413_5060326', '74047_6886288', '89775_8348994', '33862_3149164', '106319_9887598', '811_75370', '99782_9279722', '67262_6255305', '12293_1143173', '100320_9329720', '93209_8668428', '97830_9098132', '117136_10893644', '117136_10893642', '40256_3743738', '27511_2558479', '105157_9779591', '48342_4495769', '94948_8830143', '88130_8196072', '4394_408585', '168079_15631284', '248849_23142894',

In [53]:
for i in results:
    for j in i:
        print "NEXT"
        print j
    
    

NEXT
set(['6986_649624', '54943_5109619', '128347_11936204', '88248_8207052', '102403_9523446', '86223_8018692', '71253_6626496', '113324_10539083', '58095_5402792', '95746_8904377', '147009_13671808', '139353_12959815', '137727_12808549', '139353_12959812', '16338_1519364', '88469_8227542', '98426_9153534', '127022_11813043', '30284_2816333', '135965_12644692', '130464_12133077', '34656_3222963', '6778_630270', '52638_4895287', '30566_2842604', '21238_1975135', '50872_4731041', '112336_10447190', '109752_10206925', '138231_12855470', '83672_7781434', '2842_264250', '28245_2626725', '120488_11205314', '54413_5060326', '74047_6886288', '89775_8348994', '33862_3149164', '106319_9887598', '811_75370', '99782_9279722', '67262_6255305', '12293_1143173', '100320_9329720', '93209_8668428', '97830_9098132', '117136_10893644', '117136_10893642', '40256_3743738', '27511_2558479', '105157_9779591', '48342_4495769', '94948_8830143', '88130_8196072', '4394_408585', '168079_15631284', '248849_231428

In [45]:
help(delayed)

Help on function delayed in module joblib.parallel:

delayed(function, check_pickle=True)
    Decorator used to capture the arguments of a function.
    
    Pass `check_pickle=False` when:
    
    - performing a possibly repeated check is too costly and has been done
      already once outside of the call to delayed.
    
    - when used in conjunction `Parallel(backend='threading')`.



In [None]:
            #Parameter_dict['Pop_map'] = "%s/%s" % (root, fil)
        #print fil
        
        XYset, ZWset = MISC.Super_SLM_finder(Parameter_dict, "010")
        ### So add in here the SL_marker_finder function and capture the number of SL markers identified. 
        
        XY_numbs.append(len(XYset))
        ZW_numbs.append(len(ZWset))
        
        print popmap

In [None]:
num_cores = multiprocessing.cpu_count()-1
     
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i) for i in inputs)

In [35]:

print XY_numbs
print ZW_numbs

for i in range(len(XY_numbs)):
    print XY_numbs[i] / ZW_numbs[i]


[709, 809, 789, 843, 830, 852, 812, 1136, 814, 830]
[757, 553, 597, 679, 754, 680, 575, 688, 642, 509]
0.936591809775
1.46292947559
1.3216080402
1.24153166421
1.10079575597
1.25294117647
1.41217391304
1.6511627907
1.26791277259
1.63064833006


### The real assignments

In [36]:
Parameter_dict['Pop_map'] = "/home/djeffrie/Data/RADseq/Lpipiens/P_val_calculations/Sex_info_LAR_SCO_kept_certain.txt"

XYset, ZWset = MISC.Super_SLM_finder(Parameter_dict, "010")


##### Using SNP heterozygosity approach #####
 
Number of loci = 52351
Number of samples = 19
Number of loci with too few samples = 0
Number of loci with low MAF = 0
Number of loci with enough data = 52350
Number of putative X linked snps = 864
Number of putative X linked tags = 864
Number of putative Z linked markers = 549
Number of putative Z linked tags = 549

 ### DONE! ### 


 ## After merging tags accross methods ## 

Final number of XY tags = 864
Final number of ZW tags = 549
Sex linked tags outputted to fastas 'Putative_XYlinked_makers.fa' and Putative_ZWlinked_makers.fa
in the directory //home/djeffrie/Data/RADseq/Lpipiens/Lpip_all_stacks


In [37]:
len(XYset)/len(ZWset)

1.5737704918032787