## Bootstrapping sex linked marker sets

In [172]:
%matplotlib inline
import MISC_RAD_tools as MISC
import SLMF_lightweight as SLMF_L

In this notebook I will first compile all of the final stacks outputs for each species dataset (or subset) and then, for each one, I will randomise the male and female assignments accross the samples. For 100 random male female assignments I will then run the sex linked marker finding analyses, using exactly the parameters used to identify the final set of sex linked markers in the paper. The only difference is the male and female assignments. 

This randomisation will give an idea of what the false positive rate in the dataset is. For example, a skew towards more males or females in the dataset may make false positives of one type or another more likely. Also, if there are several populations in the data and males and females are not distributed evenly among them then population structure could look like sex linkage. However, randomising male and female assignments across all samples will allow us to account for this. 

Due to the prohibitively long amount of time that it would take, I will not do 1000 randomisations for each species, although this is what I would prefer. Instead I will just do 100, this should still give a reasonably good estimate for the false positive rate. 

With regards to how these randomisations will be used to judge the validity of the dataset, I will look for sample sets where the number of sex-linked markers found using the correct male/female assignments is above the 95th percentile of the distribution of sex-linked markers found in the randomisations. 

In collaboration with the genome mapping, this should help validate the sex-linked markers sets found.



#### First, make a list of all of the parameter disctionaries, which contain paths and parameters used for finding sex linked markers


In [100]:
Parameter_dictionaries = []

### L. chiricahuensis

In [98]:
Parameter_dict_Lchiri = {}
Parameter_dict_Lchiri["Name"] = "Lchiri"

##### Data ########################

Parameter_dict_Lchiri['Catalog'] =  "/home/djeffrie/Data/RADseq/Lchricahuensis/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Lchiri['VCF'] =  "/home/djeffrie/Data/RADseq/Lchricahuensis/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Lchiri['Pop_map'] = "/home/djeffrie/Data/RADseq/Lchricahuensis/Sex_ID_info_reassigned.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Lchiri['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Lchiri['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Lchiri['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Lchiri['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Lchiri['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Lchiri['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Lchiri['heterogamtic_heterozygosity_threshold'] = 0.5 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Lchiri['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Lchiri['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Lchiri['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Lchiri['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Lchiri)

### L. montezumae

No sex linked markers found

### L. pipiens (HOR / SWE populations)

In [99]:
Parameter_dict_LpipHORSWE = {}
Parameter_dict_LpipHORSWE["Name"] = "LpipHORSWE"

##### Data ########################

Parameter_dict_LpipHORSWE['Catalog'] =  "/home/djeffrie/Data/RADseq/Lpipiens/Lpip_all_stacks/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_LpipHORSWE['VCF'] =  "/home/djeffrie/Data/RADseq/Lpipiens/Lpip_all_stacks/HOR_SWE_populations/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_LpipHORSWE['Pop_map'] = "/home/djeffrie/Data/RADseq/Lpipiens/Lpip_all_stacks/HOR_SWE_populations/Sex_ID_info_kept_certain.txt"

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_LpipHORSWE['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_LpipHORSWE['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_LpipHORSWE['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_LpipHORSWE['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_LpipHORSWE['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_LpipHORSWE['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_LpipHORSWE['heterogamtic_heterozygosity_threshold'] = 0.8 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_LpipHORSWE['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_LpipHORSWE['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_LpipHORSWE['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_LpipHORSWE['sex_presence_threshold'] =  0.7 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_LpipHORSWE)

### L. tarahumarae

In [22]:
Parameter_dict_Ltarah = {}
Parameter_dict_Ltarah["Name"] = "Ltarah"

working_dir = "/home/djeffrie/Data/RADseq/Ltarahumarae/Sex_linked_markers/"

##### Data ########################

Parameter_dict_Ltarah['Catalog'] =  "%s/batch_1.catalog.tags.tsv.gz" % working_dir ## Path to the catalog file - used by all approaches.
Parameter_dict_Ltarah['VCF'] =  "%s/batch_1.vcf"  % working_dir ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Ltarah['Pop_map'] = "%s/Sex_ID_info_kept_2.txt"  % working_dir ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Ltarah['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Ltarah['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ltarah['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ltarah['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Ltarah['homogametic_REF_allele_freq'] = 0.9 ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

# 2. Heterozygosity approach
Parameter_dict_Ltarah['homogamtic_homozygosity_threshold'] = 0.9 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Ltarah['heterogamtic_heterozygosity_threshold'] = 0.5 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Ltarah['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ltarah['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ltarah['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Ltarah['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Ltarah)

### R. berlandieri

In [23]:
Parameter_dict_R_ber = {}
Parameter_dict_R_ber["Name"] = "R_ber"

##### Data ########################

Parameter_dict_R_ber['Catalog'] =  "/home/djeffrie/Data/RADseq/Rberlandieri/Sex_linked_markers/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_R_ber['VCF'] =  "/home/djeffrie/Data/RADseq/Rberlandieri/Sex_linked_markers/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_R_ber['Pop_map'] = "/home/djeffrie/Data/RADseq/Rberlandieri/Sex_linked_markers/Sex_ID_info_kept_2.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_R_ber['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_R_ber['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_R_ber['coverage_threshold1'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_R_ber['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_R_ber['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_R_ber['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_R_ber['heterogamtic_heterozygosity_threshold'] = 0.5 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_R_ber['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_R_ber['coverage_threshold2'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_R_ber['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_R_ber['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_R_ber)

### P. perezi

In [24]:
Parameter_dict_Pper = {}
Parameter_dict_Pper["Name"] = "Pper"

##### Data ########################

Parameter_dict_Pper['Catalog'] =  "/home/djeffrie/Data/RADseq/Pperezi/Sex_linked_markers/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Pper['VCF'] =  "/home/djeffrie/Data/RADseq/Pperezi/Sex_linked_markers/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Pper['Pop_map'] = "/home/djeffrie/Data/RADseq/Pperezi/Sex_linked_markers/Sex_info_ID.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Pper['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Pper['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Pper['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Pper['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Pper['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Pper['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Pper['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Pper['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Pper['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Pper['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Pper['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Pper)

### R. arvalis (all)

In [25]:
Parameter_dict_Rarv = {}
Parameter_dict_Rarv["Name"] = "Rarv"

##### Data ########################

Parameter_dict_Rarv['Catalog'] =  "/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Rarv['VCF'] =  "/home/djeffrie/Data/RADseq/Rarvalis_NEW/Stacks/Populations_kept_2/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Rarv['Pop_map'] = "/home/djeffrie/Data/RADseq/Rarvalis_NEW//Stacks/Sex_ID_info_kept_2.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Rarv['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Rarv['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rarv['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rarv['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Rarv['homogametic_REF_allele_freq'] = 0.9 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Rarv['homogamtic_homozygosity_threshold'] = 0.9 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Rarv['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Rarv['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rarv['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rarv['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Rarv['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Rarv)

### R. chensinensis

In [26]:
Parameter_dict_Rchen = {}
Parameter_dict_Rchen["Name"] = "Rchen"

##### Data ########################

Parameter_dict_Rchen['Catalog'] =  "/home/djeffrie/Data/RADseq/Rchensinensis/Sex_linked_markers/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Rchen['VCF'] =  "/home/djeffrie/Data/RADseq/Rchensinensis/Sex_linked_markers/batch_1_strict_kept_3.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Rchen['Pop_map'] = "/home/djeffrie/Data/RADseq/Rchensinensis/Sex_linked_markers/Sex_ID_info_kept_3.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Rchen['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Rchen['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rchen['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rchen['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Rchen['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Rchen['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Rchen['heterogamtic_heterozygosity_threshold'] = 0.7 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Rchen['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rchen['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rchen['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Rchen['sex_presence_threshold'] =  0.7 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Rchen)

### R. dalmatina 

In [101]:
Parameter_dict_Rdal = {}
Parameter_dict_Rdal["Name"] = "Rdal"

##### Data ########################

Parameter_dict_Rdal['Catalog'] =  "/home/djeffrie/Data/RADseq/Rdalmatina/Populations_final/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Rdal['VCF'] =  "/home/djeffrie/Data/RADseq/Rdalmatina/Populations_final/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Rdal['Pop_map'] = "/home/djeffrie/Data/RADseq/Rdalmatina/Populations_final/Sex_ID_info.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Rdal['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Rdal['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rdal['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rdal['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Rdal['homogametic_REF_allele_freq'] = 0.9 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Rdal['homogamtic_homozygosity_threshold'] = 0.9 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Rdal['heterogamtic_heterozygosity_threshold'] = 0.5 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Rdal['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rdal['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rdal['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Rdal['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Rdal)

### R. iberica all northern Spain

In [102]:
Parameter_dict_Ribe_Sp = {}
Parameter_dict_Ribe_Sp["Name"] = "Ribe_Sp"

##### Data ########################

Parameter_dict_Ribe_Sp['Catalog'] =  "/home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Ribe_Sp['VCF'] =  "/home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/Populations_N_spain/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Ribe_Sp['Pop_map'] = "/home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/Populations_N_spain/Sex_ID_info_N_spain.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Ribe_Sp['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Ribe_Sp['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ribe_Sp['coverage_threshold1'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ribe_Sp['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Ribe_Sp['homogametic_REF_allele_freq'] = 0.9 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Ribe_Sp['homogamtic_homozygosity_threshold'] = 0.9 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Ribe_Sp['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Ribe_Sp['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ribe_Sp['coverage_threshold2'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ribe_Sp['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Ribe_Sp['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Ribe_Sp)

### R. iberica "family" (which isn't actually a family)

In [103]:
Parameter_dict_Ribe_fam = {}
Parameter_dict_Ribe_fam["Name"] = "Ribe_fam"

##### Data ########################

Parameter_dict_Ribe_fam['Catalog'] =  "/home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Ribe_fam['VCF'] =  "/home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/FAMILY/New_family_assignments/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Ribe_fam['Pop_map'] = "/home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/FAMILY/New_family_assignments/Sex_ID_info.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Ribe_fam['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Ribe_fam['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ribe_fam['coverage_threshold1'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ribe_fam['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Ribe_fam['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Ribe_fam['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Ribe_fam['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Ribe_fam['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ribe_fam['coverage_threshold2'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ribe_fam['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Ribe_fam['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Ribe_fam)

### R. italica (CM pop)

In [104]:
Parameter_dict_Rita = {}
Parameter_dict_Rita["Name"] = "Rita"

##### Data ########################

Parameter_dict_Rita['Catalog'] =  "/home/djeffrie/Data/RADseq/Ritalica/Sex_linked_markers/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Rita['VCF'] =  "/home/djeffrie/Data/RADseq/Ritalica/Sex_linked_markers/Pop_CM/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Rita['Pop_map'] = "/home/djeffrie/Data/RADseq/Ritalica/Sex_linked_markers/Pop_CM/CM_sex_ID_info.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Rita['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Rita['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rita['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rita['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Rita['homogametic_REF_allele_freq'] = 0.9 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Rita['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Rita['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Rita['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rita['coverage_threshold2'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rita['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Rita['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Rita)

### R. kukinoris (Nanping only)

In [105]:
Parameter_dict_Rkuk = {}
Parameter_dict_Rkuk["Name"] = "Rkuk"

##### Data ########################

Parameter_dict_Rkuk['Catalog'] =  "/home/djeffrie/Data/RADseq/Rkukinoris/Stacks/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Rkuk['VCF'] =  "/home/djeffrie/Data/RADseq/Rkukinoris/Stacks/Populations_Nanping_kept_altered/batch_1.vcf" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Rkuk['Pop_map'] = "/home/djeffrie/Data/RADseq/Rkukinoris/Stacks/Populations_Nanping_kept_altered/Sex_ID_info.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Rkuk['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Rkuk['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rkuk['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rkuk['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Rkuk['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Rkuk['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Rkuk['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Rkuk['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Rkuk['coverage_threshold2'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Rkuk['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Rkuk['sex_presence_threshold'] =  0.7 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Rkuk)

### R. yavapaiensis

In [106]:
Parameter_dict_Ryav = {}
Parameter_dict_Ryav["Name"] = "Ryav"

## NOTE. DICT KEY NAMES MUST NOT BE CHANGED!

##### Data ########################

Parameter_dict_Ryav['Catalog'] =  "/home/djeffrie/Data/RADseq/Ryavapaiensis/Sex_linked_markers/batch_1.catalog.tags.tsv.gz" ## Path to the catalog file - used by all approaches.
Parameter_dict_Ryav['VCF'] =  "/home/djeffrie/Data/RADseq/Ryavapaiensis/Sex_linked_markers/batch_1.vcf.altered" ## path to vcf file (note this will be altered to make header compatible with Pyvcf. New vcf will have same name with ".altered" appended to the end). Used by Approach i) and ii)
Parameter_dict_Ryav['Pop_map'] = "/home/djeffrie/Data/RADseq/Ryavapaiensis/Sex_linked_markers/Sex_ID_info.txt" ## path to population map file containing sex information. Same format as Stacks pop map file. Used by all approaches.

###### threshold parameters #######

# 1. Frequency approach
Parameter_dict_Ryav['X_or_Z_freq_threshold'] = 0.4  ## (Default = 0.4) The lower threshold for the freq caluclation to find sex linked snps, e.g. for an XY system, a threshold of 0.4 means that f(F) - f(M) can be >= 0.4 and <= 0.6 (the upper threshold is automatically calculated to be the same distance above 0.5 as the lower threshold is below 0.5) 
Parameter_dict_Ryav['sample_presence_cutoff1'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ryav['coverage_threshold1'] = 3 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci below this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ryav['maf_threshold1'] =  0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 
Parameter_dict_Ryav['homogametic_REF_allele_freq'] = 1 ## (Default = 0.95) The sex linked SNP will be the minor allele, so a check is done to make sure that the homogametic sex is above the threshold specified for the major allele. In theory this should be 1. But allowing for some error 0.95 is used as a default. 

# 2. Heterozygosity approach
Parameter_dict_Ryav['homogamtic_homozygosity_threshold'] = 1 ## (Default = 0.9) The minimum number of the homogametic sex which must not have the tag for that tag to be considered linked to the sex-limited chromosome
Parameter_dict_Ryav['heterogamtic_heterozygosity_threshold'] = 0.6 ## (Default = 0.5) The lower threshold for the proportion of heterozygotes in the heterogametic sex at a locus 
Parameter_dict_Ryav['sample_presence_cutoff2'] = 0.75 ## (Default = 0.75) a locus must be called in at least this proportion of all samples (not within populations) to be considered
Parameter_dict_Ryav['coverage_threshold2'] = 7 ## (Default = 3) a locus must have at least this threshold in a sample to be considered for that sample. Note that loci bels this threshold will be removed from a sample, and this can push the locus below the sample presence cut-off, which will then remove the locus.
Parameter_dict_Ryav['maf_threshold2'] = 0.05 ## (Default = 0.05) minor allele frequency cutoff for a locus across all samples. 

# 3. Sex specific presence or absence approach
Parameter_dict_Ryav['sex_presence_threshold'] =  0.5 ## (Default = 0.5) The minimum percenatage of the heterogametic sex that a tag must be present in.

Parameter_dictionaries.append(Parameter_dict_Ryav)

In [107]:
print "There are %s species for which sex linked markers were found that will be randomised" % len(Parameter_dictionaries)

There are 6 species for which sex linked markers were found that will be randomised


The program will work as follows:

For every species, it will
    1. make a new directory
    2. Move the sex information file and vcf file into that directory
    3. Make the randomised sex information files and renamed vcf files
    4. Run the randomisation prgroam for that species and collect results
    5. Write results to an output file
    6. Tidy up loose files
    7. Plot the results



In [35]:
## Define a program to help parallelise the analyses


def Super_SLM_finder_parallel(popmap, Parameter_dict):
    import os
    
    Parameter_dict['Pop_map'] = popmap
    Parameter_dict['VCF'] = "%s.vcf" % popmap.rpartition(".")[0]
    results_dict = {}
    results_dict["XYset"], results_dict["ZWset"], results_dict["Detailed"] = SLMF_L.Super_SLM_finder(Parameter_dict, "111", verbose = False, write_files=False, plot=False)
    
    os.remove(Parameter_dict['VCF']) ## remove VCFs as they are used
    os.remove("%s.altered" % Parameter_dict['VCF'])
    os.remove("%s.all_frequencies.tsv" % Parameter_dict['VCF'])
    

    return results_dict
    
    

In [110]:
from joblib import Parallel, delayed
import multiprocessing
import os
import shutil
from random import shuffle

results_dict = {}

for dataset in Parameter_dictionaries:
    print "processing dataset in", dataset["VCF"]
    # 1. Make a new directory next in the VCF path. 
    Randomisation_dir = "%s/Randomisations" % dataset["VCF"].rpartition("/")[0]
    if not os.path.exists(Randomisation_dir):
        os.makedirs(Randomisation_dir)
    
    # 2. Copy sex info and VCF into that folder
    
    shutil.copyfile(dataset["Pop_map"], "%s/Sex_ID_info.txt" % Randomisation_dir)  ## sex info
    shutil.copyfile(dataset["VCF"], "%s/batch_1.vcf" % Randomisation_dir)  ## sex info
    
    orig_vcf = "%s/batch_1.vcf" % Randomisation_dir
    
    print "\nRandomisations happening in %s" % Randomisation_dir
    
    ## 3. Make the randomised sex info files

    orig_popmap_path = "%s/Sex_ID_info.txt" % Randomisation_dir
    orig_popmap = open(orig_popmap_path, 'r').readlines()

    sexes = []
    samples = []
    IDs = []
    randomisations = []

    for line in orig_popmap:
        sexes.append(line.strip().split()[1])
        samples.append(line.strip().split()[0])
        IDs.append(line.strip().split()[2])

    popmaps = []
    for i in range(100):
        popmap_path = "%s/rand_popmap_%s.txt" % (Randomisation_dir, i)
        popmaps.append(popmap_path)
        rand_popmap = open(popmap_path, 'w')
        shuffle(sexes)
        randomisations.append(sexes)
        
        for i in range(len(samples)):
            rand_popmap.write("%s\t%s\t%s\n" % (samples[i], sexes[i], IDs[i]))

        rand_popmap.close()
    
    print "\nRandom sex info files made"
    
    ## make new VCFs for parallelised analyses
    
    for i in popmaps:
        new_vcf = "%s.vcf" % i.rpartition(".")[0]
        shutil.copyfile(orig_vcf, new_vcf)
        
    ## 4. Run the randomisations
    
    print "\nRunning randomisations\n"
    
    results_dict[dataset["Name"]] = Parallel(n_jobs=4, verbose = 1)(delayed(Super_SLM_finder_parallel)(i, dataset) for i in popmaps)
    
    ## 5. Output the results for each species after the species is complete (i.e. checkpoints)
    
    outfile = open("%s/Randomisations_%s.txt" % (Randomisation_dir, dataset["Name"]), 'w')
    
    for Randomisation in results_dict[dataset["Name"]]:

        XYfreq = len(Randomisation["Detailed"]["XY"]["freq"])
        XYhet = len(Randomisation["Detailed"]["XY"]["het"])
        Ytags = len(Randomisation["Detailed"]["XY"]["Ytags"])
    
        ZWfreq = len(Randomisation["Detailed"]["ZW"]["freq"])
        ZWhet = len(Randomisation["Detailed"]["ZW"]["het"])
        Wtags = len(Randomisation["Detailed"]["ZW"]["Wtags"])
                
        line = "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (dataset["Name"],XYfreq,XYhet,Ytags,ZWfreq,ZWhet,Wtags)
    
        outfile.write(line)
        
    outfile.close()
    
    print "Results outputted to %s/Randomisations_%s.txt" % (Randomisation_dir, dataset["Name"])
    
    
    

processing dataset in /home/djeffrie/Data/RADseq/Rdalmatina/Populations_final/batch_1.vcf

Randomisations happening in /home/djeffrie/Data/RADseq/Rdalmatina/Populations_final/Randomisations

Random sex info files made

Running randomisations



[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  3.5min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:  8.4min finished


Results outputted to /home/djeffrie/Data/RADseq/Rdalmatina/Populations_final/Randomisations/Randomisations_Rdal.txt
processing dataset in /home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/Populations_N_spain/batch_1.vcf

Randomisations happening in /home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/Populations_N_spain/Randomisations

Random sex info files made

Running randomisations

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 




Number of loci = 12198Number of loci = 12198Number of loci = 12198Number of loci = 12198



Number of samples =Number of samples =Number of samples =Number of samples = 18
 18
 18
 18
Number of loci with too few samples = 2903
Number of loci with too few samples = 2903
Number of loci with too few samples = 2903
Number of loci with too few samples = 2903
Number of loci with low MAF = 3061
Number of loci with low MAF = 3061
Number

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  9.5min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 21.7min finished


Results outputted to /home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/Populations_N_spain/Randomisations/Randomisations_Ribe_Sp.txt
processing dataset in /home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/FAMILY/New_family_assignments/batch_1.vcf

Randomisations happening in /home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/FAMILY/New_family_assignments/Randomisations

Random sex info files made

Running randomisations
Number of putative Z linked tags = 19
Number of putative Z linked tags = 10
Number of putative Z linked tags = 3
Number of putative Z linked tags = 1

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

***DONE!***

***DONE!***

***DONE!***

***DONE!***








Number of loci = 4122Number of loci = 4122Number of loci = 4122Number of loci = 4122
##### Using SNP heterozygosity approach #####
 
##### Using SNP heterozygosity approach #####
 
###

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  8.5min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 19.5min finished


Results outputted to /home/djeffrie/Data/RADseq/Riberica/Stacks_trimmed/IDd/FAMILY/New_family_assignments/Randomisations/Randomisations_Ribe_fam.txt
processing dataset in /home/djeffrie/Data/RADseq/Ritalica/Sex_linked_markers/Pop_CM/batch_1.vcf

Randomisations happening in /home/djeffrie/Data/RADseq/Ritalica/Sex_linked_markers/Pop_CM/Randomisations

Random sex info files made

Running randomisations

 ## After merging tags accross methods ## 

 ## After merging tags accross methods ## 

 ## After merging tags accross methods ## 

 ## After merging tags accross methods ## 

 ### DONE! ### 

 ### DONE! ### 

 ### DONE! ### 

 ### DONE! ### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 












Final number of XY tags = 61
Final number of XY tags = 54
Final number of XY tags = 72
Final number of XY tags = 73
X_het markers ->  []
X_het markers ->  []
X_het markers -

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  9.9min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 22.6min finished


Results outputted to /home/djeffrie/Data/RADseq/Ritalica/Sex_linked_markers/Pop_CM/Randomisations/Randomisations_Rita.txt
processing dataset in /home/djeffrie/Data/RADseq/Rkukinoris/Stacks/Populations_Nanping_kept_altered/batch_1.vcf

Randomisations happening in /home/djeffrie/Data/RADseq/Rkukinoris/Stacks/Populations_Nanping_kept_altered/Randomisations




##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP heterozygosity approach #####
 
##### Using SNP heterozygosity approach #####
 
##### Using SNP heterozygosity approach #####
 
##### Using SNP heterozygosity approach #####
 



Number of samples =Number of samples =Number of samples =Number of samples =







Number of samples =Number of samples =Number of samples =Number of samples = 64
 64
 64
 64
Number of loci = 28284Number of loci = 28284Number of loci = 28284Number of loci = 28284Number of loc

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 30.7min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 69.9min finished


Results outputted to /home/djeffrie/Data/RADseq/Rkukinoris/Stacks/Populations_Nanping_kept_altered/Randomisations/Randomisations_Rkuk.txt
processing dataset in /home/djeffrie/Data/RADseq/Ryavapaiensis/Sex_linked_markers/batch_1.vcf.altered

Randomisations happening in /home/djeffrie/Data/RADseq/Ryavapaiensis/Sex_linked_markers/RandomisationsNumber of putative Z linked markers = 0
Number of putative Z linked markers = 0
Number of putative Z linked markers = 0
Number of putative Z linked markers = 0
Number of loci with enough data = 21163
Number of loci with enough data = 21163
Number of loci with enough data = 21163
Number of loci with enough data = 21163

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 
Number of loci with enough data = 6233
Number of loci with enough data = 6233
Number of loci with enough data = 6233
Number of loci with enough data = 6233
Number of put

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 14.1min
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed: 32.1min finished


Results outputted to /home/djeffrie/Data/RADseq/Ryavapaiensis/Sex_linked_markers/Randomisations/Randomisations_Ryav.txt
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0
Final number of ZW tags = 0

##### Using SNP heterozygosity approach #####
 
##### Using SNP heterozygosity approach #####
 
##### Using SNP heterozygosity approach #####
 
##### Using SNP heterozygosity approach #####
 




SUMMARY:
Number of males: 25
SUMMARY:
Number of males: 25
SUMMARY:
Number of males: 25
SUMMARY:
Number of males: 25
##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 
Number of putative Z linked tags = 8
Number of putative Z linked tags = 0
Number of putative Z linked tags = 6
Number of putative Z linked tags = 0

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using SNP frequency approach #### 

##### Using S

## So all went well, randomisations completed in less than 24 hours with 100 randomisations per dataset. 

### Now to plot

In [109]:
results_dict1 ## contains first half
results_dict ## contains 2nd half.

In [152]:
randomisation_files = []
for root, dirs, files in os.walk("/home/djeffrie/Data/RADseq/Randomisations"):
    for fil in files:
        if fil.startswith("Randomisation"):
            randomisation_files.append("%s/%s" % (root, fil))

In [170]:
randomisation_filepath = "/home/djeffrie/Data/RADseq/Randomisations/Randomisations_Ryav.txt"

randomisations = open(randomisation_filepath, 'r').readlines()

XYfreqs = []
XYhets = []
Ytags = []

ZW_freqs = []
ZW_hets = []
W_tagss = []

XYvsZW_freqs = []
XYvsZW_hets = []
XYvsZW_tagss = []

for line in randomisations:
    #print line.split()
    species = line.split()[0]
    XYfreq = line.split()[1]
    XYhet = line.split()[2]
    Y_tags = line.split()[3]
    
    ZW_freq = line.split()[4]
    ZW_het = line.split()[5]
    W_tags = line.split()[6]
    
    XYvsZW_freq = int(XYfreq) - int(ZW_freq)
    XYvsZW_het = int(XYhet) - int(ZW_het)
    XYvsZW_tags = int(Y_tags) - int(W_tags)

    XYfreqs.append(int(XYfreq))
    XYhets.append(int(XYhet))
    Ytags.append(int(Y_tags))
    ZW_freqs.append(int(ZW_freq))
    ZW_hets.append(int(ZW_het))
    W_tagss.append(int(W_tags))
    XYvsZW_freqs.append(int(XYvsZW_freq))
    XYvsZW_hets.append(int(XYvsZW_het))
    XYvsZW_tagss.append(int(XYvsZW_tags))
    

In [190]:
help(plt.vlines)

Help on function vlines in module matplotlib.pyplot:

vlines(x, ymin, ymax, colors=u'k', linestyles=u'solid', label=u'', hold=None, **kwargs)
    Plot vertical lines.
    
    Plot vertical lines at each `x` from `ymin` to `ymax`.
    
    Parameters
    ----------
    x : scalar or 1D array_like
        x-indexes where to plot the lines.
    
    ymin, ymax : scalar or 1D array_like
        Respective beginning and end of each line. If scalars are
        provided, all lines will have same length.
    
    colors : array_like of colors, optional, default: 'k'
    
    linestyles : ['solid' | 'dashed' | 'dashdot' | 'dotted'], optional
    
    label : string, optional, default: ''
    
    Returns
    -------
    lines : `~matplotlib.collections.LineCollection`
    
    Other parameters
    ----------------
    kwargs : `~matplotlib.collections.LineCollection` properties.
    
    See also
    --------
    hlines : horizontal lines
    
    Examples
    ---------
    .. plot:: mpl_exam

In [207]:
from matplotlib import pyplot as plt

fig = plt.figure(figsize = (20,10))

fig.add_subplot(1,1,3)

counts, bins, bars = plt.hist(XYfreqs, bins= 20, edgecolor = "lightblue", color = "royalblue")
plt.xlim((0,200))
plt.vlines(100, 0, max(counts)*0.75, color = "red")
plt.show()

fig.add_subplot(2,1,3)
counts, bins, bars = plt.hist(XYhets, bins= 20, edgecolor = "lightblue", color = "royalblue")
plt.xlim((0,200))
plt.vlines(100, 0, max(counts)*0.75, color = "red")
plt.show()

fig.add_subplot(3,1,3)
counts, bins, bars = plt.hist(Ytags, bins= 20, edgecolor = "lightblue", color = "royalblue")
plt.xlim((0,200))
plt.vlines(100, 0, max(counts)*0.75, color = "red")
plt.show()

ValueError: num must be 0 <= num <= 1, not 3

<matplotlib.figure.Figure at 0x7fd52fa10890>

In [195]:
print counts

[  8.   7.  18.  14.  20.  12.   5.   4.   4.   3.   1.   1.   0.   0.   0.
   0.   1.   1.   0.   1.]
