# Aligning linkage maps to scaffolds

Here I will take the 17235 linkage mapped RADtags from the 9 families of RADseq, and align them to the scaffolds of the Rtemp PacBio and Optical mapping hybrid assembly. 

I will take this reasonably slowly, as we know that the assembly is extremely heterozygous, so the tags are likely to align to more than one position if that position is represented by two scaffolds in the assembly. . . 

Usually, tags will align to many different places, due to the high repeat content of the genome, so our usual criteria on which to keep an alignment if it aligns either uniquely (uncommon), or if the best hit for a given tag has an e-value which is 1e-05 better than any other hit. In this case however, we should also consider alignments where there are 2 hits which are 1e-05 better than the rest. These could represent the two scaffolds for a given region. We can then examine the tags which map to two scaffolds with high fidelity, and see if those two scaffolds are often aligned to by many tags. . . . if so, then we can keep them. Or perhaps just choose one for ALLMAPS (the one with the most hits) and manually assign the other with the same position. 

Lets go . . . 

## Step 1. Make a fasta from the whitelists of linkage mapped RADtags.

Th issue here is that the tag IDs are not preserved in the Lepmap outputs - the tags just get an index based on their order in the input file. 

So I need to make a dictionary of the real vs Lepmap tag_IDs. . . 

In [2]:
from __future__ import division

import numpy as np

In [1]:
### SbfI

N_markers_SbfI = 25572

SbfI_vcf = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/Meitriele_Linkage_map/VCF/batch_1.vcf.altered", 'r').readlines()

real_tag_IDs = {}

marker_number = 1

for line in SbfI_vcf:
    if not line.startswith("#"):
        marker_ID = line.split()[2]
        
        real_tag_IDs[str(marker_number)] = marker_ID
        
        marker_number += 1
        
print len(real_tag_IDs), "SNPs in the VCF"




25572 SNPs in the VCF


In [3]:
## Note that there are tags with more than one SNP in the map, which means that, when aligning, these two markers will be represented by only one alignment
print len(real_tag_IDs.values()), "SNPs in the input VCF"
print len(set(real_tag_IDs.values())), "Tags contain those SNPs"
## But it is important to keep this redundency in the list of real marker IDs at this point, removing it will result in getting different marker names when using the Lepmap tag_IDs as an index for the list of real names. 


25572 SNPs in the input VCF
15937 Tags contain those SNPs


In [29]:
## Here are the tag_IDs from the LepMap outputs

whitelist = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/Meitriele_Linkage_map/LepMap_Fem_linkage_map/SbfI_tags.txt", 'r').readlines()

real_whitelist = []

## as these LepMap IDs are just the order of the markers in the input file, I can use them as an index to extract the 
## real tag ID from the list I made above from the vcf.

for tag in whitelist:
    real_tag_ID = real_tag_IDs[tag.strip()] 
    real_whitelist.append(real_tag_ID)

print len(real_whitelist), "SNPs in the linkage map"    
print "Found in %s RADtags" % len(set(real_whitelist))

11304 SNPs in the linkage map
Found in 8490 RADtags


In [24]:
def fasta_maka(whitey, cat, out = None):

    """
    whitey = whitelist (either a python list or a file path) containing locus IDs in the form of "<Tag_ID>_<Position>"
    cat    = path to the catalog file to get sequences from

    """

    import sys
    import gzip

    if isinstance(whitey, str):
        loci = open(whitey, 'r').readlines()
    elif isinstance(whitey, (list, set)):
        loci = whitey
    else:
        sys.exit("Unknown whitelist format - expected a python list or a file path")

    if cat.endswith("gz"):
        tags = gzip.open(cat, 'r').readlines()
    else:
        tags = open(cat, 'r').readlines()

        
    ## Pull out the locus ID's from the whitelist

    Loc_IDs = []
    
    for locus in loci:
        if locus.startswith("compli"):
            Loc_id = locus.split("_")[1]
        else:
            Loc_id = locus.split("_")[0]
        Loc_IDs.append(Loc_id.strip())

    print "Number of tags in whitelist:",len(Loc_IDs)

    ## Write the fasta

    if not out == None:
        fasta = open(out, 'w')
        outpath = out
    else:
        fasta = open("%s/%s" % (cat.rpartition('/')[0], 'Whitelist_tags.fa'), 'w')
        outpath = "%s/%s" % (cat.rpartition('/')[0], 'Whitelist_tags.fa')

        
    found_IDs = []
    
    count = 0
    for line in tags:
        if 'consensus' in line:
            Tag_ID = line.split()[2]
            if Tag_ID in Loc_IDs:
                found_IDs.append(Tag_ID)
                count+=1
                fasta.write('>'+ Tag_ID +'\n'+line.split()[8]+'\n')

    print count, "sequences written to", outpath

    fasta.close()

In [25]:
import MISC_RAD_tools as MISC

## use the fasta_maka function

catalog_tags = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Meitriele_Linkage_map/Catalog/batch_1.catalog.tags.tsv.gz"

## SbfI

out = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Meitriele_Linkage_map/LepMap_Fem_linkage_map/SbfI_tags.fasta"

## pass the whitelist (with real tag_IDs) to the fasta_maka function, to make the fasta for mapping
fasta_maka(real_whitelist, catalog_tags)


Number of tags in whitelist: 11304
8490 sequences written to /home/djeffrie/Data/Genomes/Rtemp_hybrid/Meitriele_Linkage_map/Catalog/Whitelist_tags.fa
[]


So there were 13544 SNPs in the linkage map, and these were found on 9897 RADtags

### PstI

In [826]:
### PstI

PstI_vcf = open("/home/djeffrie/Data/RADseq/R_temp_fams/NEW/Populations_PstI_pool/batch_1.vcf", 'r').readlines()

real_tag_IDs = {}


marker_number = 1
for line in PstI_vcf:
    if not line.startswith("#"):
        marker_ID = line.split()[2]
        
        real_tag_IDs[str(marker_number)] = marker_ID
        
        marker_number += 1
        
print len(real_tag_IDs), "SNPs in the VCF"
real_tag_IDs_PstI = real_tag_IDs

21611 SNPs in the VCF


In [756]:
## Note that there are tags with more than one SNP in the map, which means that, when aligning, these two markers will be represented by only one alignment
print len(real_tag_IDs_PstI.values()), "SNPs in the input VCF"
print len(set(real_tag_IDs_PstI.values())), "Tags contain those SNPs"
## But it is important to keep this redundency in the list of real marker IDs at this point, removing it will result in getting different marker names when using the Lepmap tag_IDs as an index for the list of real names. 


21611 SNPs in the input VCF
14351 Tags contain those SNPs


In [805]:
## Here are the tag_IDs from the LepMap outputs

whitelist = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_tags.txt", 'r').readlines()

real_whitelist = []

## as these LepMap IDs are just the order of the markers in the input file, I can use them as an index to extract the 
## real tag ID from the list I made above from the vcf.

for tag in whitelist:
    real_tag_ID = real_tag_IDs_PstI[tag.strip()] ## +1 because it is a zero based indexing system.
    real_whitelist.append(real_tag_ID)

print len(real_whitelist), "SNPs in the linkage map"    
real_whitelist_PstI = real_whitelist

5680 SNPs in the linkage map


In [806]:
import MISC_RAD_tools as MISC

## use the fasta_maka function

catalog_tags = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/RADseq_catalog/batch_1.catalog.tags.tsv.gz"

## PstI

out = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_tags.fasta"

## pass the whitelist (with real tag_IDs) to the fasta_maka function, to make the fasta for mapping
MISC.fasta_maka(real_whitelist_PstI, catalog_tags)



Number of tags in whitelist: 5680
4509 sequences written to /home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/RADseq_catalog/Whitelist_tags.fa


### EcoRI

In [929]:
### PstI

EcoRI_vcf = open("/home/djeffrie/Data/RADseq/R_temp_fams/NEW/Populations_C1_kept/batch_1.vcf", 'r').readlines()

real_tag_IDs = {}


marker_number = 1
for line in EcoRI_vcf:
    if not line.startswith("#"):
        marker_ID = line.split()[2]
        
        real_tag_IDs[str(marker_number)] = marker_ID
        
        marker_number += 1
        
print len(real_tag_IDs), "SNPs in the VCF"
real_tag_IDs_EcoRI = real_tag_IDs

13427 SNPs in the VCF


In [769]:
## Note that there are tags with more than one SNP in the map, which means that, when aligning, these two markers will be represented by only one alignment
print len(real_tag_IDs.values()), "SNPs in the input VCF"
print len(set(real_tag_IDs.values())), "Tags contain those SNPs"
## But it is important to keep this redundency in the list of real marker IDs at this point, removing it will result in getting different marker names when using the Lepmap tag_IDs as an index for the list of real names. 


13427 SNPs in the input VCF
10404 Tags contain those SNPs


In [770]:
## Here are the tag_IDs from the LepMap outputs

whitelist = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_tags.txt", 'r').readlines()

real_whitelist = []

## as these LepMap IDs are just the order of the markers in the input file, I can use them as an index to extract the 
## real tag ID from the list I made above from the vcf.

for tag in whitelist:
    real_tag_ID = real_tag_IDs[tag.strip()] ## +1 because it is a zero based indexing system.
    real_whitelist.append(real_tag_ID)

print len(real_whitelist), "SNPs in the linkage map"    

3192 SNPs in the linkage map


In [771]:
import MISC_RAD_tools as MISC

## use the fasta_maka function

catalog_tags = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/RADseq_catalog/batch_1.catalog.tags.tsv.gz"

## SbfI

out = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_tags.fasta"

## pass the whitelist (with real tag_IDs) to the fasta_maka function, to make the fasta for mapping
MISC.fasta_maka(real_whitelist, catalog_tags)



Number of tags in whitelist: 3192
2657 sequences written to /home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/RADseq_catalog/Whitelist_tags.fa


So the final number of tags aligned to the genome from these linkage maps will be:

SbfI = 9897  
PstI = 4287  
EcoRI = 2657


I will now map them


## Step 2. Align the tags to the genome

In [5]:
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
Rtemp_db = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_blastn"

In [494]:
### SbfI 

my_query = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_tags.fasta"

## xml format
blast_outs = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_blast_hits_MTS_100.xml"
blastn_cline = BLASTn(query=my_query, db=Rtemp_db, outfmt=5, out=blast_outs, num_threads = 7, max_target_seqs = 100)


stdout, stderr = blastn_cline()

In [807]:
### PstI

my_query = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_tags.fasta"

## xml format
blast_outs = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_blast_hits_MTS_100.xml"
blastn_cline = BLASTn(query=my_query, db=Rtemp_db, outfmt=5, out=blast_outs, num_threads = 7, max_target_seqs = 100)

stdout, stderr = blastn_cline()

In [925]:
### EcoRI

my_query = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_tags.fasta"

## xml format
blast_outs = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_blast_hits_MTS_100.xml"
blastn_cline = BLASTn(query=my_query, db=Rtemp_db, outfmt=5, out=blast_outs, num_threads = 7, max_target_seqs = 100)

stdout, stderr = blastn_cline()

## Step 3. Examine the alignment outputs - SbfI

Questions to answer:

- Whats the alignment rate?   
    - How many unique hits?  
    - How many tags with 1 "Best" hit?   


- How accurate are the alignments? 
    - Are the alignment filters optimal?
    

I will do everything for the SbfI map first, to get functions working and get the workflow running, then do PstI and EcoRI maps later. 

### What is the alignment rate (keeping only reliable alignments)?

In [727]:
blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_blast_hits_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-20
best_hit_criteria = 1e-5
window = 2000
get_frags = 0
verb = 1

Kept_blast_records = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 2575
Number of unique alingments kept: 2260


So in total for SbfI I keep 4835 tags. Thats an alignmnet rate of about 48.8% with these alignment filtering options

### How accurate are the alignments?

So what I want to do next is confirm that tags that map to the same scaffold are all from the same region in the linkage map, as a validation that the mapping has been accurate.

So first I will get the linkage map information for the aligned tags

In [863]:
## Function to get the information from the linkage map

def Get_LM_info(LM_path, real_tag_IDs):
    
    LM = open(LM_path, 'r').readlines()

    Linkage_map_dict = {}
    
    print len(LM)

    for tag in LM[1:]:
        #print tag
        tag_ID = tag.split()[0]  ## remember this needs to be converted to the "real" tag_ID
        real_tag_ID = real_tag_IDs[tag_ID.strip()]
        LG = tag.split()[1]
        POS = tag.split()[2]

        Linkage_map_dict[real_tag_ID] = {}
        Linkage_map_dict[real_tag_ID]["LG"] = LG
        Linkage_map_dict[real_tag_ID]["POS"] = POS
        
    return Linkage_map_dict
        


In [526]:
linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/SbfI_map/MAP_20/map_20_ordered_1_inf_mask23_FEMALE_genetic_mapper.dat"

Linkage_map_dict_SbfI = Get_LM_info(linkage_map_path, real_tag_IDs)

In [528]:
## compile alignment and linkage mapping info for each tag into a single dictionary

Blast_plus_LM_info_SbfI = {}

for tag in Kept_blast_records:
    if tag in Linkage_map_dict_SbfI:
        Blast_plus_LM_info_SbfI[tag] = {}
        Blast_plus_LM_info_SbfI[tag]["scaff"] = Kept_blast_records[tag]["Ref_hit_id"]
        Blast_plus_LM_info_SbfI[tag]["STRT"] = Kept_blast_records[tag]["Hit_start_coord"]
        Blast_plus_LM_info_SbfI[tag]["LG"] = Linkage_map_dict_SbfI[tag]["LG"]
        Blast_plus_LM_info_SbfI[tag]["POS"] = Linkage_map_dict_SbfI[tag]["POS"]
        

In [502]:
## Output to file

outfile = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_ALLMAPS_input.dat", 'w')

for i in Blast_plus_LM_info:
    outfile.write("%s\t%s\t%s\t%s\t%s\n" % (i, Blast_plus_LM_info[i]["scaff"],Blast_plus_LM_info[i]["STRT"],Blast_plus_LM_info[i]["LG"],Blast_plus_LM_info[i]["POS"]))
                  
outfile.close()

So, how consistent are the mappings to the same scaffold?

In [595]:
## get per scaffold information
    
scaff_dict = {}
for record in Blast_plus_LM_info:
    if Blast_plus_LM_info[record]["scaff"] not in scaff_dict:
        scaff_dict[Blast_plus_LM_info[record]["scaff"]] = {}
        scaff_dict[Blast_plus_LM_info[record]["scaff"]][record] = {}
        scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["LG"] = Blast_plus_LM_info[record]["LG"]
        scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["POS"] = Blast_plus_LM_info[record]["POS"]
    else:
        scaff_dict[Blast_plus_LM_info[record]["scaff"]][record] = {}
        scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["LG"] = Blast_plus_LM_info[record]["LG"]
        scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["POS"] = Blast_plus_LM_info[record]["POS"]

In [596]:
scaff_dict

{'Super-Scaffold_347': {u'107019': {'LG': '2', 'POS': '148.77'},
  u'108544': {'LG': '2', 'POS': '148.77'},
  u'20872': {'LG': '2', 'POS': '146.29'},
  u'29074': {'LG': '2', 'POS': '148.77'},
  u'29991': {'LG': '2', 'POS': '148.77'},
  u'92226': {'LG': '2', 'POS': '148.77'},
  u'96867': {'LG': '2', 'POS': '148.77'}},
 'Super-Scaffold_692': {u'25555': {'LG': '2', 'POS': '54.71'},
  u'32959': {'LG': '2', 'POS': '55.64'},
  u'81424': {'LG': '2', 'POS': '52.24'}},
 'Super-Scaffold_2843': {u'30690': {'LG': '4', 'POS': '66.72'}},
 'Super-Scaffold_3178': {u'27868': {'LG': '1', 'POS': '129.13'}},
 'Super-Scaffold_3171': {u'45831': {'LG': '5', 'POS': '173.50'},
  u'77957': {'LG': '5', 'POS': '173.50'}},
 'Super-Scaffold_3170': {u'26922': {'LG': '9', 'POS': '61.55'},
  u'55304': {'LG': '9', 'POS': '62.79'}},
 'Super-Scaffold_1210': {u'89347': {'LG': '5', 'POS': '38.12'}},
 'Super-Scaffold_1995': {u'16131': {'LG': '8', 'POS': '87.25'}},
 'Super-Scaffold_1994': {u'42649': {'LG': '2', 'POS': '94.64

#### Answer: VERY consistent!

### Summary so far 

Ok, so this looks great. The consistency between the tags matching the scaffolds and their positions on the linkage map is almost perfect. 

However, before I proceed to the next step, I want to check that these blast filtering parameters are optimal. If they are too stringent, then perhaps I could squeeze a few more loci out here, and in doing so, anchor more scaffolds.

So I want to quantify the amount of noise that is introduced if I relax the blast filtering parameters. Specifically, if I reduce the minumum evalue thresh for a mapping to 1e-18 or 1e-15, and reduce the difference between the first and second best hits to 1e-4 or 1e-3. 

So I will run through the above again, and each time, I will look at the postion in the linkage maps of tags that map to the same scaffolds. If relaxing the blast parameters results in erroneous mapping of RADtags to scaffolds, then I would see more cases where tags from multiple linkage groups match to the same scaffold. So I will quantify this and use it as a metric to assess the "accuracy" of the mappings given different blast hit filters

### Blast filtering Tests

In [576]:
def evaluate_mapping_efficiency(blast_outs_path, linkage_map_path, N_min_mappings, evalue_thresh, best_hit_criteria):
    
    from collections import Counter
    import operator

    
    ## filter blast hits
    
    Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
    window = 2000
    get_frags = 0
    verb = 1

    Kept_blast_records = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

    ## Get linkage map information
    
    Linkage_map_dict = Get_LM_info(linkage_map_path, real_tag_IDs)
    
    ## compile alignment and linkage mapping info for each tag into a single dictionary

    Blast_plus_LM_info = {}

    for tag in Kept_blast_records:
        if tag in Linkage_map_dict:
            Blast_plus_LM_info[tag] = {}
            Blast_plus_LM_info[tag]["scaff"] = Kept_blast_records[tag]["Ref_hit_id"]
            Blast_plus_LM_info[tag]["STRT"] = Kept_blast_records[tag]["Hit_start_coord"]
            Blast_plus_LM_info[tag]["LG"] = Linkage_map_dict[tag]["LG"]
            Blast_plus_LM_info[tag]["POS"] = Linkage_map_dict[tag]["POS"]
    
    
    ## get per scaffold information
    
    scaff_dict = {}
    for record in Blast_plus_LM_info:
        if Blast_plus_LM_info[record]["scaff"] not in scaff_dict:
            scaff_dict[Blast_plus_LM_info[record]["scaff"]] = {}
            scaff_dict[Blast_plus_LM_info[record]["scaff"]][record] = {}
            scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["LG"] = Blast_plus_LM_info[record]["LG"]
            scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["POS"] = Blast_plus_LM_info[record]["POS"]
        else:
            scaff_dict[Blast_plus_LM_info[record]["scaff"]][record] = {}
            scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["LG"] = Blast_plus_LM_info[record]["LG"]
            scaff_dict[Blast_plus_LM_info[record]["scaff"]][record]["POS"] = Blast_plus_LM_info[record]["POS"]

    
    ## keep only scaffolds with <N_min_mappings> hits
    
    scaff_dict_kept = {}

    for scaff in scaff_dict:
        if len(scaff_dict[scaff]) >= N_min_mappings:
            scaff_dict_kept[scaff] = scaff_dict[scaff]
            

    ## Now count the number of different linkage groups which have been mapped to the same scaffold
    
    consensus_values = []
    for scaff in scaff_dict_kept:
        LGs = []
        for record in scaff_dict_kept[scaff]:
            LGs.append(scaff_dict_kept[scaff][record]["LG"])

        LG_counts = Counter(LGs)
        consensus_LG = max(LG_counts.iteritems(), key=operator.itemgetter(1))[0]  ## the LG with the highest number of mappings to that scaffold

        consensus_value = LG_counts[consensus_LG] / sum(LG_counts.values())

        consensus_values.append(consensus_value)


    ### Outputs

    N_mapped_markers = len(Kept_blast_records)
    N_scaffolds_mapped_to = len(scaff_dict)
    N_scaffs_over_10 = len(scaff_dict_kept)
    mean_consensuses = np.mean(consensus_values)

    print "Total N mapped markers:", N_mapped_markers
    print "N scaffolds mapped to:", N_scaffolds_mapped_to
    print "N scaffolds with >10 mappings:", N_scaffs_over_10
    print "Mean consesnsus value: %.6f" % mean_consensuses

#### SbfI

Eval_thresh = 1e-20  
Best_hit_criteria = 1e-5

In [582]:
blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_blast_hits_MTS_100.xml"
linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/SbfI_map/MAP_20/map_20_ordered_1_inf_mask23_FEMALE_genetic_mapper.dat"
N_mappings_min = 10
evalue_threshold = 1e-20
best_hit_threshold = 1e-5

evaluate_mapping_efficiency(blast_outs_path, linkage_map_path, N_mappings_min, evalue_threshold, best_hit_threshold)


Number of multi-alingments kept: 2575
Number of unique alingments kept: 2260
Total N mapped markers: 4835
N scaffolds mapped to: 1646
N scaffolds with >10 mappings: 82
Mean consesnsus value: 0.992328


Eval_thresh = 1e-18  
Best_hit_criteria = 1e-4

In [583]:
blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_blast_hits_MTS_100.xml"
linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/SbfI_map/MAP_20/map_20_ordered_1_inf_mask23_FEMALE_genetic_mapper.dat"
N_mappings_min = 10
evalue_threshold = 1e-18
best_hit_threshold = 1e-4

evaluate_mapping_efficiency(blast_outs_path, linkage_map_path, N_mappings_min, evalue_threshold, best_hit_threshold)


Number of multi-alingments kept: 2647
Number of unique alingments kept: 2263
Total N mapped markers: 4910
N scaffolds mapped to: 1660
N scaffolds with >10 mappings: 82
Mean consesnsus value: 0.992337


Eval_thresh = 1e-16  
Best_hit_criteria = 1e-3

In [584]:
blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_blast_hits_MTS_100.xml"
linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/SbfI_map/MAP_20/map_20_ordered_1_inf_mask23_FEMALE_genetic_mapper.dat"
N_mappings_min = 10
evalue_threshold = 1e-16
best_hit_threshold = 1e-3

evaluate_mapping_efficiency(blast_outs_path, linkage_map_path, N_mappings_min, evalue_threshold, best_hit_threshold)


Number of multi-alingments kept: 3204
Number of unique alingments kept: 2272
Total N mapped markers: 5476
N scaffolds mapped to: 1822
N scaffolds with >10 mappings: 94
Mean consesnsus value: 0.990366


### Summary

So it looks like it is the best_hit_criteria that is really responsible for the change in the number of loci retained. But there doesn't seem to be too much of a drop in the mapping efficiency even when using eval = 1e-16 and the best_hit_diff eval of 1e-3. Yet I can retain an extra 600 or so mappings, almost 200 extra scaffolds. So I think it is worth running ALLMAPS with both strict and relaxed parameters. So from here, I will take the strict (eval = 1e-20 and the best_hit_diff eval of 1e-5) and relaxed (eval = 1e-16 and the best_hit_diff eval of 1e-3) filtering outputs from each map, and use them in two separate runs of ALLMAPS.

## Step 4. Identifying homologous scaffolds in the assembly

Our assembly is about 150% the size (in bp) of the expected genome size. This is likely due to the fact that some regions of the genome are represented by 2 scaffolds, which each represent one of the haplotypes at that position. 

I would like to try to identify the pairs of scaffolds for which this is the case, for two reasons:

- I am not sure how this will affect the ALLMAPS analysis, so it seems best to remove one of each pair
- For future studies mapping to the genome, it will be useful to know which scaffolds are homologous.

So I will try to do this using the linkage mapped RADtags. . . My theory is that, a tag which lies in a region represented by two scaffolds will likely map to both scaffolds with roughly equal efficiency. So I can look for pairs of scaffolds where multiple tags map to both of them with high probability. I will use a similar approach to the normal blast filtering here, i.e. tags must map to both with a max evalue of 1e-20, and the first two hits of that tag must have evalues which are 1e-5 times lower than the third best hit.

If I find some pairs of scaffolds with several tags which map to both reliably, then I can check the position of those RADtags in the linkage map. If the tags are close to eachother in the map, then this is a strong validation that these scaffolds are indeed homologous, and so I can name them as such in the genome assembly. For these scaffolds, I can also remove one of each pair from the ALLMAPS inputs, hopefully this will avoid any problems they might have caused. 


### So how many tags consistently hit two scaffolds?

In [505]:
evalue_thresh = 1e-20
best_hit_criteria = 1e-5

Co_alignments = {}
unique = 0
Best_multi = 0
Two_best_multi = 0

N_records = 0

blast_outs_handle = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_blast_hits_MTS_100.xml", 'r')
blastouts = NCBIXML.parse(blast_outs_handle)

for record in blastouts:
    
    if len(record.alignments) == 1:
        if record.alignments[0].hsps[0].expect <= evalue_thresh:
            unique += 1
        
    elif len(record.alignments) == 2:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh, record.alignments[0].hsps[0].expect < best_hit_criteria * record.alignments[1].hsps[0].expect]): ## if the first alignmnet has an eval lower than 1e-5 * the eval of the second alignment.
            Best_multi += 1

    
    #print record.alignments[0].hsps[0]
    elif len(record.alignments) >= 3:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh, record.alignments[0].hsps[0].expect < best_hit_criteria * record.alignments[1].hsps[0].expect]): ## if the first alignmnet has an eval lower than 1e-5 * the eval of the second alignment.
            Best_multi += 1

        elif all([all([record.alignments[0].hsps[0].expect <= evalue_thresh,record.alignments[1].hsps[0].expect <= evalue_thresh,record.alignments[0].hsps[0].expect > best_hit_criteria * record.alignments[1].hsps[0].expect,record.alignments[1].hsps[0].expect < best_hit_criteria * record.alignments[2].hsps[0].expect])]): ## and if the 2 alignment has an eval lower than 1e-5 * the eval of the third alignment
            Two_best_multi += 1
            
            key_1 = "%s %s" % (record.alignments[0].title, record.alignments[1].title)
            key_2 = "%s %s" % (record.alignments[1].title, record.alignments[0].title)
    
            if key_1 in Co_alignments:
                Co_alignments[key_1].append(record.query)
            elif key_2 in Co_alignments:
                Co_alignments[key_2].append(record.query)
            else:
                Co_alignments[key_1] = [record.query]
    


    N_records += 1

print "Unique = ", unique
print "Best single hit from multi", Best_multi
print "Best two hits from multi", Two_best_multi
    
#Co_alignments


Unique =  2260
Best single hit from multi 2575
Best two hits from multi 1351


So on top of the 4835 tags which I would usually keep, there are 1351 tags which map to a pair scaffolds more reliably than any other scaffolds. 

So how many of these scaffold pairs are mapped to by more than one tag?

In [602]:
counter = 0
for i in Co_alignments:
    if len(Co_alignments[i]) >1:
        counter +=1
        print i, len(Co_alignments[i])
counter 

gnl|BL_ORD_ID|3164 Super-Scaffold_3165 gnl|BL_ORD_ID|767 Super-Scaffold_768 3
gnl|BL_ORD_ID|1303 Super-Scaffold_1304 gnl|BL_ORD_ID|766 Super-Scaffold_767 2
gnl|BL_ORD_ID|1767 Super-Scaffold_1768 gnl|BL_ORD_ID|1768 Super-Scaffold_1769 3
gnl|BL_ORD_ID|2089 Super-Scaffold_2090 gnl|BL_ORD_ID|550 Super-Scaffold_551 3
gnl|BL_ORD_ID|3080 Super-Scaffold_3081 gnl|BL_ORD_ID|949 Super-Scaffold_950 3
gnl|BL_ORD_ID|1435 Super-Scaffold_1436 gnl|BL_ORD_ID|719 Super-Scaffold_720 2
gnl|BL_ORD_ID|1564 Super-Scaffold_1565 gnl|BL_ORD_ID|198 Super-Scaffold_199 5
gnl|BL_ORD_ID|989 Super-Scaffold_990 gnl|BL_ORD_ID|145 Super-Scaffold_146 3
gnl|BL_ORD_ID|4575 Super-Scaffold_10000023259 gnl|BL_ORD_ID|1829 Super-Scaffold_1830 2
gnl|BL_ORD_ID|1812 Super-Scaffold_1813 gnl|BL_ORD_ID|1361 Super-Scaffold_1362 2
gnl|BL_ORD_ID|4493 Super-Scaffold_10000012119 gnl|BL_ORD_ID|616 Super-Scaffold_617 3
gnl|BL_ORD_ID|1793 Super-Scaffold_1794 gnl|BL_ORD_ID|497 Super-Scaffold_498 2
gnl|BL_ORD_ID|1481 Super-Scaffold_1482 gnl|BL_

265

265 pairs are mapped to by more than one tag. Lets just take a look at one of those pairs, to see where the tags that map to them come from in the SbfI linkage map

In [604]:
test_loci = Co_alignments["gnl|BL_ORD_ID|1435 Super-Scaffold_1436 gnl|BL_ORD_ID|719 Super-Scaffold_720"]

for i in test_loci:
    print Linkage_map_dict_SbfI[i]
    

{'LG': '10', 'POS': '59.68'}
{'LG': '10', 'POS': '60.61'}


In [600]:
test_loci = Co_alignments["gnl|BL_ORD_ID|526 Super-Scaffold_527 gnl|BL_ORD_ID|190 Super-Scaffold_191"]

for i in test_loci:
    print Linkage_map_dict_SbfI[i]
    

{'LG': '5', 'POS': '72.47'}
{'LG': '5', 'POS': '70.61'}
{'LG': '5', 'POS': '74.01'}
{'LG': '5', 'POS': '74.01'}
{'LG': '5', 'POS': '72.47'}
{'LG': '5', 'POS': '72.47'}


Ok, so they all seem to come from a 2cM region of LG 5 for this last pair! Thats a pretty strong validation that these two scaffolds are indeed homologous. And even for the pair with only 2 markers, the linkage map positions are very consistent. So this seems to be a very good way of identifying homologous scaffolds. 

### Now I will filter these co-alignment scaffold pairs

I will keep pairs with 2 or more markers, where all markers come from the same linkage group. Note I will only do this with the strict blast filters, as adding noise at this stage could be bad news.

In [613]:
Passed_scaffold_pairs = []
Failed_scaffold_pairs = []

for scaffold_pair in Co_alignments:
    if len(Co_alignments[scaffold_pair]) >= 2:
        LGs = []
        for locus in Co_alignments[scaffold_pair]:
            locus_LG = Linkage_map_dict_SbfI[locus]["LG"]
            LGs.append(locus_LG)
        if len(set(LGs)) == 1:
            Passed_scaffold_pairs.append(scaffold_pair)
        else:
            Failed_scaffold_pairs.append(scaffold_pair)

In [614]:
print len(Passed_scaffold_pairs), "scaffolds passed"
print len(Failed_scaffold_pairs), "scaffolds failed"

261 scaffolds passed
4 scaffolds failed


Thats a pretty good pass rate! But lets take a look at the 4 that failed

In [617]:
test_loci = Co_alignments["gnl|BL_ORD_ID|756 Super-Scaffold_757 gnl|BL_ORD_ID|124 Super-Scaffold_125"]

for pair in Failed_scaffold_pairs:
    loci = Co_alignments[pair]
   
    for i in loci:
        print pair, Linkage_map_dict_SbfI[i]

gnl|BL_ORD_ID|3006 Super-Scaffold_3007 gnl|BL_ORD_ID|2066 Super-Scaffold_2067 {'LG': '1', 'POS': '165.00'}
gnl|BL_ORD_ID|3006 Super-Scaffold_3007 gnl|BL_ORD_ID|2066 Super-Scaffold_2067 {'LG': '2', 'POS': '81.97'}
gnl|BL_ORD_ID|283 Super-Scaffold_284 gnl|BL_ORD_ID|900 Super-Scaffold_901 {'LG': '5', 'POS': '88.57'}
gnl|BL_ORD_ID|283 Super-Scaffold_284 gnl|BL_ORD_ID|900 Super-Scaffold_901 {'LG': '5', 'POS': '87.33'}
gnl|BL_ORD_ID|283 Super-Scaffold_284 gnl|BL_ORD_ID|900 Super-Scaffold_901 {'LG': '8', 'POS': '69.58'}
gnl|BL_ORD_ID|756 Super-Scaffold_757 gnl|BL_ORD_ID|124 Super-Scaffold_125 {'LG': '11', 'POS': '45.02'}
gnl|BL_ORD_ID|756 Super-Scaffold_757 gnl|BL_ORD_ID|124 Super-Scaffold_125 {'LG': '11', 'POS': '43.17'}
gnl|BL_ORD_ID|756 Super-Scaffold_757 gnl|BL_ORD_ID|124 Super-Scaffold_125 {'LG': '6', 'POS': '82.43'}
gnl|BL_ORD_ID|1796 Super-Scaffold_1797 gnl|BL_ORD_ID|556 Super-Scaffold_557 {'LG': '5', 'POS': '175.01'}
gnl|BL_ORD_ID|1796 Super-Scaffold_1797 gnl|BL_ORD_ID|556 Super-Scaff

So two out of these 4 could be salvaged - they have 2 consistent mappings, and one false mapping. The other two have only 2 mappings, so knowing if one is correct is not possible. 

So I will add the salvagable ones to the passed list

In [618]:
Passed_scaffold_pairs.append("gnl|BL_ORD_ID|283 Super-Scaffold_284 gnl|BL_ORD_ID|900 Super-Scaffold_901")
Passed_scaffold_pairs.append("gnl|BL_ORD_ID|756 Super-Scaffold_757 gnl|BL_ORD_ID|124 Super-Scaffold_125")

So how many scaffolds can I retain by including these in the analyses?

In [622]:
print "%s scaffolds" % str(len(Passed_scaffold_pairs))

263 scaffolds


That means I also remove 263 scaffolds from the assembly essentially, or at least account for 263 scaffolds-worth of the assembly length. I can take a look at the amount that this corresponds to later, to see how much it brings down the overall assembly length. 

Next then, I would like to remove one of the scaffold pairs, and retain the other for use in the ALLMAPS analysis. I would like to retain the one with the highest number of total hits. . . that could include hits not already counted for the pair, as there may be tags which mapped to only one in each pair. 

So, for each pair, I will search the individual scaffolds against the retained blast hits, to count for each one individually, the number of tags which aligned to it.

### Remove one of the homologous scaffold pairs

This isn't as simple as it could be, as I have just realised that there are some scaffolds that are in more than one pair of co-aligned scaffs. So How to deal with these . . . ?

In [686]:
retained_scaffolds_from_coaligned_pairs = {}
set_aside_scaffolds_from_coaligned_pairs = {}

count = 0 

total_mappings = {}

for scaffold_pair in Passed_scaffold_pairs:
    scaffold_1 = scaffold_pair.split()[1]
    scaffold_2 = scaffold_pair.split()[3]
    
    if scaffold_1 not in total_mappings:
        total_mappings[scaffold_1] = []
    if scaffold_2 not in total_mappings:
        total_mappings[scaffold_2] = []

    
    for record in Kept_blast_records:
        if scaffold_1 == Kept_blast_records[record]["Ref_hit_id"]:
            total_mappings[scaffold_1].append(record)

        elif scaffold_2 == Kept_blast_records[record]["Ref_hit_id"]:
            total_mappings[scaffold_2].append(record)


In [689]:
len(Passed_scaffold_pairs)

263

In [703]:
retained_scaffolds_from_coaligned_pairs = {}
set_aside_scaffolds_from_coaligned_pairs = {}

for scaffold_pair in Passed_scaffold_pairs:
    scaffold_1 = scaffold_pair.split()[1]
    scaffold_2 = scaffold_pair.split()[3]
    
    if len(total_mappings[scaffold_1]) > len(total_mappings[scaffold_2]):
        if scaffold_1 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_1] = [scaffold_2]
        if scaffold_2 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]

        
    elif len(total_mappings[scaffold_2]) > len(total_mappings[scaffold_1]):
        if scaffold_2 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]
            
        if scaffold_1 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]
        
    elif len(total_mappings[scaffold_1]) == len(total_mappings[scaffold_2]):
        if scaffold_1 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_1] = [scaffold_2]
        if scaffold_2 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]


        

In [700]:
len(retained_scaffolds_from_coaligned_pairs)

239

In [701]:
len(set_aside_scaffolds_from_coaligned_pairs)

250

Ok, so I will retain 239 of these scaffolds, and set aside 250 of them for the ALLMAPS analyses, but I will keep those 250 in the genome assembly and keep note that they are homolgous with the others. I will probably add this info to the headers later on. 

## Output for ALLMAPS SbfI

So now I will output the SbfI data for allmaps, thats the retained blast hits containing unique mappings and the multiple mappings with ONE clear winner, and added to these scaffolds will be the mappings to the retained scaffolds as well. 

In [894]:
### First get the mapping info for the retained co-alignment scaffolds

blast_outs_handle = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_blast_hits_MTS_100.xml", 'r')
blastouts = NCBIXML.parse(blast_outs_handle)

retained_co_alignments_mapping_info = {}
count = 0
for record in blastouts:
    if len(record.alignments) >= 3:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh,record.alignments[1].hsps[0].expect <= evalue_thresh,record.alignments[0].hsps[0].expect > best_hit_criteria * record.alignments[1].hsps[0].expect,record.alignments[1].hsps[0].expect < best_hit_criteria * record.alignments[2].hsps[0].expect]): ## and if the 2 alignment has an eval lower than 1e-5 * the eval of the third alignment
            count += 1
            if record.alignments[0].title.split()[1] in retained_scaffolds_from_coaligned_pairs or record.alignments[1].title.split()[1] in retained_scaffolds_from_coaligned_pairs:
                retained_co_alignments_mapping_info[record.query] = {}
                retained_co_alignments_mapping_info[record.query]["Ref_hit_id"] = str(record.alignments[0].hit_def)
                retained_co_alignments_mapping_info[record.query]["Evalue"] = float(record.alignments[0].hsps[0].expect)
                retained_co_alignments_mapping_info[record.query]["Hit_start_coord"] = int(record.alignments[0].hsps[0].sbjct_start)
                retained_co_alignments_mapping_info[record.query]["Hit_end_coord"] = int(record.alignments[0].hsps[0].sbjct_end)
print count

IOError: [Errno 2] No such file or directory: '/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_only/SbfI_blast_hits_MTS_100.xml'

In [713]:
len(retained_co_alignments_mapping_info)

774

In [718]:
help(a.update)

Help on built-in function update:

update(...)
    D.update([E, ]**F) -> None.  Update D from dict/iterable E and F.
    If E present and has a .keys() method, does:     for k in E: D[k] = E[k]
    If E present and lacks .keys() method, does:     for (k, v) in E: D[k] = v
    In either case, this is followed by: for k in F: D[k] = F[k]



In [891]:
## Now make the ALLMAPS inputs

All_ALLMAPS_records = dict(Kept_blast_records.items() + retained_co_alignments_mapping_info.items())


In [735]:
print len(Kept_blast_records)
print len(retained_co_alignments_mapping_info)
print len(All_ALLMAPS_records)

4835
774
5609


In [892]:
## Get linkage map info

linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/SbfI_map/MAP_20/map_20_ordered_1_inf_mask23_FEMALE_genetic_mapper.dat"
Linkage_map_dict_SbfI = Get_LM_info(linkage_map_path, real_tag_IDs)

## compile alignment and linkage mapping info for each tag into a single dictionary

Blast_plus_LM_info_SbfI = {}

for tag in All_ALLMAPS_records:
    if tag in Linkage_map_dict_SbfI:
        Blast_plus_LM_info_SbfI[tag] = {}
        Blast_plus_LM_info_SbfI[tag]["scaff"] = All_ALLMAPS_records[tag]["Ref_hit_id"]
        Blast_plus_LM_info_SbfI[tag]["STRT"] = All_ALLMAPS_records[tag]["Hit_start_coord"]
        Blast_plus_LM_info_SbfI[tag]["LG"] = Linkage_map_dict_SbfI[tag]["LG"]
        Blast_plus_LM_info_SbfI[tag]["POS"] = Linkage_map_dict_SbfI[tag]["POS"]
        
## Output to file

outfile = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_ALLMAPS_with_coaligned_input.dat", 'w')

for i in Blast_plus_LM_info_SbfI:
    outfile.write("%s\t%s\t%s\t%s\t%s\n" % (i, Blast_plus_LM_info_SbfI[i]["scaff"],Blast_plus_LM_info_SbfI[i]["STRT"],Blast_plus_LM_info_SbfI[i]["LG"],Blast_plus_LM_info_SbfI[i]["POS"]))
                  
outfile.close()

13629


KeyError: '24183'

So I ran this through ALLMAPS, and already the output looks very nice. . . . but I will process the other maps and add them to the analyses before I take a look at all the outputs.  First though, I will output the SbfI data with the relexed blast outputs too.

### SbfI relaxed blast filters

In [738]:
## get the relaxed blast outputs

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_only/SbfI_blast_hits_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-16
best_hit_criteria = 1e-3
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_relaxed = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 3204
Number of unique alingments kept: 2272


In [739]:
## add the coalignment scaffs

All_ALLMAPS_records_relaxed = dict(Kept_blast_records_relaxed.items() + retained_co_alignments_mapping_info.items())

In [743]:
## Get linkage map info

linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/SbfI_map/MAP_20/map_20_ordered_1_inf_mask23_FEMALE_genetic_mapper.dat"
Linkage_map_dict_SbfI = Get_LM_info(linkage_map_path, real_tag_IDs)

## compile alignment and linkage mapping info for each tag into a single dictionary

Blast_plus_LM_info_SbfI = {}

for tag in All_ALLMAPS_records_relaxed:
    if tag in Linkage_map_dict_SbfI:
        Blast_plus_LM_info_SbfI[tag] = {}
        Blast_plus_LM_info_SbfI[tag]["scaff"] = All_ALLMAPS_records_relaxed[tag]["Ref_hit_id"]
        Blast_plus_LM_info_SbfI[tag]["STRT"] = All_ALLMAPS_records_relaxed[tag]["Hit_start_coord"]
        Blast_plus_LM_info_SbfI[tag]["LG"] = Linkage_map_dict_SbfI[tag]["LG"]
        Blast_plus_LM_info_SbfI[tag]["POS"] = Linkage_map_dict_SbfI[tag]["POS"]
        
## Output to file

outfile = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/SbfI_ALLMAPS_with_coaligned_input_relaxed.dat", 'w')

for i in Blast_plus_LM_info_SbfI:
    outfile.write("%s\t%s\t%s\t%s\t%s\n" % (i, Blast_plus_LM_info_SbfI[i]["scaff"],Blast_plus_LM_info_SbfI[i]["STRT"],Blast_plus_LM_info_SbfI[i]["LG"],Blast_plus_LM_info_SbfI[i]["POS"]))
                  
outfile.close()

Kept an extra 2% of the genome, not bad really, still see how things look with all maps both relaxed and strict

## No make outputs for PstI

So for PstI I need to find the multi mappings, as before, get relaxed and strict blast outs, merge them with the coaligned scaffolds, and then output.

In [898]:
## strict blast filtered

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_blast_hits_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-20
best_hit_criteria = 1e-5
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_PstI_strict = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 934
Number of unique alingments kept: 1011


In [899]:
## strict blast filtered

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_blast_hits_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-16
best_hit_criteria = 1e-3
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_PstI_relaxed = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 1247
Number of unique alingments kept: 1024


Now get co-aligned scaffold pairs

In [900]:
evalue_thresh = 1e-20
best_hit_criteria = 1e-5

Co_alignments = {}
unique = 0
Best_multi = 0
Two_best_multi = 0

N_records = 0

blast_outs_handle = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_blast_hits_MTS_100.xml", 'r')
blastouts = NCBIXML.parse(blast_outs_handle)

for record in blastouts:
    
    if len(record.alignments) == 1:
        if record.alignments[0].hsps[0].expect <= evalue_thresh:
            unique += 1
        
    elif len(record.alignments) == 2:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh, record.alignments[0].hsps[0].expect < best_hit_criteria * record.alignments[1].hsps[0].expect]): ## if the first alignmnet has an eval lower than 1e-5 * the eval of the second alignment.
            Best_multi += 1

    
    #print record.alignments[0].hsps[0]
    elif len(record.alignments) >= 3:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh, record.alignments[0].hsps[0].expect < best_hit_criteria * record.alignments[1].hsps[0].expect]): ## if the first alignmnet has an eval lower than 1e-5 * the eval of the second alignment.
            Best_multi += 1

        elif all([all([record.alignments[0].hsps[0].expect <= evalue_thresh,record.alignments[1].hsps[0].expect <= evalue_thresh,record.alignments[0].hsps[0].expect > best_hit_criteria * record.alignments[1].hsps[0].expect,record.alignments[1].hsps[0].expect < best_hit_criteria * record.alignments[2].hsps[0].expect])]): ## and if the 2 alignment has an eval lower than 1e-5 * the eval of the third alignment
            Two_best_multi += 1
            
            key_1 = "%s %s" % (record.alignments[0].title, record.alignments[1].title)
            key_2 = "%s %s" % (record.alignments[1].title, record.alignments[0].title)
    
            if key_1 in Co_alignments:
                Co_alignments[key_1].append(record.query)
            elif key_2 in Co_alignments:
                Co_alignments[key_2].append(record.query)
            else:
                Co_alignments[key_1] = [record.query]
    


    N_records += 1

print "Unique = ", unique
print "Best single hit from multi", Best_multi
print "Best two hits from multi", Two_best_multi
    
#Co_alignments


Unique =  1011
Best single hit from multi 934
Best two hits from multi 460


In [919]:
linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/PstI_map/MAP_15/Inf_mask23/map_15_ordered_1_FEMALE_genetic_mapper.dat"

Linkage_map_dict_PstI = Get_LM_info(linkage_map_path, real_tag_IDs_PstI)

5681


In [902]:
counter = 0
for i in Co_alignments:
    if len(Co_alignments[i]) >1:
        counter +=1
        print i, Co_alignments[i]
counter 

gnl|BL_ORD_ID|392 Super-Scaffold_393 gnl|BL_ORD_ID|1494 Super-Scaffold_1495 [u'162503', u'246771']
gnl|BL_ORD_ID|2089 Super-Scaffold_2090 gnl|BL_ORD_ID|550 Super-Scaffold_551 [u'235049', u'294839']
gnl|BL_ORD_ID|4567 Super-Scaffold_10000022714 gnl|BL_ORD_ID|500 Super-Scaffold_501 [u'183938', u'207464']
gnl|BL_ORD_ID|352 Super-Scaffold_353 gnl|BL_ORD_ID|1866 Super-Scaffold_1867 [u'280810', u'302907']
gnl|BL_ORD_ID|2982 Super-Scaffold_2983 gnl|BL_ORD_ID|483 Super-Scaffold_484 [u'209571', u'287558']
gnl|BL_ORD_ID|858 Super-Scaffold_859 gnl|BL_ORD_ID|38 Super-Scaffold_39 [u'280602', u'283619']
gnl|BL_ORD_ID|1442 Super-Scaffold_1443 gnl|BL_ORD_ID|62 Super-Scaffold_63 [u'18917', u'185025', u'223440', u'278948']
gnl|BL_ORD_ID|2130 Super-Scaffold_2131 gnl|BL_ORD_ID|3688 Super-Scaffold_3689 [u'219612', u'295176']
gnl|BL_ORD_ID|299 Super-Scaffold_300 gnl|BL_ORD_ID|14 Super-Scaffold_15 [u'253913', u'322762']
gnl|BL_ORD_ID|1424 Super-Scaffold_1425 gnl|BL_ORD_ID|479 Super-Scaffold_480 [u'200479', u

57

Ok, so there are 57 tags scaffol pairs which have 2 or more tags which map to them . . . 

Now filter these for those where the tags come from the same linkage group.

In [903]:
Passed_scaffold_pairs = []
Failed_scaffold_pairs = []

for scaffold_pair in Co_alignments:
    if len(Co_alignments[scaffold_pair]) >= 2:
        LGs = []
        for locus in Co_alignments[scaffold_pair]:
            locus_LG = Linkage_map_dict_PstI[locus]["LG"]
            LGs.append(locus_LG)
        if len(set(LGs)) == 1:
            Passed_scaffold_pairs.append(scaffold_pair)
        else:
            Failed_scaffold_pairs.append(scaffold_pair)

Take a look at the 1 scaffold that failed

In [904]:
for i in Co_alignments["gnl|BL_ORD_ID|2369 Super-Scaffold_2370 gnl|BL_ORD_ID|1595 Super-Scaffold_1596"]:
        print i, Linkage_map_dict_PstI[i]

228113 {'LG': '1', 'POS': '324.46'}
308140 {'LG': '5', 'POS': '86.15'}


Yep, can't do anything about that. . . .so now I will find the scaffold in each pair with the most hits, retain these and set the others aside. 

In [905]:
retained_scaffolds_from_coaligned_pairs = {}
set_aside_scaffolds_from_coaligned_pairs = {}

count = 0 

total_mappings = {}

for scaffold_pair in Passed_scaffold_pairs:
    scaffold_1 = scaffold_pair.split()[1]
    scaffold_2 = scaffold_pair.split()[3]
    
    if scaffold_1 not in total_mappings:
        total_mappings[scaffold_1] = []
    if scaffold_2 not in total_mappings:
        total_mappings[scaffold_2] = []

    
    for record in Kept_blast_records_PstI_strict:
        if scaffold_1 == Kept_blast_records_PstI_strict[record]["Ref_hit_id"]:
            total_mappings[scaffold_1].append(record)

        elif scaffold_2 == Kept_blast_records_PstI_strict[record]["Ref_hit_id"]:
            total_mappings[scaffold_2].append(record)


In [906]:
retained_scaffolds_from_coaligned_pairs = {}
set_aside_scaffolds_from_coaligned_pairs = {}

for scaffold_pair in Passed_scaffold_pairs:
    scaffold_1 = scaffold_pair.split()[1]
    scaffold_2 = scaffold_pair.split()[3]
    
    if len(total_mappings[scaffold_1]) > len(total_mappings[scaffold_2]):
        if scaffold_1 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_1] = [scaffold_2]
        if scaffold_2 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]

        
    elif len(total_mappings[scaffold_2]) > len(total_mappings[scaffold_1]):
        if scaffold_2 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]
            
        if scaffold_1 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]
        
    elif len(total_mappings[scaffold_1]) == len(total_mappings[scaffold_2]):
        if scaffold_1 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_1] = [scaffold_2]
        if scaffold_2 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]


        

### Now make the strict ALLMAPS input for PstI

In [908]:
### First get the mapping info for the retained co-alignment scaffolds

blast_outs_handle = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_blast_hits_MTS_100.xml", 'r')
blastouts = NCBIXML.parse(blast_outs_handle)

retained_co_alignments_mapping_info = {}
count = 0
for record in blastouts:
    if len(record.alignments) >= 3:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh,record.alignments[1].hsps[0].expect <= evalue_thresh,record.alignments[0].hsps[0].expect > best_hit_criteria * record.alignments[1].hsps[0].expect,record.alignments[1].hsps[0].expect < best_hit_criteria * record.alignments[2].hsps[0].expect]): ## and if the 2 alignment has an eval lower than 1e-5 * the eval of the third alignment
            count += 1
            if record.alignments[0].title.split()[1] in retained_scaffolds_from_coaligned_pairs or record.alignments[1].title.split()[1] in retained_scaffolds_from_coaligned_pairs:
                retained_co_alignments_mapping_info[record.query] = {}
                retained_co_alignments_mapping_info[record.query]["Ref_hit_id"] = str(record.alignments[0].hit_def)
                retained_co_alignments_mapping_info[record.query]["Evalue"] = float(record.alignments[0].hsps[0].expect)
                retained_co_alignments_mapping_info[record.query]["Hit_start_coord"] = int(record.alignments[0].hsps[0].sbjct_start)
                retained_co_alignments_mapping_info[record.query]["Hit_end_coord"] = int(record.alignments[0].hsps[0].sbjct_end)
print count

460


### Now make the ALLMAPS inputs - Strict blast filters

In [921]:
## Now make the ALLMAPS inputs

All_ALLMAPS_records_strict_PstI = dict(Kept_blast_records_PstI_strict.items() + retained_co_alignments_mapping_info.items())


In [916]:
print len(Kept_blast_records_PstI_strict)
print len(retained_co_alignments_mapping_info)
print len(All_ALLMAPS_records)

1945
141
2086


In [956]:
## compile alignment and linkage mapping info for each tag into a single dictionary

Blast_plus_LM_info_PstI = {}

for tag in All_ALLMAPS_records_strict_PstI:
    if tag in Linkage_map_dict_PstI:
        Blast_plus_LM_info_PstI[tag] = {}
        Blast_plus_LM_info_PstI[tag]["scaff"] = All_ALLMAPS_records_strict_PstI[tag]["Ref_hit_id"]
        Blast_plus_LM_info_PstI[tag]["STRT"] = All_ALLMAPS_records_strict_PstI[tag]["Hit_start_coord"]
        Blast_plus_LM_info_PstI[tag]["LG"] = Linkage_map_dict_PstI[tag]["LG"]
        Blast_plus_LM_info_PstI[tag]["POS"] = Linkage_map_dict_PstI[tag]["POS"]

print len(Blast_plus_LM_info_PstI)
        
## Output to file

outfile = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_ALLMAPS_with_coaligned_input_strict.dat", 'w')

for i in Blast_plus_LM_info_PstI:
    outfile.write("%s\t%s\t%s\t%s\t%s\n" % (i, Blast_plus_LM_info_PstI[i]["scaff"],Blast_plus_LM_info_PstI[i]["STRT"],Blast_plus_LM_info_PstI[i]["LG"],Blast_plus_LM_info_PstI[i]["POS"]))
                  
outfile.close()

2086


### Now make the relaxed ALLMAPS input for PstI

In [923]:
## Now make the ALLMAPS inputs

All_ALLMAPS_records_relaxed_PstI = dict(Kept_blast_records_PstI_relaxed.items() + retained_co_alignments_mapping_info.items())
print len(All_ALLMAPS_records_relaxed_PstI)

2398


In [957]:
## compile alignment and linkage mapping info for each tag into a single dictionary

Blast_plus_LM_info_PstI_relaxed = {}

for tag in All_ALLMAPS_records_relaxed_PstI:
    if tag in Linkage_map_dict_PstI:
        Blast_plus_LM_info_PstI_relaxed[tag] = {}
        Blast_plus_LM_info_PstI_relaxed[tag]["scaff"] = All_ALLMAPS_records_relaxed_PstI[tag]["Ref_hit_id"]
        Blast_plus_LM_info_PstI_relaxed[tag]["STRT"] = All_ALLMAPS_records_relaxed_PstI[tag]["Hit_start_coord"]
        Blast_plus_LM_info_PstI_relaxed[tag]["LG"] = Linkage_map_dict_PstI[tag]["LG"]
        Blast_plus_LM_info_PstI_relaxed[tag]["POS"] = Linkage_map_dict_PstI[tag]["POS"]

print len(Blast_plus_LM_info_PstI_relaxed)
        
## Output to file

outfile = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/PstI_ALLMAPS_with_coaligned_input_relaxed.dat", 'w')

for i in Blast_plus_LM_info_PstI_relaxed:
    outfile.write("%s\t%s\t%s\t%s\t%s\n" % (i, Blast_plus_LM_info_PstI_relaxed[i]["scaff"],Blast_plus_LM_info_PstI_relaxed[i]["STRT"],Blast_plus_LM_info_PstI_relaxed[i]["LG"],Blast_plus_LM_info_PstI_relaxed[i]["POS"]))
                  
outfile.close()

2398


## And finally do the same for EcoRI

In [945]:
## strict blast filtered

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_blast_hits_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-20
best_hit_criteria = 1e-5
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_EcoRI_strict = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 665
Number of unique alingments kept: 741


In [946]:
## strict blast filtered

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_blast_hits_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-16
best_hit_criteria = 1e-3
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_EcoRI_relaxed = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 849
Number of unique alingments kept: 743


Now get co-aligned scaffold pairs

In [928]:
evalue_thresh = 1e-20
best_hit_criteria = 1e-5

Co_alignments = {}
unique = 0
Best_multi = 0
Two_best_multi = 0

N_records = 0

blast_outs_handle = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_blast_hits_MTS_100.xml", 'r')
blastouts = NCBIXML.parse(blast_outs_handle)

for record in blastouts:
    
    if len(record.alignments) == 1:
        if record.alignments[0].hsps[0].expect <= evalue_thresh:
            unique += 1
        
    elif len(record.alignments) == 2:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh, record.alignments[0].hsps[0].expect < best_hit_criteria * record.alignments[1].hsps[0].expect]): ## if the first alignmnet has an eval lower than 1e-5 * the eval of the second alignment.
            Best_multi += 1

    
    #print record.alignments[0].hsps[0]
    elif len(record.alignments) >= 3:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh, record.alignments[0].hsps[0].expect < best_hit_criteria * record.alignments[1].hsps[0].expect]): ## if the first alignmnet has an eval lower than 1e-5 * the eval of the second alignment.
            Best_multi += 1

        elif all([all([record.alignments[0].hsps[0].expect <= evalue_thresh,record.alignments[1].hsps[0].expect <= evalue_thresh,record.alignments[0].hsps[0].expect > best_hit_criteria * record.alignments[1].hsps[0].expect,record.alignments[1].hsps[0].expect < best_hit_criteria * record.alignments[2].hsps[0].expect])]): ## and if the 2 alignment has an eval lower than 1e-5 * the eval of the third alignment
            Two_best_multi += 1
            
            key_1 = "%s %s" % (record.alignments[0].title, record.alignments[1].title)
            key_2 = "%s %s" % (record.alignments[1].title, record.alignments[0].title)
    
            if key_1 in Co_alignments:
                Co_alignments[key_1].append(record.query)
            elif key_2 in Co_alignments:
                Co_alignments[key_2].append(record.query)
            else:
                Co_alignments[key_1] = [record.query]
    


    N_records += 1

print "Unique = ", unique
print "Best single hit from multi", Best_multi
print "Best two hits from multi", Two_best_multi
    
#Co_alignments


Unique =  741
Best single hit from multi 665
Best two hits from multi 346


In [934]:
linkage_map_path = "/home/djeffrie/Data/RADseq/R_temp_fams/NEW/EcoRI_map/Lepmap/MAP_9_5/InfMask23/map_9_5_ordered_1_infMask_23_FEMALE_genetic_mapper.dat"

Linkage_map_dict_EcoRI = Get_LM_info(linkage_map_path, real_tag_IDs_EcoRI)

3193


In [931]:
counter = 0
for i in Co_alignments:
    if len(Co_alignments[i]) >1:
        counter +=1
        print i, Co_alignments[i]
counter 

gnl|BL_ORD_ID|1016 Super-Scaffold_1017 gnl|BL_ORD_ID|948 Super-Scaffold_949 [u'626758', u'692696']
gnl|BL_ORD_ID|515 Super-Scaffold_516 gnl|BL_ORD_ID|1613 Super-Scaffold_1614 [u'606525', u'728830']
gnl|BL_ORD_ID|1134 Super-Scaffold_1135 gnl|BL_ORD_ID|1074 Super-Scaffold_1075 [u'600084', u'745826']
gnl|BL_ORD_ID|3080 Super-Scaffold_3081 gnl|BL_ORD_ID|949 Super-Scaffold_950 [u'554986', u'619217']
gnl|BL_ORD_ID|959 Super-Scaffold_960 gnl|BL_ORD_ID|437 Super-Scaffold_438 [u'570829', u'745133']
gnl|BL_ORD_ID|1342 Super-Scaffold_1343 gnl|BL_ORD_ID|173 Super-Scaffold_174 [u'565977', u'589166']
gnl|BL_ORD_ID|2947 Super-Scaffold_2948 gnl|BL_ORD_ID|172 Super-Scaffold_173 [u'609429', u'610424']
gnl|BL_ORD_ID|3075 Super-Scaffold_3076 gnl|BL_ORD_ID|198 Super-Scaffold_199 [u'678013', u'706219', u'721304']
gnl|BL_ORD_ID|20 Super-Scaffold_21 gnl|BL_ORD_ID|715 Super-Scaffold_716 [u'584063', u'608919']
gnl|BL_ORD_ID|3474 Super-Scaffold_3475 gnl|BL_ORD_ID|869 Super-Scaffold_870 [u'616085', u'693734']
gnl

40

Ok, so there are 40 tags scaffol pairs which have 2 or more tags which map to them . . . 

Now filter these for those where the tags come from the same linkage group.

In [935]:
Passed_scaffold_pairs = []
Failed_scaffold_pairs = []

for scaffold_pair in Co_alignments:
    if len(Co_alignments[scaffold_pair]) >= 2:
        LGs = []
        for locus in Co_alignments[scaffold_pair]:
            locus_LG = Linkage_map_dict_EcoRI[locus]["LG"]
            LGs.append(locus_LG)
        if len(set(LGs)) == 1:
            Passed_scaffold_pairs.append(scaffold_pair)
        else:
            Failed_scaffold_pairs.append(scaffold_pair)

In [937]:
Failed_scaffold_pairs

[u'gnl|BL_ORD_ID|1342 Super-Scaffold_1343 gnl|BL_ORD_ID|173 Super-Scaffold_174',
 u'gnl|BL_ORD_ID|3113 Super-Scaffold_3114 gnl|BL_ORD_ID|184 Super-Scaffold_185',
 u'gnl|BL_ORD_ID|3263 Super-Scaffold_3264 gnl|BL_ORD_ID|974 Super-Scaffold_975',
 u'gnl|BL_ORD_ID|1658 Super-Scaffold_1659 gnl|BL_ORD_ID|647 Super-Scaffold_648']

Take a look at the 4 scaffolds that failed

In [939]:
for pair in Failed_scaffold_pairs:
    for i in Co_alignments[pair]:
        print pair, i, Linkage_map_dict_EcoRI[i]

 gnl|BL_ORD_ID|1342 Super-Scaffold_1343 gnl|BL_ORD_ID|173 Super-Scaffold_174 565977 {'LG': '15', 'POS': '43.62'}
gnl|BL_ORD_ID|1342 Super-Scaffold_1343 gnl|BL_ORD_ID|173 Super-Scaffold_174 589166 {'LG': '12', 'POS': '4.00'}
gnl|BL_ORD_ID|3113 Super-Scaffold_3114 gnl|BL_ORD_ID|184 Super-Scaffold_185 535749 {'LG': '4', 'POS': '66.36'}
gnl|BL_ORD_ID|3113 Super-Scaffold_3114 gnl|BL_ORD_ID|184 Super-Scaffold_185 710115 {'LG': '16', 'POS': '12.70'}
gnl|BL_ORD_ID|3263 Super-Scaffold_3264 gnl|BL_ORD_ID|974 Super-Scaffold_975 564859 {'LG': '16', 'POS': '15.37'}
gnl|BL_ORD_ID|3263 Super-Scaffold_3264 gnl|BL_ORD_ID|974 Super-Scaffold_975 736390 {'LG': '4', 'POS': '62.36'}
gnl|BL_ORD_ID|1658 Super-Scaffold_1659 gnl|BL_ORD_ID|647 Super-Scaffold_648 641430 {'LG': '1', 'POS': '197.56'}
gnl|BL_ORD_ID|1658 Super-Scaffold_1659 gnl|BL_ORD_ID|647 Super-Scaffold_648 724235 {'LG': '13', 'POS': '12.15'}


Nothing I can do here, carry on with just the ones that passed

In [941]:
retained_scaffolds_from_coaligned_pairs = {}
set_aside_scaffolds_from_coaligned_pairs = {}

count = 0 

total_mappings = {}

for scaffold_pair in Passed_scaffold_pairs:
    scaffold_1 = scaffold_pair.split()[1]
    scaffold_2 = scaffold_pair.split()[3]
    
    if scaffold_1 not in total_mappings:
        total_mappings[scaffold_1] = []
    if scaffold_2 not in total_mappings:
        total_mappings[scaffold_2] = []

    
    for record in Kept_blast_records_PstI_strict:
        if scaffold_1 == Kept_blast_records_PstI_strict[record]["Ref_hit_id"]:
            total_mappings[scaffold_1].append(record)

        elif scaffold_2 == Kept_blast_records_PstI_strict[record]["Ref_hit_id"]:
            total_mappings[scaffold_2].append(record)


In [942]:
retained_scaffolds_from_coaligned_pairs = {}
set_aside_scaffolds_from_coaligned_pairs = {}

for scaffold_pair in Passed_scaffold_pairs:
    scaffold_1 = scaffold_pair.split()[1]
    scaffold_2 = scaffold_pair.split()[3]
    
    if len(total_mappings[scaffold_1]) > len(total_mappings[scaffold_2]):
        if scaffold_1 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_1] = [scaffold_2]
        if scaffold_2 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]

        
    elif len(total_mappings[scaffold_2]) > len(total_mappings[scaffold_1]):
        if scaffold_2 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]
            
        if scaffold_1 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]
        
    elif len(total_mappings[scaffold_1]) == len(total_mappings[scaffold_2]):
        if scaffold_1 in retained_scaffolds_from_coaligned_pairs:
            retained_scaffolds_from_coaligned_pairs[scaffold_1].append(scaffold_2)
        else:
            retained_scaffolds_from_coaligned_pairs[scaffold_1] = [scaffold_2]
        if scaffold_2 in set_aside_scaffolds_from_coaligned_pairs:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2].append(scaffold_1)
        else:
            set_aside_scaffolds_from_coaligned_pairs[scaffold_2] = [scaffold_1]


        

### Now make the strict ALLMAPS input for PstI

In [943]:
### First get the mapping info for the retained co-alignment scaffolds

blast_outs_handle = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_blast_hits_MTS_100.xml", 'r')
blastouts = NCBIXML.parse(blast_outs_handle)

retained_co_alignments_mapping_info = {}
count = 0
for record in blastouts:
    if len(record.alignments) >= 3:
        if all([record.alignments[0].hsps[0].expect <= evalue_thresh,record.alignments[1].hsps[0].expect <= evalue_thresh,record.alignments[0].hsps[0].expect > best_hit_criteria * record.alignments[1].hsps[0].expect,record.alignments[1].hsps[0].expect < best_hit_criteria * record.alignments[2].hsps[0].expect]): ## and if the 2 alignment has an eval lower than 1e-5 * the eval of the third alignment
            count += 1
            if record.alignments[0].title.split()[1] in retained_scaffolds_from_coaligned_pairs or record.alignments[1].title.split()[1] in retained_scaffolds_from_coaligned_pairs:
                retained_co_alignments_mapping_info[record.query] = {}
                retained_co_alignments_mapping_info[record.query]["Ref_hit_id"] = str(record.alignments[0].hit_def)
                retained_co_alignments_mapping_info[record.query]["Evalue"] = float(record.alignments[0].hsps[0].expect)
                retained_co_alignments_mapping_info[record.query]["Hit_start_coord"] = int(record.alignments[0].hsps[0].sbjct_start)
                retained_co_alignments_mapping_info[record.query]["Hit_end_coord"] = int(record.alignments[0].hsps[0].sbjct_end)
print count

346


In [947]:
## Now make the ALLMAPS inputs

All_ALLMAPS_records_strict_EcoRI = dict(Kept_blast_records_EcoRI_strict.items() + retained_co_alignments_mapping_info.items())


In [950]:
## compile alignment and linkage mapping info for each tag into a single dictionary

Blast_plus_LM_info_EcoRI = {}

for tag in All_ALLMAPS_records_strict_EcoRI:
    if tag in Linkage_map_dict_EcoRI:
        Blast_plus_LM_info_EcoRI[tag] = {}
        Blast_plus_LM_info_EcoRI[tag]["scaff"] = All_ALLMAPS_records_strict_EcoRI[tag]["Ref_hit_id"]
        Blast_plus_LM_info_EcoRI[tag]["STRT"] = All_ALLMAPS_records_strict_EcoRI[tag]["Hit_start_coord"]
        Blast_plus_LM_info_EcoRI[tag]["LG"] = Linkage_map_dict_EcoRI[tag]["LG"]
        Blast_plus_LM_info_EcoRI[tag]["POS"] = Linkage_map_dict_EcoRI[tag]["POS"]

print len(Blast_plus_LM_info_EcoRI)
        
## Output to file

outfile = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_ALLMAPS_with_coaligned_input_strict.dat", 'w')

for i in Blast_plus_LM_info_EcoRI:
    outfile.write("%s\t%s\t%s\t%s\t%s\n" % (i, Blast_plus_LM_info_EcoRI[i]["scaff"],Blast_plus_LM_info_EcoRI[i]["STRT"],Blast_plus_LM_info_EcoRI[i]["LG"],Blast_plus_LM_info_EcoRI[i]["POS"]))
                  
outfile.close()

1488


### Now make the relaxed ALLMAPS input for PstI

In [951]:
## Now make the ALLMAPS inputs

All_ALLMAPS_records_relaxed_EcoRI = dict(Kept_blast_records_EcoRI_relaxed.items() + retained_co_alignments_mapping_info.items())
print len(All_ALLMAPS_records_relaxed_EcoRI)

1661


In [955]:
## compile alignment and linkage mapping info for each tag into a single dictionary

Blast_plus_LM_info_EcoRI_relaxed = {}

for tag in All_ALLMAPS_records_relaxed_EcoRI:
    if tag in Linkage_map_dict_EcoRI:
        Blast_plus_LM_info_EcoRI_relaxed[tag] = {}
        Blast_plus_LM_info_EcoRI_relaxed[tag]["scaff"] = All_ALLMAPS_records_relaxed_EcoRI[tag]["Ref_hit_id"]
        Blast_plus_LM_info_EcoRI_relaxed[tag]["STRT"] = All_ALLMAPS_records_relaxed_EcoRI[tag]["Hit_start_coord"]
        Blast_plus_LM_info_EcoRI_relaxed[tag]["LG"] = Linkage_map_dict_EcoRI[tag]["LG"]
        Blast_plus_LM_info_EcoRI_relaxed[tag]["POS"] = Linkage_map_dict_EcoRI[tag]["POS"]

print len(Blast_plus_LM_info_EcoRI_relaxed)
        
## Output to file

outfile = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/Mapped_tag_fastas/EcoRI_ALLMAPS_with_coaligned_input_relaxed.dat", 'w')

for i in Blast_plus_LM_info_EcoRI_relaxed:
    outfile.write("%s\t%s\t%s\t%s\t%s\n" % (i, Blast_plus_LM_info_EcoRI_relaxed[i]["scaff"],Blast_plus_LM_info_EcoRI_relaxed[i]["STRT"],Blast_plus_LM_info_EcoRI_relaxed[i]["LG"],Blast_plus_LM_info_EcoRI_relaxed[i]["POS"]))
                  
outfile.close()

1661


### Summary so far,

So I made the inputs. I identified some homologous scaffolds, and I ran things through ALLMAPS in a preliminary run. The problem is I think the maps I made with EcoRI and PstI were wrong in some places. . . I used LepMap3 for this, and perhaps it is not ideal for only one or two families. 

So anyway, I have got the maps made by Gemma Palomar, from the Palomar et al paper, and the map from Alan, so I will use these instead. Also its easier to cite these than use different programs for the linkage maps. 

So I need to do a bit of work before I can incorporate these maps, I need to get the markers, map them to the genome, then repeat the methods above for each one. Should be able to do this today!




### PstI map from Palomar et al 2017

I have a tdf (tab delim file) containing the tag_ID and the position in the map. I also have two fasta files contining the tag sequences, one for R1s and one for R2s.  I think most of the sequences in the map file, if not all, are from the map. So I will just blast them against the genome and take directly from the filtered blast hits. 

In [965]:
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
Rtemp_db = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_blastn"

### R1 tags

my_query = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ForwardContigReference.fasta"

## xml format
blast_outs = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ForwardContigReference_blastouts_MTS_100.xml"
blastn_cline = BLASTn(query=my_query, db=Rtemp_db, outfmt=5, out=blast_outs, num_threads = 7, max_target_seqs = 100)


stdout, stderr = blastn_cline()



In [966]:
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
Rtemp_db = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_blastn"

### R1 tags

my_query = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ReverseContigsReference.fasta"

## xml format
blast_outs = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ReverseContigsReference_blastouts_MTS_100.xml"
blastn_cline = BLASTn(query=my_query, db=Rtemp_db, outfmt=5, out=blast_outs, num_threads = 7, max_target_seqs = 100)


stdout, stderr = blastn_cline()



Filter the blast hits

In [967]:
## R1s strict

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ForwardContigReference_blastouts_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-20
best_hit_criteria = 1e-5
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_R1s_strict = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 494
Number of unique alingments kept: 611


In [968]:
## R1s relaxed

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ForwardContigReference_blastouts_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-16
best_hit_criteria = 1e-3
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_R1s_relaxed = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 638
Number of unique alingments kept: 620


In [970]:
## R2s strict

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ReverseContigsReference_blastouts_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-20
best_hit_criteria = 1e-5
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_R2s_strict = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 570
Number of unique alingments kept: 697


In [971]:
## R2s relaxed

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ReverseContigsReference_blastouts_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-16
best_hit_criteria = 1e-3
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_R2s_strict = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 710
Number of unique alingments kept: 704


Ok so about 46% of the tags can be mapped again. . . . Now I can combine this alignment information with their map information.

In [985]:
Female_PstI_Palomar_map = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/Female_palomar_map.txt", 'r').readlines()

### Strict

In [991]:
ALLMAPS_input_pstI_Palomar = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ALLMAPS_input_pstI_Palomar.csv", 'w')
ALLMAPS_input_pstI_Palomar.write("Scaffold ID,scaffold position,LG,genetic position\n")

for i in Female_PstI_Palomar_map:
    snp_ID = i.split()[0]
    tag_ID = i.split()[0].rpartition("_")[0]
    LG = i.split()[1]
    POS = i.split()[2]
    if tag_ID in Kept_blast_records_R1s_strict:
        ALLMAPS_input_pstI_Palomar.write("%s,%s,%s,%s\n" % (Kept_blast_records_R1s_strict[tag_ID]["Ref_hit_id"], Kept_blast_records_R1s_strict[tag_ID]['Hit_start_coord'], LG, POS))
    elif tag_ID in Kept_blast_records_R2s_strict:
        ALLMAPS_input_pstI_Palomar.write("%s,%s,%s,%s\n" % (Kept_blast_records_R2s_strict[tag_ID]["Ref_hit_id"], Kept_blast_records_R2s_strict[tag_ID]['Hit_start_coord'], LG, POS))
        
ALLMAPS_input_pstI_Palomar.close()

### Relaxed

In [1036]:
ALLMAPS_input_pstI_Palomar = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/PstI_map/Tags_from_Palomar_etal/ALLMAPS_input_pstI_Palomar_relaxed.csv", 'w')
ALLMAPS_input_pstI_Palomar.write("Scaffold ID,scaffold position,LG,genetic position\n")

for i in Female_PstI_Palomar_map:
    snp_ID = i.split()[0]
    tag_ID = i.split()[0].rpartition("_")[0]
    LG = i.split()[1]
    POS = i.split()[2]
    if tag_ID in Kept_blast_records_R1s_relaxed:
        ALLMAPS_input_pstI_Palomar.write("%s,%s,%s,%s\n" % (Kept_blast_records_R1s_relaxed[tag_ID]["Ref_hit_id"], Kept_blast_records_R1s_relaxed[tag_ID]['Hit_start_coord'], LG, POS))
    elif tag_ID in Kept_blast_records_R1s_relaxed:
        ALLMAPS_input_pstI_Palomar.write("%s,%s,%s,%s\n" % (Kept_blast_records_R1s_relaxed[tag_ID]["Ref_hit_id"], Kept_blast_records_R1s_relaxed[tag_ID]['Hit_start_coord'], LG, POS))
        
ALLMAPS_input_pstI_Palomar.close()

Ok, that went well I think, I have an extra 1819 SNPs that are in the PstI map and aligned to the scaffolds. Now for the same thing with EcoRI. 

### EcoRI

For EcoRI, the protocol is slightly different. The SNPs were called by mapping the raw reads to the genome, so the markers in the map are the markers that could be mapped. So there are no RADtag sequences here, only positions in an old genome assembly. So I need to extract sequence from around these locations, and map them to the new genome. But how much to extract? In theory the more the better, as there will be greater chance of having a unique hit with more sequence, but chimeric contigs in either assembly could reduce efficiency if the fragments are too large. So I think I will try 200bp fragments around the SNP (100 each side), 500 fragments, 1kb frags and 2 kb frags. When I filter the blast results of these fragments I should see the improvment or loss of effienciency and be able to choose the appropriate one. 

In [1008]:
## Get the linkage map info first

Brelsford_EcoRI_linkage_map = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/Female_map_brelsford.txt", 'r').readlines()

LM_dict = {}

for i in  Brelsford_EcoRI_linkage_map:
    marker_id = i.split()[0]
    scaff_id = marker_id.split("c")[1].split("_")[0]
    marker_position = marker_id.split("_")[1].split("b")[0]
    LG = i.split()[1]
    LG_position = i.split()[2]
    
    
    if scaff_id not in LM_dict:
        LM_dict[scaff_id] = {}
    LM_dict[scaff_id][marker_position] = {}
    LM_dict[scaff_id][marker_position]["LG"] = LG
    LM_dict[scaff_id][marker_position]["LG_position"] = LG_position
    

In [1027]:
## Now extract the relevent bit of each scaff

from Bio import SeqIO

window = 250  ## 100bp each side of the SNP

old_genome = SeqIO.parse(open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Old_genome/Rtk43.200.fa" ,'r'),"fasta")

outs = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/marker_seqs_%s.fasta" % str(window) ,'w')

for record in old_genome:
    scaff = record.id.split(".")[0].split("d")[1]
    
    if scaff in LM_dict:
        for SNP in LM_dict[scaff]:
            if int(SNP) < window:  ## if the pos of the snp is less than the window width from the start of the seq
                outs.write(">scaffold%s|%s_%s\n" % (scaff, str(0), str(int(SNP)+window)))
                outs.write("%s\n" % record.seq[: int(SNP)+window])
            elif int(SNP) > len(record.seq) - window:
                outs.write(">scaffold%s|%s_%s\n" % (scaff, str(int(SNP)-window), len(record.seq)))
                outs.write("%s\n" % record.seq[int(SNP)-window:])
            else:
                outs.write(">scaffold%s|%s_%s\n" % (scaff, str(int(SNP)-window), str(int(SNP)+window)))
                outs.write("%s\n" % record.seq[int(SNP)-window: int(SNP)+window])
        
outs.close()        
        
        
        
    



Now blast those bits of scaffold to the new genome assembly

In [1028]:
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
Rtemp_db = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_blastn"

### R1 tags

my_query = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/marker_seqs_250.fasta"

## xml format
blast_outs = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/marker_seqs_250_blastouts_MTS_100.xml"
blastn_cline = BLASTn(query=my_query, db=Rtemp_db, outfmt=5, out=blast_outs, num_threads = 7, max_target_seqs = 100)


stdout, stderr = blastn_cline()



In [1029]:
## Blast filters strict

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/marker_seqs_250_blastouts_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-20
best_hit_criteria = 1e-5
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_EcoRI_brels_strict = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 541
Number of unique alingments kept: 180


In [1037]:
## Blast filters strict

blast_outs_path = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/marker_seqs_250_blastouts_MTS_100.xml"
Rtemp_fasta = "/home/djeffrie/Data/Genomes/Rtemp_hybrid/Genome/Rtemp_hyb_final_assemb.fasta"
evalue_thresh = 1e-16
best_hit_criteria = 1e-3
window = 2000
get_frags = 0
verb = 1

Kept_blast_records_EcoRI_brels_relaxed = MISC.BlastParseExtra(blast_outs_path, Rtemp_fasta, best_hit_criteria, evalue_thresh, get_frags, window, 1)

Number of multi-alingments kept: 567
Number of unique alingments kept: 180


Note, I did a couple of tests to see what length of sequence around each marker to retain. . . . I tested 100 each side (200 fragments) and 250 each side (500 bp frags). They both returned a similar number of total hits, but the 500 frags mapping returned a lot more unique mappings, so I kept these. 

Now I just need to compile the linkage map and scaffold data and thats it. 

### Strict

In [1035]:
EcoRI_brels_ALLMAPS_inputs = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/EcoRI_Brels_ALLMAPS_inputs.txt", 'w')

EcoRI_brels_ALLMAPS_inputs.write("Scaffold ID,scaffold position,LG,genetic position\n")

for i in Kept_blast_records_EcoRI_brels_strict:
    scaff = i.split("|")[0].split("d")[1]
    if scaff in LM_dict:
        for SNP in LM_dict[scaff]:
            EcoRI_brels_ALLMAPS_inputs.write("%s,%s,%s,%s\n" % (Kept_blast_records_EcoRI_brels_strict[i]["Ref_hit_id"],Kept_blast_records_EcoRI_brels_strict[i]["Hit_start_coord"], LM_dict[scaff][SNP]["LG"], LM_dict[scaff][SNP]["LG_position"]))
        
EcoRI_brels_ALLMAPS_inputs.close()
    


### Relaxed

In [1038]:
EcoRI_brels_ALLMAPS_inputs = open("/home/djeffrie/Data/Genomes/Rtemp_hybrid/ALLMAPS/EcoRI_map/Brelsford_map/EcoRI_Brels_ALLMAPS_inputs_relaxed.txt", 'w')

EcoRI_brels_ALLMAPS_inputs.write("Scaffold ID,scaffold position,LG,genetic position\n")

for i in Kept_blast_records_EcoRI_brels_relaxed:
    scaff = i.split("|")[0].split("d")[1]
    if scaff in LM_dict:
        for SNP in LM_dict[scaff]:
            EcoRI_brels_ALLMAPS_inputs.write("%s,%s,%s,%s\n" % (Kept_blast_records_EcoRI_brels_relaxed[i]["Ref_hit_id"],Kept_blast_records_EcoRI_brels_relaxed[i]["Hit_start_coord"], LM_dict[scaff][SNP]["LG"], LM_dict[scaff][SNP]["LG_position"]))
        
EcoRI_brels_ALLMAPS_inputs.close()
    