### Here I map RADtags from 6 Rtemp families to scaffolds containing portions of DMRT1

The RAD dataset used:

6 families from the SAME population (Tvedora, Sweden)

- After Stacks catalog stage, 260,000 RADtags were found.
- Filtered for RADtags to be present in all parents and 75% of offspring.
- Filtered loci with coverage < 3, maf < 0.05 and heterozygosity > 0.75
- This gave ~30,000 RADtags
- Of these 13,000 were polymorphic and contained ~20,000 SNPs (~1.5snps per tag)  
(Note that, although this is stringent filtering, more relaxed filtering would, at best, double the amount of loci to be used (at the cost of data reliability))


The below analyses is done using the sequences of the 13,000 RADtags. 


In [3]:
import sys
sys.path.append('/home/djeffrie/Dropbox/My_Dropbox_Scripts/Python/My_Modules/')
import MISC_RAD_tools as MISC
import Bio

In [83]:
## Get the filtered tag IDs

whitelist = open("/home/djeffrie/Data/RADseq/R_temp_fams/Populations_all_kept/whitelist.txt", 'r').readlines()
whitelist = [i.strip() for i in whitelist]

In [12]:
## Make a fasta file of the tag sequences

MISC.fasta_maka(whitelist, "/home/djeffrie/Data/RADseq/R_temp_fams/batch_1.catalog.tags.tsv.gz")

Number of tags in whitelist: 12938
12938 sequences written to /home/djeffrie/Data/RADseq/R_temp_fams/Whitelist_tags.fa


####  These sequences were blasted to the DMRT1 scaffold sequences (blastn)

In [15]:
## Filtering blast hits

VCF_tags_to_DMRT1_outs = "/home/djeffrie/Data/RADseq/R_temp_fams/Populations_all_kept/VCF_tags_to_DMRT1.xml"  ## blast outputs
DMRT1_fasta = "/home/djeffrie/Data/Genomes/Rtemp/DMRT1_scaffs/scaffolds_dmrt1all.fasta"  ## DMRT1 fasta

## Filtering criteria

best_hit_crit = 1e-5 ## the top hit for a tag must be 5 orders of magnitute better than the next best hit
Eval_threshold = 1e-20 ## the minimum evalue for a hit

VCF_tags_to_DMRT1_Filtered_blasts = MISC.BlastParseExtra(VCF_tags_to_DMRT1_outs , DMRT1_fasta , best_hit_crit, Eval_threshold, 1 , 0 , 1) ## filtering function call

Number of multi-alingments kept: 2
Number of unique alingments kept: 43
Getting subject scaffold segments from /home/djeffrie/Data/Genomes/Rtemp/DMRT1_scaffs/scaffolds_dmrt1all.fasta . . . 
4 sequence scaffold segments are in /home/djeffrie/Data/RADseq/R_temp_fams/Populations_all_kept/blast_2000_chunks.fa


So there are apparently 45 hits, but this seems very high, so I will look into this. 

In [86]:
## get real scaffold lengths (the lengths in the scaffold IDs are wrong)

scaff_lengths = {}

dmrt1s = Bio.SeqIO.parse("/home/djeffrie/Data/Genomes/Rtemp/DMRT1_scaffs/scaffolds_dmrt1all.fasta", "fasta")

for i in dmrt1s:
    scaff_lengths[i.id] = len(i.seq)

print scaff_lengths

print "Total length of scaffolds = %s" % sum(scaff_lengths.values())

{'scaffold7867.1|size68756': 58504, 'scaffold60721.1|size23283': 23365, 'scaffold9351.1|size72711': 91773, 'scaffold86492.1|size16590': 15889, 'scaffold142901.1|size18471': 19316}
Total length of scaffolds = 208847


In [89]:
## Look at the distribution of the hits across the scaffolds

scaff_positions = {}

for tag in VCF_tags_to_DMRT1_Filtered_blasts:
        
    if VCF_tags_to_DMRT1_Filtered_blasts[tag]['Ref_hit_id'] not in scaff_positions:
        scaff_positions[VCF_tags_to_DMRT1_Filtered_blasts[tag]['Ref_hit_id']] = {}
        scaff_length = scaff_lengths[VCF_tags_to_DMRT1_Filtered_blasts[tag]['Ref_hit_id']]
        
        for pos in range(1,scaff_length):
            scaff_positions[VCF_tags_to_DMRT1_Filtered_blasts[tag]['Ref_hit_id']][pos] = 0
     
        
    if VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_start_coord"] < VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_end_coord"]:
        
        for k in range(VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_start_coord"], VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_end_coord"]):
            scaff_positions[VCF_tags_to_DMRT1_Filtered_blasts[tag]['Ref_hit_id']][k] += 1
            
    elif VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_end_coord"] < VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_start_coord"]:
        for k in range(VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_end_coord"], VCF_tags_to_DMRT1_Filtered_blasts[tag]["Hit_start_coord"]):
            scaff_positions[VCF_tags_to_DMRT1_Filtered_blasts[tag]['Ref_hit_id']][k] += 1
    
## Take a look at the number of RADtags that map for each position on the scaffolds - below I show just a representitive portion.

for i in range(16975,17134):
    print i, scaff_positions['scaffold7867.1|size68756'][i]
        
        
        

16975 0
16976 0
16977 0
16978 0
16979 0
16980 0
16981 0
16982 0
16983 0
16984 0
16985 0
16986 0
16987 0
16988 0
16989 0
16990 0
16991 0
16992 0
16993 4
16994 4
16995 4
16996 4
16997 4
16998 4
16999 4
17000 4
17001 4
17002 4
17003 4
17004 4
17005 4
17006 4
17007 4
17008 4
17009 4
17010 4
17011 4
17012 4
17013 4
17014 4
17015 4
17016 4
17017 4
17018 4
17019 4
17020 4
17021 4
17022 4
17023 4
17024 4
17025 4
17026 4
17027 4
17028 4
17029 4
17030 4
17031 5
17032 5
17033 5
17034 5
17035 6
17036 6
17037 6
17038 6
17039 6
17040 6
17041 6
17042 6
17043 6
17044 6
17045 5
17046 5
17047 7
17048 9
17049 10
17050 10
17051 10
17052 10
17053 10
17054 10
17055 10
17056 10
17057 10
17058 10
17059 10
17060 10
17061 10
17062 10
17063 10
17064 10
17065 9
17066 9
17067 9
17068 9
17069 9
17070 9
17071 9
17072 9
17073 9
17074 9
17075 9
17076 9
17077 9
17078 9
17079 9
17080 9
17081 9
17082 9
17083 9
17084 9
17085 7
17086 7
17087 7
17088 7
17089 7
17090 7
17091 7
17092 7
17093 7
17094 7
17095 7
17096 7
17097 7


So even just by eye you can see that there are regions where multiple RADtags hit, and many small alignment lengths, suggesting that these scaffolds contain repeats . . . 

So this points to the problem, there are repeats in the DMRT1 scaffolds and the RADtags. This gives you lots of hits, but importantly you do not know if these are indeed the locations of these RADtags. 

<b>The problem with this approach is that, by only mapping to DMRT1 scaffolds, you don't see all the other mappings that those RADtags would have throughout the genome . . . </b>

So the correct approach would be:
- Map all RADtags to the whole genome
- Filter out tags which don't have a confident mapping position
- Only then look for tags that hit one of the DMRT1 scaffolds 

. . .I do this below

In [78]:
VCF_tags_to_DMRT1_outs = "/home/djeffrie/Data/RADseq/R_temp_fams/Populations_all_kept/VCF_tags_to_Rtemp.xml"

DMRT1_fasta = "/home/djeffrie/Data/Genomes/Rtemp/Rtemp_gapfilled.iter2.fa"


best_hit_crit = 1e-5 ## the top hit for a tag must be 5 orders of magnitute better than the next best hit
Eval_threshold = 1e-20 ## the minimum evalue for a hit
Window = 2000 ## the size (each side of the mapping) of the fragment to retreive from the fasta for the next mapping stage. 

VCF_tags_to_Rtemp_Filtered_blasts = MISC.BlastParseExtra(VCF_tags_to_DMRT1_outs , DMRT1_fasta , best_hit_crit, Eval_threshold, 0 ,Window, 1)

Number of multi-alingments kept: 3651
Number of unique alingments kept: 3429


So about 7000 RADtags can be confidently mapped. . . how many of these are on DMRT1 scaffolds?

In [82]:
for i in VCF_tags_to_Rtemp_Filtered_blasts:
    if VCF_tags_to_Rtemp_Filtered_blasts[i]["Ref_hit_id"] in scaff_lengths:
        print i

<b>ANSWER: None</b>

So even if you managed to double or triple the number of RADtags to use (which is unlikely when dealing with samples across Europe), the number of loci hitting DMRT1 is going to be somewhere between none and very few. 
