# Looking at the distribution of TEs in the temporaria genome.
    
    Here I want to use the TE library created by dnaPipeTE to find the distribution of TEs throughout the genome. 
    
    The first step here is to assign <i>Rana temporaria</i> contigs to Xenopus chromosomes. To do this, I am blasting the 
    Xenopus exon sequences against the Rtemp genome. I will keep all contigs where one Xenopus exon hit confidently (more than 1e-5
    times more likely. 
    
    I will then give the repeat library that I created with dnaPipeTE to repeat masker, and the outputs of this will tell me the 
    distribution of those TEs on my scaffolds.

##1.   Blasting Xenopus exons to the current R. temporaria genome. 

I extracted the exons from the the Xenopus genome (v.9) using the gff gene annotation file. I gave this file and the genome fasta to bedtools. (Note - I modified this file so that the 2nd column contained the "Name:" of the sequence, and retained only "gene" lines, not CDS, RNA etc. This was necassary to get unique sequence names from the bedtools command below)

    bedtools getfasta -fi Xtropicalis.v9.repeatMasked.fa -bed Unique_names_Xen_exons.gff3 -fo Xentrop_exons_unique_names.fa -name

Used the Rtemp_new.fa genome version. 

Used the following blast command to map the Xen exons to the Rtemp scaffolds:

    blastn -num_threads 6 -query Xentrop_exons.fa -db Rtemp_new.fa -out Rtemp_vs_xen_exons.xml -outfmt 5 > Rtemp_vs_xen.log


##2. Sorting blast hits for only the best ones.

I will do this with my BlastParse function, which will find matches with a best hit that is 1e-5 times better than the next best hit, or alignments where there is only one match. In both cases an e-value must be below 1e-20. 


In [2]:
import MISC_RAD_tools as MISC

In [133]:
working_dir = "/home/djeffrie/Data/Genomes/Rtemp_vs_Xen"

blast_output_path = "%s/Rtemp_vs_xen_blast.xml" % working_dir
Rtemp_genome_path = "%s/Rtemp_new.fa" % working_dir
best_hit_eval_difference = 1e-5
Evalue_threshold = 1e-20
Scaff_window_size = 1000000000 ## Set to 1 Gb so I get the whole scaffold!

In [134]:
retained_blast_hits = MISC.BlastParseExtra(blast_output_path,Rtemp_genome_path,best_hit_eval_difference,Evalue_threshold,Scaff_window_size)

Number of multi-alingments kept: 3224
Number of unique alingments kept: 1124
Getting subject scaffold segments from /home/djeffrie/Data/Genomes/Rtemp_vs_Xen/Rtemp_new.fa . . . 
4348 sequence scaffold segments are in /home/djeffrie/Data/Genomes/Rtemp_vs_Xen/blast_1000000000_chunks.fa


To start with, there were <b>1751062 scaffolds</b> in the Rtemp genome and <b>26,550 exons</b> in from the Xenopus genome.

Combining the multiple hits and unique hits, <b>I retained 4348 scaffolds</b>. Thats <b>0.25% of the temporaria scaffolds</b> and <b>16.4% of the Xenopus exons</b>. 

So these are small representations, but the TEs are so prolific, it could still give a good idea of how abundant they are on each chromosome. Just need to be careful to scale by the amount of info on each chromosome!


Ok, took about 2 hours, but it finished. This is now hopefully a concervative subset of scaffolds that match between Rtemporaria and Xenopus. 

The only chance that error can creep in from this stage is from mis-assembled scaffolds. But I can't do anythin about this until the better genome arrives. 


### However . . . . 

I think that I am in danger of excluding regions where the X and the Y are particularly diverged in our Spain genome. 

In [147]:
scaff_to_exons_table = open("/home/djeffrie/Data/Genomes/Rtemp_vs_Xen/Rtemp_scaff_to_Xen_exon_blastparseextra.txt", 'w')

In [150]:
for exon in retained_blast_hits:
    scaff_to_exons_table.write("%s\t%s\n" % (exon.split("=")[1], retained_blast_hits[exon]["Ref_hit_id"]))
                                              
scaff_to_exons_table.close()

ValueError: I/O operation on closed file

In [146]:
retained_blast_hits[retained_blast_hits.keys()[0]]

print retained_blast_hits.keys()[0].split("=")[1], retained_blast_hits[retained_blast_hits.keys()[0]]["Ref_hit_id"]

LOC100494812-like scaffold121.1|size320889
