To mask with a repeat library, I need to make a fasta with the PASTEC classifications in the headers. RepMask will then hopefully create meaningful output. 

One thing to think about though is that the PASTC classifications have a confidence index. So if a CI is low, I could decide for example, not to keep the annotation, and class it as "No category". But what CI do I use as a cut off? 

So, the first thing I think I'll do is make a histogram of the CI's to see if there are any values which stick out. 



In [1]:
from matplotlib import pyplot as plt

### The output of PASTEC looks like this:

1. sequence name
2. sequence length
3. strand : "+" or "-" or "."
4. "ok" or "PotentialChimeric":
5. "ok" means that only one classification was found
6. "PotentialChimeric" means that several classifications are possible for this sequence. In this case, the best 7. classification is given according to the confidence index. If no decision is possible, all the classifications are returned in the "order" field (separated by "|").
8. class classification : "I" or "II" or "noCat" or "NA"
9. order classification ("LTR" and/or "TIR" and/or "LINE" and/or "Crypton",...) or "PotentialHostGene" or "rDNA" or "SSR" or "noCat"
10. completeness : "complete" or "incomplete" or "NA"
11. confidence index ("CI=") and evidences. The confidence index is computed according to the evidence found for this classification (the best CI is 100). The evidences are separated in 2 types : structural ("struct=") and homology ("coding="). The evidences unused for the considered classification are in "other=" section.

"noCat" means that no classification was found at this level.
"NA" means "not available", according to the information in the "order" field.

### And the recommended format for IDs in a custom library is:

`repeatname#class/subclass`  
or simply  
`repeatname#class`  

In this format, the data will be processed (overlapping repeats are
merged etc), alternative output (.ace or .gff) can be created and an
overview .tbl file will be created. Classes that will be displayed in
the .tbl file are 'SINE', 'LINE', 'LTR', 'DNA', 'Satellite', anything
with 'RNA' in it, 'Simple_repeat', and 'Other' or 'Unknown' (the
latter defaults when class is missing). Subclasses are plentiful. They
are not all tabulated in the .tbl file or necessarily spelled
identically as in the repeat files, so check the RepeatMasker.embl
file for names that can be parsed into the .tbl file.

So below I'll parse the PASTEC output and construct the header in that format.

In [29]:
PASTEClassifications = open("/Users/danieljeffries/Data/M_huetii_genome/PASTEC/PASTEC_homology_1.classif", 'r').readlines()
headers = {}

for line in PASTEClassifications:
        
    seq_id = line.split()[0] 
    rep_class = line.split()[4]
    rep_order = line.split()[5]
    
    if rep_class == "noCat":
        rep_class = "Unknown"
    if rep_order == "noCat":
        rep_order = "Unknown"
    
    ## Find the subclass. From the PASTEC output for mercurialis I only see Gypsy & Copia, so I only look for these
    
    subclass = "Unknown"
    
    for i in line.split():
        if "Gypsy" in i:
            subclass = "Gypsy"
        elif "Copia" in i:
            subclass = "Copia"        
    
    headers[seq_id] = "%s#%s/%s" % (seq_id, rep_order, subclass)

### No go through the replong fasta file and change the headers

I'll also remove the sequences that were annotated as potential host genes

In [31]:
from Bio import SeqIO

fasta = SeqIO.parse(open("/Users/danieljeffries/Data/M_huetii_genome/M_annua_homologs/result.fa", 'r'), "fasta")
out_fasta_handle = open("/Users/danieljeffries/Data/M_huetii_genome/M_annua_homologs/result_classif.fa", 'w')

for seq in fasta:
    if seq.id in headers:
        if "PotentialHostGene" not in headers[seq.id]:
            seq.id = headers[seq.id]
            SeqIO.write(seq, out_fasta_handle, "fasta")
        
out_fasta_handle.close()

Ended up with 1619 sequences (-270 host gene sequences). 