## What is it good for ?
Consensus files of TE sequences often need a tedious pre-processing : LTR transposable elements are splitted between their intern and LTR parts. If you want to use those TE sequences for an alignment, you have to restore the real sequence of these TEs by flanking the intern part with the LTR part (like this : LTR|Intern|LTR ).

The goal of this script is to simplify this task and to provide a correctly reconstructed new fasta file.

## How to use it ?

This script need two files : a classic **consensus fasta file** (an example being the Dfam families.fa file present in the folder) and a **dictionnary in tsv format** providing correspondance between intern and LTR parts.

1) Generating the dictionnary

You can choose to write the dictionnary yourself (TE_name, TE_intern_part, TE_LTR_part, tab_separated):

dictionnary example :  
`Copia  Copia_I    Copia_LTR`  
`Roo   Roo-I_DM   Roo-LTR_DM`

OR generate it automatically using 'generate_dictionnary.py' script and the default list of suffixes ("standard_suffixes.tsv"):

`generate_dictionnary.py families.fa standard_suffixes.tsv > families.dictionnary.tsv`

You can also specify a custom list of suffixes in a tsv file, with the first line being the list of intern suffixes, and the second line being the list of LTR suffixes, separated with tabs.

ex :  
`_I -I_DM`  
`_LTR   -LTR_DM`

**It is advised to have a look at the generated dictionnary and manually cure it if needed.**

2) Get the new consensus fasta file using this command:

`TE_LTR_flanker.py families.fa standard_suffixes.tsv > new_families.fa`

Notes :

TE sequences that are not present in the dictionnary are written as such in the output.  
The name used to describe the whole TE in the new fasta will correspond to the name of the intern part. (TODO : maybe give the option of providing a third column with TE names ?)




In [25]:
# import warnings
# from warnings import warn
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

Dfam_consensus_fasta = "families.fa"
suffix_file = "standard_suffixes.tsv"



In [26]:
## Dictionnary generator

def generate_matching_pairs_from_suffixes(consensus_fasta, suffix_file):
    with open(suffix_file, 'r') as input:
        I_suffix_list = input.readline().split()
        LTR_suffix_list = input.readline().split()
    I_suffix_set = set(I_suffix_list)
    LTR_suffix_set = set(LTR_suffix_list)

    def is_LTR_part(name):
        for suffix in LTR_suffix_set:
            if name.endswith(suffix):
                return True
        return False

    def is_I_part(name):
        for suffix in I_suffix_set:
            if name.endswith(suffix):
                return True
        return False

    def remove_suffix(seq_ID):
        suffix_list = list(LTR_suffix_set) + list(I_suffix_set)
        for suffix in suffix_list:
            if seq_ID.endswith(suffix):
                return seq_ID[:-len(suffix)]
        return seq_ID

    I_part_dict = {}
    LTR_part_dict = {}
    not_splitted_TE = []

    with open(consensus_fasta, 'r') as input:
        for line in input:
            if line.startswith('>'):
                seq_ID = line.split()[-1].replace('>', '')
                TE_name = remove_suffix(seq_ID)
                if is_LTR_part(seq_ID):
                    LTR_part_dict[TE_name] = seq_ID
                elif is_I_part(seq_ID):
                    I_part_dict[TE_name] = seq_ID
                else:
                    not_splitted_TE.append(seq_ID)
    
    matching_TE_parts = set(I_part_dict.keys()).intersection(set(LTR_part_dict.keys()))
    unmatched_I_list = [I_part_dict[x] for x in set(I_part_dict.keys()).difference(set(LTR_part_dict.keys()))]
    unmatched_LTR_list = [LTR_part_dict[x] for x in set(LTR_part_dict.keys()).difference(set(I_part_dict.keys()))]
    if len(unmatched_I_list + unmatched_LTR_list) > 0 :
        logging.error("These TE names ends with a recognized suffix, but couldn't be matched. You might want to manually cure some of them in the dictionnary :\n" + ', '.join(unmatched_I_list) + "\n" + ', '.join(unmatched_LTR_list) + "\n")
    TE_dictionnary = ""
    for TE_name in matching_TE_parts:
        TE_dictionnary += "\t".join([TE_name, I_part_dict[TE_name], LTR_part_dict[TE_name]]) + "\n"
    TE_dictionnary += '\n'.join(unmatched_I_list + unmatched_LTR_list + not_splitted_TE)
    return TE_dictionnary

print(generate_matching_pairs_from_suffixes(Dfam_consensus_fasta, suffix_file))


ERROR:root:These TE names ends with a recognized suffix, but couldn't be matched. You might want to manually cure some of them in the dictionnary :
HMSBEAGLE_I, TOM_I
Stalker3_LTR, Gypsy6A_LTR, DMTOM1_LTR, DM412B_LTR, Gypsy12A_LTR

Invader6	Invader6_I	Invader6_LTR
BATUMI	BATUMI_I	BATUMI_LTR
MICROPIA	MICROPIA_I	MICROPIA_LTR
NOMAD	NOMAD_I	NOMAD_LTR
ZAM	ZAM_I	ZAM_LTR
Gypsy9	Gypsy9_I	Gypsy9_LTR
ROVER	ROVER-I_DM	ROVER-LTR_DM
STALKER4	STALKER4_I	STALKER4_LTR
MDG1	MDG1_I	MDG1_LTR
Copia2	Copia2_I	Copia2_LTR_DM
Gypsy5	Gypsy5_I	Gypsy5_LTR
MDG3	MDG3_I	MDG3_LTR
DIVER	DIVER_I	DIVER_LTR
GTWIN	GTWIN_I	GTWIN_LTR
BURDOCK	BURDOCK_I	BURDOCK_LTR
Gypsy7	Gypsy7_I	Gypsy7_LTR
IDEFIX	IDEFIX_I	IDEFIX_LTR
Invader1	Invader1_I	Invader1_LTR
Chouto	Chouto_I	Chouto_LTR
BLASTOPIA	BLASTOPIA_I	BLASTOPIA_LTR
Bica	Bica_I	Bica_LTR
Invader2	Invader2_I	Invader2_LTR
QUASIMODO2	QUASIMODO2-I_DM	QUASIMODO2-LTR_DM
QUASIMODO	QUASIMODO_I	QUASIMODO_LTR
Stalker2	Stalker2_I	Stalker2_LTR
Gypsy3	Gypsy3_I	Gypsy3_LTR
DM1731	DM1731_I	DM173

In [None]:
## Generating new fasta with flanked intern parts using dictionnary

def import_dictionnary(TE_dictionnary):
    TE_dict = dict()

    with open(TE_dictionnary, 'r') as dictionnary:
        for line in dictionnary :
            if len(line.split()) == 3:
                TE_name, I_part, LTR_part = line.split()
                TE_dict[TE_name] = [I_part, LTR_part]

            elif len(line.split()) == 1:
                TE_dict[TE_name] = []
            else :
                raise ValueError("Incorrect dictionnary format. Make sure every line either contains 3 columns corresponding to the TE_name, TE_intern_part, TE_LTR_part OR only contains 1 column corresponding to the TE_name.")
    return TE_dict

print(import_dictionnary(TE_dictionnary))

def generating_LTR_flanked_fasta_file(consensus_fasta, TE_dictionnary):
    