## What is it good for ?
Consensus files of TE sequences often need a tedious pre-processing : LTR transposable elements are splitted between their intern and LTR parts. If you want to use those TE sequences for an alignment, you have to restore the real sequence of these TEs by flanking the intern part with the LTR part (like this : LTR|Intern|LTR ).

The goal of this script is to simplify this task and to provide a correctly reconstructed new fasta file.

## How to use it ?

This script need two files : a classic **consensus fasta file** (an example being the Dfam families.fa file present in the folder) and a **dictionnary in tsv format** providing correspondance between intern and LTR parts.

1) Generating the dictionnary

You can choose to write the dictionnary yourself (TE_name, TE_intern_part, TE_LTR_part, tab_separated):

dictionnary example :  
`Copia  Copia_I    Copia_LTR`  
`Roo   Roo-I_DM   Roo-LTR_DM`

OR generate it automatically using 'generate_dictionnary.py' script and the default list of suffixes ("standard_suffixes.tsv"):

`generate_dictionnary.py families.fa standard_suffixes.tsv > families.dictionnary.tsv`

You can also specify a custom list of suffixes in a tsv file, with the first line being the list of intern suffixes, and the second line being the list of LTR suffixes, separated with tabs.

ex :  
`_I -I_DM`  
`_LTR   -LTR_DM`

**It is advised to have a look at the generated dictionnary and manually cure it if needed.**

2) Get the new consensus fasta file using this command:

`TE_LTR_flanker.py families.fa standard_suffixes.tsv > new_families.fa`

Notes :

TE sequences that are not present in the dictionnary are written as such in the output.  
The name used to describe the whole TE in the new fasta will correspond to the name of the intern part. (TODO : maybe give the option of providing a third column with TE names ?)




In [63]:
import warnings

Dfam_consensus_fasta = "families.fa"
suffix_file = "standard_suffixes.tsv"

def generate_matching_pairs_from_suffixes(consensus_fasta, suffix_file):
    with open(suffix_file, 'r') as input:
        I_suffix_list = input.readline().split()
        LTR_suffix_list = input.readline().split()
    I_suffix_set = set(I_suffix_list)
    LTR_suffix_set = set(LTR_suffix_list)
    print(I_suffix_set)
    print(LTR_suffix_set)

    def is_LTR_part(name):
        for suffix in LTR_suffix_set:
            if name.endswith(suffix):
                return True
        return False

    def is_I_part(name):
        for suffix in I_suffix_set:
            if name.endswith(suffix):
                return True
        return False

    def remove_suffix(seq_ID):
        suffix_list = list(LTR_suffix_set) + list(I_suffix_set)
        for suffix in suffix_list:
            if seq_ID.endswith(suffix):
                return seq_ID[:-len(suffix)]
        return seq_ID

    I_part_dict = {}
    LTR_part_dict = {}
    not_splitted_TE = []

    with open(consensus_fasta, 'r') as input:
        for line in input:
            if line.startswith('>'):
                seq_ID = line.split()[-1].replace('>', '')
                TE_name = remove_suffix(seq_ID)
                if is_LTR_part(seq_ID):
                    LTR_part_dict[TE_name] = seq_ID
                elif is_I_part(seq_ID):
                    I_part_dict[TE_name] = seq_ID
                else:
                    not_splitted_TE.append(seq_ID)
    
    matching_TE_parts = set(I_part_dict.keys()).intersection(set(LTR_part_dict.keys()))
    # unmatched_TE_parts = set(I_part_dict.keys()).symmetric_difference(set(LTR_part_dict.keys()))
    unmatched_I_list = [I_part_dict[x] for x in set(I_part_dict.keys()).difference(set(LTR_part_dict.keys()))]
    unmatched_LTR_list = [LTR_part_dict[x] for x in set(LTR_part_dict.keys()).difference(set(I_part_dict.keys()))]
    # print(unmatched_I_list)
    # print(unmatched_LTR_list)

    if len(unmatched_I_list + unmatched_LTR_list) > 0 :
        warnings.simplefilter("default")
        print("oij")

        
        # These TE names ends with a recognized suffix, but couldn't be match. You might want to manually cure them in the dictionnary :\n" + str(unmatched_I_list) + "\n" + str(unmatched_LTR_list))
    # print(matching_TE_parts)
    TE_dictionnary = ""
    # for TE_name in matching_TE_parts:
    #     print(TE_name, I_part_dict[TE_name], LTR_part_dict[TE_name])
    # for TE_name in unmatched_TE_parts:
    #     if TE_name in I_part_dict:
    #         print(I_part_dict[TE_name])
    #     else :
    #         print(LTR_part_dict[TE_name])
    # for TE_name in not_splitted_TE:
    #     print(TE_name)
    # print(len(matching_TE_parts))
    # print(len(unmatched_TE_parts))
    # print(len(not_splitted_TE))
generate_matching_pairs_from_suffixes(Dfam_consensus_fasta, suffix_file)

# def match_I_and_LTR(consensus_fasta):
#     with open(consensus_fasta, 'r') as consensus:
        # for 


{'_I', '-I_DM'}
{'-LTR_DM', '_LTR_DM', '_LTR'}
oij
