# Concatenation

In this example we will concatente the alignments of phylome 15 in three different ways:

1. Standard 1-to-1 concatenation
2. Collapse lineage specific duplications
3. Get orthologous subtrees

First load the functions

In [None]:
import analyse_phylome as ap

First of all, we set all the useful variables. We need to have the best_trees and the uncompressed directory of the alignments of the phylome. Then we create the pathfile that will be used to locate the correct files later.

In [None]:
readalPath = ("/home/giacomo/master-thesis/Second_proj/obtain_phylome_data_scripts/readal")

outDir = "out_dir15"
# uncompressed PDB algs directory 
path = "test_data/all_algs"
pathsFile = outDir + "/paths15.txt"
# PDB best trees
treeFile = "test_data/best_trees15.txt"
trees121File = outDir + "/phy_15_trees121.txt"
tag = "alg_aa"

ap.create_folder(outDir)
ap.create_pathfile(pathsFile, path, tag)

Then we need to get the species 2 age dictionary for the phylome. This can be done by loading a species tree (for example obtained from duptree). The dictionary may contain wrong values as only the keys are important. In fact, it will also work if you input a list of the species code (except concat method #2 as we reroot trees with spe2age dict before collapsing duplications).

In [None]:
# Either with a list(method 1 and 3)
spe_list = ap.get_all_species(treeFile)

# or with spe2age dict
sptree = ap.load_species_tree("test_data/rooted_phylome_15_sp_duptree.txt")
spe2age = ap.build_sp2age(sptree, "341454")

#### Method 1

In the first method we extract 1-to-1 trees and concatenate the corresponding alignment. By default, alignment with at least 90% of the species will be concatenated but this can be modified through the argument prop. The concatenation also stops when 100 files have been concatenated (this can be modified with the at_least param) or if the maximum length (default to 50000) is reached. The functions return a directory where the raw aln files are stored named out/concatenated/ and the concatenated sequences both in fasta and phy format. The sequences will be from all to prop (if you have 10 species and set prop=0.9 you'll get the concatenated_10 and concatenated_9 files). Further, a stats file of the number of genes and length of the file will be in this directory as stats.txt.

In [None]:
#obtain the 1-to-1 trees
ap.obtain_121_trees(treeFile, trees121File)

help(ap.build_concatenated_alg)
ap.build_concatenated_alg(trees121File, spe2age, pathsFile, outDir, readalPath, prop=0.9, at_least=100, max_length=50000)

#### Method 2

With this method first we collapse lineage specific duplications so that we may get more 1-to-1 trees. Then we concatenate the corresponding alignments. The output is the same as before.

In [None]:
help(ap.build_extra_concatenated_alg2)
ap.build_extra_concatenated_alg2(treeFile, pathsFile, outDir, readalPath, spe2age, min=15, at_least=100)

#### Method 3

In this method we try to get a 1-to-1 ortholog subtree from trees with duplications. Then, we conctenate as before.

In [None]:
help(ap.build_extra_concatenated_alg3)
ap.build_extra_concatenated_alg3(treeFile, pathsFile, outDir, readalPath, spe2age, min=15 , at_least=100)