# Concatenation

In this example we will concatente the alignments of phylome 15 in three different ways:

1. Standard 1-to-1 concatenation
2. Collapse lineage specific duplications
3. Get orthologous subtrees

First load the functions

In [1]:
import os
os.chdir('..')

In [2]:
import analyse_phylome as ap

We can first see if there are the data we need on the FTP server by using ap.get_ftp_stats(). This will return each file in the directory for that phylome with each file syze in bytes. (you must have a connection to run this).

In [3]:
phy_id = ["15"]
ap.get_ftp_stats(phy_id)

{'phylome_0015': {'all_protein_names.txt.gz': 1176790,
  'all_algs.tar.gz': 44447239,
  'orthologs.txt.gz': 358637,
  'all_gene_names.txt.gz': 534174,
  'all_trees.tar.gz': 2040889,
  'best_trees.txt.gz': 1015595,
  'phylome_info.txt.gz': 1079,
  'all_id_conversion.txt.gz': 1207330}}

You may want to use this function to download the files from FTP. By default it will download best_trees, all_algs and phylome_info file. The file still have to be properly extracted though.

In [None]:
outDir = "out_dir15"

ap.create_folder(outDir)
ap.get_ftp_files(15, outdir=outDir)

First of all, we set all the useful variables. We need to have the best_trees and the uncompressed directory of the alignments of the phylome. Then we create the pathfile that will be used to locate the correct files later.

In [6]:
readalPath = ("/home/giacomo/master-thesis/Second_proj/obtain_phylome_data_scripts/readal")

# uncompressed PDB algs directory 
path = "test_data/all_algs"
pathsFile = outDir + "/paths15.txt"
# PDB best trees
treeFile = "test_data/best_trees15.txt"
trees121File = outDir + "/phy_15_trees121.txt"
tag = "alg_aa"

ap.create_pathfile(pathsFile, path, tag)

Then we need to get the species 2 age dictionary for the phylome. This can be done by loading a species tree (for example obtained from duptree). The dictionary may contain wrong values as only the keys are important. In fact, it will also work if you input a list of the species code (except concat method #2 as we reroot trees with spe2age dict before collapsing duplications).

In [7]:
# Either with a list(method 1 and 3)
spe_list = ap.get_all_species(treeFile)

# or with spe2age dict
sptree = ap.load_species_tree("test_data/rooted_phylome_15_sp_duptree.txt")
spe2age = ap.build_sp2age(sptree, "341454")

In [10]:
# Alternatively you can use the json file with sp2age dict for most phylomes:
import json

with open('data/root_phy.json') as f:
    all_dicts = json.load(f)
    
phy_id = '15'
spe2age_stored = all_dicts[phy_id]

#### Method 1

In the first method we extract 1-to-1 trees and concatenate the corresponding alignment. By default, alignment with at least 90% of the species will be concatenated but this can be modified through the argument prop. The concatenation also stops when 100 files have been concatenated (this can be modified with the at_least param) or if the maximum length (default to 50000) is reached. The functions return a directory where the raw aln files are stored named out/concatenated/ and the concatenated sequences both in fasta and phy format. The sequences will be from all to prop (if you have 10 species and set prop=0.9 you'll get the concatenated_10 and concatenated_9 files). Further, a stats file of the number of genes and length of the file will be in this directory as stats.txt. Eventually, there is an option to build the partitioned model file in order to do a partitioned analysis. Setting partition=True will create the Raxml partition file with the best model from the best trees file. For method 1, the user has to define the path to the best trees. For the other two methods it's enough to set partition=True.

In [None]:
#obtain the 1-to-1 trees
ap.obtain_121_trees(treeFile, trees121File)

help(ap.build_concatenated_alg)
ap.build_concatenated_alg(trees121File, spe2age, pathsFile, outDir, readalPath, prop=0.9, at_least=100, max_length=50000, partition=True, treeFile=treeFile)

#### Method 2

With this method first we collapse lineage specific duplications so that we may get more 1-to-1 trees. Then we concatenate the corresponding alignments. The output is the same as before. Min argument is similar to prop argument and is needed to consider only those trees that have at least "min" number of species.

In [None]:
help(ap.build_extra_concatenated_alg2)
ap.build_extra_concatenated_alg2(treeFile, pathsFile, outDir, readalPath, spe2age, min=15, at_least=100, partition=True)

#### Method 3

In this method we try to get a 1-to-1 ortholog subtree from trees with duplications. Then, we conctenate as before.

In [None]:
help(ap.build_extra_concatenated_alg3)
ap.build_extra_concatenated_alg3(treeFile, pathsFile, outDir, readalPath, spe2age, min=15, at_least=100)