# **CHAPTER 2. Phylogenomic analysis**

Import all the modules needed

In [1]:
import os
import re
import pandas as pd
from Bio import SeqIO

Create the directory to store the data

In [3]:
%%bash

mkdir phylogenomics/
mkdir phylogenomics/Proteins_renamed

Take sequences files from `pangenome/Annotation/Proteins_classic` directory and rename them all to `>[GENE]` format

In [None]:
%%bash

for file in pangenome/Annotation/Proteins_classic/*.prt
do 
  sed -E -e 's/^.*(ND1|nd1|NAD1|nad1).*/>NAD1/' \
         -e 's/^.*(ND2|nd2|NAD2|nad2).*/>NAD2/' \
         -e 's/^.*(ND3|nd3|NAD3|nad3).*/>NAD3/' \
         -e 's/^.*(ND4|nd4|NAD4|nad4).*/>NAD4/' \
         -e 's/^.*(ND4L|nd4l|NAD4L|nad4l|NAD4l|nad4L).*/>NAD4L/' \
         -e 's/^.*(ND5|nd5|NAD5|nad5).*/>NAD5/' \
         -e 's/^.*(ND6|nd6|NAD6|nad6).*/>NAD6/' \
         -e 's/^.*(COX1|cox1).*/>COX1/' \
         -e 's/^.*(COX2|cox2).*/>COX2/' \
         -e 's/^.*(COX3|cox3).*/>COX3/' \
         -e 's/^.*(COB|cob).*/>COB/' \
         -e 's/^.*(rps3|RPS3).*/>rps3/' \
         -e 's/^.*(ATP6|atp6).*/>ATP6/' \
         -e 's/^.*(ATP8|atp8).*/>ATP8/' \
         -e 's/^.*(ATP9|atp9).*/>ATP9/' \
         -e 's/^.*(CYTB|cytb).*/>CYTB/' "$file" | \
  awk '/^>/ {keep=0} /^>(NAD1|NAD2|NAD3|NAD4|NAD4L|NAD5|NAD6|COX1|COX2|COX3|COB|rps3|ATP6|ATP8|ATP9|CYTB)/ {keep=1} keep' \
  > phylogenomics/Proteins_renamed/$(basename "$file" .prt)_renamed_mt_prot.fa
done

In [19]:
! head -1 phylogenomics/Proteins_renamed/NC_015789_renamed_mt_prot.fa

>COX1


Good.

Download `BBMap`

In [None]:
! wget https://sourceforge.net/projects/bbmap/files/latest/download -O BBMap.tar.gz

Unarchive `BBMap` and delete the initial archive

In [None]:
! tar -xvzf BBMap.tar.gz && rm -rf BBMap.tar.gz

Create a directory to store renamed sequences

In [None]:
! mkdir phylogenomics/Proteins_renamed_r2/

Using `BBMap` rename sequences from `>[GENE]` format to `>accession_number_[GENE]` format

In [None]:
%%bash

for file in phylogenomics/Proteins_renamed/*mt_prot.fa
do export species_name=$(basename $file _renamed_mt_prot.fa)
bbmap/rename.sh in=$file prefix="$species_name" addprefix=true out=phylogenomics/Proteins_renamed_r2/$species_name.mt_prots.fa ignorejunk=true ;
done

In [21]:
! head -1 phylogenomics/Proteins_renamed_r2/NC_015789.mt_prots.fa

>NC_015789_COX1


Good.

Some genes could be not single copy genes. In some organisms there may be several `cox1` genes. That's why we need to enumerate them!

In [3]:
def rename_sequences(input_dir, output_dir):
    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Loop through each FASTA file in the directory
    for filename in os.listdir(input_dir):
        if filename.endswith(".fasta") or filename.endswith(".fa"):
            filepath = os.path.join(input_dir, filename)
            output_filepath = os.path.join(output_dir, filename)

            # Dictionary to keep track of sequence names and their counts
            name_count = {}
            
            # List to store the updated sequences
            updated_sequences = []
            
            # Read the FASTA file
            for record in SeqIO.parse(filepath, "fasta"):
                name = record.id
                
                # If the name is already seen, append a count suffix
                if name in name_count:
                    name_count[name] += 1
                    new_name = f"{name}_{name_count[name]}"
                else:
                    name_count[name] = 1
                    new_name = f"{name}_1"
                
                # Update the record ID
                record.id = new_name
                record.description = ""  # Optionally clear the description
                
                # Store the updated record
                updated_sequences.append(record)

            # Write the updated sequences to a new FASTA file
            with open(output_filepath, "w") as output_handle:
                SeqIO.write(updated_sequences, output_handle, "fasta")

In [4]:
input_dir = "phylogenomics/Proteins_renamed_r2"
output_dir = "phylogenomics/Proteins_renamed_r3"

rename_sequences(input_dir, output_dir)

In [22]:
! head -1 phylogenomics/Proteins_renamed_r3/NC_015789.mt_prots.fa

>NC_015789_COX1_1


Good.

Now we shall prepare to launch `Proteinortho`. `Proteinortho` has a lot of requirements for inputs. For instance, if in directory there are empty files - `Proteinortho` will just crash. So, we delete empty files!

In [5]:
for file in os.listdir("phylogenomics/Proteins_renamed_r3"):
    path = os.path.join("phylogenomics/Proteins_renamed_r3", file)
    if os.path.isfile(path) and os.path.getsize(path) == 0:
        os.remove(path)

Launch `Proteinortho`

In [6]:
! proteinortho phylogenomics/Proteins_renamed_r3/*.fa -cpus=10

*****************************************************************
[1;32mProteinortho[0m with PoFF version 6.3.4 - An orthology detection tool
*****************************************************************
Using 10 CPU threads (1 threads per processes each with 10 threads), Detected 'diamond' version 2.1.11
Checking input files.
Checking phylogenomics/Proteins_renamed_r3/NC_015789.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_015991.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_023125.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_023126.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_023127.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_023128.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_025200.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_027422.mt_prots.fa... ok
Checking phylogenomics/Proteins_renamed_r3/NC_031375.mt_prots.fa... ok
Checking phylogenomics/Pro

`Proteinortho` does not allow to redirect outputs, so we do it manually afterwards!

Create a directory to store `Proteinortho` outputs

In [7]:
! mkdir phylogenomics/protein_ortho_output/

Move `Proteinortho` outputs

In [8]:
! mv myproject* phylogenomics/protein_ortho_output/

Now let us work with `Proteinortho` output

Read the `phylogenomics/protein_ortho_output/myproject.proteinortho.tsv` file to `pandas` dataframe

In [9]:
df_all = pd.read_csv('phylogenomics/protein_ortho_output/myproject.proteinortho.tsv', sep='\t')

This is how it looks

In [10]:
df_all.head(15)

Unnamed: 0,# Species,Genes,Alg.-Conn.,NC_015789.mt_prots.fa,NC_015991.mt_prots.fa,NC_023125.mt_prots.fa,NC_023126.mt_prots.fa,NC_023127.mt_prots.fa,NC_023128.mt_prots.fa,NC_025200.mt_prots.fa,...,NC_056146.mt_prots.fa,NC_056147.mt_prots.fa,NC_056148.mt_prots.fa,NC_056195.mt_prots.fa,NC_061762.mt_prots.fa,NC_069201.mt_prots.fa,NC_070174.mt_prots.fa,NC_071210.mt_prots.fa,NC_082275.mt_prots.fa,NC_082276.mt_prots.fa
0,22,22,1.0,NC_015789_ATP6_1,NC_015991_ATP6_1,NC_023125_ATP6_1,NC_023126_ATP6_1,NC_023127_ATP6_1,NC_023128_ATP6_1,NC_025200_ATP6_1,...,NC_056146_ATP6_1,NC_056147_ATP6_1,NC_056148_ATP6_1,NC_056195_ATP6_1,NC_061762_ATP6_1,NC_069201_ATP6_1,NC_070174_ATP6_1,NC_071210_ATP6_1,NC_082275_ATP6_1,NC_082276_ATP6_1
1,22,22,1.0,NC_015789_COX2_1,NC_015991_COX2_1,NC_023125_COX2_1,NC_023126_COX2_1,NC_023127_COX2_1,NC_023128_COX2_1,NC_025200_COX2_1,...,NC_056146_COX2_1,NC_056147_COX2_1,NC_056148_COX2_1,NC_056195_COX2_1,NC_061762_COX2_1,NC_069201_COX2_1,NC_070174_COX2_1,NC_071210_COX2_1,NC_082275_COX2_1,NC_082276_COX2_1
2,22,22,1.0,NC_015789_COX3_1,NC_015991_COX3_1,NC_023125_COX3_1,NC_023126_COX3_1,NC_023127_COX3_1,NC_023128_COX3_1,NC_025200_COX3_1,...,NC_056146_COX3_1,NC_056147_COX3_1,NC_056148_COX3_1,NC_056195_COX3_1,NC_061762_COX3_1,NC_069201_COX3_1,NC_070174_COX3_1,NC_071210_COX3_1,NC_082275_COX3_1,NC_082276_COX3_1
3,22,22,1.0,NC_015789_CYTB_1,NC_015991_CYTB_1,NC_023125_COB_1,NC_023126_COB_1,NC_023127_COB_1,NC_023128_COB_1,NC_025200_COB_1,...,NC_056146_COB_1,NC_056147_COB_1,NC_056148_COB_1,NC_056195_COB_1,NC_061762_COB_1,NC_069201_COB_1,NC_070174_COB_1,NC_071210_COB_1,NC_082275_COB_1,NC_082276_COB_1
4,22,22,0.909,NC_015789_NAD1_1,NC_015991_NAD1_1,NC_023125_NAD1_1,NC_023126_NAD1_1,NC_023127_NAD1_1,NC_023128_NAD1_1,NC_025200_NAD1_1,...,NC_056146_NAD1_1,NC_056147_NAD1_1,NC_056148_NAD1_1,NC_056195_NAD1_1,NC_061762_NAD1_1,NC_069201_NAD1_1,NC_070174_NAD1_1,NC_071210_NAD1_1,NC_082275_NAD1_1,NC_082276_NAD1_1
5,22,22,0.909,NC_015789_NAD6_1,NC_015991_NAD6_1,NC_023125_NAD6_1,NC_023126_NAD6_1,NC_023127_NAD6_1,NC_023128_NAD6_1,NC_025200_NAD6_1,...,NC_056146_NAD6_1,NC_056147_NAD6_1,NC_056148_NAD6_1,NC_056195_NAD6_1,NC_061762_NAD6_1,NC_069201_NAD6_1,NC_070174_NAD6_1,NC_071210_NAD6_1,NC_082275_NAD6_1,NC_082276_NAD6_1
6,21,21,1.0,NC_015789_ATP8_1,*,NC_023125_ATP8_1,NC_023126_ATP8_1,NC_023127_ATP8_1,NC_023128_ATP8_1,NC_025200_ATP8_1,...,NC_056146_ATP8_1,NC_056147_ATP8_1,NC_056148_ATP8_1,NC_056195_ATP8_1,NC_061762_ATP8_1,NC_069201_ATP8_1,NC_070174_ATP8_1,NC_071210_ATP8_1,NC_082275_ATP8_1,NC_082276_ATP8_1
7,21,21,1.0,NC_015789_NAD2_1,*,NC_023125_NAD2_1,NC_023126_NAD2_1,NC_023127_NAD2_1,NC_023128_NAD2_1,NC_025200_NAD2_1,...,NC_056146_NAD2_1,NC_056147_NAD2_1,NC_056148_NAD2_1,NC_056195_NAD2_1,NC_061762_NAD2_1,NC_069201_NAD2_1,NC_070174_NAD2_1,NC_071210_NAD2_1,NC_082275_NAD2_1,NC_082276_NAD2_1
8,21,21,1.0,NC_015789_NAD4_1,*,NC_023125_NAD4_1,NC_023126_NAD4_1,NC_023127_NAD4_1,NC_023128_NAD4_1,NC_025200_NAD4_1,...,NC_056146_NAD4_4,NC_056147_NAD4_4,NC_056148_NAD4_2,NC_056195_NAD4_1,NC_061762_NAD4_1,NC_069201_NAD4_1,NC_070174_NAD4_1,NC_071210_NAD4_2,NC_082275_NAD4_1,NC_082276_NAD4_1
9,21,21,1.0,NC_015789_NAD4_2,NC_015991_NAD4_1,NC_023125_NAD4_2,NC_023126_NAD4_2,NC_023127_NAD4_2,NC_023128_NAD4_2,NC_025200_NAD4_2,...,NC_056146_NAD4_1,NC_056147_NAD4_1,NC_056148_NAD4_1,*,NC_061762_NAD4_2,NC_069201_NAD4_2,NC_070174_NAD4_2,NC_071210_NAD4_1,NC_082275_NAD4_2,NC_082276_NAD4_2


We shall keep only the first 14 rows

In [None]:
df_all = df_all.iloc[:13]

In [12]:
df_all

Unnamed: 0,# Species,Genes,Alg.-Conn.,NC_015789.mt_prots.fa,NC_015991.mt_prots.fa,NC_023125.mt_prots.fa,NC_023126.mt_prots.fa,NC_023127.mt_prots.fa,NC_023128.mt_prots.fa,NC_025200.mt_prots.fa,...,NC_056146.mt_prots.fa,NC_056147.mt_prots.fa,NC_056148.mt_prots.fa,NC_056195.mt_prots.fa,NC_061762.mt_prots.fa,NC_069201.mt_prots.fa,NC_070174.mt_prots.fa,NC_071210.mt_prots.fa,NC_082275.mt_prots.fa,NC_082276.mt_prots.fa
0,22,22,1.0,NC_015789_ATP6_1,NC_015991_ATP6_1,NC_023125_ATP6_1,NC_023126_ATP6_1,NC_023127_ATP6_1,NC_023128_ATP6_1,NC_025200_ATP6_1,...,NC_056146_ATP6_1,NC_056147_ATP6_1,NC_056148_ATP6_1,NC_056195_ATP6_1,NC_061762_ATP6_1,NC_069201_ATP6_1,NC_070174_ATP6_1,NC_071210_ATP6_1,NC_082275_ATP6_1,NC_082276_ATP6_1
1,22,22,1.0,NC_015789_COX2_1,NC_015991_COX2_1,NC_023125_COX2_1,NC_023126_COX2_1,NC_023127_COX2_1,NC_023128_COX2_1,NC_025200_COX2_1,...,NC_056146_COX2_1,NC_056147_COX2_1,NC_056148_COX2_1,NC_056195_COX2_1,NC_061762_COX2_1,NC_069201_COX2_1,NC_070174_COX2_1,NC_071210_COX2_1,NC_082275_COX2_1,NC_082276_COX2_1
2,22,22,1.0,NC_015789_COX3_1,NC_015991_COX3_1,NC_023125_COX3_1,NC_023126_COX3_1,NC_023127_COX3_1,NC_023128_COX3_1,NC_025200_COX3_1,...,NC_056146_COX3_1,NC_056147_COX3_1,NC_056148_COX3_1,NC_056195_COX3_1,NC_061762_COX3_1,NC_069201_COX3_1,NC_070174_COX3_1,NC_071210_COX3_1,NC_082275_COX3_1,NC_082276_COX3_1
3,22,22,1.0,NC_015789_CYTB_1,NC_015991_CYTB_1,NC_023125_COB_1,NC_023126_COB_1,NC_023127_COB_1,NC_023128_COB_1,NC_025200_COB_1,...,NC_056146_COB_1,NC_056147_COB_1,NC_056148_COB_1,NC_056195_COB_1,NC_061762_COB_1,NC_069201_COB_1,NC_070174_COB_1,NC_071210_COB_1,NC_082275_COB_1,NC_082276_COB_1
4,22,22,0.909,NC_015789_NAD1_1,NC_015991_NAD1_1,NC_023125_NAD1_1,NC_023126_NAD1_1,NC_023127_NAD1_1,NC_023128_NAD1_1,NC_025200_NAD1_1,...,NC_056146_NAD1_1,NC_056147_NAD1_1,NC_056148_NAD1_1,NC_056195_NAD1_1,NC_061762_NAD1_1,NC_069201_NAD1_1,NC_070174_NAD1_1,NC_071210_NAD1_1,NC_082275_NAD1_1,NC_082276_NAD1_1
5,22,22,0.909,NC_015789_NAD6_1,NC_015991_NAD6_1,NC_023125_NAD6_1,NC_023126_NAD6_1,NC_023127_NAD6_1,NC_023128_NAD6_1,NC_025200_NAD6_1,...,NC_056146_NAD6_1,NC_056147_NAD6_1,NC_056148_NAD6_1,NC_056195_NAD6_1,NC_061762_NAD6_1,NC_069201_NAD6_1,NC_070174_NAD6_1,NC_071210_NAD6_1,NC_082275_NAD6_1,NC_082276_NAD6_1
6,21,21,1.0,NC_015789_ATP8_1,*,NC_023125_ATP8_1,NC_023126_ATP8_1,NC_023127_ATP8_1,NC_023128_ATP8_1,NC_025200_ATP8_1,...,NC_056146_ATP8_1,NC_056147_ATP8_1,NC_056148_ATP8_1,NC_056195_ATP8_1,NC_061762_ATP8_1,NC_069201_ATP8_1,NC_070174_ATP8_1,NC_071210_ATP8_1,NC_082275_ATP8_1,NC_082276_ATP8_1
7,21,21,1.0,NC_015789_NAD2_1,*,NC_023125_NAD2_1,NC_023126_NAD2_1,NC_023127_NAD2_1,NC_023128_NAD2_1,NC_025200_NAD2_1,...,NC_056146_NAD2_1,NC_056147_NAD2_1,NC_056148_NAD2_1,NC_056195_NAD2_1,NC_061762_NAD2_1,NC_069201_NAD2_1,NC_070174_NAD2_1,NC_071210_NAD2_1,NC_082275_NAD2_1,NC_082276_NAD2_1
8,21,21,1.0,NC_015789_NAD4_1,*,NC_023125_NAD4_1,NC_023126_NAD4_1,NC_023127_NAD4_1,NC_023128_NAD4_1,NC_025200_NAD4_1,...,NC_056146_NAD4_4,NC_056147_NAD4_4,NC_056148_NAD4_2,NC_056195_NAD4_1,NC_061762_NAD4_1,NC_069201_NAD4_1,NC_070174_NAD4_1,NC_071210_NAD4_2,NC_082275_NAD4_1,NC_082276_NAD4_1
9,21,21,1.0,NC_015789_NAD4_2,NC_015991_NAD4_1,NC_023125_NAD4_2,NC_023126_NAD4_2,NC_023127_NAD4_2,NC_023128_NAD4_2,NC_025200_NAD4_2,...,NC_056146_NAD4_1,NC_056147_NAD4_1,NC_056148_NAD4_1,*,NC_061762_NAD4_2,NC_069201_NAD4_2,NC_070174_NAD4_2,NC_071210_NAD4_1,NC_082275_NAD4_2,NC_082276_NAD4_2


Good.

If there are any non-single copy orthologs we must delete them!

In [13]:
df_all.loc[1] = df_all.loc[1].apply(lambda x: x.replace(',', '*') if isinstance(x, str) else x)

In [14]:
df_all

Unnamed: 0,# Species,Genes,Alg.-Conn.,NC_015789.mt_prots.fa,NC_015991.mt_prots.fa,NC_023125.mt_prots.fa,NC_023126.mt_prots.fa,NC_023127.mt_prots.fa,NC_023128.mt_prots.fa,NC_025200.mt_prots.fa,...,NC_056146.mt_prots.fa,NC_056147.mt_prots.fa,NC_056148.mt_prots.fa,NC_056195.mt_prots.fa,NC_061762.mt_prots.fa,NC_069201.mt_prots.fa,NC_070174.mt_prots.fa,NC_071210.mt_prots.fa,NC_082275.mt_prots.fa,NC_082276.mt_prots.fa
0,22,22,1.0,NC_015789_ATP6_1,NC_015991_ATP6_1,NC_023125_ATP6_1,NC_023126_ATP6_1,NC_023127_ATP6_1,NC_023128_ATP6_1,NC_025200_ATP6_1,...,NC_056146_ATP6_1,NC_056147_ATP6_1,NC_056148_ATP6_1,NC_056195_ATP6_1,NC_061762_ATP6_1,NC_069201_ATP6_1,NC_070174_ATP6_1,NC_071210_ATP6_1,NC_082275_ATP6_1,NC_082276_ATP6_1
1,22,22,1.0,NC_015789_COX2_1,NC_015991_COX2_1,NC_023125_COX2_1,NC_023126_COX2_1,NC_023127_COX2_1,NC_023128_COX2_1,NC_025200_COX2_1,...,NC_056146_COX2_1,NC_056147_COX2_1,NC_056148_COX2_1,NC_056195_COX2_1,NC_061762_COX2_1,NC_069201_COX2_1,NC_070174_COX2_1,NC_071210_COX2_1,NC_082275_COX2_1,NC_082276_COX2_1
2,22,22,1.0,NC_015789_COX3_1,NC_015991_COX3_1,NC_023125_COX3_1,NC_023126_COX3_1,NC_023127_COX3_1,NC_023128_COX3_1,NC_025200_COX3_1,...,NC_056146_COX3_1,NC_056147_COX3_1,NC_056148_COX3_1,NC_056195_COX3_1,NC_061762_COX3_1,NC_069201_COX3_1,NC_070174_COX3_1,NC_071210_COX3_1,NC_082275_COX3_1,NC_082276_COX3_1
3,22,22,1.0,NC_015789_CYTB_1,NC_015991_CYTB_1,NC_023125_COB_1,NC_023126_COB_1,NC_023127_COB_1,NC_023128_COB_1,NC_025200_COB_1,...,NC_056146_COB_1,NC_056147_COB_1,NC_056148_COB_1,NC_056195_COB_1,NC_061762_COB_1,NC_069201_COB_1,NC_070174_COB_1,NC_071210_COB_1,NC_082275_COB_1,NC_082276_COB_1
4,22,22,0.909,NC_015789_NAD1_1,NC_015991_NAD1_1,NC_023125_NAD1_1,NC_023126_NAD1_1,NC_023127_NAD1_1,NC_023128_NAD1_1,NC_025200_NAD1_1,...,NC_056146_NAD1_1,NC_056147_NAD1_1,NC_056148_NAD1_1,NC_056195_NAD1_1,NC_061762_NAD1_1,NC_069201_NAD1_1,NC_070174_NAD1_1,NC_071210_NAD1_1,NC_082275_NAD1_1,NC_082276_NAD1_1
5,22,22,0.909,NC_015789_NAD6_1,NC_015991_NAD6_1,NC_023125_NAD6_1,NC_023126_NAD6_1,NC_023127_NAD6_1,NC_023128_NAD6_1,NC_025200_NAD6_1,...,NC_056146_NAD6_1,NC_056147_NAD6_1,NC_056148_NAD6_1,NC_056195_NAD6_1,NC_061762_NAD6_1,NC_069201_NAD6_1,NC_070174_NAD6_1,NC_071210_NAD6_1,NC_082275_NAD6_1,NC_082276_NAD6_1
6,21,21,1.0,NC_015789_ATP8_1,*,NC_023125_ATP8_1,NC_023126_ATP8_1,NC_023127_ATP8_1,NC_023128_ATP8_1,NC_025200_ATP8_1,...,NC_056146_ATP8_1,NC_056147_ATP8_1,NC_056148_ATP8_1,NC_056195_ATP8_1,NC_061762_ATP8_1,NC_069201_ATP8_1,NC_070174_ATP8_1,NC_071210_ATP8_1,NC_082275_ATP8_1,NC_082276_ATP8_1
7,21,21,1.0,NC_015789_NAD2_1,*,NC_023125_NAD2_1,NC_023126_NAD2_1,NC_023127_NAD2_1,NC_023128_NAD2_1,NC_025200_NAD2_1,...,NC_056146_NAD2_1,NC_056147_NAD2_1,NC_056148_NAD2_1,NC_056195_NAD2_1,NC_061762_NAD2_1,NC_069201_NAD2_1,NC_070174_NAD2_1,NC_071210_NAD2_1,NC_082275_NAD2_1,NC_082276_NAD2_1
8,21,21,1.0,NC_015789_NAD4_1,*,NC_023125_NAD4_1,NC_023126_NAD4_1,NC_023127_NAD4_1,NC_023128_NAD4_1,NC_025200_NAD4_1,...,NC_056146_NAD4_4,NC_056147_NAD4_4,NC_056148_NAD4_2,NC_056195_NAD4_1,NC_061762_NAD4_1,NC_069201_NAD4_1,NC_070174_NAD4_1,NC_071210_NAD4_2,NC_082275_NAD4_1,NC_082276_NAD4_1
9,21,21,1.0,NC_015789_NAD4_2,NC_015991_NAD4_1,NC_023125_NAD4_2,NC_023126_NAD4_2,NC_023127_NAD4_2,NC_023128_NAD4_2,NC_025200_NAD4_2,...,NC_056146_NAD4_1,NC_056147_NAD4_1,NC_056148_NAD4_1,*,NC_061762_NAD4_2,NC_069201_NAD4_2,NC_070174_NAD4_2,NC_071210_NAD4_1,NC_082275_NAD4_2,NC_082276_NAD4_2


Good.

Now we shall drop every `*` from the dataset

In [15]:
df_all_cols_to_drop = df_all.columns[df_all.apply(lambda col: col.astype(str).str.contains(r'\*', regex=True).any())]

If we drop `NC_015991.mt_prots.fa`, `NC_051483.mt_prots.fa`, `NC_056195.mt_prots.fa` columns there will be no `*` in our data! Meaning there will be a set of 13 genes with exactly the same set of taxa (PERFECT!!!)

In [16]:
df_all_cols_to_drop

Index(['NC_015991.mt_prots.fa', 'NC_051483.mt_prots.fa',
       'NC_056195.mt_prots.fa'],
      dtype='object')

Drop'em and save the df to `phylogenomics/protein_ortho_output/All.tsv` file

In [17]:
df_all = df_all.drop(columns=df_all_cols_to_drop)
df_all.to_csv('phylogenomics/protein_ortho_output/All.tsv', sep='\t', index=False)

Now we shall start the process of extracting desired sequences

This script will write all the sequences families to be extracted

In [None]:
! Rscript scripts/write_family_names.R phylogenomics/protein_ortho_output/All.tsv phylogenomics/All/ phylogenomics/All_names/

By this step we create a huge dataset of all the sequences we've been analysing

In [47]:
! cat phylogenomics/Proteins_renamed_r3/*.fa > phylogenomics/all_pep.fa

Create a directory to store the desired sequences

In [49]:
! mkdir phylogenomics/All_seqs/

Now, using `BBMap` once again we will extract desired sequences from `Proteinortho` output!

In [None]:
%%bash

for file in phylogenomics/All_names/family*.txt
do 
  bbmap/filterbyname.sh in=phylogenomics/all_pep.fa out=phylogenomics/All_seqs/$(basename "$file" .names.txt).seq.fa include=t names=$file overwrite=true ignorejunk=true
done

Let us check the names of sequences in files

In [23]:
! head -1 phylogenomics/All_seqs/family1.seq.fa

>NC_015789_ATP6_1


Well, this is good. But for building the phylogenomic tree all the sequences in all the genes sets must be named the same! (e.g. not `>NC_015789_[GENE]`, but `>NC_015789`)

Create the directory to store the renamed sequences

In [52]:
! mkdir phylogenomics/All_seqs_renamed/

In [53]:
%%bash

for file in phylogenomics/All_seqs/*.fa
do
  # Process each file line by line
  while read -r line; do
    if [[ $line == \>* ]]; then
      # For all other headers, delete the second "_" and everything after it
      echo "$line" | sed 's/\(_[^_]*\)\(_.*\)/\1/' >> phylogenomics/All_seqs_renamed/$(basename "$file")
    else
      # For non-header lines (sequence data), copy them as-is
      echo "$line" >> phylogenomics/All_seqs_renamed/$(basename "$file")
    fi
  done < "$file"
done

In [24]:
! head -1 phylogenomics/All_seqs_renamed/family1.seq.fa

>NC_015789


Good.

Now we have everything ready for building phylogenomic tree!

Create the directory to store MSAs

In [54]:
%%bash

mkdir phylogenomics/MSAs/
mkdir phylogenomics/MSAs/All/

Run `MAFFT`

In [55]:
! for file in phylogenomics/All_seqs_renamed/*.fa; do mafft --auto "$file" > phylogenomics/MSAs/All/$(basename "$file" .fa).aln; done

outputhat23=16
treein = 0
compacttree = 0
stacksize: 8176 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.525
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
   10 / 19
done.

Progressive alignment ... 
STEP    18 /18 
done.
tbfast (aa) Version 7.525
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

   10 / 19
Segment   1/  1    1- 274
STEP 003-006-1  identical.   
Oscillating.

done
dvtditr (aa) Version 7.525
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
0 thread(s)


Strategy:
 L-INS-i (Probably most accurate, very slow)
 I

Create a directory to store trimmed MSAs

In [56]:
%%bash

mkdir phylogenomics/trimmed_MSAs/
mkdir phylogenomics/trimmed_MSAs/All/

Run `trimAl`

In [57]:
! for file in phylogenomics/MSAs/All/*aln; do trimal -in $file -out phylogenomics/trimmed_MSAs/All/trimmed_$(basename "$file") -automated1; done

Create a directory to store `ModelFinder` data

In [58]:
%%bash

mkdir phylogenomics/model-finder/
mkdir phylogenomics/model-finder/All/

Run `ModelFinder`

In [59]:
! iqtree2 -m MFP -s phylogenomics/trimmed_MSAs/All/ --prefix phylogenomics/model-finder/All/All -T AUTO

IQ-TREE multicore version 2.4.0 for MacOS Intel 64-bit built Feb 12 2025
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt,
Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong

Host:    Ilas-Mac-mini.local (SSE4.2, 16 GB RAM)
Command: iqtree2 -m MFP -s phylogenomics/trimmed_MSAs/All/ --prefix phylogenomics/model-finder/All/All -T AUTO
Seed:    594338 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Mon Mar 17 20:59:14 2025
Kernel:  SSE2 - auto-detect threads (10 CPU cores detected)

Reading 13 alignment files in directory phylogenomics/trimmed_MSAs/All/
Reading alignment file phylogenomics/trimmed_MSAs/All/trimmed_family1.seq.aln ... Fasta format detected
Reading fasta file: done in 0.000638962 secs using 62.13% CPU
Alignment most likely contains protein sequences
Alignment has 19 sequences with 258 columns, 131 distinct patterns
73 parsimony-informative, 25 singleton sites, 160 constant sites
           Gap/Ambiguity  Composition 

Get the best substitution model!

In [60]:
! head -42 phylogenomics/model-finder/All/All.iqtree | tail -6

Best-fit model according to BIC: cpREV+F+R3

List of models sorted by BIC scores: 

Model                  LogL         AIC      w-AIC        AICc     w-AICc         BIC      w-BIC
cpREV+F+R3       -32411.668   64939.336 +    0.335   64940.980 +    0.341   65307.524 +    0.923


Create a directory to store final tree files

In [61]:
%%bash

mkdir phylogenomics/tree/
mkdir phylogenomics/tree/All/

Run `IQ-TREE`

In [62]:
! iqtree2 -s phylogenomics/trimmed_MSAs/All/ -m cpREV+F+R3 --prefix phylogenomics/tree/All/All -bb 10000 -T AUTO

IQ-TREE multicore version 2.4.0 for MacOS Intel 64-bit built Feb 12 2025
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt,
Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong

Host:    Ilas-Mac-mini.local (SSE4.2, 16 GB RAM)
Command: iqtree2 -s phylogenomics/trimmed_MSAs/All/ -m cpREV+F+R3 --prefix phylogenomics/tree/All/All -bb 10000 -T AUTO
Seed:    404982 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Mon Mar 17 21:00:50 2025
Kernel:  SSE2 - auto-detect threads (10 CPU cores detected)

Reading 13 alignment files in directory phylogenomics/trimmed_MSAs/All/
Reading alignment file phylogenomics/trimmed_MSAs/All/trimmed_family1.seq.aln ... Fasta format detected
Reading fasta file: done in 0.000556946 secs using 93.01% CPU
Alignment most likely contains protein sequences
Alignment has 19 sequences with 258 columns, 131 distinct patterns
73 parsimony-informative, 25 singleton sites, 160 constant sites
           Gap/Ambiguity  Com

Now add `.1` to the accession numbers on the tree

In [64]:
def modify_tree_file(input_file, output_file):
    """
    Reads a Newick tree from a file, adds ".1" to all accession numbers, 
    and writes the modified tree to an output file.

    Args:
        input_file (str): Path to the input tree file.
        output_file (str): Path to save the modified tree.
    """

    def add_suffix(match):
        return match.group(0) + ".1"

    # Read tree from file
    with open(input_file, "r") as infile:
        tree_str = infile.read().strip()

    # Modify tree
    modified_tree = re.sub(r'NC_\d+', add_suffix, tree_str)

    # Write modified tree to output file
    with open(output_file, "w") as outfile:
        outfile.write(modified_tree)

    print(f"Modified tree saved to: {output_file}")

In [65]:
input_tree_file = "phylogenomics/tree/All/All.treefile"  # Replace with actual file path
output_tree_file = "phylogenomics/tree/All/All_ready.treefile"

modify_tree_file(input_tree_file, output_tree_file)

Modified tree saved to: phylogenomics/tree/All/All_ready.treefile


Now please proceed to the `04_ggtree_journal.R` to visualize the phylogenetic trees