# Overview
Gene trees for archaeal FsxAs are inferred by aligning sequences with MAFFT (under L-INS-I) algorithm and running IQTree. Cleaning up of MSAs with <name of the program here> is also done in order to test if these changes something. Aligned sequences are either full-length FsxA sequences or their corresponding ectodomain.
Sequences are those send by Dr. Pablo Aguilar in excel format by mail communication, and following suggestion by him some minor modifications of the data is done (as well as eliminating a duplicated sequence).

In [1]:
# importing libraries
import os
import glob
import subprocess
from Bio import SeqIO
from Bio.Seq import Seq
import pandas as pd

# creating some important directories in order to allocate data (if they doesn't exist)
allocate_dirs = ['../data/sequences', '../data/MSAs', '../data/MSAs/raw', '../data/MSAs/clean']
for dir in allocate_dirs:
    if not os.path.exists(dir):
        os.mkdir(dir)

## Loading ecological data, performing minor data filtering and saving FASTAs in R
There seems to be some missing ectodomain sequences and complete sequences..., but indeed it was my problem when passing Excel into a machine-readble xlsx file, so solving it by performing some minor modifications with tidyverse

In [2]:
import rpy2
%load_ext rpy2.ipython

In [3]:
%%R 

library(tidyverse)
library(magrittr)
library(glue)
library(readxl)
library(bioseq)

# loading ecological metadata
ecological_data_columns = c('HEADER','scaffold ID_ORF','scaffold ID_ORF (SPADES)','CONFIDENCE','METAGENOMICS PROJECT',
                               'TAXA','scaffold length','COMPLETE SEQUENCE','LENGTH','SIGNAL PEPTIDE?','TMs','ECTODOMAIN',
                               'ECTO LENGTH','C','Num CYS','ECTO Isoelectric point','COMMENTS','BIOSAMPLE','MG NAME',
                               'HABITAT_Detailed','Temperature_Detailed','elev mts','collec DATE','HABITAT','AUTHORS',
                               'CONTACT','PAPER DOI','ISOLATION','SOLID','AQUEOUS','SALT?','pH','T_Classified',
                               'ALT_DEPT (mts)','FILTER FRACTION','O2')

ecological_metadata_table = readxl::read_xlsx('../data/metadata/modified_FsxAs-Kosher-Taxo-Abr-2021.xlsx',
                                            col_names = ecological_data_columns, 
                                            skip = 1)
        
#ecological_data.tibble %>% dplyr::filter(is.na(ECTODOMAIN)) 
ecological_metadata_table %<>% dplyr::filter(!is.na(`COMPLETE SEQUENCE`) & !is.na(`ECTODOMAIN`)) 

# saving this data in TSV format in order to avoid problems when reading
ecological_metadata_table %>% readr::write_tsv(., '../data/metadata/modified_FsxAs-Kosher-Taxo-Abr-2021.tsv')
    
# creating FASTAs with bioseq
tibble(label = ecological_metadata_table$HEADER,
       sequence = ecological_metadata_table$`COMPLETE SEQUENCE`) %>%
    deframe() %>%
    as_aa() %>%
    bioseq::write_fasta(., '../data/sequences/FsxA_full_length.faa')
            
tibble(label = ecological_metadata_table$HEADER,
       sequence = ecological_metadata_table$`ECTODOMAIN`) %>%
    deframe() %>%
    as_aa() %>%
    bioseq::write_fasta(., '../data/sequences/FsxA_ectodomains.faa')

R[write to console]: ── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

R[write to console]: [32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.6
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

R[write to console]: ── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

R[write to console]: 
Attaching package: ‘magrittr’


R[write to console]: The following object is masked from ‘package:purrr’:

    set_names


R[write to console]: The following object is masked from ‘package:tidyr’:

 

**Notes of importance**: 
- should check number of sequences that indeed present a COMPLETE SEQUENCE and ECTODOMAIN, and see if they are the same as mentioned by Martin by email-communication.
- tried to perform saving of FASTAs with Bio.SeqIO in python but got a weird error, maybe related with other stuff... polishing will be done there, but that's the reason why procesing of FASTAs is done in two different parts of the script

**Note**: checked the number of sequences in both FASTAs and it is OK (94 sequences)

## Running MSAs and MSA trimming

## Tree only with FsxA ectodomains

### MAFFT (L-INS-I algorithm)

In [4]:
# running MSA with MAFFT under L-INS-I algorithm
fasta_sets = [(fasta_file.split('/')[3].replace('.faa', ''),
              fasta_file) for fasta_file in glob.glob('../data/sequences/*faa')]
fasta_sets


[('FsxA_full_length', '../data/sequences/FsxA_full_length.faa'),
 ('FsxA_ectodomains', '../data/sequences/FsxA_ectodomains.faa'),
 ('hap2.P.HU', '../data/sequences/hap2.P.HU.faa')]

In [5]:
for fasta_set in fasta_sets:
  # depacking variables
  tag, fasta_file = fasta_set
  msa_output = '../data/MSAs/raw/{0}.msa'.format(tag)
  if not os.path.exists(msa_output):
    out_file = open(msa_output, 'w') 
    mafft_command = 'mafft --maxiterate 1000 --localpair {0}'.format(fasta_file).split(' ')
    subprocess.run(mafft_command, stdout = out_file)

## Running IQTree

In [6]:
# running IQTree with model selection and tree inference by maximum-likelihood, under 1000 ultrafast-bootstrap replicates
for fasta_set in fasta_sets:
  # depacking variables
  tag, fasta_file = fasta_set
  # creating directories to allocate results
  if not os.path.exists('../data/trees/infered_by_mauricio'):
        os.mkdir('../data/trees/infered_by_mauricio')
  family_iqtree_dir = '../data/trees/infered_by_mauricio/{0}'.format(tag)
  msa_output = '../data/MSAs/raw/{0}.msa'.format(tag)
  if not os.path.exists(family_iqtree_dir):
        # create dir and run IQTree
        os.mkdir(family_iqtree_dir)
        # run IQTree
        iqtree_cmd = 'iqtree2 -s {0} -m TEST --threads-max 15 -alrt 1000 -B 1000 -pre {2}/{1}'.format(msa_output, tag, family_iqtree_dir).split(' ')
        subprocess.run(iqtree_cmd)

## Tree with FsxAs + HAP2

### Extracting HAP2 ectodomains

In [12]:
import shutil

# copy HAP2 sequences
if not os.path.exists('../data/sequences/hap2.P.HU.faa'):
    shutil.copy(src = '/media4/eletor/genomas/hap2.P.HU.faa', dst = '../data/sequences/hap2.P.HU.faa')
    
# perform search of FsxA ectodomain HMM against sequences with hmmsearch
# create directories to allocate results
hap2_dirs = ['../results/extracting_HAP2_ectodomains', '../results/extracting_HAP2_ectodomains/hmmsearchout', '../results/extracting_HAP2_ectodomains/sequences']
for dir in hap2_dirs:
    if not os.path.exists(dir):
        os.mkdir(dir)

# copy FsxA ectodomain HMM and perform hmmsearch
if not os.path.exists('../data/sequences/fsx.ectos.hmm'):
    shutil.copy(src = '/media4/eletor/FsxA/Halobacteria/HMMfsxa/fsx.ectos.hmm', dst = '../data/sequences/fsx.ectos.hmm')
    
hmmsearchout = '../results/extracting_HAP2_ectodomains/hmmsearchout/fsxA_ectodomain_vs_HAP2s.hmmsearchout'
hmmsearchtblout = '../results/extracting_HAP2_ectodomains/hmmsearchout/fsxA_ectodomain_vs_HAP2s.tblout'
hmmsearchdomtblout = '../results/extracting_HAP2_ectodomains/hmmsearchout/fsxA_ectodomain_vs_HAP2s.domtblout'
if not os.path.exists(hmmsearchout):
  hmmsearch_cmd = 'hmmsearch -o {0} --tblout {1} --domtblout {2} --cpu 3 ../data/sequences/fsx.ectos.hmm ../data/sequences/hap2.P.HU.faa'.format(hmmsearchout, hmmsearchtblout, hmmsearchdomtblout).split(' ')
  subprocess.run(hmmsearch_cmd)

Parsing hmmsearch outfiles with R's library rhmmer

In [13]:
import rpy2

In [14]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [15]:
%%R -o fsx_ectodomains_vs_HAP2s_filtered_table
# script made to parse hmmsearch results of homologous groups against ORFs
# loading libraries
library(rhmmer)
library(tidyverse)
library(magrittr)
library(glue)

# create vector with column names of hmmsearch output
domtblouts_colnames = c('target_name', 't_accession', 'tlen', 'query_name', 'q_accession', 'qlen',
                    'fullseq_evalue', 'fullseq_score', 'fullseq_bias', 'num_of_domain', 'total_hit_domains', 
                    'c-evalue', 'i-evalue', 'hmm_score', 'hmm_bias', 'hmm_coord_from', 'hmm_coord_to', 'ali_coord_from',
                    'ali_coord_to', 'env_coord_from', 'env_coord_to', 'acc', 'description_of_target')

# see how many hits has each homologous group
filtered_hits.tibble = list.files('../results/extracting_HAP2_ectodomains/hmmsearchout', pattern = '.domtblout', full.names=T) %>%
  as.list() %>%
  purrr::map_dfr(., ~{
    # load results
    current_domtblout.tibble = rhmmer::read_domtblout(file = .x)
    # rename columns
    colnames(current_domtblout.tibble) = domtblouts_colnames
    # filtering results
    current_domtblout.tibble %>%
      rowwise() %>%
      # creating column with alignment length and getting qcov
      dplyr::mutate(aln_length = abs(ali_coord_to - ali_coord_from),
                    qcov = aln_length/qlen) %>%
      # query coverage must be at least 70%
      dplyr::filter(qcov >= 0.7) %>%
      # global sequence e-value must be at most 1e-15
      dplyr::filter(fullseq_evalue <= 1e-15)
                 })

# saving table
filtered_hits.tibble %>% readr::write_tsv(., '../results/extracting_HAP2_ectodomains/hmmsearchout/fsx_ectodomain_vs_HAP2s_hmmsearch_filtered.tsv')
        
# create variable to export to python 
fsx_ectodomains_vs_HAP2s_filtered_table = filtered_hits.tibble

In [16]:
# parse hits and extract matching sequences from HAP2
fsx_ectodomains_vs_HAP2s_filtered_table.head() # important: in this table ali_coord_from and ali_coord_to refer to coordinates in subject sequence,
                                        # and hmm_coord_from and hmm_coord_to refer to positions in the Fsx ectodomain HMM
# save in files

Unnamed: 0,target_name,t_accession,tlen,query_name,q_accession,qlen,fullseq_evalue,fullseq_score,fullseq_bias,num_of_domain,...,hmm_coord_from,hmm_coord_to,ali_coord_from,ali_coord_to,env_coord_from,env_coord_to,acc,description_of_target,aln_length,qcov
1,Cper_4312,-,981,fsx.ectos.mafft,-,484,7.2e-68,223.2,13.7,1,...,8,446,25,521,21,587,0.86,,496,1.024793
2,000313135.1,-,927,fsx.ectos.mafft,-,484,3.4000000000000003e-62,204.4,10.6,1,...,6,470,34,580,29,604,0.81,ELR19439.1,546,1.128099
3,Sarc_20979,-,945,fsx.ectos.mafft,-,484,4.3e-61,200.8,10.4,1,...,9,429,34,528,27,542,0.86,,494,1.020661
4,Cfra_3504,-,1093,fsx.ectos.mafft,-,484,9.099999999999999e-57,186.5,1.1,1,...,19,424,7,473,2,489,0.84,,466,0.96281
5,003719475.1,-,585,fsx.ectos.mafft,-,484,2.1e-56,185.4,0.0,1,...,7,481,24,548,20,552,0.81,RNF04482.1,524,1.082645


### Selection HAP2 sequences to be used
 - Going to include just 3 or 4 seqs. In this section I retrieve from NCBI some taxonomical features in order to decide... (**didnt work, so taking 5 at random**)

In [37]:
# going to do this by API calls
import requests
# import time
# from requests.adapters import HTTPAdapter
# #response = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term={0}'.format('SBT79210.1'))
# 
# page = ''
# while page == '':
#     try:
#         page = requests.get('https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term={0}'.format('SBT79210.1'), HTTPAdapter(max_retries=1))
#         break
#     except:
#         print("Connection refused by the server..")
#         print("Let me sleep for 5 seconds")
#         print("ZZzzzz...")
#         time.sleep(5)
#         print("Was a nice sleep, now let me continue...")
#         continue
# response_json = response.json()

In [38]:
# importing Biopython entrez tools
from Bio import Entrez

# #help(Entrez.esummary)
# Entrez.email = 'mauricio.langleib@gmail.com'
# #handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae")
# #record = Entrez.read(handle)
# #record["IdList"]
# #record["IdList"][0]
# #
# #handle = Entrez.efetch(db="Taxonomy", id="158330", retmode="xml")
# #records = Entrez.read(handle)
# #
# #records[0]
# 
# handle = Entrez.esearch(db="protein", term="SBT79210.1", retmode='xml')
# record = Entrez.read(handle)
# record["IdList"]

### Checking length of Fsx ectodomains
Just to check that the obtained result makes sense...

In [39]:
from Bio import SeqIO

# loading HAP2 sequences and extracting HAP2 ectodomains
# create variable to allocate ectodomain sequences
hap2_ectodomains = []
# create dict with HAP2 sequences
hap2_seqs_dict = {record.id: record for record in SeqIO.parse('../data/sequences/hap2.P.HU.faa', 'fasta')}

# iterate over hit table and extract domains
for index, row in fsx_ectodomains_vs_HAP2s_filtered_table.iterrows():
    # get sequence id, and HMM-to-seq alignment
    sequence_id = row['target_name']
    aln_start = row['ali_coord_from']
    aln_stop = row['ali_coord_to']
    # get sequence record
    seq_record = hap2_seqs_dict[sequence_id]
    # subset to alignment coordinates and add to description de "ECTODOMAIN" word
    seq_record_ectodomain = seq_record[aln_start:aln_stop]
    seq_record_ectodomain.description = seq_record_ectodomain.description + '_ECTODOMAIN'
    # append to ectodomains list
    hap2_ectodomains.append(seq_record_ectodomain)
    
# save HAP2 ectodomains to FASTA file
with open('../results/extracting_HAP2_ectodomains/sequences/HAP2_ectodomains.faa', 'w') as handle_fasta:
    SeqIO.write(hap2_ectodomains, handle_fasta,'fasta')

In [47]:
# selecting five HAP2 seqs at random, setting seed previously in order to get reproducible results if script runs again
import random
random.seed(22)
hap2_selected_ectodomains = random.sample(hap2_ectodomains, 5)

### Join HAP2 ectodomains with FsxA ectodmains

In [48]:
# appending to FsxA ectodomain sequences
# loading FsxA ectodomains
fsxa_ectodomains = [record for record in SeqIO.parse('../data/sequences/FsxA_ectodomains.faa', 'fasta')]
fsx_complete_set = fsxa_ectodomains + hap2_selected_ectodomains

# saving 
with open('../data/sequences/fsxA_and_hap2_ectodomains.faa', 'w') as handle_fasta:
    SeqIO.write(fsx_complete_set, handle_fasta, 'fasta')

### Infer MSA and phylogeny

In [49]:
# running MSA with MAFFT under L-INS-I algorithm
fasta_sets = [(fasta_file.split('/')[3].replace('.faa', ''),
              fasta_file) for fasta_file in glob.glob('../data/sequences/*faa')]

for fasta_set in fasta_sets:
  # depacking variables
  tag, fasta_file = fasta_set
  msa_output = '../data/MSAs/raw/{0}.msa'.format(tag)
  if not os.path.exists(msa_output):
    out_file = open(msa_output, 'w') 
    mafft_command = 'mafft --maxiterate 1000 --localpair {0}'.format(fasta_file).split(' ')
    subprocess.run(mafft_command, stdout = out_file)
    

In [50]:
# running IQTree with model selection and tree inference by maximum-likelihood, under 1000 ultrafast-bootstrap replicates
for fasta_set in fasta_sets:
  # depacking variables
  tag, fasta_file = fasta_set
  # creating directories to allocate results
  if not os.path.exists('../data/trees/infered_by_mauricio'):
        os.mkdir('../data/trees/infered_by_mauricio')
  family_iqtree_dir = '../data/trees/infered_by_mauricio/{0}'.format(tag)
  msa_output = '../data/MSAs/raw/{0}.msa'.format(tag)
  if not os.path.exists(family_iqtree_dir):
        # create dir and run IQTree
        os.mkdir(family_iqtree_dir)
        # run IQTree
        iqtree_cmd = 'iqtree2 -s {0} -m TEST --threads-max 15 -alrt 1000 -B 1000 -pre {2}/{1}'.format(msa_output, tag, family_iqtree_dir).split(' ')
        subprocess.run(iqtree_cmd)