# Haloferax hits of protein 2090

Experimental evidence point towards differential acitvity of homologs of A/B hydrolase 2090.

For instance, _Haloferax volcanii_, which we use as a model organism to ectopically express the protein, does not seem to kill _Pontibacillus SP9-4_ natively. This is usrprinsing because its genome does encode a similar protein. 

In [12]:
import os
from pathlib import Path
import re

import numpy as np
import pandas as pd
from Bio import SeqIO
from Bio import Phylo

cwd = os.getcwd()
if cwd.endswith('notebook'):
    os.chdir('..')
    cwd = os.getcwd()

from src.tree.itol_annotation import itol_labels

In [2]:
data_folder = Path('./data')
assert data_folder.is_dir()

## Load Haloferax homologs of protein 2090

In [5]:
hits = pd.read_csv(data_folder / 'hydrolase_search' / 'search_output.csv')

haloferax_hits = hits[
    (hits['gtdb_genus'] == 'Haloferax') |
    (hits['ncbi_genus'] == 'Haloferax')
].copy()

print(f'Number of Haloferax hits: {len(haloferax_hits)}')

haloferax_hits[['id', 'gtdb_id', 'db_proka_id', 'gtdb_species', 'ncbi_species', 'bits']]

Number of Haloferax hits: 29


Unnamed: 0,id,gtdb_id,db_proka_id,gtdb_species,ncbi_species,bits
0,A0A1H7H3G6,,,,Haloferax larsenii,590
1,M0GTD3,NZ_AOLI01000034.1_29,WP_007545272.1@GCF_000336955.1,Haloferax larsenii,Haloferax larsenii,581
2,NZ_AOLK01000012.1_176,NZ_AOLK01000012.1_176,,Haloferax elongans,,552
3,M0HUC4,,,,Haloferax elongans,526
4,M0I6U3,NZ_AOLN01000018.1_554,,Haloferax mucosum,Haloferax mucosum,455
5,I3R2Q4,NC_017941.2_785,WP_004057439.1@GCF_000306765.2,Haloferax mediterranei,Haloferax mediterranei,449
6,A0A371M0N6,,,,Haloferax sp. Atlit-4N,440
7,A0A871BD94,,,,Haloferax gibbonsii,440
8,A0A371N295,NZ_AOLJ01000011.1_106,WP_004972615.1@GCF_000336775.1,Haloferax gibbonsii,Haloferax sp. Atlit-6N,440
9,A0A371MMW4,,,,Haloferax sp. Atlit-12N,439


## Export metadata and sequences

In [6]:
haloferax_hits.to_csv(data_folder / 'hydrolase_search' / 'haloferax_hits' / 'haloferax_2090_homologs.csv', index=False)

In [8]:
records = []
hits_set = set(haloferax_hits['id'].values)
for r in SeqIO.parse(data_folder / 'hydrolase_search' / 'search_output.fasta', 'fasta'):
    if r.id == 'pgaptmp_002090_1' or r.id in hits_set:
        records.append(r)

assert len(records) == (len(haloferax_hits) + 1)

with (data_folder / 'hydrolase_search' / 'haloferax_hits' / 'haloferax_2090_homologs.fasta').open('w') as f_out:
    SeqIO.write(records, f_out, 'fasta')

## Tree

- Alignment from MAFFT (L-INS-I option)
- Tree with IQ-TREE with automatic model search and ultra fast bootstraps using [IQ-Tree webserver](http://iqtree.cibiv.univie.ac.at).

The chosen model is `JTT+G4`.

See alignment and tree logs in [`data/homology_search/haloferax_hits/`](../data/homology_search/haloferax_hits/).

### Tree post-processing

IQ-TREE removes all characters deemed "special characters" from a sequence ID and replace them with underscores. 

Let's recover the original IDs.

In [11]:
tree = Phylo.read(data_folder / 'hydrolase_search' / 'haloferax_hits' / 'tree' / 'haloferax_2090_homologs.aln.fasta.treefile', 'newick')

leaf_ids = set()
for leaf in tree.get_terminals():
    if re.match(r'^.+_GC[AF]_.+$', leaf.name) is not None:
        leaf.name = leaf.name.replace('_GC', '@GC')

    leaf_ids.add(leaf.name)

final_tree_path = data_folder / 'hydrolase_search' / 'haloferax_hits' / 'tree' / 'haloferax_2090_homologs.phyloxml'
with final_tree_path.open('w') as f_out:
    Phylo.write([tree], f_out, 'phyloxml')

tree = Phylo.read(final_tree_path, 'phyloxml')

print(f'Number of leaves: {len(list(tree.get_terminals())):,}')

Number of leaves: 30


### Tree annotation: labels

In [13]:
labels = []
protein_ids = sorted(leaf_ids)
metadata_df = haloferax_hits.set_index('id')
for protein_id in leaf_ids:
    if protein_id == 'pgaptmp_002090_1':
        label = f'Haloferax sp. S5a-1 (pgaptmp_002090_1)'
        labels.append([protein_id, label])
        continue

    row = metadata_df.loc[protein_id]
    
    if not pd.isnull(row['uniprot_id']):
        species = row['ncbi_species']
        if pd.isnull(species):
            raise ValueError(f'No species specified for protein {protein_id}')
        
        uniprot_id = row['uniprot_id']
        id_str = f'UniProt: {uniprot_id}'
    else:
        species = row['gtdb_species']
        gtdb_id = row['gtdb_id']
        id_str = f'GTDB: {gtdb_id}'
    
    label = f'{species} ({id_str})'

    labels.append([protein_id, label])

itol_labels(
    labels, 
    data_folder / 'hydrolase_search' / 'haloferax_hits' / 'tree' / 'tree_labels.txt'
)