# Alpha/beta hydrolase protein 2090

- Predicted to be an alpha/beta hydrolase.
- Contains Pfam domain [DUF900 (PF05990)](https://www.ebi.ac.uk/interpro/entry/pfam/PF05990/) (positions 61-236)
- From organism _Haloferax sp. s5a-1_, isolated from a saltern in Margherita di Savoia, Italy ([Atanasova et al., 2013](https://doi.org/10.1002/mbo3.115)). 
- Sequenced & assembled by Chahrazad Taissir.

## Protein sequence

```fasta
>gnl|extdb|pgaptmp_002090_1 alpha/beta hydrolase [Haloferax sp. s5a-1] 
MASRRRFLKTTAATFAGLTVFGATSGAAASTPYISTRDHFDDDANLTSGHTARGYDTSGDVPVVDSGSTS
EIFVFAHGWDKNSDNPEQDALEKIAKADTKLTEAGYDCEVVGYTWDSDKGDGWEFGWFEAQEIAQKNGRK
LAQFALDVKRASPGTTVRFTSHSLGAQVIFSALRTLDSRSAWTDSGYTIETMHPFGAATDNEVPGKEEGR
DTYEAIQESAGHVYNYYNAADDVLQWVYNTIEFDQALGETGLEGGDTPAGNYTDRDVESQVGDDHGNYLD
TIADDIVGDI
```

## Homology search

Similar proteins sequences are searched with software [`MMseqs2`](https://github.com/soedinglab/MMseqs2) in two databases: [GTDB](https://gtdb.ecogenomic.org/) & [UniProtKB](https://www.uniprot.org/help/uniprotkb).

The search procedure works as follows:

1. Search single protein sequence `pgaptmp_002090_1` (alias `2090`) in GTDB & UniProtKB.
2. Map GTDB hits to UniprotKB.
3. Map GTDB hits to our own internal DB, `db_proka`, a phylogenetically balanced subset of GTDB release 214 ([Strock et al., 2024](https://doi.org/10.1101/2024.09.18.613068)).
4. Merge datasets into one.

Steps 1, 2 & 3 are implemented in script [`src/homology_search/hydrolase_2090_search.sh`](src/homology_search/), running on Imperial's HPC.

Step 4 is implemented in this notebook.



In [7]:
import os
from pathlib import Path
import re

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO
from Bio.SearchIO.HmmerIO.hmmer3_domtab import Hmmer3DomtabHmmqueryParser

cwd = os.getcwd()
if cwd.endswith('notebook'):
    os.chdir('..')
    cwd = os.getcwd()

In [2]:
sns.set_palette('colorblind')
sns.set_style('whitegrid')
sns.set_context('paper', font_scale=1.8)
plt.rcParams['font.family'] = 'Helvetica'

palette = sns.color_palette().as_hex()


data_folder = Path('./data')
assert data_folder.is_dir()

## Process homology search results

### Parse GTDB hits

In [3]:
gtdb_hits = pd.read_csv(data_folder / 'hydrolase_search' / 'gtdb.tsv', sep='\t')
gtdb_hits['domain'] = gtdb_hits['taxlineage'].apply(lambda t: t.split(';')[0].replace('d_', ''))
gtdb_hits['gtdb_phylum'] = gtdb_hits['taxlineage'].apply(lambda t: t.split(';')[1].replace('p_', ''))
gtdb_hits['gtdb_class'] = gtdb_hits['taxlineage'].apply(lambda t: t.split(';')[2].replace('c_', ''))
gtdb_hits['gtdb_order'] = gtdb_hits['taxlineage'].apply(lambda t: t.split(';')[3].replace('o_', ''))
gtdb_hits['gtdb_family'] = gtdb_hits['taxlineage'].apply(lambda t: t.split(';')[4].replace('f_', ''))
gtdb_hits['gtdb_genus'] = gtdb_hits['taxlineage'].apply(lambda t: t.split(';')[5].replace('g_', ''))
gtdb_hits['gtdb_species'] = gtdb_hits['taxlineage'].apply(lambda t: t.split(';')[6].replace('s_', ''))
gtdb_hits.head()

Unnamed: 0,query,target,evalue,bits,tstart,tend,taxlineage,domain,gtdb_phylum,gtdb_class,gtdb_order,gtdb_family,gtdb_genus,gtdb_species
0,pgaptmp_002090_1,NZ_AOLI01000034.1_29,7.391e-185,581,1,290,d_Archaea;p_Halobacteriota;c_Halobacteria;o_Ha...,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haloferacaceae,Haloferax,Haloferax larsenii
1,pgaptmp_002090_1,NZ_AOLK01000012.1_176,5.303e-175,552,1,289,d_Archaea;p_Halobacteriota;c_Halobacteria;o_Ha...,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haloferacaceae,Haloferax,Haloferax elongans
2,pgaptmp_002090_1,NZ_AOLN01000018.1_554,2.454e-141,455,1,290,d_Archaea;p_Halobacteriota;c_Halobacteria;o_Ha...,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haloferacaceae,Haloferax,Haloferax mucosum
3,pgaptmp_002090_1,NC_017941.2_785,3.791e-139,449,1,290,d_Archaea;p_Halobacteriota;c_Halobacteria;o_Ha...,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haloferacaceae,Haloferax,Haloferax mediterranei
4,pgaptmp_002090_1,NZ_AOLJ01000011.1_106,5.312e-136,440,1,290,d_Archaea;p_Halobacteriota;c_Halobacteria;o_Ha...,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haloferacaceae,Haloferax,Haloferax gibbonsii


In [4]:
n_archaea = len(gtdb_hits[gtdb_hits['domain'] == 'Archaea'])
n_bacteria = len(gtdb_hits[gtdb_hits['domain'] == 'Bacteria'])

print(f'Number of archaea: {n_archaea:,}')
print(f'Number of bacteria: {n_bacteria:,}')

Number of archaea: 309
Number of bacteria: 19


Distribution of archaeal phyla:

In [5]:
gtdb_hits[gtdb_hits['domain'] == 'Archaea']['gtdb_phylum'].value_counts()

gtdb_phylum
Halobacteriota       160
Thermoproteota       148
Nanohaloarchaeota      1
Name: count, dtype: int64

Distribution of bacterial phyla:

In [6]:
gtdb_hits[gtdb_hits['domain'] == 'Bacteria']['gtdb_phylum'].value_counts()

gtdb_phylum
Chlamydiota          5
4484-113             3
Pseudomonadota       3
Actinomycetota       3
Verrucomicrobiota    2
Bacteroidota         1
Planctomycetota      1
Gemmatimonadota      1
Name: count, dtype: int64