# Tree of hydrolase 2090

- Homology search is documented in [`notebook/hydrolase_tree.ipynb`](notebook/hydrolase_tree.ipynb).
- Alignment, trimming and tree building is done in script: [`src/tree/make_tree.sh`](src/tree/make_tree.sh).

In [4]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
from Bio import SeqIO

cwd = os.getcwd()
if cwd.endswith('notebook'):
    os.chdir('..')
    cwd = os.getcwd()

In [5]:
data_folder = Path('./data')
assert data_folder.is_dir()

## Alignment

### Deleted sequences

After proteins have been aligned, and the alignment trimmed, sequences with less more than 50% gaps are removed.

How many sequences were dropped? Which ones?

In [11]:
trimmed_aln = SeqIO.to_dict(SeqIO.parse(data_folder / 'hydrolase_tree' / 'sequences.aln.trimmed.fasta', 'fasta'))
final_aln = SeqIO.to_dict(SeqIO.parse(data_folder / 'hydrolase_tree' / 'alignment_final.fasta', 'fasta'))

dropped_ids = sorted(set(trimmed_aln.keys()) - set(final_aln.keys()))

print('Number of dropped sequences:', len(dropped_ids))

Number of dropped sequences: 6


In [14]:
metadata_df = pd.read_csv(data_folder / 'hydrolase_search' / 'search_output.csv', index_col='id')
metadata_df.loc[dropped_ids]

Unnamed: 0_level_0,gtdb_id,db_proka_id,uniprot_id,domain,gtdb_phylum,gtdb_class,gtdb_order,gtdb_family,gtdb_genus,gtdb_species,ncbi_phylum,ncbi_class,ncbi_order,ncbi_family,ncbi_genus,ncbi_species,tstart,tend,evalue,bits
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
A0A5Q0UT33,NZ_CP044130.1_50,,A0A5Q0UT33,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haloarculaceae,Halomicrobium,Halomicrobium sp009617995,Euryarchaeota,Halobacteria,Halobacteriales,Haloarculaceae,Halomicrobium,Halomicrobium sp. LC1Hm,35,137,5.661e-05,57
A0A6A7LEE0,,,A0A6A7LEE0,Archaea,,,,,,,Nitrososphaerota,Nitrososphaeria,Nitrososphaerales,Nitrososphaeraceae,,Nitrososphaeraceae archaeon,10,87,7.392e-09,69
A0A835X6S7,,,A0A835X6S7,Archaea,,,,,,,Nitrososphaerota,Nitrososphaeria,Nitrosopumilales,,,Nitrosopumilales archaeon,43,147,4.548e-06,61
A0A842N6W6,,,A0A842N6W6,Archaea,,,,,,,Nitrososphaerota,,,,Candidatus Nitrosopelagicus,Candidatus Nitrosopelagicus sp.,7,142,1.903e-06,62
JAHENP010000151.1_1,JAHENP010000151.1_1,,,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haloferacaceae,Halobaculum,Halobaculum sp018609965,,,,,,,16,137,0.000567,54
WP_210424182.1@GCF_004765785.1,NZ_SBIT01000010.1_266,WP_210424182.1@GCF_004765785.1,,Archaea,Halobacteriota,Halobacteria,Halobacteriales,Haladaptataceae,Halorussus,Halorussus ruber,,,,,,,17,93,6.846999999999999e-19,100


In [15]:
pfam_df = pd.read_csv(data_folder / 'hydrolase_search' / 'search_output.pfam.csv', index_col='protein_id')
pfam_df.loc[sorted(set(pfam_df.index) & set(dropped_ids))]

Unnamed: 0_level_0,hmm_accession,hmm_query,evalue,bitscore,accuracy,start,end
protein_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A0A835X6S7,PF05990.17,DUF900,1.2e-07,26.3,0.79,31,148


- 5 of 6 do not have a single Pfam entry.
- 1 of 6 is a distant homolog (e-value is quite high: 1.2e-7)