In [3]:
!pip install -qr ../requirements.txt
!bash ../mount_efs.sh

In [16]:
from dotenv import load_dotenv
import os

from graph_db.db_connection import Neo4jConnection

load_dotenv()
uri = os.getenv("NEO4J_URI")
username = os.getenv("NEO4J_USERNAME")
password = os.getenv("NEO4J_PASSWORD")

# Enzyme Analysis and Recommendations

## 1. Lipase (EC 3.1.1.3) - 12,300 results

**Structural Complexity**: Lipases typically have an α/β hydrolase fold and are often monomeric, but some can form oligomers. They may exhibit interfacial activation, where a lid domain covers the active site, complicating modeling efforts.

**Industrial Potential**: High. Used extensively in the food industry, detergents, pharmaceuticals, and biodiesel production.

**Substrate Complexity**: Lipid substrates are hydrophobic and flexible, making docking challenging. The interfacial activation mechanism adds complexity to finding accurate docking poses.

**Additional Factors**: The presence of a lid domain requires careful consideration during optimization.

**References**:
- Sharma, R., et al. (2001). Lipases: Sources, structure, properties and applications. *Biotechnology Advances*, 19(8), 627-662.
- Houde, A., et al. (2004). Lipases and their industrial applications. *Applied Biochemistry and Biotechnology*, 118(1-3), 155-170.

---

## 2. Amylase (EC 3.2.1.1) - 4,834 results

**Structural Complexity**: Generally monomeric with a well-characterized (β/α)_8-barrel structure, making them simpler for computational modeling.

**Industrial Potential**: High. Widely used in baking, brewing, textiles, detergents, and biofuel industries.

**Substrate Complexity**: Acts on starch polysaccharides. While starch is complex, smaller substrate analogs can simplify docking studies.

**Additional Factors**: Their stability and broad applicability make them ideal candidates for optimization.

**References**:
- Van der Maarel, M. J., et al. (2002). Properties and applications of starch-converting enzymes of the α-amylase family. *Journal of Biotechnology*, 94(2), 137-155.
- Gupta, R., et al. (2003). Microbial α-amylases: a biotechnological perspective. *Process Biochemistry*, 38(11), 1599-1616.

---

## 3. Cellulase (EC 3.2.1.4) - 3,551 results

**Structural Complexity**: Often modular with catalytic and carbohydrate-binding modules connected by flexible linkers, increasing complexity.

**Industrial Potential**: High. Essential in biofuel production, paper, textile, and food industries.

**Substrate Complexity**: Cellulose is an insoluble, fibrous polysaccharide, posing challenges in modeling and docking.

**Additional Factors**: The modular nature and substrate insolubility make cellulases less ideal for initial optimization efforts.

**References**:
- Wilson, D. B. (2009). Cellulases and biofuels. *Current Opinion in Biotechnology*, 20(3), 295-299.
- Baldrian, P., & Valášková, V. (2008). Degradation of cellulose by basidiomycetous fungi. *FEMS Microbiology Reviews*, 32(3), 501-521.

---

## 4. Serine Protease (EC 3.4.21.62) - 38 results

**Structural Complexity**: Typically monomeric with a well-understood catalytic triad, simplifying structural analysis.

**Industrial Potential**: Moderate to High. Used in detergents, leather processing, food industry, and pharmaceuticals.

**Substrate Complexity**: Peptide substrates are small and easier to model, facilitating docking studies.

**Additional Factors**: The low number of sequences might limit diversity but simplifies selection.

**References**:
- Hedstrom, L. (2002). Serine protease mechanism and specificity. *Chemical Reviews*, 102(12), 4501-4524.
- Rao, M. B., et al. (1998). Molecular and biotechnological aspects of microbial proteases. *Microbiology and Molecular Biology Reviews*, 62(3), 597-635.

---

## 5. Lactase (EC 3.2.1.23) - 10,183 results

**Structural Complexity**: Often multimeric (e.g., β-galactosidase is a tetramer), increasing optimization complexity.

**Industrial Potential**: High. Used in dairy to produce lactose-free products and in pharmaceuticals.

**Substrate Complexity**: Lactose is a disaccharide, making docking manageable.

**Additional Factors**: Multimeric nature and large size may pose challenges.

**References**:
- Juers, D. H., et al. (2001). Structural basis for the regulation of beta-galactosidase activity. *Journal of Molecular Biology*, 311(5), 951-962.
- Heyman, M. B. (2006). Lactose intolerance in infants, children, and adolescents. *Pediatrics*, 118(3), 1279-1286.

---

## 6. Xylanase (EC 3.2.1.8) - 6,496 results

**Structural Complexity**: Mostly monomeric, with some having modular architectures similar to cellulases.

**Industrial Potential**: High. Important in paper and pulp industry, animal feed, and biofuel production.

**Substrate Complexity**: Xylan is a complex polysaccharide; however, modeling smaller oligomers can simplify docking.

**Additional Factors**: Stability under extreme conditions enhances industrial applicability.

**References**:
- Polizeli, M. L., et al. (2005). Xylanases from fungi: properties and industrial applications. *Applied Microbiology and Biotechnology*, 67(5), 577-591.
- Subramaniyan, S., & Prema, P. (2002). Biotechnology of microbial xylanases: enzymology, molecular biology, and application. *Critical Reviews in Biotechnology*, 22(1), 33-64.

---

## 7. Catalase (EC 1.11.1.6) - 7,003 results

**Structural Complexity**: Commonly a tetramer (multimeric), which may complicate optimization.

**Industrial Potential**: Moderate. Used in food preservation, textile bleaching, and wastewater treatment.

**Substrate Complexity**: Hydrogen peroxide is a small molecule, making docking straightforward.

**Additional Factors**: The multimeric structure might pose challenges in computational studies.

**References**:
- Chelikani, P., et al. (2004). Diversity of structures and properties among catalases. *Cellular and Molecular Life Sciences*, 61(2), 192-208.
- Kirkman, H. N., & Gaetani, G. F. (2007). Mammalian catalase: a venerable enzyme with new mysteries. *Trends in Biochemical Sciences*, 32(1), 44-50.

---

# Recommendations

Based on the analysis:

- **Amylase** and **Serine Protease** emerge as top candidates due to their monomeric nature, high industrial relevance, and manageable substrate complexity.
- **Xylanase** is also a good candidate, being mostly monomeric with significant industrial applications, although substrate complexity is higher than in amylases.
- **Lipase** could be considered if you focus on monomeric variants and are prepared to handle the complexities associated with lipid substrates and interfacial activation.

## Enzymes to Prioritize for Optimization:
- **Amylase** (EC 3.2.1.1)
- **Serine Protease** (EC 3.4.21.62)
- **Xylanase** (EC 3.2.1.8)

## Enzymes to Defer:
- **Cellulase** and **Lactase**: Due to modular/multimeric structures and substrate complexities.
- **Catalase**: Multimeric structure may complicate optimization efforts.

In [4]:
from src.databases import extract_reaction_data

reactions_path = "../data/modelSEED/reactions.json"
compounds_path = "../data/modelSEED/compounds.json"

reactions, compounds = extract_reaction_data(reactions_path, compounds_path)

In [23]:
reactions[0]

{'abbreviation': 'R00004',
 'abstract_reaction': None,
 'aliases': ['AraCyc: INORGPYROPHOSPHAT-RXN',
  'BiGG: IPP1; PPA; PPA_1; PPAm',
  'BrachyCyc: INORGPYROPHOSPHAT-RXN',
  'KEGG: R00004',
  'MetaCyc: INORGPYROPHOSPHAT-RXN',
  'Name: Diphosphate phosphohydrolase; Inorganic diphosphatase; Inorganic pyrophosphatase; Pyrophosphate phosphohydrolase; diphosphate phosphohydrolase; inorganic diphosphatase; inorganic diphosphatase (one proton translocation); inorganicdiphosphatase; pyrophosphate phosphohydrolase'],
 'code': '(1) cpd00001[0] + (1) cpd00012[0] <=> (2) cpd00009[0]',
 'compound_ids': 'cpd00001;cpd00009;cpd00012;cpd00067',
 'definition': '(1) H2O[0] + (1) PPi[0] <=> (2) Phosphate[0] + (1) H+[0]',
 'deltag': -3.46,
 'deltagerr': 0.05,
 'direction': '=',
 'ec_numbers': ['3.6.1.1'],
 'equation': '(1) cpd00001[0] + (1) cpd00012[0] <=> (2) cpd00009[0] + (1) cpd00067[0]',
 'id': 'rxn00001',
 'is_obsolete': 0,
 'is_transport': 0,
 'linked_reaction': 'rxn27946;rxn27947;rxn27948;rxn32487;

## Lipase

1. Search for BSLA sequences among candidates (or lid-free) -> taxonomy

In [14]:
ec = "3.1.1.3"
df = pd.read_csv(f"outputs/enzyme_query/{ec}.tsv", sep="\t")
df = df[df["gtdb_classification"].notna() & df["gtdb_classification"].str.contains("Bacillota")]
df

Unnamed: 0,protein_id,protein_name,protein_ec_numbers,genome_id,gtdb_classification,sample_id,sample_temperature,sample_depth,sample_latitude,sample_longitude,associated_reaction_ids


__RESULT__: not a single representative at the phylum level for Lipase.

## Serine Protease

1. Evaluate all 38 sequences

In [5]:
import pandas as pd
from src.databases import get_enzymes_with_complete_smiles

ec = "3.4.21.62"
df = pd.read_csv(f"outputs/enzyme_query/{ec}.tsv", sep="\t")

# Process the dataframe
filtered_df = get_enzymes_with_complete_smiles(df, reactions, compounds)
filtered_df

Unnamed: 0,protein_id,protein_name,protein_ec_numbers,genome_id,gtdb_classification,sample_id,sample_temperature,sample_depth,sample_latitude,sample_longitude,associated_reaction_ids
0,OceanDNA-b22424_00127_1,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b22424,d__Bacteria;p__Planctomycetota;c__Planctomycet...,SAMN08714628,,1.0,22.34,114.27,[rxn40535]
1,OceanDNA-b22856_00119_6,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b22856,d__Bacteria;p__Planctomycetota;c__Planctomycet...,SAMN08714535,,1.0,22.34,114.27,[rxn40535]
2,OceanDNA-b22853_00436_2,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b22853,d__Bacteria;p__Planctomycetota;c__Planctomycet...,SAMN08714535,,1.0,22.34,114.27,[rxn40535]
3,OceanDNA-b24029_00389_2,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b24029,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,SAMN13674978,,1500.0,80.03,179.55,[rxn40535]
4,OceanDNA-b21391_00015_14,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b21391,d__Bacteria;p__Patescibacteria;c__ABY1;o__GWA2...,SAMN05224487,9.181,200.0,48.591667,-123.505,[rxn40535]
5,OceanDNA-b21387_00016_14,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b21387,d__Bacteria;p__Patescibacteria;c__ABY1;o__GWA2...,SAMN05224480,9.101,150.0,48.591667,-123.505,[rxn40535]
6,OceanDNA-b16100_00582_1,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b16100,d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia...,SAMN08714533,,1.0,22.2,39.04,[rxn40535]
7,OceanDNA-b22858_00035_26,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b22858,d__Bacteria;p__Planctomycetota;c__Planctomycet...,SAMN08714628,,1.0,22.34,114.27,[rxn40535]
8,OceanDNA-b34239_00006_19,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b34239,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,,,,,,[rxn40535]
9,OceanDNA-b21394_00016_1,subtilisin [EC:3.4.21.62],['3.4.21.62'],OceanDNA-b21394,d__Bacteria;p__Patescibacteria;c__ABY1;o__GWA2...,SAMN05224533,9.011,165.0,48.591667,-123.505,[rxn40535]


## Print reaction compounds

In [5]:
rxn_id = "rxn40535"
rxn = [r for r in reactions if r["id"] == rxn_id][0]
rxn_cpds = rxn["compound_ids"].split(";")
for cpd_id in rxn_cpds:
    cpd = [c for c in compounds if c["id"] == cpd_id][0]
    print(f"Compound  ID: {cpd_id}, name: {cpd['name']}, smiles: {cpd['smiles']}")

Compound  ID: cpd00001, name: H2O, smiles: O
Compound  ID: cpd00067, name: H+, smiles: [H+]
Compound  ID: cpd27149, name: General-Protein-Substrates, smiles: *[NH2+][C@@H](*)C(=O)N[C@@H](*)C(=O)O*
Compound  ID: cpd36163, name: Peptides-holder, smiles: *[NH2+][C@@H](*)C(=O)[O-]


## Selection strategy: low-temperature adapted proteases

__Subtilisin__

## Amylase

In [31]:
import pandas as pd

ec = "3.2.1.1"
df = pd.read_csv(f"outputs/enzyme_query/{ec}.tsv", sep="\t")
df.head()

Unnamed: 0,protein_id,protein_name,protein_ec_numbers,genome_id,gtdb_classification,sample_id,sample_temperature,sample_depth,sample_latitude,sample_longitude,associated_reaction_ids
0,OceanDNA-b34977_00071_12,alpha-amylase [EC:3.2.1.1],['3.2.1.1'],OceanDNA-b34977,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,SAMN05422165,0.851,4002.0,-33.55,39.89,"['rxn06093', 'rxn09952', 'rxn15543', 'rxn17890..."
1,OceanDNA-b39144_00021_33,alpha-amylase [EC:3.2.1.1],['3.2.1.1'],OceanDNA-b39144,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,SAMEA2621010,7.026748,700.0,-31.0379,4.66455,"['rxn06093', 'rxn09952', 'rxn15543', 'rxn17890..."
2,OceanDNA-b16139_00029_7,alpha-amylase [EC:3.2.1.1],['3.2.1.1'],OceanDNA-b16139,d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia...,SAMN08714533,,1.0,22.2,39.04,"['rxn06093', 'rxn09952', 'rxn15543', 'rxn17890..."
3,OceanDNA-b2637_00327_3,alpha-amylase [EC:3.2.1.1],['3.2.1.1'],OceanDNA-b2637,d__Bacteria;p__Actinobacteriota;c__Actinomycet...,,,,,,"['rxn06093', 'rxn09952', 'rxn15543', 'rxn17890..."
4,OceanDNA-b35218_00007_42,alpha-amylase [EC:3.2.1.1],['3.2.1.1'],OceanDNA-b35218,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,SAMN09765683,,4000.0,22.75,-158.0,"['rxn06093', 'rxn09952', 'rxn15543', 'rxn17890..."


In [32]:
from src.databases import get_enzymes_with_complete_smiles

# Process the dataframe
filtered_df = get_enzymes_with_complete_smiles(df, reactions, compounds)

print(f"Original dataframe shape: {df.shape}")
print(f"Filtered dataframe shape: {filtered_df.shape}")

Original dataframe shape: (4834, 12)
Filtered dataframe shape: (4834, 11)


## Xylanase

1. Search for monomeric candidates -> taxonomy?

In [8]:
import pandas as pd
from src.databases import get_enzymes_with_complete_smiles

ec = "3.2.1.8"
df = pd.read_csv(f"outputs/enzyme_query/{ec}.tsv", sep="\t")

# Process the dataframe
filtered_df = get_enzymes_with_complete_smiles(df, reactions, compounds)
print(f"Original dataframe shape: {df.shape}")
print(f"Filtered dataframe shape: {filtered_df.shape}")
filtered_df

Original dataframe shape: (6496, 12)
Filtered dataframe shape: (6496, 11)


Unnamed: 0,protein_id,protein_name,protein_ec_numbers,genome_id,gtdb_classification,sample_id,sample_temperature,sample_depth,sample_latitude,sample_longitude,associated_reaction_ids
0,OceanDNA-b11441_00307_2,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b11441,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,SAMEA2622429,10.258400,380.0,-1.85465,-84.6212,"[rxn17994, rxn17995, rxn41846]"
1,OceanDNA-b12969_00099_2,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b12969,d__Bacteria;p__Bacteroidota;c__Rhodothermia;o_...,SAMEA2620078,26.817650,5.0,18.58120,66.5622,"[rxn17994, rxn17995, rxn41846]"
2,OceanDNA-b34913_00002_30,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b34913,d__Bacteria;p__Proteobacteria;c__Gammaproteoba...,SAMEA4398266,3.196100,5.0,69.11150,-51.5727,"[rxn17994, rxn17995, rxn41846]"
3,OceanDNA-b13052_00008_84,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b13052,d__Bacteria;p__Bacteroidota;c__Rhodothermia;o_...,SAMEA2622119,20.649431,50.0,-12.94740,-96.0527,"[rxn17994, rxn17995, rxn41846]"
4,OceanDNA-b23891_00051_3,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b23891,d__Bacteria;p__Proteobacteria;c__Alphaproteoba...,SAMN03169781,9.600000,1.0,54.17420,7.9000,"[rxn17994, rxn17995, rxn41846]"
...,...,...,...,...,...,...,...,...,...,...,...
6491,OceanDNA-b7432_00001_98,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b7432,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,SAMN03169778,7.300000,1.0,54.17420,7.9000,"[rxn17994, rxn17995, rxn41846]"
6492,OceanDNA-b7285_00046_2,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b7285,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,SAMN06266121,9.700000,1.0,54.18830,7.9000,"[rxn17994, rxn17995, rxn41846]"
6493,OceanDNA-b22665_00006_12,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b22665,d__Bacteria;p__Planctomycetota;c__Planctomycet...,SAMN13674978,,1500.0,80.03000,179.5500,"[rxn17994, rxn17995, rxn41846]"
6494,OceanDNA-b12431_00027_8,"endo-1,4-beta-xylanase [EC:3.2.1.8]",['3.2.1.8'],OceanDNA-b12431,d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__...,SAMN07224114,26.848000,15.0,24.42000,-156.3100,"[rxn17994, rxn17995, rxn41846]"
