# Use the KG to identify seed protein candidates for Proteus

Objective: Use the KG to identify seed protein candidates for Proteus based on EC numbers, substrates, and environmental metadata. Once the list of seed protein candidates is retrieved, we can further filter the list, for instance, we could maximize sequence diversity by running CD-HIT on the sequences or maximize taxonomic diversity by selecting proteins from different taxonomic groups.

__NOTE__:

The KG will not store actual sequences but only sequence IDs. Sequences will be retrieved from our internal sequence database based on the sequence IDs.

### Example queries (illustrative)

1. __EC number-based query with optional environmental filters__

This query searches for proteins that have a specific EC number in their list of EC numbers and allows for optional filtering based on environmental parameters (temperature, salinity) and geographical location. It returns the matching proteins along with their associated GENOMEs and sample information.

```sql
MATCH (p:Protein)-[:CATALYZES]->(r:Reaction)
WHERE $ecNumber IN p.ec_numbers
OPTIONAL MATCH (m:GENOME)-[:CONTAINS]->(p)
OPTIONAL MATCH (m)-[:ORIGINATED_FROM]->(s:Samples)
WHERE 
  ($minTemp IS NULL OR s.temperature >= $minTemp) AND
  ($maxTemp IS NULL OR s.temperature <= $maxTemp) AND
  ($minSalinity IS NULL OR s.salinity >= $minSalinity) AND
  ($maxSalinity IS NULL OR s.salinity <= $maxSalinity) AND
  ($minLat IS NULL OR s.latitude >= $minLat) AND
  ($maxLat IS NULL OR s.latitude <= $maxLat) AND
  ($minLon IS NULL OR s.longitude >= $minLon) AND
  ($maxLon IS NULL OR s.longitude <= $maxLon)
RETURN p, m, s
LIMIT 1000
```

__Example output__

This is just an illustrative (not real) example of the output format. The actual output will depend on the data in the KG.

EC number-based query output (for EC 3.2.1.1 - alpha-amylase):

| Protein ID | Protein Name | EC Numbers | MAG ID | MAG Division | GTDB Classification | Sample Name | Temperature (°C) | Salinity (PSU) | Depth (m) | Latitude | Longitude |
|------------|--------------|------------|--------|--------------|---------------------|-------------|------------------|----------------|-----------|----------|-----------|
| P001 | Alpha-amylase | 3.2.1.1 | MAG001 | Bacteria | d__Bacteria;<br>p__Proteobacteria;<br>c__Gammaproteobacteria;<br>o__Enterobacterales;<br>f__Enterobacteriaceae;<br>g__Escherichia;<br>s__Escherichia coli | OceanSample001 | 15.2 | 35.1 | 100.5 | 40.7128 | -74.0060 |
| P002 | Alpha-<br>glucosidase | 3.2.1.1,<br>3.2.1.20 | MAG002 | Bacteria | d__Bacteria;<br>p__Firmicutes;<br>c__Bacilli;<br>o__Bacillales;<br>f__Bacillaceae;<br>g__Bacillus;<br>s__Bacillus subtilis | OceanSample002 | 18.7 | 34.8 | 50.2 | 34.0522 | -118.2437 |

<br>

2. __SMILES similarity-based query with optional environmental filters__

This query finds compounds similar to a target compound (based on pre-computed SMILES similarity), then retrieves the proteins that catalyze reactions involving these similar compounds. It allows for optional filtering based on environmental parameters and geographical location. The results are ordered by chemical similarity and include the proteins, similar compounds, similarity scores, associated GENOMEs, and sample information.

```sql
MATCH (target:Compound {smiles: $targetSmiles})
MATCH (target)-[sim:CHEMICALLY_SIMILAR]->(c:Compound)-[:SUBSTRATE_OF]->(r:Reaction)<-[:CATALYZES]-(p:Protein)
WHERE sim.similarity >= $similarityThreshold
OPTIONAL MATCH (m:GENOME)-[:CONTAINS]->(p)
OPTIONAL MATCH (m)-[:ORIGINATED_FROM]->(s:Samples)
WHERE 
  ($minTemp IS NULL OR s.temperature >= $minTemp) AND
  ($maxTemp IS NULL OR s.temperature <= $maxTemp) AND
  ($minSalinity IS NULL OR s.salinity >= $minSalinity) AND
  ($maxSalinity IS NULL OR s.salinity <= $maxSalinity) AND
  ($minLat IS NULL OR s.latitude >= $minLat) AND
  ($maxLat IS NULL OR s.latitude <= $maxLat) AND
  ($minLon IS NULL OR s.longitude >= $minLon) AND
  ($maxLon IS NULL OR s.longitude <= $maxLon)
RETURN p, c, sim.similarity AS similarity, m, s
ORDER BY similarity DESC
LIMIT 1000
```

__Example output__

This is just an illustrative (not real) example of the output format. The actual output will depend on the data in the KG.

SMILES similarity-based query output (for target compound CCO - ethanol):

| Protein ID | Protein Name | EC Numbers | Compound ID | Compound Name | SMILES | Similarity | MAG ID | MAG Division | GTDB Classification | Sample Name | Temperature (°C) | Salinity (PSU) | Depth (m) | Latitude | Longitude |
|------------|--------------|------------|-------------|---------------|--------|------------|--------|--------------|---------------------|-------------|------------------|----------------|-----------|----------|-----------|
| P003 | Alcohol<br>dehydrogenase | 1.1.1.1 | C001 | Ethanol | CCO | 1.00 | MAG003 | Bacteria | d__Bacteria;<br>p__Actinobacteria;<br>c__Actinobacteria;<br>o__Corynebacteriales;<br>f__Mycobacteriaceae;<br>g__Mycobacterium;<br>s__Mycobacterium smegmatis | OceanSample003 | 22.1 | 33.9 | 10.5 | 51.5074 | -0.1278 |
| P004 | Methanol<br>dehydrogenase | 1.1.1.244 | C002 | Methanol | CO | 0.88 | MAG004 | Bacteria | d__Bacteria;<br>p__Proteobacteria;<br>c__Alphaproteobacteria;<br>o__Rhizobiales;<br>f__Methylobacteriaceae;<br>g__Methylobacterium;<br>s__Methylobacterium extorquens | OceanSample004 | 20.5 | 35.2 | 75.8 | 48.8566 | 2.3522 |

## Computing tanimoto distances for ModelSeed compounds

In [1]:
from src.distances import compute_fingerprint_distances
from src.utils import extract_data


reactions_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/reactions.json"
compounds_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/compounds.json"

n = None
reactions, compounds = extract_data(reactions_path, compounds_path, n)

distances = compute_fingerprint_distances(compounds, n_jobs=12)
print(f"Computed {len(distances)} pairwise distances")

In [5]:
similarities = [(c1, c2, 1 - d) for (c1, c2, d) in distances]

In [9]:
from src.distances import store_distances_parquet, store_similarities_parquet

distance_file = "outputs/distances.parquet"
store_distances_parquet(distances, distance_file)

similarity_file = "outputs/similarities.parquet"
store_similarities_parquet(similarities, similarity_file)

Stored 130661695 pairwise similarities in outputs/similarities.parquet


In [7]:
from src.distances import read_distance_parquet

distance_file = "outputs/compound_distances.parquet"
distance = read_distance_parquet(distance_file, "cpd00020", "cpd00061")
print(f"Distance between COMP1 and COMP2: {distance}")

Distance between COMP1 and COMP2: 0.7391304347826086


## Make reaction and compound databases


Simplify the modelSEED database and make csv files for compounds and reactions.

In [1]:
from src.utils import extract_data


reactions_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/reactions.json"
compounds_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/compounds.json"

n = None
reactions, compounds = extract_data(reactions_path, compounds_path, n)

In [3]:
[c for c in compounds if "pyruvate" in c["name"].lower()][0]

{'abbreviation': 'pyr',
 'abstract_compound': None,
 'aliases': ['Name: 2-Oxopropanoate; 2-Oxopropanoic acid; 2-oxo-propionic acid; 2-oxopropanoate; 2-oxopropanoic acid; BTS; Pyroracemic acid; Pyruvate; Pyruvic acid; acetylformic acid; alpha-ketopropionic acid; pyroracemic acid; pyruvate; pyruvic acid',
  'AraCyc: PYRUVATE',
  'BiGG: pyr',
  'BrachyCyc: PYRUVATE',
  'KEGG: C00022',
  'MetaCyc: PYRUVATE'],
 'charge': -1,
 'comprised_of': None,
 'deltag': -82.56,
 'deltagerr': 0.18,
 'formula': 'C3H3O3',
 'id': 'cpd00020',
 'inchikey': 'LCTONWCANYUPML-UHFFFAOYSA-M',
 'is_cofactor': 0,
 'is_core': 1,
 'is_obsolete': 0,
 'linked_compound': None,
 'mass': 87.0,
 'name': 'Pyruvate',
 'notes': ['GC', 'EQ', 'EQU'],
 'pka': '1:4:2.93',
 'pkb': '1:5:-9.58',
 'smiles': 'CC(=O)C(=O)[O-]',
 'source': 'Primary Database'}

In [4]:
[c for c in compounds if "pyruvate" in c["name"].lower()][1]

{'abbreviation': 'pep',
 'abstract_compound': None,
 'aliases': ['Name: 2-(phosphooxy)- 2-propenoate; P-enol-pyr; P-enol-pyruvate; PEP; Phosphoenolpyruvate; Phosphoenolpyruvic acid; phosphoenolpyruvate',
  'AraCyc: PHOSPHO-ENOL-PYRUVATE',
  'BiGG: pep',
  'BrachyCyc: PHOSPHO-ENOL-PYRUVATE',
  'KEGG: C00074',
  'MetaCyc: PHOSPHO-ENOL-PYRUVATE'],
 'charge': -3,
 'comprised_of': None,
 'deltag': -284.84,
 'deltagerr': 0.2,
 'formula': 'C3H2O6P',
 'id': 'cpd00061',
 'inchikey': 'DTBNBXWJWCWCIK-UHFFFAOYSA-K',
 'is_cofactor': 0,
 'is_core': 1,
 'is_obsolete': 0,
 'linked_compound': None,
 'mass': 166.0,
 'name': 'Phosphoenolpyruvate',
 'notes': ['GC', 'EQ', 'EQU'],
 'pka': '1:4:3.36;1:8:0.76;1:10:6.02',
 'pkb': '',
 'smiles': 'C=C(OP(=O)([O-])[O-])C(=O)[O-]',
 'source': 'Primary Database'}

### Filter reactions by list of properties

In [11]:
def filter_reaction_dicts(reaction_dicts):
    # Define the keys to keep
    keys_to_keep = [
        "aliases", "code", "compound_ids", "definition", "deltag", "deltagerr",
        "direction", "ec_numbers", "equation", "is_transport", "linked_reaction",
        "name", "pathways", "reversibility", "source", "status", "stoichiometry"
    ]

    # Function to filter a single dictionary
    def filter_dict(d):
        filtered = {k: d[k] for k in keys_to_keep if k in d}
        # Rename 'id' to 'reaction_id' if present
        if 'id' in d:
            filtered['reaction_id'] = d['id']
        return filtered

    # Apply the filter to all dictionaries in the list
    return [filter_dict(d) for d in reaction_dicts]


filtered_reactions = filter_reaction_dicts(reactions)

### Filter compounds by list of properties

In [10]:
def filter_compound_dicts(compound_dicts):
    # Define the keys to keep
    keys_to_keep = [
        "aliases", "charge", "deltag", "deltagerr", "formula", "inchikey",
        "is_cofactor", "is_core", "mass", "name", "pka", "pkb", "smiles", "source"
    ]

    # Function to filter a single dictionary
    def filter_dict(d):
        filtered = {k: d[k] for k in keys_to_keep if k in d}
        # Rename 'id' to 'compound_id' if present
        if 'id' in d:
            filtered['compound_id'] = d['id']
        return filtered

    # Apply the filter to all dictionaries in the list
    return [filter_dict(d) for d in compound_dicts]


filtered_compounds = filter_compound_dicts(compounds)

### Save to json

In [12]:
import json

def save_to_json(data, filename):
    """
    Save a list of dictionaries to a JSON file.
    
    Args:
    data (list): List of dictionaries to save
    filename (str): Name of the file to save the data to
    """
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)


save_to_json(filtered_reactions, 'outputs/filtered_reactions.json')
save_to_json(filtered_compounds, 'outputs/filtered_compounds.json')

## Notes

1. RDKIT chemical similarity function: https://github.com/rdkit/rdkit-orig/blob/57058c886a49cc597b0c40641a28697ee3a57aee/rdkit/DataStructs/__init__.py#L31

2. Use monorepo db_connection to get a class for the connection

## Test NAL KG locally

In [25]:
from dotenv import load_dotenv
import os

load_dotenv()
uri = os.getenv("NEO4J_URI")
username = os.getenv("NEO4J_USERNAME")
password = os.getenv("NEO4J_PASSWORD")

In [27]:
from neo4j import GraphDatabase

def get_schema_info(tx):
    # Get node labels
    labels = tx.run("CALL db.labels()").data()
    print("Node Labels:", [label["label"] for label in labels])

    # Get relationship types
    rel_types = tx.run("CALL db.relationshipTypes()").data()
    print("Relationship Types:", [rel["relationshipType"] for rel in rel_types])

    # Get property keys
    prop_keys = tx.run("CALL db.propertyKeys()").data()
    print("Property Keys:", [prop["propertyKey"] for prop in prop_keys])

    # Get detailed schema with property types
    schema_query = """
    MATCH (n)
    WITH DISTINCT labels(n) AS nodeLabels, keys(n) AS nodeProps
    UNWIND nodeLabels AS label
    UNWIND nodeProps AS prop
    WITH label, prop
    MATCH (n:`${label}`)
    WHERE n[prop] IS NOT NULL
    WITH label, prop, n[prop] AS value
    RETURN DISTINCT label, prop, 
           CASE
             WHEN value =~ '^-?\\d+$' THEN 'Integer'
             WHEN value =~ '^-?\\d*\\.?\\d+$' THEN 'Float'
             WHEN value IN ['true', 'false'] THEN 'Boolean'
             ELSE 'String'
           END AS type
    LIMIT 1000
    """
    schema = tx.run(schema_query).data()
    
    print("\nDetailed Schema:")
    for item in schema:
        print(f"Label: {item['label']}, Property: {item['prop']}, Type: {item['type']}")




with GraphDatabase.driver(uri, auth=(username, password)) as driver:
    with driver.session() as session:
        session.read_transaction(get_schema_info)

  session.read_transaction(get_schema_info)


Node Labels: ['Compound', 'Genome', 'Reaction', 'BGC', 'Protein', 'Sample']
Relationship Types: ['CONTAINS', 'ORIGINATED_FROM', 'SUBSTRATE_OF', 'PRODUCT_OF', 'CHEMICAL_SIMILARITY', 'CATALYZES']
Property Keys: ['UNIQUE IMPORT ID', 'name', 'is_core', 'charge', 'notes', 'aliases', 'inchikey', 'smiles', 'pkb', 'mass', 'pka', 'deltag', 'source', 'abbreviation', 'is_obsolete', 'formula', 'deltagerr', 'id', 'is_cofactor', 'linked_compound', 'stoichiometry', 'code', 'equation', 'is_transport', 'ec_numbers', 'compound_ids', 'reversibility', 'definition', 'direction', 'status', 'pathways', 'linked_reaction', 'gcc_prevalence', 'gcc_to_refseq', 'gcc', 'gcf', 'gcc_only_mag', 'other', 'non_ribosomal_peptyde_synthestases', 'distance_refseq', 'bgc_representative', 'type_ii_iii_polyketide_synthestases', 'terpene', 'ripps', 'distance_mibig', 'type_i_polyketide_synthestases', 'bgc_complete', 'bgc_length', 'acc', 'mag_id', 'bgc_class', 'contig_length', 'file', 'start', 'length', 'end', 'contig_id', 'tool'

In [5]:
from src.utils import extract_data

reactions_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/reactions.json"
compounds_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/compounds.json"

n = None
reactions, compounds = extract_data(reactions_path, compounds_path, n)

In [3]:
from src.queries import find_reactions_with_similar_product_compounds


compound_id = "cpd00069"
compound_name = [c for c in compounds if c["id"] == compound_id][0]["name"]
print(f"Target compound: {compound_name}")

results = find_reactions_with_similar_product_compounds(
    uri,username, password,
    compound_id,
    similarity_threshold=0.2,
    limit=100
    )
results[0]

Target compound: L-Tyrosine


{'reaction_id': 'rxn45976',
 'reaction_name': '',
 'similar_compound_id': 'cpd03843',
 'similar_compound_name': 'D-Tyrosine',
 'similar_compound_smiles': '[NH3+][C@H](Cc1ccc(O)cc1)C(=O)[O-]',
 'similarity_distance': 0.0}

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()
uri = os.getenv("NEO4J_URI")
username = os.getenv("NEO4J_USERNAME")
password = os.getenv("NEO4J_PASSWORD")

In [5]:
from graph_db.db_connection import Neo4jConnection

conn = Neo4jConnection(uri, username, password)

In [23]:
query = """
MATCH (c:Compound {compound_id: $compound_id})
OPTIONAL MATCH (c)-[r:PRODUCT_OF]-(reaction:Reaction)
RETURN c, r, reaction
LIMIT 1000"""

params = {"compound_id": "cpd00069"}
conn.query(query, params)

[<Record c=<Node element_id='4:7de5358e-0bc2-417b-aa77-97dd0da86487:30819875' labels=frozenset({'Compound'}) properties={'is_core': 1, 'aliases': ['Name: (S)-2-Amino-3-(p-hydroxyphenyl)propionic acid; (S)-3-(p-Hydroxyphenyl)alanine; L-Tyrosine; L-tyr; L-tyrosine; Tyrosine; tyr; tyrosine', 'AraCyc: TYR', 'BiGG: tyr__L', 'BrachyCyc: TYR', 'KEGG: C00082', 'MetaCyc: TYR'], 'inchikey': 'OUYCCCASQSFEME-QMMMGPOBSA-N', 'charge': 0, 'smiles': '[NH3+][C@@H](Cc1ccc(O)cc1)C(=O)[O-]', 'pkb': '1:8:-5.96;1:13:9.19', 'pka': '1:8:9.79;1:11:2.00', 'mass': 181.0, 'compound_id': 'cpd00069', 'deltag': 18.52, 'source': 'Primary Database', 'name': 'L-Tyrosine', 'formula': 'C9H11NO3', 'deltagerr': 0.71, 'is_cofactor': 0}> r=<Relationship element_id='5:7de5358e-0bc2-417b-aa77-97dd0da86487:1096688' nodes=(<Node element_id='4:7de5358e-0bc2-417b-aa77-97dd0da86487:30819875' labels=frozenset({'Compound'}) properties={'is_core': 1, 'aliases': ['Name: (S)-2-Amino-3-(p-hydroxyphenyl)propionic acid; (S)-3-(p-Hydroxyphe

In [8]:
query = """
MATCH (c:Compound {compound_id: $compound_id})-[sim:CHEMICAL_SIMILARITY]-(similar:Compound)
MATCH (similar)-[:PRODUCT_OF]->(r:Reaction)
WHERE toFloat(sim.distance) <= $distance_threshold
RETURN r.reaction_id AS reaction_id,
       r.name AS reaction_name,
       similar.compound_id AS similar_compound_id,
       similar.name AS similar_compound_name,
       similar.smiles AS similar_compound_smiles,
       toFloat(sim.distance) AS similarity_distance
ORDER BY similarity_distance ASC
LIMIT $limit"""

conn = Neo4jConnection(uri, username, password)

params = {"compound_id": "cpd00009", "distance_threshold": 0.4, "limit": 100}
conn.query(query, params)

[<Record reaction_id='rxn40068' reaction_name='2-hydroxyethylphosphonate:O2 1,2-oxidoreductase (methylphosphonate forming)' similar_compound_id='cpd25960' similar_compound_name='MePn' similar_compound_smiles='CP(=O)([O-])O' similarity_distance=0.36363636363636365>]

In [10]:
query = """
MATCH (c:Compound {compound_id: $compound_id})-[sim:CHEMICAL_SIMILARITY]-(similar:Compound)
MATCH (similar)-[:PRODUCT_OF]->(r:Reaction)
MATCH (p:Protein)-[:CATALYZES]->(r)
WHERE toFloat(sim.distance) <= $distance_threshold
RETURN 
    r.reaction_id AS reaction_id,
    r.name AS reaction_name,
    similar.compound_id AS similar_compound_id,
    similar.name AS similar_compound_name,
    similar.smiles AS similar_compound_smiles,
    toFloat(sim.distance) AS similarity_distance,
    p.protein_id AS catalyzing_protein_id,
    p.name AS catalyzing_protein_name,
    p.ec_numbers AS catalyzing_protein_ec_numbers
ORDER BY similarity_distance ASC
LIMIT $limit"""


conn = Neo4jConnection(uri, username, password)

params = {"compound_id": "cpd00009", "distance_threshold": 0.9, "limit": 100}
conn.query(query, params)

[<Record reaction_id='rxn00293' reaction_name='UTP:N-acetyl-alpha-D-glucosamine-1-phosphate uridylyltransferase' similar_compound_id='cpd00012' similar_compound_name='PPi' similar_compound_smiles='O=P([O-])([O-])OP(=O)([O-])O' similarity_distance=0.5333333333333333 catalyzing_protein_id='OceanDNA-b44519_00103_2' catalyzing_protein_name='bifunctional UDP-N-acetylglucosamine pyrophosphorylase / glucosamine-1-phosphate N-acetyltransferase [EC:2.7.7.23 2.3.1.157]' catalyzing_protein_ec_numbers=['2.3.1.157', '2.7.7.23']>,
 <Record reaction_id='rxn00293' reaction_name='UTP:N-acetyl-alpha-D-glucosamine-1-phosphate uridylyltransferase' similar_compound_id='cpd00012' similar_compound_name='PPi' similar_compound_smiles='O=P([O-])([O-])OP(=O)([O-])O' similarity_distance=0.5333333333333333 catalyzing_protein_id='OceanDNA-b36760_00082_1' catalyzing_protein_name='bifunctional UDP-N-acetylglucosamine pyrophosphorylase / glucosamine-1-phosphate N-acetyltransferase [EC:2.7.7.23 2.3.1.157]' catalyzing_p

In [3]:
[r for r in reactions if r["id"] == "rxn00001"][0]

{'abbreviation': 'R00004',
 'abstract_reaction': None,
 'aliases': ['AraCyc: INORGPYROPHOSPHAT-RXN',
  'BiGG: IPP1; PPA; PPA_1; PPAm',
  'BrachyCyc: INORGPYROPHOSPHAT-RXN',
  'KEGG: R00004',
  'MetaCyc: INORGPYROPHOSPHAT-RXN',
  'Name: Diphosphate phosphohydrolase; Inorganic diphosphatase; Inorganic pyrophosphatase; Pyrophosphate phosphohydrolase; diphosphate phosphohydrolase; inorganic diphosphatase; inorganic diphosphatase (one proton translocation); inorganicdiphosphatase; pyrophosphate phosphohydrolase'],
 'code': '(1) cpd00001[0] + (1) cpd00012[0] <=> (2) cpd00009[0]',
 'compound_ids': 'cpd00001;cpd00009;cpd00012;cpd00067',
 'definition': '(1) H2O[0] + (1) PPi[0] <=> (2) Phosphate[0] + (1) H+[0]',
 'deltag': -3.46,
 'deltagerr': 0.05,
 'direction': '=',
 'ec_numbers': ['3.6.1.1'],
 'equation': '(1) cpd00001[0] + (1) cpd00012[0] <=> (2) cpd00009[0] + (1) cpd00067[0]',
 'id': 'rxn00001',
 'is_obsolete': 0,
 'is_transport': 0,
 'linked_reaction': 'rxn27946;rxn27947;rxn27948;rxn32487;

In [17]:
from graph_db.db_connection import Neo4jConnection

conn = Neo4jConnection(uri, username, password)

query = """
MATCH (p:Protein)-[:CATALYZES]->(r:Reaction)
WHERE $ec_number IN p.ec_numbers
RETURN
   p.protein_id AS protein_id,
   p.name AS protein_name,
   p.ec_numbers AS protein_ec_numbers,
   r.reaction_id AS reaction_id,
   r.name AS reaction_name,
   r.ec_numbers AS reaction_ec_numbers
LIMIT 100"""

params = {
    "ec_number": "3.2.1.4"
}

results = conn.query(query, params)

In [19]:
results

[]

In [41]:
set([",".join(r["reaction_ec_numbers"]) for r in results])

{'3.2.1.10,3.2.1.20,3.2.1.48', '3.2.1.183,5.1.3.14'}

In [46]:
[r for r in reactions if ((r["ec_numbers"] is not None) and ("3.2.1.1" in r["ec_numbers"]))]

[{'abbreviation': 'R02108',
  'abstract_reaction': None,
  'aliases': ['KEGG: R02108', 'Name: 1,4-alpha-D-Glucan glucanohydrolase'],
  'code': '(1) cpd00001[0] <=> (1) cpd11594[0]',
  'compound_ids': 'cpd00001;cpd11594',
  'definition': '(1) H2O[0] <=> (1) Dextrin[0]',
  'deltag': 10000000.0,
  'deltagerr': 10000000.0,
  'direction': '=',
  'ec_numbers': ['3.2.1.1'],
  'equation': '(1) cpd00001[0] <=> (1) cpd11594[0]',
  'id': 'rxn06093',
  'is_obsolete': 0,
  'is_transport': 0,
  'linked_reaction': None,
  'name': '1,4-alpha-D-Glucan glucanohydrolase',
  'notes': ['GCP', 'EQP'],
  'pathways': ['KEGG: rn00500 (Starch and sucrose metabolism)'],
  'reversibility': '?',
  'source': 'Primary Database',
  'status': 'MI:C:12/H:18/O:9/R:2',
  'stoichiometry': '-1:cpd00001:0:0:"H2O";1:cpd11594:0:0:"Dextrin"'},
 {'abbreviation': 'AAMYL',
  'abstract_reaction': None,
  'aliases': ['AlgaGEM: R_R02108_p',
   'AraGEM: R_R02108_p',
   'BiGG: AAMYL',
   'Maize_C4GEM: R_R02108_p',
   'iAF1260: AAMYL',

In [26]:
filtered_results = [r for r in results if r["sample_temperature"] is not None and (r["protein_ec_numbers"] in r["reaction_ec_numbers"])]
filtered_results

[]

Looks like we are getting partial matches for ec number, the reaction incldued in results: rxn00235 has assigned ec number: 3.1.1.33 instead of the queried 3.1.1.3.

## Search for proteins with more than one EC Number

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
uri = os.getenv("NEO4J_URI")
username = os.getenv("NEO4J_USERNAME")
password = os.getenv("NEO4J_PASSWORD")

In [15]:
from graph_db.db_connection import Neo4jConnection

conn = Neo4jConnection(uri, username, password)

query = """
MATCH (p:Protein {protein_id: "OceanDNA-b40955_00050_5"})
RETURN p.protein_id, p.name, p.sequence, p.ec_numbers
""" 

res = conn.query(query)
res

[<Record p.protein_id='OceanDNA-b40955_00050_5' p.name='(R,R)-butanediol dehydrogenase / meso-butanediol dehydrogenase / diacetyl reductase [EC:1.1.1.4 1.1.1.- 1.1.1.303]' p.sequence=None p.ec_numbers='1.1.1.4, 1.1.1.-, 1.1.1.303'>]

In [17]:
res[0].data()["p.ec_numbers"]

'1.1.1.4, 1.1.1.-, 1.1.1.303'

In [8]:
from graph_db.db_connection import Neo4jConnection

conn = Neo4jConnection(uri, username, password)

query = """
MATCH (r:Reaction {reaction_id: "rxn02112"})
RETURN r.reaction_id, r.name, r.ec_numbers
""" 

res = conn.query(query)
res

[<Record r.reaction_id='rxn02112' r.name='(R,R)-Butane-2,3-diol:NAD+ oxidoreductase' r.ec_numbers=['1.1.1.4']>]

In [11]:
res[0].data()["r.ec_numbers"]

['1.1.1.4']

In [6]:
[r for r in reactions if ((r["ec_numbers"] is not None) and ("1.1.1.4" in r["ec_numbers"]))]

[{'abbreviation': 'R02946',
  'abstract_reaction': None,
  'aliases': ['BiGG: BTDD-RR',
   'KEGG: R02946',
   'MaizeCyc: RR-BUTANEDIOL-DEHYDROGENASE-RXN',
   'MetaCyc: RR-BUTANEDIOL-DEHYDROGENASE-RXN',
   'PoplarCyc: RR-BUTANEDIOL-DEHYDROGENASE-RXN',
   'Name: (R)-2,3-butanediol dehydrogenase; (R)-diacetyl reductase; (R,R)-Butane-2,3-diol:NAD+ oxidoreductase; (R,R)-butane-2,3-diol:NAD+ oxidoreductase; (R,R)-butanediol dehydrogenase; 1-amino-2-propanol dehydrogenase; 1-amino-2-propanol oxidoreductase; 2,3-butanediol dehydrogenase; D-(-)-butanediol dehydrogenase; D-1-amino-2-propanol dehydrogenase; D-1-amino-2-propanol:NAD+ oxidoreductase; D-aminopropanol dehydrogenase; D-butanediol dehydrogenase; aminopropanol oxidoreductase; butylene glycol dehydrogenase; butyleneglycol dehydrogenase; diacetyl (acetoin) reductase'],
  'code': '(1) cpd00003[0] + (1) cpd01947[0] <=> (1) cpd00004[0] + (1) cpd00361[0]',
  'compound_ids': 'cpd00003;cpd00004;cpd00067;cpd00361;cpd01947',
  'definition': '(1) 

In [2]:
from graph_db.db_connection import Neo4jConnection

conn = Neo4jConnection(uri, username, password)

query = """
MATCH (p:Protein {protein_id: "OceanDNA-b44106_00008_5"})
RETURN p.protein_id, p.name, p.sequence
""" 

res = conn.query(query)
res

[<Record p.protein_id='OceanDNA-b44106_00008_5' p.name='triacylglycerol lipase [EC:3.1.1.3]' p.sequence=None>]

In [20]:
from graph_db.db_connection import Neo4jConnection

conn = Neo4jConnection(uri, username, password)

query = """
MATCH (p:Protein {protein_id: "OceanDNA-b44106_00008_5"})
OPTIONAL MATCH (p)-[:CATALYZES]->(r:Reaction)
RETURN p, collect(r) as catalyzed_reactions
"""

res = conn.query(query)
res

[<Record p=<Node element_id='4:7de5358e-0bc2-417b-aa77-97dd0da86487:23679428' labels=frozenset({'Protein'}) properties={'protein_id': 'OceanDNA-b44106_00008_5', 'genome_id': 'OceanDNA-b44106', 'kegg_ortholog_id': 'K01046', 'name': 'triacylglycerol lipase [EC:3.1.1.3]', 'contig_id': 'OceanDNA-b44106_00008', 'ec_numbers': '3.1.1.3'}> catalyzed_reactions=[<Node element_id='4:7de5358e-0bc2-417b-aa77-97dd0da86487:30840618' labels=frozenset({'Reaction'}) properties={'stoichiometry': '-1:cpd00001:0:0:"H2O";-1:cpd01726:0:0:"6-Acetyl-D-glucose";1:cpd00027:0:0:"D-Glucose";1:cpd00029:0:0:"Acetate";1:cpd00067:0:0:"H+"', 'aliases': ['KEGG: R00327', 'MetaCyc: 6-ACETYLGLUCOSE-DEACETYLASE-RXN', 'Name: 6-Acetyl-D-glucose acetylhydrolase'], 'code': '(1) cpd00001[0] + (1) cpd01726[0] <=> (1) cpd00027[0] + (1) cpd00029[0]', 'equation': '(1) cpd00001[0] + (1) cpd01726[0] => (1) cpd00027[0] + (1) cpd00029[0] + (1) cpd00067[0]', 'deltag': -6.8, 'is_transport': 0, 'source': 'Primary Database', 'ec_numbers': [

## Candidate enzymes with high industrial relevance


### Lipase (EC 3.1.1.3)
Main usage: Food industry (dairy, baking), detergents, biofuel production

### Amylase (EC 3.2.1.1)
Main usage: Starch processing, baking, brewing, textile industry

### Cellulase (EC 3.2.1.4)
Main usage: Biofuel production, textile industry (stone-washing denim), paper and pulp industry

### Serine Protease (EC 3.4.21.62)
Main usage: Detergents, food processing, leather industry, pharmaceuticals

### Phytase (EC 3.1.3.26)
Main usage: Animal feed industry, particularly for poultry and swine

### Lactase (EC 3.2.1.23)
Main usage: Dairy industry (lactose-free products), food and beverage

### Glucose oxidase (EC 1.1.3.4)
Main usage: Food preservation, glucose biosensors, baking industry

### Xylanase (EC 3.2.1.8)
Main usage: Baking industry, animal feed, paper and pulp industry

### Catalase (EC 1.11.1.6)
Main usage: Food preservation, textile industry (bleaching), cosmetics

### Pectinase (EC 3.2.1.15)
Main usage: Fruit juice production, wine making, textile processing