# Use the KG to identify seed protein candidates for ProtEvoPy

Objective: Use the KG to identify seed protein candidates for ProtEvoPy based on EC numbers, substrates, and environmental metadata. Once the list of seed protein candidates is retrieved, we can further filter the list, for instance, we could maximize sequence diversity by running CD-HIT on the sequences or maximize taxonomic diversity by selecting proteins from different taxonomic groups.

__NOTE__:

The KG will not store actual sequences but only sequence IDs. Sequences will be retrieved from our internal sequence database based on the sequence IDs.

### Example queries (illustrative)

1. __EC number-based query with optional environmental filters__

This query searches for proteins that have a specific EC number in their list of EC numbers and allows for optional filtering based on environmental parameters (temperature, salinity) and geographical location. It returns the matching proteins along with their associated GENOMEs and sample information.

```sql
MATCH (p:Protein)-[:CATALYZES]->(r:Reaction)
WHERE $ecNumber IN p.ec_numbers
OPTIONAL MATCH (m:GENOME)-[:CONTAINS]->(p)
OPTIONAL MATCH (m)-[:ORIGINATED_FROM]->(s:Samples)
WHERE 
  ($minTemp IS NULL OR s.temperature >= $minTemp) AND
  ($maxTemp IS NULL OR s.temperature <= $maxTemp) AND
  ($minSalinity IS NULL OR s.salinity >= $minSalinity) AND
  ($maxSalinity IS NULL OR s.salinity <= $maxSalinity) AND
  ($minLat IS NULL OR s.latitude >= $minLat) AND
  ($maxLat IS NULL OR s.latitude <= $maxLat) AND
  ($minLon IS NULL OR s.longitude >= $minLon) AND
  ($maxLon IS NULL OR s.longitude <= $maxLon)
RETURN p, m, s
LIMIT 1000
```

__Example output__

This is just an illustrative (not real) example of the output format. The actual output will depend on the data in the KG.

EC number-based query output (for EC 3.2.1.1 - alpha-amylase):

| Protein ID | Protein Name | EC Numbers | MAG ID | MAG Division | GTDB Classification | Sample Name | Temperature (°C) | Salinity (PSU) | Depth (m) | Latitude | Longitude |
|------------|--------------|------------|--------|--------------|---------------------|-------------|------------------|----------------|-----------|----------|-----------|
| P001 | Alpha-amylase | 3.2.1.1 | MAG001 | Bacteria | d__Bacteria;<br>p__Proteobacteria;<br>c__Gammaproteobacteria;<br>o__Enterobacterales;<br>f__Enterobacteriaceae;<br>g__Escherichia;<br>s__Escherichia coli | OceanSample001 | 15.2 | 35.1 | 100.5 | 40.7128 | -74.0060 |
| P002 | Alpha-<br>glucosidase | 3.2.1.1,<br>3.2.1.20 | MAG002 | Bacteria | d__Bacteria;<br>p__Firmicutes;<br>c__Bacilli;<br>o__Bacillales;<br>f__Bacillaceae;<br>g__Bacillus;<br>s__Bacillus subtilis | OceanSample002 | 18.7 | 34.8 | 50.2 | 34.0522 | -118.2437 |

<br>

2. __SMILES similarity-based query with optional environmental filters__

This query finds compounds similar to a target compound (based on pre-computed SMILES similarity), then retrieves the proteins that catalyze reactions involving these similar compounds. It allows for optional filtering based on environmental parameters and geographical location. The results are ordered by chemical similarity and include the proteins, similar compounds, similarity scores, associated GENOMEs, and sample information.

```sql
MATCH (target:Compound {smiles: $targetSmiles})
MATCH (target)-[sim:CHEMICALLY_SIMILAR]->(c:Compound)-[:SUBSTRATE_OF]->(r:Reaction)<-[:CATALYZES]-(p:Protein)
WHERE sim.similarity >= $similarityThreshold
OPTIONAL MATCH (m:GENOME)-[:CONTAINS]->(p)
OPTIONAL MATCH (m)-[:ORIGINATED_FROM]->(s:Samples)
WHERE 
  ($minTemp IS NULL OR s.temperature >= $minTemp) AND
  ($maxTemp IS NULL OR s.temperature <= $maxTemp) AND
  ($minSalinity IS NULL OR s.salinity >= $minSalinity) AND
  ($maxSalinity IS NULL OR s.salinity <= $maxSalinity) AND
  ($minLat IS NULL OR s.latitude >= $minLat) AND
  ($maxLat IS NULL OR s.latitude <= $maxLat) AND
  ($minLon IS NULL OR s.longitude >= $minLon) AND
  ($maxLon IS NULL OR s.longitude <= $maxLon)
RETURN p, c, sim.similarity AS similarity, m, s
ORDER BY similarity DESC
LIMIT 1000
```

__Example output__

This is just an illustrative (not real) example of the output format. The actual output will depend on the data in the KG.

SMILES similarity-based query output (for target compound CCO - ethanol):

| Protein ID | Protein Name | EC Numbers | Compound ID | Compound Name | SMILES | Similarity | MAG ID | MAG Division | GTDB Classification | Sample Name | Temperature (°C) | Salinity (PSU) | Depth (m) | Latitude | Longitude |
|------------|--------------|------------|-------------|---------------|--------|------------|--------|--------------|---------------------|-------------|------------------|----------------|-----------|----------|-----------|
| P003 | Alcohol<br>dehydrogenase | 1.1.1.1 | C001 | Ethanol | CCO | 1.00 | MAG003 | Bacteria | d__Bacteria;<br>p__Actinobacteria;<br>c__Actinobacteria;<br>o__Corynebacteriales;<br>f__Mycobacteriaceae;<br>g__Mycobacterium;<br>s__Mycobacterium smegmatis | OceanSample003 | 22.1 | 33.9 | 10.5 | 51.5074 | -0.1278 |
| P004 | Methanol<br>dehydrogenase | 1.1.1.244 | C002 | Methanol | CO | 0.88 | MAG004 | Bacteria | d__Bacteria;<br>p__Proteobacteria;<br>c__Alphaproteobacteria;<br>o__Rhizobiales;<br>f__Methylobacteriaceae;<br>g__Methylobacterium;<br>s__Methylobacterium extorquens | OceanSample004 | 20.5 | 35.2 | 75.8 | 48.8566 | 2.3522 |

## Computing tanimoto distances for ModelSeed compounds

In [8]:
from src.distances import compute_fingerprint_distances
from src.utils import extract_data


reactions_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/reactions.json"
compounds_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/compounds.json"

n = None
reactions, compounds = extract_data(reactions_path, compounds_path, n)

distances = compute_fingerprint_distances(compounds, n_jobs=12)
print(f"Computed {len(distances)} pairwise distances")

Computed 130661695 pairwise distances


In [18]:
from src.distances import store_distances_parquet

distance_file = "outputs/distances.parquet"
store_distances_parquet(distances, distance_file)

Stored 130661695 pairwise distances in outputs/distances.parquet


In [20]:
from src.distances import read_distance_parquet

distance = read_distance_parquet(distance_file, "cpd00001", "cpd00002")
print(f"Distance between COMP1 and COMP2: {distance}")

Distance between COMP1 and COMP2: 1.0


## Make reaction and compound databases


Simplify the modelSEED database and make csv files for compounds and reactions.

In [1]:
from src.utils import extract_data


reactions_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/reactions.json"
compounds_path = "/home/robaina/Documents/NewAtlantis/enzyme_activity/notebooks/data/annotations/modelseed/compounds.json"

n = None
reactions, compounds = extract_data(reactions_path, compounds_path, n)

### Filter reactions by list of properties

In [11]:
def filter_reaction_dicts(reaction_dicts):
    # Define the keys to keep
    keys_to_keep = [
        "aliases", "code", "compound_ids", "definition", "deltag", "deltagerr",
        "direction", "ec_numbers", "equation", "is_transport", "linked_reaction",
        "name", "pathways", "reversibility", "source", "status", "stoichiometry"
    ]

    # Function to filter a single dictionary
    def filter_dict(d):
        filtered = {k: d[k] for k in keys_to_keep if k in d}
        # Rename 'id' to 'reaction_id' if present
        if 'id' in d:
            filtered['reaction_id'] = d['id']
        return filtered

    # Apply the filter to all dictionaries in the list
    return [filter_dict(d) for d in reaction_dicts]


filtered_reactions = filter_reaction_dicts(reactions)

### Filter compounds by list of properties

In [10]:
def filter_compound_dicts(compound_dicts):
    # Define the keys to keep
    keys_to_keep = [
        "aliases", "charge", "deltag", "deltagerr", "formula", "inchikey",
        "is_cofactor", "is_core", "mass", "name", "pka", "pkb", "smiles", "source"
    ]

    # Function to filter a single dictionary
    def filter_dict(d):
        filtered = {k: d[k] for k in keys_to_keep if k in d}
        # Rename 'id' to 'compound_id' if present
        if 'id' in d:
            filtered['compound_id'] = d['id']
        return filtered

    # Apply the filter to all dictionaries in the list
    return [filter_dict(d) for d in compound_dicts]


filtered_compounds = filter_compound_dicts(compounds)

### Save to json

In [12]:
import json

def save_to_json(data, filename):
    """
    Save a list of dictionaries to a JSON file.
    
    Args:
    data (list): List of dictionaries to save
    filename (str): Name of the file to save the data to
    """
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)


save_to_json(filtered_reactions, 'outputs/filtered_reactions.json')
save_to_json(filtered_compounds, 'outputs/filtered_compounds.json')

## Notes

1. RDKIT chemical similarity function: https://github.com/rdkit/rdkit-orig/blob/57058c886a49cc597b0c40641a28697ee3a57aee/rdkit/DataStructs/__init__.py#L31